American cars data analysis
Updated: Mar 9, 2019
The inspiration of this little project is the experience of buying my car in USA. I found the whole car selling market is in a situation which is between transparent and absolutely untransparent. And if you are not interested for the background, you can skip all those paragraphs below, and start read the charts and other contents.
I say it is transparent is that cars are the necessities of most American families. Everyone in the USA can talk to you about their own understanding about the cars. Also there are lots lots of people who can fix their own car in their garage. It is not possible in any other developing countries such as China. The second hand market in America is also very sophisticated. And more importantly, people are willingly to do the research for buying a car. All these needs give the birth of cars.com. This is a website containing all the information of cars in the markets, including new and used cars on sell and cars had been sold. The search engine of this website is so powerful. You can do whatever research in your mind. Also this website provides the deal evaluation for a second-hand car. It uses words such as great deal, good deal, and fair deal to judge the price basing on the condition of the car. Dealers would like to publish their car information on cars.com as well, because it can give more exposure. For many dealers there is even a department just for dealing the issues on the internet. I think this is the most useful function for those people who does not know the market very well.
I say it is not transparent because for a same new car, having same color and same features such as engine type, the price can be quite different. Of course there are many other factors are influencing the price of a car. But this is the interesting part of this work. Right now we have a platform that has the information for the majority of on selling cars. We can find the pattern for our hypothesis. And we can develop more interesting research that is not provided by cars.com. For example, one simple question is that which state sells cheapest car? Moreover we can find which state has the cheapest Toyota? And which city in VA has the cheapest Toyota? Which one is the cheapest Hybrid car? Which one is the cheapest car with blind spot monitor?
These questions may not be a problem for a man who knows the market very well. But it will be better if we could provide the actual number to support the opinion and the distribution of the market.
The crawling strategy
Since these searches are not provided by cars.com we need to crawl all the information on the website. Here, I just crawled all new car informations. Because I am just interested in the new car market. I think new are information is more valuable, since comparing to the second-hand car, the new car information is less manipulated.
The crawling strategy is to do the search operation in the main entrance of the website. The major problem of cars.com is that it will only show the first 5000 results for each search at most. So if the search term is zip code: '22202', range: 'all miles', it will return the total number of the new cars in the database. However, it is not possible to access all of them.
So the strategy is finding the zip code of every major city in USA, with a fixed make and model, and then recording the first 2500 items(cars). Although there will be items that show repeatedly, Scrapy will just crawl all the urls once, so it will not be a problem. The number of the items that I finally get is 1,905,897, but on the cars.com website, there are 2,458,124 cars in total. I am not sure what happened here but I do have the data from all 50 states, and there are situations that proves that the actual number of the cars does not match the total number. For example, if I search all the new cars in Alaska on the webpage, there are 3,479 cars. Actually, I just get 1288 cars. But at least I can guarantee that I have the data of all major cities in the US.
Charts & Graphs
Before we go into any of these charts, I have to emphasize that none of these are the selling number. They are just the stock numbers. It shows the cars are available on the market. I may crawl all the cars data again on March to compare how many cars are sold during this month and how many new cars are added. At the same time, we can see how the price goes. Is it become cheaper or more expensive?
Now, finally, here comes the charts.
1. Which state has the most cars?
It is interesting to see that the distribution is very imbalanced.
You can see although there are 50 states in America, California(CA) has over 10% of the cars in whole America. CA, TX, and FL have nearly 30% of the cars in America.
My guessing is that maybe it is related to the population. These three states are the top three states in population rank.
2. Which make(brand) has the most cars?
Chevrolet and Ford have around 11.5% of the cars in America. It is reasonable, since they are "American made". Toyota and Honda have around 8% of the cars. Their cars have a good quality and they are safe to drive.
3. We can combine make and state.
In this part, I want to show in each state, the distribution of makes.
I am sorry that it may be not so clear in the chart. We can see that the distribution of make in every state is not exactly the same. Sometimes Ford has the biggest number in one state, but it is not in another state. I will not display all the details of the distribution of all the states. Please leave a message if you need it. I will put on the end of this blog.
Finally, this is the question I raised in the previous section.
4. Which state sales the cheapest car?
The average price for the whole US is $36,986.497. This is including all kinds of cars, such as compact, mid size, and large size. The cars.com does not show the categories of the car on their car detail page. So I may not be able to answer the questions such as: which is the cheapest compact car? But I will run some other interesting question later.
In the chart, Hawaii(HI) has the cheapest cars, and Alaska(AK) has the most expensive cars.
5. Which state has the cheapest Toyota?
Florida(FL) has the cheapest Toyota. It is not surprise that the price in Alaska(AK) is the most expensive.
6. Which city in VA sells the cheapest Toyota?
Obviously, Falls Church wins.
These bar chart can be done in every state and every brand(make).
The point of doing this is not just for fun. I have no idea about how Toyota do the business, but we can give it a glimpse by doing this. Also, we can dig deeper if we could analyse the behavior of price combining the situation in a particular city or state. Actually I cannot go further at this time, and I believe that If I could crawl the whole dataset again on March, I can see more things with the actual selling data.
Now, let us do more interesting data exploration.
7. Which brand(make) is most expensive?
Maybe this is not as meaningful as previous charts, because the price of a car can be quite insane. Most of the information of cars on cars.com are for normal people. You may be able to see the people that this brand is targeting, according to the average sell price. But in the most situation, the expensive cars are not available on cats.com, such as Rolls-Royce. Also, for Aston Martin, although there are 53 records in the database, 13 of them are not listing the price of the car. They do not expect that the client will buy their cars using cars.com.
8. What is the engine liter distribution?
Sometimes, I am thinking that people must design the size of an engine with some consideration. I am not major in mechanic or building cars. But I can see the distribution of the size of engines.
There are 1,785,939(93.7%) cars with the information of the engine size. The rest of them may be the electric vehicles, or the dealer did not put the engine information on the website. If you have any idea that the engine size will distribute like this, please leave a message.