Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
— Gregory Piatetsky-Shapiro”
In this post, we are going to be looking at the a dataset from the Nigerian used car market. We are going to explore the data, do some necessary data cleaning and answer some questions about the data. The data set consists of information about 1451 cars in the Nigerian car market, you can find more details on the data set. here. Let’s read in the data and take a brief look it.
library(tidyverse) car <- read_csv("C:/Users/Adejumo/Downloads/car_scrape(1).csv") head(car) ## # A tibble: 6 x 10 ## title odometer location isimported engine transmission fuel paint price ## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 Toyota ~ 60127 Lagos Locally us~ 4-cyli~ automatic petr~ Silv~ 2.00e6 ## 2 Acura M~ 132908 Lagos Foreign Us~ 6-cyli~ automatic petr~ Whine 3.32e6 ## 3 Lexus E~ 120412 Lagos Locally us~ 6-cyli~ automatic petr~ Silv~ 2.66e6 ## 4 Mercede~ 67640 Lagos Foreign Us~ 4-cyli~ automatic petr~ Black 9.02e6 ## 5 Mercede~ 92440 Abuja Foreign Us~ 4-cyli~ automatic petr~ Black 5.79e6 ## 6 Mercede~ 39979 Abuja Foreign Us~ 4-cyli~ automatic petr~ Brown 1.94e7 ## # ... with 1 more variable: year <dbl> glimpse(car) ## Rows: 1,451 ## Columns: 10 ## $ title <chr> "Toyota Corolla", "Acura MDX", "Lexus ES 350", "Mercedes-~ ## $ odometer <dbl> 60127, 132908, 120412, 67640, 92440, 39979, 144211, 82828~ ## $ location <chr> "Lagos", "Lagos", "Lagos", "Lagos", "Abuja", "Abuja", "La~ ## $ isimported <chr> "Locally used", "Foreign Used", "Locally used", "Foreign ~ ## $ engine <chr> "4-cylinder(I4)", "6-cylinder(V6)", "6-cylinder(V6)", "4-~ ## $ transmission <chr> "automatic", "automatic", "automatic", "automatic", "auto~ ## $ fuel <chr> "petrol", "petrol", "petrol", "petrol", "petrol", "petrol~ ## $ paint <chr> "Silver", "Whine", "Silver", "Black", "Black", "Brown", "~ ## $ price <dbl> 1995000, 3315000, 2655000, 9015000, 5790000, 19440000, 19~ ## $ year <dbl> 2009, 2009, 2008, 2013, 2013, 2016, 2008, 2000, 2010, 200~ skimr::skim(car)
Name | car |
Number of rows | 1451 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
title | 0 | 1 | 6 | 37 | 0 | 240 | 0 |
location | 0 | 1 | 3 | 16 | 0 | 13 | 0 |
isimported | 0 | 1 | 3 | 12 | 0 | 3 | 0 |
engine | 0 | 1 | 14 | 16 | 0 | 9 | 0 |
transmission | 0 | 1 | 6 | 9 | 0 | 2 | 0 |
fuel | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
paint | 0 | 1 | 3 | 23 | 0 | 75 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
odometer | 0 | 1 | 116802.14 | 115841.9 | 0 | 53338 | 92919 | 152769.5 | 1775588 | ▇▁▁▁▁ |
price | 0 | 1 | 8431088.23 | 13089603.8 | 400000 | 2615000 | 4215000 | 8865000.0 | 167015008 | ▇▁▁▁▁ |
year | 0 | 1 | 2008.59 | 39.2 | 1217 | 2006 | 2010 | 2014.0 | 2626 | ▁▁▇▁▁ |
From the results above, we can see that we have no missing values. There seems to be some problem with the values under year, we have presence of extreme values both at the upper and lower ends. Lets us take a closer look at the year variable.
car %>% filter(year < 1960 | year > 2021) ## # A tibble: 5 x 10 ## title odometer location isimported engine transmission fuel paint price ## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 Mercedes~ 403461 Lagos Locally us~ 4-cyl~ manual dies~ white 6.02e6 ## 2 Mercedes~ 701934 Lagos Locally us~ 8-cyl~ manual dies~ white 1.20e7 ## 3 Mercedes~ 0 Lagos Locally us~ 8-cyl~ manual dies~ white 1.20e7 ## 4 Mercedes~ 510053 Lagos Locally us~ 6-cyl~ manual dies~ white 7.50e7 ## 5 Mercedes~ 650923 Lagos Locally us~ 6-cyl~ manual dies~ blue 7.02e6 ## # ... with 1 more variable: year <dbl>
There are 5 cars with wrong years, this might have occurred in the entry of the data. Since the number is negligible, we can do away with the rows and proceed with our analysis. Next is to also transform the price columns by dividing the price by millions.
car_data <- car %>% filter(!(year < 1960 | year > 2021)) %>% mutate(price_millions = price/1000000, .keep = "unused") head(car_data) ## # A tibble: 6 x 10 ## title odometer location isimported engine transmission fuel paint year ## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 Toyota C~ 60127 Lagos Locally us~ 4-cyli~ automatic petr~ Silv~ 2009 ## 2 Acura MDX 132908 Lagos Foreign Us~ 6-cyli~ automatic petr~ Whine 2009 ## 3 Lexus ES~ 120412 Lagos Locally us~ 6-cyli~ automatic petr~ Silv~ 2008 ## 4 Mercedes~ 67640 Lagos Foreign Us~ 4-cyli~ automatic petr~ Black 2013 ## 5 Mercedes~ 92440 Abuja Foreign Us~ 4-cyli~ automatic petr~ Black 2013 ## 6 Mercedes~ 39979 Abuja Foreign Us~ 4-cyli~ automatic petr~ Brown 2016 ## # ... with 1 more variable: price_millions <dbl>
Now let’s start answering some interesting questions about the data.
Are locally used car more in the market?
Some cars are used abroad and imported into the country while others are used newly in the country. Let’s see the percentage of locally and foreign used cars.
car_data %>% group_by(isimported) %>% summarise(count = n()) %>% mutate(percentage = (count/sum(count))*100) %>% ggplot(aes(x = isimported, y = percentage, fill = isimported)) + geom_col() + labs(x = "Cars", y = "Percentage", title = "Percentage of the type of used car", aes = "Type of Used Car") + guides(fill = guide_legend(title="Type of Use"))
Prices of type of used cars
We now know the kind of cars sold in the market, let’s compare the prices and see which among the types of cars are more expensive.
car_data %>% ggplot(aes(x = isimported, y = price_millions, colour = isimported))+ geom_boxplot()+ labs(x = "Cars", y = "Price in millions", title = "Prices of various types of car", aes = "Type of Used Car") + guides(fill = guide_legend(title="Type of Use"))
car_data %>% filter(price_millions > 100) ## # A tibble: 6 x 10 ## title odometer location isimported engine transmission fuel paint year ## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 Land Rov~ 13687 Lagos Foreign Us~ 8-cyli~ automatic petr~ Green 2019 ## 2 Mercedes~ 20 Lagos New 8-cyli~ automatic petr~ Black 2019 ## 3 Mercedes~ 6758 Lagos New 12-cyl~ automatic petr~ Black 2019 ## 4 Land Rov~ 18720 Lagos Foreign Us~ 8-cyli~ automatic petr~ Grey 2019 ## 5 Lexus LX~ 55530 Abuja Foreign Us~ 8-cyli~ automatic petr~ Black 2014 ## 6 Rolls-Ro~ 16069 Lagos Locally us~ 4-cyli~ automatic petr~ Black 2011 ## # ... with 1 more variable: price_millions <dbl>
Well guessed right, they are luxury cars!. I wish I could just be the owner of the Rolls-Royce Ghost(lol). Now let’s take a critical look on foreign and locally used cars below 25 million naira since majority of them fall below this range.
car_data %>% filter(isimported != "New", price_millions < 25) %>% ggplot(aes(x = isimported, y = price_millions, colour = isimported))+ geom_boxplot() + labs(x = "Cars", y = "Percentage", title = "Percentage of the type of used car", aes = "Type of Used Car") + guides(fill = guide_legend(title="Type of Use"))
Prices of cars in various locations
Let’s see how many locations do we have.
car_data %>% group_by(location) %>% summarize(count = n()) %>% arrange(desc(count)) ## # A tibble: 13 x 2 ## location count ## <chr> <int> ## 1 Lagos 1159 ## 2 Abuja 216 ## 3 Ogun 34 ## 4 Lagos State 21 ## 5 other 5 ## 6 Abia 2 ## 7 FCT 2 ## 8 Ogun State 2 ## 9 Abia State 1 ## 10 Accra 1 ## 11 Adamawa 1 ## 12 Arepo ogun state 1 ## 13 Mushin 1
There is a problem, we are having Lagos state
and Lagos
instead of only Lagos. Let’s fix this and also limit the locations to Abuja and Lagos only.
car_data %>% mutate(location = replace(location, location == "Lagos State", "Lagos")) %>% group_by(location) %>% summarize(count = n()) %>% arrange(desc(count)) %>% filter(location == c("Lagos", "Abuja")) ## # A tibble: 2 x 2 ## location count ## <chr> <int> ## 1 Lagos 1180 ## 2 Abuja 216
Now let us look at the prices of cars in the above locations.
car_data %>% mutate(location = replace(location, location == "Lagos State", "Lagos")) %>% filter(location == c("Lagos", "Abuja")) %>% ggplot(aes(x = location, y = price_millions, colour = location))+ geom_boxplot() + xlab("Location")+ ylab("Price in Million")+ ggtitle("Car price in various locations")
car_data %>% mutate(location = replace(location, location == "Lagos State", "Lagos")) %>% filter(location == c("Lagos", "Abuja"), price_millions <= 25) %>% ggplot(aes(x = location, y = price_millions, colour = location))+ geom_boxplot() + xlab("Location")+ ylab("Price in Million")+ ggtitle("Car price in various locations")
Conclusion
Well that is all for this post, there are also many questions which can be answered from the data set such as how the price of car changes yearly and so on, these are all but a few.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.