How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I love cars.
The way they sound. The engineering. The craftsmanship. And let’s be honest: fast cars are just fun.
Given my love of cars, I frequently watch Top Gear clips on YouTube.
A couple of weeks ago, I stumbled across this:
Watching the video, I’m thinking, “253 miles per hour? You’ve got to be kidding me.”
And the next thing you know, I’m scraping ‘high-performance’ car data from a website.
Data analysis example: ‘supercar’ data
After gathering the data together, I realized it would be a great dataset to use for a data analysis example.
When I initially saw the data on the website, I was already asking questions: how much horsepower does that Bugatti have? How much torque? How much compared to other cars?
The data is rich enough to answer those questions, but it needs some data manipulation to extract the exact variables from it; which means, we’ll be able to use some dplyr verbs to manipulate and reshape our data.
Moreover, the data was scattered across several smaller datasets, making it perfect for demonstrating the process of merging data together. Note that I haven’t published any tutorials on “joins” yet, so just follow along during that section. I’ll publish a tutorial on joins later. In the meantime, this will give you a preview.
And finally, this data gives us the opportunity to put some of our visualization skills to use: creating scatterplots, histograms, bar charts, and small multiples. Moreover, we’ll be able to use these tools to practice data exploration. Remember: finding insight in data is an art, and that art must be practiced.
The following data analysis example will show you the rough process for analyzing data, end-to-end.
When I say “rough process,” what I mean is that this isn’t comprehensive. At every step, there might be more to do (e.g., get more data, do more visualizations, “polish” the charts for presentation). Having said that, even though it doesn’t show everything we might do, it does take you through the overall process at a high level.
Ok, let’s get started.
Get the data
First, we’re just going to get our data from 5 separate .txt files. (These files are available here on the Sharp Sight Labs blog; you should be able to access directly them via the following code.)
Keep in mind that .txt files are sort of a simple case. As you progress, your data might be in a database (requiring you to write some SQL code) or it might be on a website that you need to scrape.
We’ll keep this simple though: these datasets are already in comma-delimited text files.
(Also note: I scraped these from a website. I did some data wrangling to reshape the original data into the following .txt files. That process was a bit more complicated though, so I’ve left it out of the tutorial.)
In the following code, we’re just going to import the files into dataframes using
library(dplyr) library(ggplot2) ################### # IMPORT datasets # ################### df.car_torque <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_torque_DATA.txt")) df.car_0_60_times <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_0-60-times_DATA.txt")) df.car_engine_size <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_engine-size_DATA.txt")) df.car_horsepower <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_horsepower_DATA.txt")) df.car_top_speed <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_top-speed_DATA.txt")) df.car_power_to_weight <- read.csv(url("http://www.sharpsightlabs.com/wp-content/uploads/2014/11/auto-snout_power-to-weight_DATA.txt")) #------------------------- # Inspect data with head() #------------------------- head(df.car_torque) head(df.car_0_60_times) head(df.car_engine_size) head(df.car_horsepower) head(df.car_top_speed) head(df.car_power_to_weight)
Examine the data
Next, examine the data.
Specifically, you’re going to look for duplicate records. The reason is that we’re going to join these together into one dataset. Duplicate records will cause a faulty join.
############################## # LOOK FOR DUPLICATE RECORDS # ############################## df.car_torque %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) df.car_0_60_times %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) df.car_engine_size %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) df.car_horsepower %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) df.car_top_speed %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) df.car_power_to_weight %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) # NOTE: duplicate records! # several datasets have duplicate rows for the following cars (car_full_nm) # Chevrolet Chevy II Nova SS 283 V8 Turbo Fire - [1964] 2 # Koenigsegg CCX 4.7 V8 Supercharged - [2006] 2 # Pontiac Bonneville 6.4L V8 - [1960] 2
Here, to look for duplicate records, we’ve chained together several dplyr verbs. In this process, we’re aggregating each dataframe by “car_full_nm,” counting the number of records for each car, and then filtering the resulting data, leaving only cars that have more than one record.
If you don’t understand this, that’s OK. The dplyr tutorial explains chaining, and all of these verbs. So if you don’t understand this code, read that tutorial.
Remove duplicate records to prep for ‘joins’ (i.e., dedupe)
Since we found some duplicate records, we want to remove those.
We’ll use the
############### # DEDUPE DATA # ############### df.car_0_60_times <- distinct(df.car_0_60_times ,car_full_nm) df.car_engine_size <- distinct(df.car_engine_size ,car_full_nm) df.car_horsepower <- distinct(df.car_horsepower ,car_full_nm) df.car_top_speed <- distinct(df.car_top_speed ,car_full_nm) df.car_torque <- distinct(df.car_torque ,car_full_nm) df.car_power_to_weight <- distinct(df.car_power_to_weight, car_full_nm)
Join datasets together into one data frame
Now that our dataframes are deduped, we’ll join them together.
Joins are an an entirely separate tutorial all by themselves, so I won’t discuss this code in detail.
Having said that, you’ll notice that I’m:
1. Starting with df.car_horsepower and joining df.car_torque to that. I’m joining these two datasets together on the car_full_nm variable. This joined dataframe is called df.car_spec_data.
2. Then, one by one, I’m joining all of the datasets to df.car_spec_data to create a “master” dataset.
Note also that after each line of code, I’ve added a comment (using the ‘#’ character), indicating the number of records. This is a simple check to make sure that the join worked properly. Because of the nature of
############# # JOIN DATA # ############# df.car_spec_data <- left_join(df.car_horsepower, df.car_torque, by="car_full_nm") # count after join: 1578 df.car_spec_data <- left_join(df.car_spec_data, df.car_0_60_times, by="car_full_nm") # count after join: 1578 df.car_spec_data <- left_join(df.car_spec_data, df.car_engine_size, by="car_full_nm") # count after join: 1578 df.car_spec_data <- left_join(df.car_spec_data, df.car_top_speed, by="car_full_nm") # count after join: 1578 df.car_spec_data <- left_join(df.car_spec_data, df.car_power_to_weight, by="car_full_nm") # count after join: 1578 # Test duplicates df.car_spec_data %>% group_by(car_full_nm) %>% summarise(count=n()) %>% filter(count!=1) # local data frame [0 x 2] # i.e., 0 duplicate records # Re-inspect data str(df.car_spec_data) head(df.car_spec_data)
Add new variables
Next, you’ll add new variables using the
You can learn more about
Additionally, you’ll notice some arcane syntax inside of the
The topic of regular expressions is a little more advanced, so don’t worry about it right now. Really, if you don’t know regular expressions, you might be tempted to spend a lot of time trying to figure it out. Learning regular expressions isn’t a great use of your time right now. Just let it work it’s magic and we’ll come back to the topic another time. For the time being, try to focus on the overall process.
##################### # ADD NEW VARIABLES ##################### #-------------- # NEW VAR: year #-------------- df.car_spec_data <- mutate(df.car_spec_data, year=sub(".*\[([0-9]{4})\]","\1",car_full_nm)) str(df.car_spec_data$year) #---------------- # NEW VAR: decade #---------------- df.car_spec_data <- mutate(df.car_spec_data, decade = as.factor( ifelse(substring(df.car_spec_data$year,1,3)=='193','1930s', ifelse(substring(df.car_spec_data$year,1,3)=='194','1940s', ifelse(substring(df.car_spec_data$year,1,3)=='195','1950s', ifelse(substring(df.car_spec_data$year,1,3)=='196','1960s', ifelse(substring(df.car_spec_data$year,1,3)=='197','1970s', ifelse(substring(df.car_spec_data$year,1,3)=='198','1980s', ifelse(substring(df.car_spec_data$year,1,3)=='199','1990s', ifelse(substring(df.car_spec_data$year,1,3)=='200','2000s', ifelse(substring(df.car_spec_data$year,1,3)=='201','2010s',"ERROR" ))))))))) ) ) head(df.car_spec_data) str(df.car_spec_data) #------------------------------- # NEW VAR: make_nm # (i.e., the "make" of the car; # the "brand name" of the car) #------------------------------- df.car_spec_data <- mutate(df.car_spec_data, make_nm = gsub(" .*$","", df.car_spec_data$car_full_nm)) #-------------------------- # NEW VAR: car_weight_tons #-------------------------- df.car_spec_data <- mutate(df.car_spec_data, car_weight_tons = horsepower_bhp / horsepower_per_ton_bhp) #-------------------------- # NEW VAR: torque_per_ton #-------------------------- df.car_spec_data <- mutate(df.car_spec_data, torque_per_ton = torque_lb_ft / car_weight_tons)
Inspect data again
Now that we’ve added our variables, we’ll just check our data again.
Notice that we’re doing that a lot.
At every step, you want to look and see if the data set turned out the way you expected.
Specifically, we’re going to check that we created the new variables properly. We’re going to accomplish that by using
######################################### # INSPECT DATA # - quick checks to make sure that # the variables were created properly ######################################### head(df.car_spec_data) #------------------------------ # Frequency table (AGGREGATE) # - decade #------------------------------ # CHECK 'decade' variable df.car_spec_data %>% group_by(decade) %>% summarize(count=n()) # decade count # 1 1930s 2 # 2 1940s 7 # 3 1950s 57 # 4 1960s 143 # 5 1970s 125 # 6 1980s 154 # 7 1990s 262 # 8 2000s 526 # 9 2010s 302 #------------------------------ # Frequency table (AGGREGATE) # - make #------------------------------ df.car_spec_data %>% group_by(make_nm) %>% summarise(make_count = length(make_nm)) %>% arrange(desc(make_count)) # note: the list of car makes is too long so the output hasn't # been added here
Using several dplyr verbs chained together, we’ve taken the df.car_spec_data data frame, grouped it by decade, then counted the number of records by decade. We’re doing this to check that we created our ‘decade’ variable correctly; if we found an incorrect value for ‘decade,’ or a count that looked way off, we’d have to go back and check our work.
We’re doing the same thing with the ‘make’ variable: chaining together several dplyr verbs to aggregate, count, and create a frequency table.
(I’ve omitted the output of the car-make frequency table. The output is a long list. Just run the code and you’ll see the output yourself.)
Recap: creating a master dataset
Now we have a dataset we can work with.
Let’s recap what we just did:
1. Imported multiple .txt files with
2. Inspected the data
3. Removed duplicate records from each data set (deduped)
4. Merged the datasets together into a single dataframe
5. Created new variables
6. Inspected again
Keep this process in mind. When you’re initiating a new analysis and creating your dataset, this is a rough outline of the steps you’ll probably need to execute.
Want to see part 2?
In part 2 of this data analysis example, I’m going to show how to explore this data using the core data visualization techniques.
If you want to see part two as soon as it’s published, sign up for our email list and we’ll send the link directly to you, so you don’t miss it.
The post How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1) appeared first on SHARP SIGHT LABS.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.