How to really do an analysis in R (part 1, data manipulation)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a couple of years, I’ve been writing about the importance of data analysis, saying that data analysis is essentially the foundation of data science itself.
While I admit that there’s room for argument (and I admit that the reality is a little more nuanced) I still firmly believe that data analysis is the true foundation of practical data science.
That is, if you want to learn practical data science (as opposed to theoretical, academic topics) you need to be able to absolutely shred a dataset.
If we drill into that just a little more, what that means is that you need to be able to take a dataset, wrangle it into shape, and use visualization and other techniques to find insights.
This is essentially why I’ve been consistently recommending that you start learning data science by getting a solid foundation in two core skill areas:
- Data wrangling (AKA, data manipulation)
- Data visualization
Once you learn these skills independently, you can use them together for analysis.
In practical terms, this means that as an R user, you should know
What does an analysis look like in R?
I’ve been emphasizing the importance of visualization, wrangling, and analysis for a while, but you might be wondering, “what does it look like, in practice?”
I’m glad you asked.
As it turns out, last week I started asking questions about economic shifts in the world. In particular, I was asking questions about shifts in geopolitical and geoeconomic centers of power, and the rise of Asia. After some clicking around, I stumbled on a Wikipedia dataset about shipping volumes at the “busiest ports in the world.”
Aaaand, down the rabbit hole I went.
I decided to analyze the dataset, partially to satisfy my curiosity, but also to give you an example of what’s possible once you know a few critical skills of the tidyverse.
A quick note for beginners
If you’re a beginner, you should know that the code you’re about to see is “intermediate level” code. At it’s core, the code is pretty straightforward (really, I swear), but there are lots of little details that obscure the underlying simplicity.
Having said that, in an upcoming post I’m going to strip all of this down to bare bones to show you that you can learn 80% of this very quickly. If you want me to notify you when that post is available, definitely sign up for the email list … if you’re part of the Sharp Sight email list, you’ll be notified when I publish that future post.
In the meantime, let’s dive in. I want you to see just how powerful the tidyverse is for wrangling, visualizing, and analyzing data.
First things first: we need to get the data and hammer it into shape.
Get the data
First, we need to get the data. To do this, we’re going to scrape the data from Wikipedia by using the
Admittedly, web scraping is not a topic for beginners. If you’re a beginner, you could just copy and paste the data into Excel, save it as a csv, and import it using the
However, I think this is a more interesting way to get the data (and it’s something that can be easily communicated/replicated through a blog post).
Ok, we’ll use
(As I noted, this is sort of an intermediate topic, so I’ll do a more in-depth tutorial on scraping in the future.)
###################################### # PART I: GET THE DATA # - We need to both get the data # as well as "wrangle" it into shape ###################################### #============== # LOAD PACKAGES #============== library(tidyverse) library(stringr) library(forcats) library(ggmap) library(rvest) #========================== #SCRAPE DATA FROM WIKIPEDIA #========================== # SCRAPE html.world_ports <- read_html("https://en.wikipedia.org/wiki/List_of_busiest_container_ports") df.world_ports <- html_table(html_nodes(html.world_ports, "table")[[2]], fill = TRUE)
Inspect the data
After scraping and reading in the data, we’ll examine the data using
This will give us the names of the columns, the dimensions of the data, and will print out the first several observations.
You’ll see this sort of data inspection after many of the procedures that we execute. It’s important to routinely examine your data after you change it. Essentially, as you execute a procedure, you want to make sure that the procedure did what you wanted it to. You want to make sure that everything looks “OK.”
Get into the habit of examining your data after important blocks of data wrangling code.
# INSPECT glimpse(df.world_ports) # Observations: 50 # Variables: 14 # $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26... # $ Port <chr> "Shanghai", "Singapore", "Shenzhen", "Hong Kong", "Ningbo-Zhoushan", "Busan", "Qingdao", "Gua... # $ Economy <chr> "China", "Singapore", "China", "Hong Kong", "China", "South Korea", "China", "China", "United... # $ 2014[1] <chr> "35,268", "33,869", "23,798", "22,374", "19,450", "18,423", "16,624", "16,160", "14,750", "14... # $ 2013[2] <chr> "33,617", "32,240", "23,280", "22,352", "17,351", "17,690", "15,520", "15,309", "13,641", "13... # $ 2012[3] <chr> "32,529", "31,649", "22,940", "23,117", "16,670", "17,046", "14,503", "14,744", "13,270", "12... # $ 2011[4] <chr> "31,700", "29,937", "22,570", "24,384", "14,686", "16,185", "13,020", "14,400", "13,000", "11... # $ 2010[5] <chr> "29,069", "28,431", "22,510", "23,532", "13,144", "14,157", "12,012", "12,550", "11,600", "10... # $ 2009[6] <chr> "25,002", "25,866", "18,250", "20,983", "10,502", "11,954", "10,260", "11,190", "11,124", "8,... # $ 2008[7] <chr> "27,980", "29,918", "21,414", "24,248", "11,226", "13,425", "10,320", "11,001", "11,827", "8,... # $ 2007[8] <chr> "26,150", "27,932", "21,099", "23,881", "9,349", "13,270", "9,462", "9,200", "10,653", "7,103... # $ 2006[9] <chr> "21,710", "24,792", "18,469", "23,539", "7,068", "12,039", "7,702", "6,600", "8,923", "5,950"... # $ 2005[10] <chr> "18,084", "23,192", "16,197", "22,427", "5,208", "11,843", "6,307", "4,685", "7,619", "4,801"... # $ 2004[11] <chr> "14,557", "21,329", "13,615", "21,984", "4,006", "11,430", "5,140", "3,308", "6,429", "3,814"...
Notice that we have a the following variables:
- Rank: lists the shipping-volume rank for 2014
- Port: the port name (i.e., the city/location of the port)
- Economy: the country where the port is located
- 2014[], 2013[], etc: These year columns contain the shipping volume for that given year
Again, the years themselves are spread across columns (as column names), and the shipping volumes (which we’ll want to eventually analyze) are listed under those year columns. This is not the correct, ‘tidy’ structure of the data, so we’ll eventually need to reshape it.
Also notice that almost everything has been read in as character data. We’ll need to change several of these variables to factors, numerics, etc.
Rename variables
Next, we’re going to rename our variables by transforming them all to lower case.
By default, when we read in the data, several of the variables have capitalized first letters.
In programming, there are several different style conventions (and I’ll admit that my naming conventions are occasionally unconventional), but many people will tell you that there’s no good reason to have a capitalized first letter for a variable name. In most situations, you’ll want to remove leading capitals, if for no other reason than it makes them easier to type.
To execute this, we’re going to use
On the right-hand side of the expression,
We then use the assignment operator (
So ultimately, on the right-hand side of the expression we’re retrieving and transforming to lower case. Then on the left-hand side of the expression we’re re-setting the column names.
#========================== # RENAME VARIABLES: # - transform to lower case #========================== #----------- # LOWER CASE #----------- colnames(df.world_ports) <- colnames(df.world_ports) %>% tolower() # INSPECT colnames(df.world_ports)
Get geo-spatial information (latitude/longitude)
Next, we’re going to “geocode” our observations so we can use a few geo-spatial visualization techniques later in our analysis. That is, we’re going to get the latitude and longitude associated with our observations.
More specifically, each row of our data frame (as it currently stands) represents a port. It is a specific location. Most frequently, it’s just the name of a city, but sometimes there is a specific name of a port.
We’re going to use
Working with geo-spatial data is a slightly more advanced topic, but
#=================================================== # GEOCODE # - here, we're picking up lat/long from Google Maps # using ggmaps::geocode() #=================================================== #-------------------------- # get data from google maps #-------------------------- geocodes.world_ports <- geocode(df.world_ports$port)
We then need to merge the lat/lon data back onto
#-------------------------------------------------------- # COMBINE: # - bind the new lat/long data to df.world_ports data frame #-------------------------------------------------------- df.world_ports <- cbind(df.world_ports, geocodes.world_ports) head(df.world_ports)
Manually code missing latitude/longitude data
When we used
But, the function failed for a few of the ports. Essentially,
Consequently, we need to get that data another way.
To get the lat/lon data, we’re going to manually retrieve this information at http://www.latlong.net/convert-address-to-lat-long.html. If you go there, you’ll find that you can simply click on a location on the map, and it will return the latitude and longitude.
After manually obtaining the remaining lat/long coordinates (they are recorded in the commented code below) we’ll just hand-code them into our data frame.
This is probably a slightly inelegant solution, but it will work for the time being (and it’s simple enough that I can communicate it to you in a blog post).
#========================================================================================= # RECODE lon and lat # - There are 4 lon/lat values that weren't found with geocode() # - We'll just hand code them # - The values can be obtained at http://www.latlong.net/convert-address-to-lat-long.html # # # Tanjung Pelepas, Johor Bahru: lon = 103.551035, lat = 1.362374 # Yingkou:..................... lon = 122.108231, lat = 40.266062 # Valencia, Spain:............. lon = -0.3762881, lat = 39.46991 # Malta Freeport:.............. lon = 14.537637 , lat = 35.816287 # #========================================================================================= df.world_ports <- df.world_ports %>% mutate( lon = case_when(.$port == "Tanjung Pelepas" ~ 103.551035 ,.$port == "Yingkou" ~ 122.108231 ,.$port == "Valencia" ~ -0.3762881 ,.$port == "Malta Freeport" ~ 14.537637 ,TRUE ~ .$lon ) ,lat = case_when(.$port == "Tanjung Pelepas" ~ 1.362374 ,.$port == "Yingkou" ~ 40.266062 ,.$port == "Valencia" ~ 39.46991 ,.$port == "Malta Freeport" ~ 35.816287 ,TRUE ~ .$lat ) ) # CHECK df.world_ports %>% filter(port == "Tanjung Pelepas") %>% select(lat,lon) df.world_ports %>% filter(port == "Yingkou") %>% select(lat,lon) df.world_ports %>% filter(port == "Valencia") %>% select(lat,lon) df.world_ports %>% filter(port == "Malta Freeport") %>% select(lat,lon)
To be honest, this block of code may be a little challenging to understand because of how we’re referencing the data frame inside the
Having said that, at the core, this code block us just using
Convert variables to factors
There are several variables that were parsed as character variables that we’d rather have as factors.
This being the case, we’re going to convert these variables to factors. Note that we’re ultimately using the
#============================= # convert variables to factors #============================= df.world_ports <- mutate(df.world_ports , economy = as.factor(str_trim(economy)) , port = as.factor(port) )
Create a ‘label’ variable
Next, we’re going to create an extra variable that we’ll essentially use only for labels when we create plots.
The reason for this, is that several of the port names are simply too long for some of our plots. When we begin to visualize the data (in the next blog post), you’ll see that some of the port names are so long, that they would get cut off if we used them.
That being the case, we’re going to create a new variable which will still contain the port names, but will use abbreviated names for the long-named locations.
To create this, we’re simply going to identify the names that are excessively long and create shortened names.
To identify the port names that are too long, we’re going to list them out using
After identifying long names, we’re using
Note as well that all of this is occurring inside of
#============================================== # CREATE 'LABEL' VARIABLE # - Several of the port names are too long when # we use them as plot labels, so we'll make # a new variable that has some abbreviated # names #============================================== #-------------------------------------------- # GET LEVELS # - this will let us identify which labels to # modify #-------------------------------------------- levels(df.world_ports$port) #-------------------------------------------- # RECODE: # - use fct_recode() to create new variable #-------------------------------------------- df.world_ports <- mutate(df.world_ports , port_label = fct_recode(port ,"Saigon" = "Ho Chi Minh City (Saigon)" ,"New York" = "New York and New Jersey" ,"Jakarta" = "Tanjung Priok (Jakarta)" ,"Bremen" = "Bremen/Bremerhaven" ,"Istanbul" = "Ambarli (Istanbul)" ,"Tangiers" = "Tanger-Med (Tangiers)" ,"Dubai" = "Jebel Ali (Dubai)" ,"Ningbo/Z-shan" = "Ningbo-Zhoushan" ) )
Keep in mind that this process was actually iterative. I had to recode these names and then use them in test-plots (like bar charts and small multiple plots) to see if the names were truly short enough, and to make sure that I got them all.
I’ve emphasized the importance of iterative workflow in the past, so I want to point out that it was actively used here, even if you can’t directly see that process in the finalized code.
Reshape data into long format (AKA ‘tidy format’)
Now, we’re going to put the data into “tidy” form.
What is tidy form?
It’s best explained with an example. If you print out the data using the
The ‘tidy’ form of the data would have the years in a ‘
Essentially, this means that we need to reshape the data into a new format, transposing the columns names (‘
As we do this, we will also be creating a new variable called ‘volume.’ This will contain the shipping volumes that had been contained underneath our set of old year columns (i.e.
To do this, we need to use
(Note that if data reshaping seems a little complicated, don’t worry.
#===================================== # RESHAPE WIDE TO LONG # - this data is not in 'tidy' form # - we need to reshape it to a longer # form using tidyr::gather() #===================================== # RESHAPE df.world_ports <- df.world_ports %>% gather(year,volume,4:14) # INSPECT head(df.world_ports) levels(as.factor(df.world_ports$year)) levels(df.world_ports$economy) levels(df.world_ports$port) names(df.world_ports)
Notice again that after performing this major transformation of the data, we’re inspecting the data. Specifically we’re using
Convert year to factor
When we reshaped our data and subsequently created the
We need to convert it to a factor variable. This is a straight forward use of
#================================================ # CAST YEAR AS FACTOR # - now that we have a 'year' variable after # reshaping, we're going to cast it as a factor #================================================ df.world_ports <- mutate(df.world_ports, year = as.factor(year))
Rename the levels of year
Now that we have our years in a single column (instead of spread across multiple columns), we’re going to recode the values of those years.
This is because when we scraped the data from Wikipedia and parsed it into our dataframe, it retained some artifacts that we don’t want.
Specifically, there are numbers inside of brackets that we want to remove. For example, if you examine the levels of
To do this, we’ll again use
There are more elegant ways to do this, but this is a simple method that’s relatively easy to comprehend.
#============================================= # RENAME FACTORS # - the levels of 'year' contain some info # that we don't need (e.g., the '[11]') in # '2004[11]' # - we'll recode manually to remove these #============================================= #------------------------------------- # GET LEVELS # - get the levels of the year factor # so we know what we need to change #------------------------------------- levels(df.world_ports$year) # "2004[11]" "2005[10]" "2006[9]" "2007[8]" "2008[7]" "2009[6]" "2010[5]" "2011[4]" # "2012[3]" "2013[2]" "2014[1]" #---------------------------------------- # RECODE 'year' # use mutate to change the year variable # - here, we'll use forcats::fct_recode() # to enumerate all of the mappings of # new names from the old names #---------------------------------------- df.world_ports <- mutate(df.world_ports , year = fct_recode(year ,"2014" = "2014[1]" ,"2013" = "2013[2]" ,"2012" = "2012[3]" ,"2011" = "2011[4]" ,"2010" = "2010[5]" ,"2009" = "2009[6]" ,"2008" = "2008[7]" ,"2007" = "2007[8]" ,"2006" = "2006[9]" ,"2005" = "2005[10]" ,"2004" = "2004[11]" ) ) # INSPECT NEW LEVELS levels(df.world_ports$year) head(df.world_ports)
Change ‘volume ‘ variable to numeric
Now we need to modify the
When we initially scraped the data, the shipping volume data was read in as character data because it contained commas. But later, to analyze these data, we need them to be numerics.
To convert from character to numeric, we first need to strip out the commas.
So, we’re going to use
Once again, we’re executing this inside of
#========================================================= # CHANGE 'volume' VARIABLE TO NUMERIC # - the 'volume' variable comes in as a char # - we need to transform it to a numeric # - to do this we first need to strip out the commas using # stringr::str_replace(), then re-cast the variable as # a numeric using as.numeric() #========================================================= # CAST 'volume' VARIABLE df.world_ports <- mutate(df.world_ports, volume = as.numeric(str_replace(volume, ",", ""))) # INSPECT head(df.world_ports) glimpse(df.world_ports)
Calculate shipping-volume rankings, by year
Ok, our numeric shipping volume data is now reformatted into a form that’s ready for analysis.
When we analyze our data though, it would be good to be able to rank the data by shipping volume for every year.
For example, we’ll want to answer the question, “for 2014, which port had the highest volume? the second highest? third, fourth, …”
We’ll not only want to answer that question for 2014, but we’ll want to know the rankings for each individual year. This will allow us to examine the rankings for any particular year, but also examine changes in rankings year-by-year.
Again, to do this, we need to calculate a rank.
Note that the data set already has a “
To drop the existing
After that, we’re calculating a new
You’ll notice that we’re taking the data frame
Notice that inside of
This will give us a ranking from top to bottom of the busiest ports, by year.
#======================================================= # RANK BY YEAR # - here, we're going to create a rank of "busiest port" # by year. To do this, we need to group our data #======================================================= # DROP the old 'rank' variable df.world_ports <- select(df.world_ports, -rank) # INSPECT glimpse(df.world_ports) # RE-RANK df.world_ports <- df.world_ports %>% group_by(year) %>% mutate(rank = min_rank(desc(volume))) %>% ungroup() # INSPECT AGAIN # - we now have new rankings glimpse(df.world_ports)
Create ‘continent ‘ variable
Now, we’re going to create a new variable called
This is pretty straight forward. It’s just the continent that each port is on (i.e., North America, Europe, Asia, etc).
We’ll use this variable to analyze our data by continent (this will be a good variable upon which to group, subset, etc).
To do this, we’ll first list the countries, by executing
When we do this, notice that once again our creation of this new variable is occurring inside of a call to
#=========================================== # ADD CONTINENT VARIABLE # - during analysis, it will be interesting # to analyze the data by continent # - to do this, we'll create a new variable #=========================================== #----------------------------------------------------- # LIST COUNTRIES # - these are the values that we'll have to 'collapse' # into a new factor #----------------------------------------------------- levels(df.world_ports$economy) # [1] "Belgium" "Brazil" "Canada" "China" # [5] "Egypt" "Germany" "Hong Kong" "India" # [9] "Indonesia" "Italy" "Japan" "Malaysia" # [13] "Malta" "Morocco" "Netherlands" "Oman" # [17] "Panama" "Philippines" "Saudi Arabia" "Singapore" # [21] "South Korea" "Spain" "Sri Lanka" "Taiwan" # [25] "Thailand" "Turkey" "United Arab Emirates" "United Kingdom" # [29] "United States" "Vietnam" #---------------------------------------------------------- # CREATE VARIABLE: 'continent' # - here we're using forcats::fct_collapse() to "collapse" # our country values down into a 'continent' variable #---------------------------------------------------------- df.world_ports <- df.world_ports %>% mutate(continent = fct_collapse(economy ,South_America = c("Brazil","Panama") ,North_America = c("Canada","United States") ,Asia = c("Japan","China","Hong Kong","India","Indonesia" ,"Malaysia","Oman","Philippines","Saudi Arabia" ,"Singapore","South Korea","Sri Lanka" ,"Taiwan","Thailand","United Arab Emirates","Vietnam") ,Europe = c("Belgium","Germany","Italy","Malta","Netherlands" ,"Spain","Turkey","United Kingdom") ,Africa = c("Egypt","Morocco") ) ) # INSPECT glimpse(df.world_ports) head(df.world_ports) levels(df.world_ports$continent) #---------------------------------------------- # CHECK GROUPINGS # - this is just a simple check to see that all # of the countries have been coded into the # correct continent #---------------------------------------------- df.world_ports %>% group_by(continent, economy) %>% summarize(1) %>% print.data.frame()
Reorder the variables
Ok, it’s the home stretch. We’re almost done wrangling our data.
One final thing we’ll do is re-order the variables.
They’re currently in an un-intuitive order, so we’ll reorganize them such that
To do this, we’ll first get all of the variable names using
Then we’ll use those names in a call to the
#========================================= # REORDER THE VARIABLES # - the variables are not in an intuitive # order in the data frame # - we'll re order them #========================================= # CHECK the number of cols df.world_ports %>% ncol() # 9 columns # GET NAMES df.world_ports %>% names() # [1] "port" "economy" "lon" "lat" "port_label" "year" "volume" "rank" # [9] "continent" #----------------------------------- # REORDER VARIABLES # we'll use dplyr::select to do this #----------------------------------- df.world_ports <- select(df.world_ports, rank, year, continent, economy, port, port_label, lon, lat, volume) # RE-CHECK the number of cols (just in case) df.world_ports %>% ncol() # 9 columns # INSPECT glimpse(df.world_ports) head(df.world_ports, n = 10)
Notice again that we’re doing a few small check and some data inspection.
For example, before re-ordering the variables, I used
Spot check a few values
Finally, we’re going to spot check a few values to make sure that they are correct.
We’ve performed quite a few transformations on the data, so it’s entirely possible that something went wrong. We need to make sure that the values are the same as the values in the original Wikipedia table.
To do this, we’ll simply subset down to an exact value. Then we’ll compare the value provided by our data to the value on Wikipedia.
To get specific values out of our data frame, we’re going to use
We’ll then use
#================================================================ # SPOT CHECK # - as a final test that everything is OK, we'll spot check some # numbers and compare against the original data # - ultimately, we just want to make sure that the original data # is intact after all of these transformations # - check the outputs against the original: # https://en.wikipedia.org/wiki/List_of_busiest_container_ports #================================================================ df.world_ports %>% filter(year == '2012', port == 'Guangzhou') %>% select(volume) # 14744 OK df.world_ports %>% filter(year == '2007', port == 'Guangzhou') %>% select(volume) # 9200 OK df.world_ports %>% filter(year == '2005', port == 'Rotterdam') %>% select(volume) # 9287 OK df.world_ports %>% filter(year == '2005', port == 'Yingkou') %>% select(volume) # 634 OK df.world_ports %>% filter(year == '2004', port == 'Yingkou') %>% select(volume) # NA OK df.world_ports %>% filter(year == '2007', port == 'Keelung') %>% select(volume) # NA OK df.world_ports %>% filter(year == '2014', port == 'Seattle/Tacoma') %>% select(volume) # 3456 OK df.world_ports %>% filter(year == '2009', port == 'Nagoya') %>% select(volume) # 2113 OK
They’re correct. We could check more, but these look good.
You’ll also be able to see in the next blog post, what the data look like when we visualize them. Aside from using visualizations for analysis we’ll also be able to examine them to see if anything looks incorrect.
Want to see part 2?
In the next blog post of this data analysis series, we’re going to use visualizations to analyze the data and tell a story.
To see part 2, SIGN UP for the email list.
When the next blog post is published, you’ll be notified immediately by email.
The post How to really do an analysis in R (part 1, data manipulation) appeared first on SHARP SIGHT LABS.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.