Site icon R-bloggers

How to really do an analysis in R (part 1, data manipulation)

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For a couple of years, I’ve been writing about the importance of data analysis, saying that data analysis is essentially the foundation of data science itself.

While I admit that there’s room for argument (and I admit that the reality is a little more nuanced) I still firmly believe that data analysis is the true foundation of practical data science.

That is, if you want to learn practical data science (as opposed to theoretical, academic topics) you need to be able to absolutely shred a dataset.

If we drill into that just a little more, what that means is that you need to be able to take a dataset, wrangle it into shape, and use visualization and other techniques to find insights.

This is essentially why I’ve been consistently recommending that you start learning data science by getting a solid foundation in two core skill areas:

Once you learn these skills independently, you can use them together for analysis.

In practical terms, this means that as an R user, you should know < inline_code>ggplot2 and < inline_code>dplyr at a minimum. I also recommend learning a few other parts of the tidyverse, like < inline_code>tidyr, and < inline_code>lubridate.

What does an analysis look like in R?

I’ve been emphasizing the importance of visualization, wrangling, and analysis for a while, but you might be wondering, “what does it look like, in practice?”

I’m glad you asked.

As it turns out, last week I started asking questions about economic shifts in the world. In particular, I was asking questions about shifts in geopolitical and geoeconomic centers of power, and the rise of Asia. After some clicking around, I stumbled on a Wikipedia dataset about shipping volumes at the “busiest ports in the world.”

Aaaand, down the rabbit hole I went.

I decided to analyze the dataset, partially to satisfy my curiosity, but also to give you an example of what’s possible once you know a few critical skills of the tidyverse.

A quick note for beginners

If you’re a beginner, you should know that the code you’re about to see is “intermediate level” code. At it’s core, the code is pretty straightforward (really, I swear), but there are lots of little details that obscure the underlying simplicity.

Having said that, in an upcoming post I’m going to strip all of this down to bare bones to show you that you can learn 80% of this very quickly. If you want me to notify you when that post is available, definitely sign up for the email list … if you’re part of the Sharp Sight email list, you’ll be notified when I publish that future post.

In the meantime, let’s dive in. I want you to see just how powerful the tidyverse is for wrangling, visualizing, and analyzing data.

First things first: we need to get the data and hammer it into shape.

 

Get the data

First, we need to get the data. To do this, we’re going to scrape the data from Wikipedia by using the < inline_code>rvest package.

Admittedly, web scraping is not a topic for beginners. If you’re a beginner, you could just copy and paste the data into Excel, save it as a csv, and import it using the < inline_code>readr package.

However, I think this is a more interesting way to get the data (and it’s something that can be easily communicated/replicated through a blog post).

Ok, we’ll use < inline_code>read_html(), < inline_code>html_nodes(), and < inline_code>html_table() from < inline_code>rvest to get the data and parse it into a dataframe.

(As I noted, this is sort of an intermediate topic, so I’ll do a more in-depth tutorial on scraping in the future.)

######################################
# PART I: GET THE DATA 
# - We need to both get the data
#   as well as "wrangle" it into shape
######################################

#==============
# LOAD PACKAGES
#==============

library(tidyverse)
library(stringr)
library(forcats)
library(ggmap)
library(rvest)


#==========================
#SCRAPE DATA FROM WIKIPEDIA
#==========================

# SCRAPE
html.world_ports <- read_html("https://en.wikipedia.org/wiki/List_of_busiest_container_ports")
df.world_ports <- html_table(html_nodes(html.world_ports, "table")[[2]], fill = TRUE)

Inspect the data

After scraping and reading in the data, we’ll examine the data using < inline_code>glimpse().

This will give us the names of the columns, the dimensions of the data, and will print out the first several observations.

You’ll see this sort of data inspection after many of the procedures that we execute. It’s important to routinely examine your data after you change it. Essentially, as you execute a procedure, you want to make sure that the procedure did what you wanted it to. You want to make sure that everything looks “OK.”

Get into the habit of examining your data after important blocks of data wrangling code.

# INSPECT
glimpse(df.world_ports)

# Observations: 50
# Variables: 14
# $ Rank     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26...
# $ Port     <chr> "Shanghai", "Singapore", "Shenzhen", "Hong Kong", "Ningbo-Zhoushan", "Busan", "Qingdao", "Gua...
# $ Economy  <chr> "China", "Singapore", "China", "Hong Kong", "China", "South Korea", "China", "China", "United...
# $ 2014[1]  <chr> "35,268", "33,869", "23,798", "22,374", "19,450", "18,423", "16,624", "16,160", "14,750", "14...
# $ 2013[2]  <chr> "33,617", "32,240", "23,280", "22,352", "17,351", "17,690", "15,520", "15,309", "13,641", "13...
# $ 2012[3]  <chr> "32,529", "31,649", "22,940", "23,117", "16,670", "17,046", "14,503", "14,744", "13,270", "12...
# $ 2011[4]  <chr> "31,700", "29,937", "22,570", "24,384", "14,686", "16,185", "13,020", "14,400", "13,000", "11...
# $ 2010[5]  <chr> "29,069", "28,431", "22,510", "23,532", "13,144", "14,157", "12,012", "12,550", "11,600", "10...
# $ 2009[6]  <chr> "25,002", "25,866", "18,250", "20,983", "10,502", "11,954", "10,260", "11,190", "11,124", "8,...
# $ 2008[7]  <chr> "27,980", "29,918", "21,414", "24,248", "11,226", "13,425", "10,320", "11,001", "11,827", "8,...
# $ 2007[8]  <chr> "26,150", "27,932", "21,099", "23,881", "9,349", "13,270", "9,462", "9,200", "10,653", "7,103...
# $ 2006[9]  <chr> "21,710", "24,792", "18,469", "23,539", "7,068", "12,039", "7,702", "6,600", "8,923", "5,950"...
# $ 2005[10] <chr> "18,084", "23,192", "16,197", "22,427", "5,208", "11,843", "6,307", "4,685", "7,619", "4,801"...
# $ 2004[11] <chr> "14,557", "21,329", "13,615", "21,984", "4,006", "11,430", "5,140", "3,308", "6,429", "3,814"...

Notice that we have a the following variables:

Again, the years themselves are spread across columns (as column names), and the shipping volumes (which we’ll want to eventually analyze) are listed under those year columns. This is not the correct, ‘tidy’ structure of the data, so we’ll eventually need to reshape it.

Also notice that almost everything has been read in as character data. We’ll need to change several of these variables to factors, numerics, etc.

Rename variables

Next, we’re going to rename our variables by transforming them all to lower case.

By default, when we read in the data, several of the variables have capitalized first letters.

In programming, there are several different style conventions (and I’ll admit that my naming conventions are occasionally unconventional), but many people will tell you that there’s no good reason to have a capitalized first letter for a variable name. In most situations, you’ll want to remove leading capitals, if for no other reason than it makes them easier to type.

To execute this, we’re going to use < inline_code>colnames() and < inline_code>tolower().

On the right-hand side of the expression, < inline_code>colnames() lists the column names and < inline_code>tolower() transforms them to lower case. Here, we’re using the pipe operator (< inline_code>%>%) to “pipe” the names into < inline_code>tolower().

We then use the assignment operator (< inline_code>< -) so the new lower-case names serve as the input of < inline_code>colnames(df.world_ports) on the left-hand side. Remember that < inline_code>colnames() can both retrieve column names and set column names.

So ultimately, on the right-hand side of the expression we’re retrieving and transforming to lower case. Then on the left-hand side of the expression we’re re-setting the column names.

< inline_code>

#==========================
# RENAME VARIABLES: 
# - transform to lower case
#==========================

#-----------
# LOWER CASE
#-----------
colnames(df.world_ports) <- colnames(df.world_ports) %>% tolower()

# INSPECT
colnames(df.world_ports)

Get geo-spatial information (latitude/longitude)

Next, we’re going to “geocode” our observations so we can use a few geo-spatial visualization techniques later in our analysis. That is, we’re going to get the latitude and longitude associated with our observations.

More specifically, each row of our data frame (as it currently stands) represents a port. It is a specific location. Most frequently, it’s just the name of a city, but sometimes there is a specific name of a port.

We’re going to use < inline_code>ggmaps::geocode() to retrieve the latitude and longitude of these locations.

Working with geo-spatial data is a slightly more advanced topic, but < inline_code>geocode() is fairly straight-forward. You give it a place name, and it returns lat/lon coordinates. In this case, we’re going to feed it the names of the ports that are contained in the variable < inline_code>df.world_ports$port.

#===================================================
# GEOCODE
# - here, we're picking up lat/long from Google Maps
#   using ggmaps::geocode()
#===================================================


#--------------------------
# get data from google maps
#--------------------------
geocodes.world_ports <- geocode(df.world_ports$port)

We then need to merge the lat/lon data back onto < inline_code>df.world_ports. To do this, we’ll use < inline_code>cbind().

#--------------------------------------------------------
# COMBINE:
# - bind the new lat/long data to df.world_ports data frame
#--------------------------------------------------------
df.world_ports <- cbind(df.world_ports, geocodes.world_ports)

head(df.world_ports)

Manually code missing latitude/longitude data

When we used < inline_code>geocode(), it correctly retrieved the lat/lon data for most of our observations.

But, the function failed for a few of the ports. Essentially, < inline_code>geocode() couldn’t find those locations (or possibly found several matching locations and didn’t know which to select). Ultimately, for these observations, the function failed to retrieve the lat/lon data.

Consequently, we need to get that data another way.

To get the lat/lon data, we’re going to manually retrieve this information at http://www.latlong.net/convert-address-to-lat-long.html. If you go there, you’ll find that you can simply click on a location on the map, and it will return the latitude and longitude.

After manually obtaining the remaining lat/long coordinates (they are recorded in the commented code below) we’ll just hand-code them into our data frame.

This is probably a slightly inelegant solution, but it will work for the time being (and it’s simple enough that I can communicate it to you in a blog post).

#=========================================================================================
# RECODE lon and lat
# - There are 4 lon/lat values that weren't found with geocode()
# - We'll just hand code them
# - The values can be obtained at http://www.latlong.net/convert-address-to-lat-long.html
#
#
# Tanjung Pelepas, Johor Bahru: lon = 103.551035, lat = 1.362374
# Yingkou:..................... lon = 122.108231, lat = 40.266062
# Valencia, Spain:............. lon = -0.3762881, lat = 39.46991
# Malta Freeport:.............. lon = 14.537637 , lat = 35.816287
#
#=========================================================================================

df.world_ports <- df.world_ports %>%
                        mutate( lon = case_when(.$port == "Tanjung Pelepas" ~ 103.551035
                                                ,.$port == "Yingkou"        ~ 122.108231
                                                ,.$port == "Valencia"       ~ -0.3762881
                                                ,.$port == "Malta Freeport" ~ 14.537637
                                                ,TRUE ~ .$lon
                                                )
                                ,lat = case_when(.$port == "Tanjung Pelepas" ~ 1.362374
                                                ,.$port == "Yingkou"        ~ 40.266062
                                                ,.$port == "Valencia"       ~ 39.46991
                                                ,.$port == "Malta Freeport" ~ 35.816287
                                                ,TRUE ~ .$lat
                                                )
                                )

# CHECK
df.world_ports %>% filter(port == "Tanjung Pelepas") %>% select(lat,lon)
df.world_ports %>% filter(port == "Yingkou") %>% select(lat,lon)
df.world_ports %>% filter(port == "Valencia") %>% select(lat,lon)
df.world_ports %>% filter(port == "Malta Freeport") %>% select(lat,lon)

To be honest, this block of code may be a little challenging to understand because of how we’re referencing the data frame inside the < inline_code>case_when() statements.

Having said that, at the core, this code block us just using < inline_code>dplyr::mutate() to modify our < inline_code>lat and < inline_code>lon variables. We’re then using < inline_code>case_when() to code those variables for specific ports. So for example, when the port name is “Valencia,” set the < inline_code>lon variable to < inline_code>39.46991. Etcetera. That’s all this is doing.

Convert variables to factors

There are several variables that were parsed as character variables that we’d rather have as factors.

This being the case, we’re going to convert these variables to factors. Note that we’re ultimately using the < inline_code>dplyr verb < inline_code>mutate to access the data frame and modify the variables. You’ll probably recall from past tutorials that < inline_code>mutate allows us to “mutate” or “change” variables that are contained inside of a data frame. Again, in this case, we’re using it to allow us to modify our data frame and convert < inline_code>economy and < inline_code>port into factors.

#=============================
# convert variables to factors
#=============================
df.world_ports <- mutate(df.world_ports
                         , economy = as.factor(str_trim(economy))
                         , port = as.factor(port)
                         )

Create a ‘label’ variable

Next, we’re going to create an extra variable that we’ll essentially use only for labels when we create plots.

The reason for this, is that several of the port names are simply too long for some of our plots. When we begin to visualize the data (in the next blog post), you’ll see that some of the port names are so long, that they would get cut off if we used them.

That being the case, we’re going to create a new variable which will still contain the port names, but will use abbreviated names for the long-named locations.

To create this, we’re simply going to identify the names that are excessively long and create shortened names.

To identify the port names that are too long, we’re going to list them out using < inline_code>levels(df.world_ports$port). When we do this, we can visually identify names that are long.

After identifying long names, we’re using < inline_code>fct_recode() to recode those values into shorter versions. Notice that we’re actually creating a new variable here, named < inline_code>port_label. This is because we want to keep the full names intact and use them when we can; there is more information in the full names, which we don’t want to loose. So in this case, we’re just making a new secondary variable to use in special cases.

Note as well that all of this is occurring inside of < inline_code>dplyr::mutate(). As I already noted, < inline_code>mutate() allows us to modify a data frame and create (or change) variables.

#==============================================
# CREATE 'LABEL' VARIABLE
# - Several of the port names are too long when
#   we use them as plot labels, so we'll make
#   a new variable that has some abbreviated
#   names
#==============================================

#--------------------------------------------
# GET LEVELS
# - this will let us identify which labels to
#   modify
#--------------------------------------------

levels(df.world_ports$port)


#--------------------------------------------
# RECODE: 
# - use fct_recode() to create new variable
#--------------------------------------------

df.world_ports <-  mutate(df.world_ports
                          , port_label = fct_recode(port
                                                    ,"Saigon" = "Ho Chi Minh City (Saigon)"
                                                    ,"New York" = "New York and New Jersey"
                                                    ,"Jakarta" = "Tanjung Priok (Jakarta)"
                                                    ,"Bremen" = "Bremen/Bremerhaven"
                                                    ,"Istanbul" = "Ambarli (Istanbul)"
                                                    ,"Tangiers" = "Tanger-Med (Tangiers)"
                                                    ,"Dubai" = "Jebel Ali (Dubai)"
                                                    ,"Ningbo/Z-shan" = "Ningbo-Zhoushan"
                                                    )
                          )

Keep in mind that this process was actually iterative. I had to recode these names and then use them in test-plots (like bar charts and small multiple plots) to see if the names were truly short enough, and to make sure that I got them all.

I’ve emphasized the importance of iterative workflow in the past, so I want to point out that it was actively used here, even if you can’t directly see that process in the finalized code.

Reshape data into long format (AKA ‘tidy format’)

Now, we’re going to put the data into “tidy” form.

What is tidy form?

It’s best explained with an example. If you print out the data using the < inline_code>head() function, you’ll notice that each year is actually in its own column. This is ‘untidy.’

The ‘tidy’ form of the data would have the years in a ‘< inline_code>year‘ variable.

Essentially, this means that we need to reshape the data into a new format, transposing the columns names (‘< inline_code>2004[11]‘, ‘< inline_code>2005[10]‘, etc) into values of a single new column.

As we do this, we will also be creating a new variable called ‘volume.’ This will contain the shipping volumes that had been contained underneath our set of old year columns (i.e. < inline_code>2004[]).

To do this, we need to use < inline_code>gather() from the < inline_code>tidyr package.

(Note that if data reshaping seems a little complicated, don’t worry. < inline_code>tidyr can be a little hard to wrap your mind around. I’ll write a separate tutorial about it in the future.)

#=====================================
# RESHAPE WIDE TO LONG
# - this data is not in 'tidy' form
# - we need to reshape it to a longer
#   form using tidyr::gather()
#=====================================

# RESHAPE
df.world_ports <- df.world_ports %>%
                        gather(year,volume,4:14)



# INSPECT
head(df.world_ports)
levels(as.factor(df.world_ports$year))
levels(df.world_ports$economy)
levels(df.world_ports$port)
names(df.world_ports)

Notice again that after performing this major transformation of the data, we’re inspecting the data. Specifically we’re using < inline_code>head() to print out a few rows, and then we’re using < inline_code>levels() several times to inspect the levels of our factor variables. Specifically, you want to look at the output of < inline_code>levels(as.factor(df.world_ports$year)) to see that the values of our new ‘< inline_code>year‘ variable are actually the names of the years. (They are.)

Convert < inline_code>year to factor

When we reshaped our data and subsequently created the < inline_code>year variable, < inline_code>year was actually created as a character variable.

We need to convert it to a factor variable. This is a straight forward use of < inline_code>as.factor() inside of < inline_code>mutate().

#================================================
# CAST YEAR AS FACTOR
# - now that we have a 'year' variable after
#   reshaping, we're going to cast it as a factor
#================================================

df.world_ports <- mutate(df.world_ports, year = as.factor(year))

Rename the levels of < inline_code>year

Now that we have our years in a single column (instead of spread across multiple columns), we’re going to recode the values of those years.

This is because when we scraped the data from Wikipedia and parsed it into our dataframe, it retained some artifacts that we don’t want.

Specifically, there are numbers inside of brackets that we want to remove. For example, if you examine the levels of < inline_code>year, you’ll notice the value < inline_code>2004[11]. We want to remove the < inline_code>[11] from this value and the similar suffixes on the other year values.

To do this, we’ll again use < inline_code>fct_recode() inside of < inline_code>mutate.

There are more elegant ways to do this, but this is a simple method that’s relatively easy to comprehend.

#=============================================
# RENAME FACTORS
# - the levels of 'year' contain some info
#   that we don't need (e.g., the '[11]') in
#   '2004[11]'
# - we'll recode manually to remove these
#=============================================

#-------------------------------------
# GET LEVELS
#  - get the levels of the year factor 
#    so we know what we need to change
#-------------------------------------

levels(df.world_ports$year)

# "2004[11]" "2005[10]" "2006[9]"  "2007[8]"  "2008[7]"  "2009[6]"  "2010[5]"  "2011[4]" 
# "2012[3]"  "2013[2]"  "2014[1]" 


#----------------------------------------
# RECODE 'year'
# use mutate to change the year variable
# - here, we'll use forcats::fct_recode()
#   to enumerate all of the mappings of
#   new names from the old names
#----------------------------------------

df.world_ports <- mutate(df.world_ports
                         , year = fct_recode(year
                                            ,"2014" = "2014[1]"
                                            ,"2013" = "2013[2]"
                                            ,"2012" = "2012[3]"
                                            ,"2011" = "2011[4]"
                                            ,"2010" = "2010[5]"
                                            ,"2009" = "2009[6]"
                                            ,"2008" = "2008[7]"
                                            ,"2007" = "2007[8]"
                                            ,"2006" = "2006[9]"
                                            ,"2005" = "2005[10]"
                                            ,"2004" = "2004[11]"
                                           )
                           )


# INSPECT NEW LEVELS
levels(df.world_ports$year)
head(df.world_ports)

Change ‘< inline_code>volume‘ variable to numeric

Now we need to modify the < inline_code>volume variable that we created when we reshaped our data using < inline_code>gather().

When we initially scraped the data, the shipping volume data was read in as character data because it contained commas. But later, to analyze these data, we need them to be numerics.

To convert from character to numeric, we first need to strip out the commas.

So, we’re going to use < inline_code>str_replace() from the < inline_code>stringr package to strip out the commas, and then cast the result as a numeric using < inline_code>as.numeric().

Once again, we’re executing this inside of < inline_code>mutate, because mutate allows us to modify variables when they are contained inside of a data frame.

#=========================================================
# CHANGE 'volume' VARIABLE TO NUMERIC
# - the 'volume' variable comes in as a char
# - we need to transform it to a numeric
# - to do this we first need to strip out the commas using 
#   stringr::str_replace(), then re-cast the variable as 
#   a numeric using as.numeric()
#=========================================================

# CAST 'volume' VARIABLE
df.world_ports <- mutate(df.world_ports, volume = as.numeric(str_replace(volume, ",", "")))


# INSPECT
head(df.world_ports)
glimpse(df.world_ports)

Calculate shipping-volume rankings, by year

Ok, our numeric shipping volume data is now reformatted into a form that’s ready for analysis.

When we analyze our data though, it would be good to be able to rank the data by shipping volume for every year.

For example, we’ll want to answer the question, “for 2014, which port had the highest volume? the second highest? third, fourth, …”

We’ll not only want to answer that question for 2014, but we’ll want to know the rankings for each individual year. This will allow us to examine the rankings for any particular year, but also examine changes in rankings year-by-year.

Again, to do this, we need to calculate a rank.

Note that the data set already has a “< inline_code>rank” variable, but this actually contains the ranks for 2014. Basically, when we used < inline_code>gather() to reshape our data, the 2014 rankings were repeated for each and every year. This is incorrect, so we need to drop the existing < inline_code>rank variable and then re-calculate.

To drop the existing < inline_code>rank, we’ll just use < inline_code>select() from < inline_code>dplyr. Syntactically, we’re using the ‘minus’ sign to drop the variable by typing “< inline_code>-rank“. This drops the variable.

After that, we’re calculating a new < inline_code>rank variable by using a chain of < inline_code>dplyr verbs.

You’ll notice that we’re taking the data frame < inline_code>df.world_ports, then grouping by < inline_code>year (because we want to establish rankings within years). Then we’re using < inline_code>mutate() to modify the data frame and create a new variable called < inline_code>rank.

Notice that inside of < inline_code>mutate(), we’re setting the value of < inline_code>rank to < inline_code>min_rank(desc(volume)). Essentially, this code ranks our observations in descending order, by volume (i.e., the largest volume receives a rank of 1, the next largest gets the rank 2, etc). But, because we used < inline_code>group_by(), it will calculate the ranks within a year.

This will give us a ranking from top to bottom of the busiest ports, by year.

#=======================================================
# RANK BY YEAR
# - here, we're going to create a rank of "busiest port"
#   by year.  To do this, we need to group our data
#=======================================================

# DROP the old 'rank' variable
df.world_ports <- select(df.world_ports, -rank)

# INSPECT
glimpse(df.world_ports)

# RE-RANK
df.world_ports <- df.world_ports %>%
                    group_by(year) %>%
                    mutate(rank = min_rank(desc(volume))) %>%
                    ungroup()

# INSPECT AGAIN
# - we now have new rankings
glimpse(df.world_ports)

Create ‘< inline_code>continent‘ variable

Now, we’re going to create a new variable called < inline_code>continent.

This is pretty straight forward. It’s just the continent that each port is on (i.e., North America, Europe, Asia, etc).

We’ll use this variable to analyze our data by continent (this will be a good variable upon which to group, subset, etc).

To do this, we’ll first list the countries, by executing < inline_code>levels(df.world_ports$economy). We’ll then use that list and code a new factor variable on the basis of those countries. To do this, we’ll use < inline_code>fct_collapse() from the < inline_code>forcats package.

< inline_code>fct_collapse() allows us to recode a factor variable and ‘collapse’ its values into more general factor groupings.

When we do this, notice that once again our creation of this new variable is occurring inside of a call to < inline_code>dplyr::mutate().

#===========================================
# ADD CONTINENT VARIABLE
# - during analysis, it will be interesting
#   to analyze the data by continent
# - to do this, we'll create a new variable
#===========================================

#-----------------------------------------------------
# LIST COUNTRIES
# - these are the values that we'll have to 'collapse'
#   into a new factor
#-----------------------------------------------------

levels(df.world_ports$economy)


# [1] "Belgium"              "Brazil"               "Canada"               "China"               
# [5] "Egypt"                "Germany"              "Hong Kong"            "India"               
# [9] "Indonesia"            "Italy"                "Japan"                "Malaysia"            
# [13] "Malta"                "Morocco"              "Netherlands"          "Oman"                
# [17] "Panama"               "Philippines"          "Saudi Arabia"         "Singapore"           
# [21] "South Korea"          "Spain"                "Sri Lanka"            "Taiwan"              
# [25] "Thailand"             "Turkey"               "United Arab Emirates" "United Kingdom"      
# [29] "United States"        "Vietnam"



#----------------------------------------------------------
# CREATE VARIABLE: 'continent'
# - here we're using forcats::fct_collapse() to "collapse"
#   our country values down into a 'continent' variable
#----------------------------------------------------------
df.world_ports <- df.world_ports %>%
                     mutate(continent = fct_collapse(economy
                                                     ,South_America = c("Brazil","Panama")
                                                     ,North_America = c("Canada","United States")
                                                     ,Asia = c("Japan","China","Hong Kong","India","Indonesia"
                                                              ,"Malaysia","Oman","Philippines","Saudi Arabia"
                                                              ,"Singapore","South Korea","Sri Lanka"
                                                              ,"Taiwan","Thailand","United Arab Emirates","Vietnam")
                                                     ,Europe = c("Belgium","Germany","Italy","Malta","Netherlands"
                                                               ,"Spain","Turkey","United Kingdom")
                                                     ,Africa = c("Egypt","Morocco")
                                                    )
                            )


# INSPECT
glimpse(df.world_ports)
head(df.world_ports)
levels(df.world_ports$continent)


#----------------------------------------------
# CHECK GROUPINGS
# - this is just a simple check to see that all
#   of the countries have been coded into the
#   correct continent
#----------------------------------------------

df.world_ports %>%
  group_by(continent, economy) %>%
  summarize(1) %>%
  print.data.frame()

Reorder the variables

Ok, it’s the home stretch. We’re almost done wrangling our data.

One final thing we’ll do is re-order the variables.

They’re currently in an un-intuitive order, so we’ll reorganize them such that < inline_code>rank and < inline_code>year come first, followed by location information of increasing specificity, until we finally end the data frame (on the far right) with the shipping volume metrics.

To do this, we’ll first get all of the variable names using < inline_code>names().

Then we’ll use those names in a call to the < inline_code>dplyr verb < inline_code>select(). The < inline_code>select() verb is typically used to “select” columns (i.e., subset your data down to a specific set of columns that you list). However, in this case, we’re going to “select” all of our columns. We’re just going to select them in a new order.

#=========================================
# REORDER THE VARIABLES
# - the variables are not in an intuitive
#   order in the data frame
# - we'll re order them 
#=========================================

# CHECK the number of cols
df.world_ports %>% ncol()
# 9 columns


# GET NAMES
df.world_ports %>% names()
# [1] "port"       "economy"    "lon"        "lat"        "port_label" "year"       "volume"     "rank"      
# [9] "continent" 


#-----------------------------------
# REORDER VARIABLES
# we'll use dplyr::select to do this
#-----------------------------------

df.world_ports <- select(df.world_ports, rank, year, continent, economy, port, port_label, lon, lat, volume) 


# RE-CHECK the number of cols (just in case)
df.world_ports %>% ncol()
# 9 columns


# INSPECT
glimpse(df.world_ports)
head(df.world_ports, n = 10)

Notice again that we’re doing a few small check and some data inspection.

For example, before re-ordering the variables, I used < inline_code>ncol() to count the number of columns. Then after re-ordering them, I checked the number of columns again, just to make sure that they are all still there. It might seem unnecessary, but these minor checks are good ways to ensure that you’ve executed your data transformations properly. Checking your work can save you a lot of headaches, so do it, even if it seems “too basic.”

Spot check a few values

Finally, we’re going to spot check a few values to make sure that they are correct.

We’ve performed quite a few transformations on the data, so it’s entirely possible that something went wrong. We need to make sure that the values are the same as the values in the original Wikipedia table.

To do this, we’ll simply subset down to an exact value. Then we’ll compare the value provided by our data to the value on Wikipedia.

To get specific values out of our data frame, we’re going to use < inline_code>filter() and < inline_code>select().

< inline_code>filter() will allow us to filter our data down to an exact row. In this case, we’re going to filter down to a specific port for a specific year.

We’ll then use < inline_code>select() to select a specific column. In this case, we’ll select < inline_code>volume, which will give us the shipping volume for that port/year combo.

#================================================================
# SPOT CHECK
# - as a final test that everything is OK, we'll spot check some 
#   numbers and compare against the original data
# - ultimately, we just want to make sure that the original data 
#   is intact after all of these transformations
# - check the outputs against the original:
#   https://en.wikipedia.org/wiki/List_of_busiest_container_ports
#================================================================

df.world_ports %>% filter(year == '2012', port == 'Guangzhou') %>% select(volume)       # 14744 OK
df.world_ports %>% filter(year == '2007', port == 'Guangzhou') %>% select(volume)       # 9200 OK
df.world_ports %>% filter(year == '2005', port == 'Rotterdam') %>% select(volume)       # 9287 OK
df.world_ports %>% filter(year == '2005', port == 'Yingkou') %>% select(volume)         # 634 OK
df.world_ports %>% filter(year == '2004', port == 'Yingkou') %>% select(volume)         # NA OK
df.world_ports %>% filter(year == '2007', port == 'Keelung') %>% select(volume)         # NA OK
df.world_ports %>% filter(year == '2014', port == 'Seattle/Tacoma') %>% select(volume)  # 3456 OK
df.world_ports %>% filter(year == '2009', port == 'Nagoya') %>% select(volume)          # 2113 OK

They’re correct. We could check more, but these look good.

You’ll also be able to see in the next blog post, what the data look like when we visualize them. Aside from using visualizations for analysis we’ll also be able to examine them to see if anything looks incorrect.

Want to see part 2?

In the next blog post of this data analysis series, we’re going to use visualizations to analyze the data and tell a story.

To see part 2, SIGN UP for the email list.

When the next blog post is published, you’ll be notified immediately by email.

The post How to really do an analysis in R (part 1, data manipulation) appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.