Site icon R-bloggers

Visualizing possible cities for Amazon’s new headquarters (using R)

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



Last week, Amazon announced that it has started to search for a new city in which to build a second headquarters.

Among several selection criteria, they indicated that they’re looking for a city with more than 1 million people, and a city with a good pool of tech talent.

While reading about Amazon’s new HQ search on a news website, I encountered a dataset of cities that might qualify: cities with over 1 million people in the metro area, and the corresponding percent of people with college degrees in each city.

The news website already visualized it, but I want to show you how to do this in < inline_code>R. It’s an excellent little exercise to show you again how to scrape/wrangle/visualize data.

With that in mind, we’re going to scrape the data, wrangle it into shape (using < inline_code>dplyr and a few other tools), and then visualize it as a map using < inline_code>ggplot2.

I’ll preface this by saying that this is an imperfect analysis. We don’t know the full and final selection criteria, and even if we did, a full analysis would be far beyond the scope of a simple blog post.

Having said that, this is a good “first pass” at such an analysis: the quick-and-dirty version.

Furthermore, if you’re getting involved in data science, this will give you some hints about how to use < inline_code>R as an analytical tool. If you’re in marketing, sales, or operations, you could very easily use this quick-and-dirty analysis as a template and starting point for some of your own work.

Ok. Let’s jump in.

First, we’ll just load the packages that we will use.

#==============
# LOAD PACKAGES
#==============

library(rvest)
library(tidyverse)
library(stringr)
library(ggmap)

Next, we’ll use several functions from the < inline_code>rvest package to scrape the data and read it into a dataframe.

#=======
# SCRAPE
#=======
html.amz_cities <- read_html("https://www.cbsnews.com/news/amazons-hq2-cities-second-headquarters-these-cities-are-contenders/")


df.amz_cities <- html.amz_cities %>%
  html_nodes("table") %>%
  .[[1]] %>% 
  html_table()


# inspect
df.amz_cities %>% head()

Next, we’ll change the column names.

When we scraped the data, the column names were not read in properly from the HTML table, so we need to add them manually.

#====================
# CHANGE COLUMN NAMES
#====================

# inspect initial column names
colnames(df.amz_cities)

# assign new column names
colnames(df.amz_cities) <- c("metro_area", 'state', 'population_tot', 'bachelors_degree_pct')


# inspect
df.amz_cities %>% head()

As it turns out, when we scraped the data, the original column names (the column names that appeared on the website) ended up in the first row of our newly created dataframe.

This is inappropriate, so we need to remove the first row of data.

#==============================================
# REMOVE FIRST ROW
# - when we scraped the data, the column names
#   on the table were read in as the first row
#   of data.
# - Therefore, we need to remove the first row
#==============================================

df.amz_cities <- df.amz_cities %>% filter(row_number() != 1)

Now we’re going to modify the data type of two variables.

< inline_code>bachelors_degree_pct and < inline_code>population_tot need to be numeric variables, but when we scraped them, they were read in as character variables. This being the case, we’re going to use some techniques to parse/coerce them into numerics.

#===================================================================================
# MODIFY VARIABLES
# - both bachelors_degree_pct and population_tot were scraped as character variables
#    but we need them in numeric format
# - we will use techniques to parse/coerce these variable from char to numeric
#===================================================================================

#--------------------------------
# PARSE AS NUMBER: population_tot
#--------------------------------

df.amz_cities <- mutate(df.amz_cities, population_tot = parse_number(population_tot))


# check
typeof(df.amz_cities$population_tot)

# inspect
df.amz_cities %>% head()


#-----------------------------
# COERCE: bachelors_degree_pct
#-----------------------------

df.amz_cities <- mutate(df.amz_cities, bachelors_degree_pct = as.numeric(bachelors_degree_pct))

Next, we’re going to create a variable that contains the city name.

When we read in the data, there was a variable for ‘< inline_code>metro_area.’ This variable contains values like “New York-Newark-Jersey City.” The < inline_code>metro_area variable might be useful for some things, but when we geocode our data (to get the lat-long coordinates), these metro names may cause errors. They are too broad. We need a variable that contains specific city names; for geocoding, it will be better if we have a strict city name.

This being the case, we will create a new < inline_code>city variable by extracting the city names from the metro names. To do this, will will use < inline_code>stringr::str_extract() along with an appropriate regular expression that can pull out the city names that we want.

#=============================================================
# CREATE VARIABLE: city
# - here, we're using the stringr function str_extract() to
#   extract the primary city name from the metro_area variable
# - to do this, we're using a regex to pull out the city name
#   prior to the first '-' character
#=============================================================

df.amz_cities <- df.amz_cities %>% mutate(city = str_extract(metro_area, "^[^-]*"))


Now that we have proper city names, we will geocode our data. We will use the < inline_code>geocode() function to retrieve the latitude and longitude for each city. Then we will merge the geo-data back onto the dataframe using < inline_code>cbind().

#=========================================
# GEOCODE
# - here, we're getting the lat/long data
#=========================================

data.geo <- geocode(df.amz_cities$city)

#inspect

data.geo %>% head()
data.geo

#========================================
# RECOMBINE: merge geo data to data frame
#========================================

df.amz_cities <- cbind(df.amz_cities, data.geo)
df.amz_cities


Quickly, we’ll use the < inline_code>dplyr::rename() function to rename the ‘< inline_code>lon‘ variable to ‘< inline_code>long.’

#==============================================================
# RENAME VARIABLE: lon -> long
# - we'll rename lon to lon, just because 'long' is consistent
#   with the name for longitude in other data sources
#   that we will use
#==============================================================

df.amz_cities <- rename(df.amz_cities, long = lon)


# get column names names
df.amz_cities %>% names()


Next, to make our data a little easier to read, we will re-order the variables. We’ll organize it so that the < inline_code>city, < inline_code>state, and < inline_code>metro are first, followed by the geo-data, and then with the analysis variables at the end (population and ‘college degree percent’).

#==========================================
# REORDER COLUMN NAMES
# - here, we're just doing it manually ...
#==========================================

df.amz_cities <- select(df.amz_cities, city, state, metro_area, long, lat, population_tot, bachelors_degree_pct)


# inspect

df.amz_cities %>% head()

Now, we’re going to get a map. In order to visualize the data as a map, we need a map of the United States.

To get this, we will use < inline_code>map_data().

#================================================
# GET USA MAP
# - this is the map of the USA states, upon which
#   we will plot our city data points
#================================================

map.states <- map_data("state")

Finally, we’re ready to plot.

We’ll initially do a “first iteration” to check that everything looks good.

#====================================
# PLOT
# - here, we're actually creating the 
#   data visualizations with ggplot()
#====================================


#------------------------------------------------
# FIRST ITERATION
# - this is just a 'first pass' to check that
#   everything looks good before we take the time
#   to format it
#------------------------------------------------
ggplot() +
  geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct))
 



At a high level, everything looks OK. The points are in the right locations, and at a glance, everything looks good.

Keep in mind, that compared to the finalized version below, the ‘first iteration’ is much much simpler to build. This is a great example of the 80/20 rule in data analysis: in this visualization, you can get 80% of the way with only 20% of the total < inline_code>ggplot() code.

Now that we have an initial version, we’ll polish it by adding titles, formatting theme elements, and by adjusting the legends.

#--------------------------------------------------
# FINALIZED VERSION (FORMATTED)
# - this is the 'finalized' version with all of the
#   detailed formatting
#--------------------------------------------------

ggplot() +
  geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), alpha = .5) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), shape = 1) +
  coord_map(projection = "albers", lat0 = 30, lat1 = 40, xlim = c(-121,-73), ylim = c(25,51)) +
  scale_color_gradient2(low = "red", mid = "yellow", high = "green", midpoint = .41, labels = scales::percent_format()) +
  scale_size_continuous(range = c(.9, 11),  breaks = c(2000000, 10000000, 20000000),labels = scales::comma_format()) +
  guides(color = guide_legend(reverse = T, override.aes = list(alpha = 1, size = 4) )) +
  labs(color = "Bachelor's Degree\nPercent"
       ,size = "Total Population\n(metro area)"
       ,title = "Possible cities for new Amazon Headquarters"
       ,subtitle = "Based on population & percent of people with college degrees") +
  theme(text = element_text(colour = "#444444", family = "Gill Sans")
        ,panel.background = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,axis.text = element_blank()
        ,plot.title = element_text(size = 28)
        ,plot.subtitle = element_text(size = 12)
        ,legend.key = element_rect(fill = "white")
        ) 




A quick note: this is not supposed to be a comprehensive analysis

I want to point out that this is not intended to be comprehensive or conclusive in any way. Without detailed selection criteria, it will be difficult to come to any solid conclusions.

Rather, this is intended to give you a hint of what’s possible using R tools. If you were so inclined, you could certainly extend this into a much more comprehensive analysis by gathering more data and producing more charts.

Creating great visualizations gets easier once you master your toolkit

As you progress as a data scientist, you will get better at creating visualizations like this.

If you practice and master individual < inline_code>R functions, you will be able to create visualizations like this quickly.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons.

You’ll learn:

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Visualizing possible cities for Amazon’s new headquarters (using R) appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.