Site icon R-bloggers

Mirror, mirror on the wall, which is your favorite German car brand in the world…

[This article was first published on R – workspace13, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Or how to create a world map in R with survey data, World Bank indicators, ggplot2 and rworldmap

by timwinke

When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?

There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.

In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.

The survey data are now publicly available at http://blog.daliaresearch.com/ask-the-world/. Dalia Research collects interesting survey visualisations on their facebook timeline as part of a competition till Dec 15th. You can win the chance to submit your own 5 questions in the next round of the study. This is my competition post.

 

https://www.facebook.com/photo.php?fbid=10154745280545224&set=o.454021328028857&type=1

My competition post
https://www.facebook.com/photo.php?fbid=10154745280545224&set=o.454021328028857&type=1

 

Surprisingly, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country. While creating the post, I figured out many other things like how to merge cross-country data with World Bank indicators and how to create a world map using rworldmap and ggplot2. In this blog post, I will quickly show how to do it.

To get a first impression about the data have a look at the Global Study report page. It’s a bit tricky to compare country answers using their report page, so let’s download and read the csv file into R. The package RCurl facilitates https requests.

#install.packages(c("dplyr", "ggplot2", "RCurl", "rworldmap", "WDI"))
library("dplyr")   # data manipulation
library("ggplot2") # visualsation
library("RCurl") # network client interface
library("rworldmap") # get world map with iso codes
library("WDI") # World Bank indicators
library("mapproj") # Converts latitude/longitude into projected coordinates

#establish https connection to download csv file
URL <- "https://s3-eu-west-1.amazonaws.com/reports.dashboard/public/oak_fb/DaliaResearch_ProjectOak_Dataset.csv"
x <- getURL(URL, ssl.verifypeer=FALSE)
car_raw <- read.csv2(textConnection(x))

I convert the year of birth variable to get the age of the respondent and select the variable carbrand. As you can see, I’m a big fan of dplyr for data manipulation by Hadley Wickham and Romain Francois and the forward-pipe operator %>%.

# First data manipulations and variable selection
car <- tbl_df(car_raw) %>%
  mutate(age = as.numeric(format(Sys.time(), "%Y")) - ybirth, #turn year of birth to age
         carbrand = as.character(carbrand) #refers to the question: What's your favourite German car brand?
         ) %>%
  select(iso2c, countryname, device_kind, device_platform, gender, age, household_income, education,carbrand) %>%
  mutate(carbrand = ifelse(is.na(carbrand), "No response", carbrand))

 

To get summary statistics, I group by carbrand and country, i.e., the 2-digit country code iso2c, and use dplyr’s summarise
function to return the counts of respondents before sorting or arrange them.

# compute country statistics
car_countrystat <- car %>%
  group_by(iso2c, carbrand) %>%
  summarise(count = n(),
            countryname = unique(countryname)) %>%
  arrange(carbrand,desc(count)) 

head(car_countrystat) #AR indicates 

## Source: local data frame [6 x 4]
## Groups: iso2c
##
##   iso2c                          carbrand count countryname
## 1    AR                              Audi   110   Argentina
## 2    AR                               BMW   117   Argentina
## 3    AR I don't know any of these brands.    15   Argentina
## 4    AR                     Mercedes-Benz    85   Argentina
## 5    AR                       No response     1   Argentina
## 6    AR                           Porsche    87   Argentina

 

Creating bar plots for country groups

The next step is to create a barplot which stacks the car brand responses for several countries or country groups.

I wrote a function which binds country groups together. It takes some data as an argument, filters only the indicated country or countries and gives a meaningful name of the group to the variable cgroup (I know, there might be more elegant ways to do that…) For lengthy country groups like the EU-28, I use a new variable eu.

# initiate country groups
eu <- c('BE','DE','FR','IT','LU','NL','UK','IE','DK','GR','ES','PT','AT','FI','SE','CY','CZ','EE','HU','LT','LV','MT','PL','SI','BG','RO','HR', 'SK') # EU-28 countries

bindcgroups <- function(data) {
  cgroup1 <- data %>% filter(iso2c %in% c('JP')) %>% mutate(cgroup = "Japan")
  cgroup2 <- data %>% filter(iso2c %in% "CN") %>% mutate(cgroup = "China")
  cgroup3 <- data %>% filter(iso2c %in% "US") %>% mutate(cgroup = "U.S.A.")
  cgroup4 <- data %>% filter(iso2c %in% "IN") %>% mutate(cgroup = "India")
  cgroup5 <- data %>% filter(iso2c %in% eu) %>% mutate(cgroup = "EU-28")
  cgroup6 <- data %>% filter(iso2c %in% "DE") %>% mutate(cgroup = "Germany")
  cgroup7 <- data %>% filter(iso2c %in% "RU") %>% mutate(cgroup = "Russia")
  cgroup8 <- data %>% filter(iso2c %in% "UA") %>% mutate(cgroup = "Ukraine")
  cgroup9 <- data %>% mutate(cgroup = "World")
  cgroup10 <- data %>% filter(iso2c %in% "BR") %>% mutate(cgroup = "Brasil")
  cgroup11 <- data %>% filter(iso2c %in% "IR") %>% mutate(cgroup = "Iran")
  cgroup12 <- data %>% filter(iso2c %in% "KR") %>% mutate(cgroup = "South Korea") # KR Korea, Rep.
  cgroup13 <- data %>% filter(iso2c %in% "ID") %>% mutate(cgroup = "Indonesia")
  cgroup14 <- data %>% filter(iso2c %in% c("EC", "PE", "VE", "CO")) %>% mutate(cgroup = "Peru-Ecuador-Venezuela-Colombia")
  cgroup15 <- data %>% filter(iso2c %in% c('CL','AR','UY')) %>% mutate(cgroup = "Chile-Argentina-Uruguay")
  cgroup16 <- data %>% filter(iso2c %in% c('TW')) %>% mutate(cgroup = "Taiwan") 

  cgroup_bind <- rbind(cgroup1,cgroup2,cgroup3,cgroup4,cgroup5,cgroup6,cgroup7,cgroup8,cgroup9,
                     cgroup10,cgroup11,cgroup12,cgroup13,cgroup14,cgroup15,cgroup16)
  return(cgroup_bind)
}

car_barplot <- bindcgroups(data = car_countrystat)

Time for plotting! I’d like to display the percentages along the bars. This requires some computations to get the center position of each bar color. I use ggplot2 by Hadley Wickham and Winston Chang for visualisation.

# create summary statistics for country groups
car_barplot <- car_barplot %>%
  group_by(cgroup, carbrand) %>%
  summarise(countn = sum(count)) %>% #total counts by country groups and carbrand
  group_by(cgroup) %>%
  mutate(groupsum = sum(countn), #total counts per country group
         percent = countn*100/groupsum, # percentage for each carbrand
         position = cumsum(percent) - 0.5*(percent) #find center of each carbrand part along the bar
         ) 

# create bar plot
p <- ggplot(data=car_barplot, aes(x=cgroup, y=percent, fill=carbrand))
p + geom_bar(stat="identity") +
  xlab("") +  ylab("Share of responses") + theme_bw() + coord_flip() +
  guides(fill = guide_legend(
    title="What's your favourite German car brand?",
    keywidth = 0.7, keyheight = 0.7,
    reverse=F, title.position="top",
    nrow = 3, byrow=T)) +
  theme(legend.position="bottom") +
  geom_text(aes(x=cgroup, y=position, label=sprintf("%1.0f%%", percent)),size=3)

Joining the data with World Bank indicators

Country data can easily be joined with World Development Indicators provided by the World Bank. Vincent Arel-Bundock created the WDI package for R which does the retrieval job. To have a look at all indicators you can use the function WDIsearch() for an overview. I grab the 2013 GDP per capita indicator, take the geographical information for all capitals and select relevant variables.

The World Bank does not provide explicit information on Taiwan (see here) which has to be added manually as 500 people from Taiwan participated in the Dalia Research survey.

library("WDI")     # package to retrieve World Bank data
#WDIsearch() #To have a look at all indicators

wdi_gdp <- WDI(country = "all", indicator = "NY.GDP.PCAP.CD",start = 2013, end = 2013, extra = T, cache = NULL)
# NY.GDP.PCAP.CD refers to "GDP per capita (current US$)" 

wdi_gdp <- wdi_gdp %>%
  mutate(gdp2013 = NY.GDP.PCAP.CD,
         capital_lon = as.numeric(paste(longitude)), #factor to numeric
         capital_lat = as.numeric(paste(latitude))) %>%
  select(iso2c, country, gdp2013, capital, capital_lon, capital_lat)

wdi_gdp$capital <- as.character(wdi_gdp$capital)
wdi_gdp$capital[wdi_gdp$country %in% "Israel"] <- "Jerusalem" #No capital for Israel provided

taiwan <- data.frame(iso2c=as.factor("TW"),
                     country=as.factor("Taiwan"),
                     gdp2013=475.33, # http://www.tradingeconomics.com/taiwan/gdp
                     capital=as.factor("Taiwan"),
                     capital_lon=120.7120023,
                     capital_lat=22.6158015)
wdi_gdp <- rbind(wdi_gdp, taiwan)

# Now lets merge the previous survey data with the World Bank indicators
car <- car %>% merge(wdi_gdp, by="iso2c", all.x=T)
car_countrystat <- car_countrystat %>%  merge(wdi_gdp, by="iso2c", all.x=T)

head(car_countrystat)

##   iso2c                          carbrand count countryname   country
## 1    AR                              Audi   110   Argentina Argentina
## 2    AR                               BMW   117   Argentina Argentina
## 3    AR I don't know any of these brands.    15   Argentina Argentina
## 4    AR                     Mercedes-Benz    85   Argentina Argentina
## 5    AR                       No response     1   Argentina Argentina
## 6    AR                           Porsche    87   Argentina Argentina
##   gdp2013      capital capital_lon capital_lat
## 1   14760 Buenos Aires      -58.42      -34.61
## 2   14760 Buenos Aires      -58.42      -34.61
## 3   14760 Buenos Aires      -58.42      -34.61
## 4   14760 Buenos Aires      -58.42      -34.61
## 5   14760 Buenos Aires      -58.42      -34.61
## 6   14760 Buenos Aires      -58.42      -34.61

Plotting a world map with the most favourite car brand for each country

Last but not least, let’s see how to construct a simple world map which shows the most favourite car brand for each country. I get the car brand with the most frequent votes by using dplyr's windowed rank function. The first value of min_rank is the lowest value so we need to use a descending order to get the most frequent value.

#only take the top 1 car brand by country
car_topbrand <- car_countrystat %>% group_by(iso2c) %>%  filter(min_rank(desc(count)) <= 1) 

car_topbrand$carbrand <- as.factor(car_topbrand$carbrand)
levels(car_topbrand$carbrand) <- c(levels(car_topbrand$carbrand), "Volkswagen")
# Volkswagen never was number one but should be in the map legend

I use the package rworldmap by Andy South to retrieve the underlying world map. The package has the convenient function joinCountryData2Map which makes joints with other data sets simple. The 2-digit country code iso2c is the unique identifier.

library("rworldmap") # get world map with iso codes
car_map <- joinCountryData2Map(car_topbrand, joinCode = "ISO2", nameJoinColumn = "iso2c")

## 66 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 179 codes from the map weren't represented in your data

Next, I use the fortify function of the ggplot2 package to turn the SpatialPolygonsDataFrame into a data frame which indicates all “edges” of a country shape. Fortify removes all other variables so I merge the previous data with the SpatialPolygon again. Don’t forget to reorder the data frame afterwards otherwise the map will look messy.

#extract polygons out of spatial world data
car_map_poly <- fortify(car_map)
## Regions defined for each Polygons

car_map_poly <- merge(car_map_poly, car_map@data, by.x="id", by.y="ADMIN", all.x=T)
car_map_poly <- car_map_poly %>% arrange(id, order)

Now the data is in the right format for ggplot2 to plot it. The group attribute of geom_polygon is important to get the drawing of countries right. I removed grids, ticks and labels to get a clean world map. The color of each country indicates the GDP per capita and the text shows the most favourite car brand. The location of the text is the center of each country. The LON and LAT information of the country’s centroids comes along the rworldmap already.

# plot world map
ggplot() +
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = car_map_poly, aes(long, lat, group = group,
               fill=gdp2013),size = 0.3) +
 scale_fill_gradient(trans = "log",breaks=c(0,1000, 2000,5000,10000,25000,50000,100000), low="#f7fcb9", high="#2c7fb8") +
  theme_bw() + xlab(NULL) + ylab(NULL) +
  guides(fill = guide_legend(
    title='Text: "What is your favourite German car brand?" nSmartphone and tablet advert survey n(August 4-8th 2014, 500 users per country, Dalia Research, open data) nnColor: GDP per capita 2013 (current US$) - World Development Indicators (World Bank)',
    keywidth = 0.7, keyheight = 0.7,
    reverse=F, title.position="top")) +
  theme(
    plot.background = element_blank()
   ,panel.grid.major = element_blank()
   ,panel.grid.minor = element_blank()
   ,panel.border = element_blank()
   ,axis.ticks = element_blank()
   ,axis.text.x = element_blank()
   ,axis.text.y = element_blank()
   ,legend.position = "bottom"
   ,legend.direction = "horizontal"
  ) +
geom_text(data=subset(car_map@data, !is.na(country)), aes(x=LON, y=LAT,label=carbrand), size=1.5)

 

 

Here we go! No clear sign of a wealth-car brand effect but some indication of spatial clustering in Europe and America. Of course, the data is not representative but you have additional education, device and income information of the respondents to see if there are some selection effects.

More information regarding the competition can be found under www.facebook.com/DaliaResearch where you can post your visualization and win world-wide answers to your own 5 question.

So, if you ever wanted to ask something to the world, now is your time!

 

The entire code of the post can be found here:
https://raw.githubusercontent.com/timwinke/worldmap/master/worldmapRcode.R

my blogroll


To leave a comment for the author, please follow the link and comment on their blog: R – workspace13.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.