Mirror, mirror on the wall, which is your favorite German car brand in the world…
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Or how to create a world map in R with survey data, World Bank indicators, ggplot2 and rworldmap
by timwinke
When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?
There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.
In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.
The survey data are now publicly available at http://blog.daliaresearch.com/ask-the-world/. Dalia Research collects interesting survey visualisations on their facebook timeline as part of a competition till Dec 15th. You can win the chance to submit your own 5 questions in the next round of the study. This is my competition post.
Surprisingly, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country. While creating the post, I figured out many other things like how to merge cross-country data with World Bank indicators and how to create a world map using rworldmap
and ggplot2
. In this blog post, I will quickly show how to do it.
To get a first impression about the data have a look at the Global Study report page. It’s a bit tricky to compare country answers using their report page, so let’s download and read the csv file into R. The package RCurl
facilitates https requests.
#install.packages(c("dplyr", "ggplot2", "RCurl", "rworldmap", "WDI")) library("dplyr") # data manipulation library("ggplot2") # visualsation library("RCurl") # network client interface library("rworldmap") # get world map with iso codes library("WDI") # World Bank indicators library("mapproj") # Converts latitude/longitude into projected coordinates #establish https connection to download csv file URL <- "https://s3-eu-west-1.amazonaws.com/reports.dashboard/public/oak_fb/DaliaResearch_ProjectOak_Dataset.csv" x <- getURL(URL, ssl.verifypeer=FALSE) car_raw <- read.csv2(textConnection(x))
I convert the year of birth variable to get the age of the respondent and select the variable carbrand
. As you can see, I’m a big fan of dplyr
for data manipulation by Hadley Wickham and Romain Francois and the forward-pipe operator %>%
.
# First data manipulations and variable selection car <- tbl_df(car_raw) %>% mutate(age = as.numeric(format(Sys.time(), "%Y")) - ybirth, #turn year of birth to age carbrand = as.character(carbrand) #refers to the question: What's your favourite German car brand? ) %>% select(iso2c, countryname, device_kind, device_platform, gender, age, household_income, education,carbrand) %>% mutate(carbrand = ifelse(is.na(carbrand), "No response", carbrand))
To get summary statistics, I group by carbrand and country, i.e., the 2-digit country code iso2c, and use dplyr’s summarise
function to return the counts of respondents before sorting or arrange
them.
# compute country statistics car_countrystat <- car %>% group_by(iso2c, carbrand) %>% summarise(count = n(), countryname = unique(countryname)) %>% arrange(carbrand,desc(count)) head(car_countrystat) #AR indicates ## Source: local data frame [6 x 4] ## Groups: iso2c ## ## iso2c carbrand count countryname ## 1 AR Audi 110 Argentina ## 2 AR BMW 117 Argentina ## 3 AR I don't know any of these brands. 15 Argentina ## 4 AR Mercedes-Benz 85 Argentina ## 5 AR No response 1 Argentina ## 6 AR Porsche 87 Argentina
Creating bar plots for country groups
The next step is to create a barplot which stacks the car brand responses for several countries or country groups.
I wrote a function which binds country groups together. It takes some data as an argument, filters only the indicated country or countries and gives a meaningful name of the group to the variable cgroup
(I know, there might be more elegant ways to do that…) For lengthy country groups like the EU-28, I use a new variable eu
.
# initiate country groups eu <- c('BE','DE','FR','IT','LU','NL','UK','IE','DK','GR','ES','PT','AT','FI','SE','CY','CZ','EE','HU','LT','LV','MT','PL','SI','BG','RO','HR', 'SK') # EU-28 countries bindcgroups <- function(data) { cgroup1 <- data %>% filter(iso2c %in% c('JP')) %>% mutate(cgroup = "Japan") cgroup2 <- data %>% filter(iso2c %in% "CN") %>% mutate(cgroup = "China") cgroup3 <- data %>% filter(iso2c %in% "US") %>% mutate(cgroup = "U.S.A.") cgroup4 <- data %>% filter(iso2c %in% "IN") %>% mutate(cgroup = "India") cgroup5 <- data %>% filter(iso2c %in% eu) %>% mutate(cgroup = "EU-28") cgroup6 <- data %>% filter(iso2c %in% "DE") %>% mutate(cgroup = "Germany") cgroup7 <- data %>% filter(iso2c %in% "RU") %>% mutate(cgroup = "Russia") cgroup8 <- data %>% filter(iso2c %in% "UA") %>% mutate(cgroup = "Ukraine") cgroup9 <- data %>% mutate(cgroup = "World") cgroup10 <- data %>% filter(iso2c %in% "BR") %>% mutate(cgroup = "Brasil") cgroup11 <- data %>% filter(iso2c %in% "IR") %>% mutate(cgroup = "Iran") cgroup12 <- data %>% filter(iso2c %in% "KR") %>% mutate(cgroup = "South Korea") # KR Korea, Rep. cgroup13 <- data %>% filter(iso2c %in% "ID") %>% mutate(cgroup = "Indonesia") cgroup14 <- data %>% filter(iso2c %in% c("EC", "PE", "VE", "CO")) %>% mutate(cgroup = "Peru-Ecuador-Venezuela-Colombia") cgroup15 <- data %>% filter(iso2c %in% c('CL','AR','UY')) %>% mutate(cgroup = "Chile-Argentina-Uruguay") cgroup16 <- data %>% filter(iso2c %in% c('TW')) %>% mutate(cgroup = "Taiwan") cgroup_bind <- rbind(cgroup1,cgroup2,cgroup3,cgroup4,cgroup5,cgroup6,cgroup7,cgroup8,cgroup9, cgroup10,cgroup11,cgroup12,cgroup13,cgroup14,cgroup15,cgroup16) return(cgroup_bind) } car_barplot <- bindcgroups(data = car_countrystat)
Time for plotting! I’d like to display the percentages along the bars. This requires some computations to get the center position of each bar color. I use ggplot2
by Hadley Wickham and Winston Chang for visualisation.
# create summary statistics for country groups car_barplot <- car_barplot %>% group_by(cgroup, carbrand) %>% summarise(countn = sum(count)) %>% #total counts by country groups and carbrand group_by(cgroup) %>% mutate(groupsum = sum(countn), #total counts per country group percent = countn*100/groupsum, # percentage for each carbrand position = cumsum(percent) - 0.5*(percent) #find center of each carbrand part along the bar ) # create bar plot p <- ggplot(data=car_barplot, aes(x=cgroup, y=percent, fill=carbrand)) p + geom_bar(stat="identity") + xlab("") + ylab("Share of responses") + theme_bw() + coord_flip() + guides(fill = guide_legend( title="What's your favourite German car brand?", keywidth = 0.7, keyheight = 0.7, reverse=F, title.position="top", nrow = 3, byrow=T)) + theme(legend.position="bottom") + geom_text(aes(x=cgroup, y=position, label=sprintf("%1.0f%%", percent)),size=3)
Joining the data with World Bank indicators
Country data can easily be joined with World Development Indicators provided by the World Bank. Vincent Arel-Bundock created the WDI
package for R which does the retrieval job. To have a look at all indicators you can use the function WDIsearch()
for an overview. I grab the 2013 GDP per capita indicator, take the geographical information for all capitals and select relevant variables.
The World Bank does not provide explicit information on Taiwan (see here) which has to be added manually as 500 people from Taiwan participated in the Dalia Research survey.
library("WDI") # package to retrieve World Bank data #WDIsearch() #To have a look at all indicators wdi_gdp <- WDI(country = "all", indicator = "NY.GDP.PCAP.CD",start = 2013, end = 2013, extra = T, cache = NULL) # NY.GDP.PCAP.CD refers to "GDP per capita (current US$)" wdi_gdp <- wdi_gdp %>% mutate(gdp2013 = NY.GDP.PCAP.CD, capital_lon = as.numeric(paste(longitude)), #factor to numeric capital_lat = as.numeric(paste(latitude))) %>% select(iso2c, country, gdp2013, capital, capital_lon, capital_lat) wdi_gdp$capital <- as.character(wdi_gdp$capital) wdi_gdp$capital[wdi_gdp$country %in% "Israel"] <- "Jerusalem" #No capital for Israel provided taiwan <- data.frame(iso2c=as.factor("TW"), country=as.factor("Taiwan"), gdp2013=475.33, # http://www.tradingeconomics.com/taiwan/gdp capital=as.factor("Taiwan"), capital_lon=120.7120023, capital_lat=22.6158015) wdi_gdp <- rbind(wdi_gdp, taiwan) # Now lets merge the previous survey data with the World Bank indicators car <- car %>% merge(wdi_gdp, by="iso2c", all.x=T) car_countrystat <- car_countrystat %>% merge(wdi_gdp, by="iso2c", all.x=T) head(car_countrystat) ## iso2c carbrand count countryname country ## 1 AR Audi 110 Argentina Argentina ## 2 AR BMW 117 Argentina Argentina ## 3 AR I don't know any of these brands. 15 Argentina Argentina ## 4 AR Mercedes-Benz 85 Argentina Argentina ## 5 AR No response 1 Argentina Argentina ## 6 AR Porsche 87 Argentina Argentina ## gdp2013 capital capital_lon capital_lat ## 1 14760 Buenos Aires -58.42 -34.61 ## 2 14760 Buenos Aires -58.42 -34.61 ## 3 14760 Buenos Aires -58.42 -34.61 ## 4 14760 Buenos Aires -58.42 -34.61 ## 5 14760 Buenos Aires -58.42 -34.61 ## 6 14760 Buenos Aires -58.42 -34.61
Plotting a world map with the most favourite car brand for each country
Last but not least, let’s see how to construct a simple world map which shows the most favourite car brand for each country. I get the car brand with the most frequent votes by using dplyr's
windowed rank function. The first value of min_rank
is the lowest value so we need to use a descending order to get the most frequent value.
#only take the top 1 car brand by country car_topbrand <- car_countrystat %>% group_by(iso2c) %>% filter(min_rank(desc(count)) <= 1) car_topbrand$carbrand <- as.factor(car_topbrand$carbrand) levels(car_topbrand$carbrand) <- c(levels(car_topbrand$carbrand), "Volkswagen") # Volkswagen never was number one but should be in the map legend
I use the package rworldmap
by Andy South to retrieve the underlying world map. The package has the convenient function joinCountryData2Map
which makes joints with other data sets simple. The 2-digit country code iso2c
is the unique identifier.
library("rworldmap") # get world map with iso codes car_map <- joinCountryData2Map(car_topbrand, joinCode = "ISO2", nameJoinColumn = "iso2c") ## 66 codes from your data successfully matched countries in the map ## 0 codes from your data failed to match with a country code in the map ## 179 codes from the map weren't represented in your data
Next, I use the fortify
function of the ggplot2
package to turn the SpatialPolygonsDataFrame into a data frame which indicates all “edges” of a country shape. Fortify
removes all other variables so I merge the previous data with the SpatialPolygon again. Don’t forget to reorder the data frame afterwards otherwise the map will look messy.
#extract polygons out of spatial world data car_map_poly <- fortify(car_map) ## Regions defined for each Polygons car_map_poly <- merge(car_map_poly, car_map@data, by.x="id", by.y="ADMIN", all.x=T) car_map_poly <- car_map_poly %>% arrange(id, order)
Now the data is in the right format for ggplot2
to plot it. The group attribute of geom_polygon
is important to get the drawing of countries right. I removed grids, ticks and labels to get a clean world map. The color of each country indicates the GDP per capita and the text shows the most favourite car brand. The location of the text is the center of each country. The LON and LAT information of the country’s centroids comes along the rworldmap
already.
# plot world map ggplot() + coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) + geom_polygon(data = car_map_poly, aes(long, lat, group = group, fill=gdp2013),size = 0.3) + scale_fill_gradient(trans = "log",breaks=c(0,1000, 2000,5000,10000,25000,50000,100000), low="#f7fcb9", high="#2c7fb8") + theme_bw() + xlab(NULL) + ylab(NULL) + guides(fill = guide_legend( title='Text: "What is your favourite German car brand?" nSmartphone and tablet advert survey n(August 4-8th 2014, 500 users per country, Dalia Research, open data) nnColor: GDP per capita 2013 (current US$) - World Development Indicators (World Bank)', keywidth = 0.7, keyheight = 0.7, reverse=F, title.position="top")) + theme( plot.background = element_blank() ,panel.grid.major = element_blank() ,panel.grid.minor = element_blank() ,panel.border = element_blank() ,axis.ticks = element_blank() ,axis.text.x = element_blank() ,axis.text.y = element_blank() ,legend.position = "bottom" ,legend.direction = "horizontal" ) + geom_text(data=subset(car_map@data, !is.na(country)), aes(x=LON, y=LAT,label=carbrand), size=1.5)
Here we go! No clear sign of a wealth-car brand effect but some indication of spatial clustering in Europe and America. Of course, the data is not representative but you have additional education, device and income information of the respondents to see if there are some selection effects.
More information regarding the competition can be found under www.facebook.com/DaliaResearch where you can post your visualization and win world-wide answers to your own 5 question.
So, if you ever wanted to ask something to the world, now is your time!
The entire code of the post can be found here:
https://raw.githubusercontent.com/timwinke/worldmap/master/worldmapRcode.R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.