Violin plots and regional income distribution
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While preparing my slides for statistical graphics, a plot really caught my eye when I was playing around with the data.
I started off by plotting the time seriesof GNI per capita by country, and as expected it got quite messy and incomprehensible.
## Download and manipulate the data library(FAOSTAT) raw.lst = getWDItoSYB(indicator = c("NY.GNP.PCAP.CD", "SP.POP.TOTL")) raw.df = raw.lst[["entity"]] traw.df = translateCountryCode(raw.df, from = "ISO2_WB_CODE", to = "UN_CODE") mraw.df = merge(traw.df, FAOregionProfile[, c("UN_CODE", "UNSD_MACRO_REG")]) final.df = mraw.df[!is.na(mraw.df$UNSD_MACRO_REG), ] ## Simple ugly time series plot ggplot(data = final.df, aes(x = Year, y = NY.GNP.PCAP.CD)) + geom_line(aes(col = Country)) + labs(x = NULL, y = "GNI per capita")
So I decided to compute the weighted average by region to examine the regional trends.
## Compute regional aggregates based on UN M49 definition reg.df = aggRegion(aggVar = "NY.GNP.PCAP.CD", weightVar = "SP.POP.TOTL", data = traw.df, keepUnspecified = FALSE, aggMethod = "weighted.mean", relationDF = data.frame(UN_CODE = FAOregionProfile[, "UN_CODE"], REG_NAME = FAOregionProfile[, "UNSD_MACRO_REG"])) ## Plot regional aggregates ggplot(data = reg.df[!is.na(reg.df$NY.GNP.PCAP.CD), ], aes(x = Year, y = NY.GNP.PCAP.CD)) + geom_line(aes(col = REG_NAME)) + labs(x = NULL, y = "GNI per capita", col = "")
I can now see the trend clearly, but there are two problems with this approach. First, the variability within region is vast and thus the weighted average or any summary statistic such as quantile can be misleading and it does not tell me what is going on within the regions. Secondly, since a minimum of 65% of the country must be present in order to compute the aggregation, no statistics was available prior to 1985.
While I was carrying out regional comparisons with box-plot and violin plot I thought why not plot them accross time as well! So here is the final graph:
## Time series violin plot ggplot(data = final.df, aes(x = as.character(Year), y = NY.GNP.PCAP.CD)) + geom_violin() + scale_y_log10() + facet_wrap(~UNSD_MACRO_REG, ncol = 1, scales = "free_y") + scale_x_discrete(breaks = as.character((seq(1960, 2010, by = 10))), labels = as.character((seq(1960, 2010, by = 10)))) + labs(x = NULL, y = "GNI per capita")
Now I can compare the regions, but at the same time I can see the within region income distribution. It amazes me how the income distribution diverges in Europe and Oceania while America and Asia moves towards a bell shaped distribution. Growth in Africa appears to be slow, but there are several countries which are growing at a faster rate and pushing the tail of the distribution. Although some of the variability in the density may have resulted from independence of countries, nonetheless it is still infromative.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.