[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A serious, decades-long attempt to understand different peoples’ values
David Hood (@Thoughfulnz) has been posting some interesting snippets of analysis using the World Values Survey data (like this example). This inspired me to have a look at the data myself; something that’s been on my to-do list for years. I have analysed it before, but a long while ago and for only a single specific purpose, and I wanted to get a more general overview of it.
The World Values Survey is an amazing multi-decade (started in 1981) network of scientists doing their best to create standard, internationally comparable, nationally representative data on values and their impact on social and political life.
It’s an impressive resource, with six waves of data already published and a seventh in preparation. I’ve only had a few hours today to look at it and that’s enough to make me want to come back. But let’s start with my finishing point for today – a big graphic of around 40 of the questions from the various surveys, showing how much people in different countries agree with various statements:
It really needs a big screen of course. Click on the image to open it in its own tab of the browser.
I find this fascinating and hope you do to. It’s a graphic that repays close attention, and the more I look at it the more interesting snippets I find. The ordering of the graphic with the most “agreed-with” statements at the top and the most “agreeing” countries on the right is crucial. It means that you can look up or across the graphic and look the squares of unusual colour that indicate exceptions. Here’s just some of the things that I noticed:
The United States stands out for its residents agreement with “Under some conditions, war is necessary to obtain justice”
The people of Hong Kong don’t agree that they are autonomous individuals
The people of Sweden are inclined to agree that all religions should be taught in public schools; and to disagree that “when jobs are scarce, employers should give priority to people of this country:
Residents of former Soviet Union countries (Belarus, Russia, Kazakhstan, Ukraine) stand out for not agreeing with “I see myself as part of my local community”
Data wrangling
I have in mind the plan to eventually grab all six waves of the World Values Survey, as the real benefit of this sort of data comes not just from cross-country comparisons but comparisons over time. I also wanted a way of wrangling the data that would lend itself to efficient analysis of multiple questions at once. Both of these needs pushed me away from the common way of treating survey with one row per respondent, one column per question to a more efficient data model as used in data warehousing.
At first I did this all in R, but then I found that when I got to the analysis stage and wanted to join my 38 million row fact table to the 90,000 row dimension table I ran out of memory. I needed a database so that hard disk could be used when the going gets tough, and to draw on all the optimisation built into that software for exactly this job. So I ended up with this toolchain (which is a very common one for me):
data import, cleaning and some normalising in R
upload into a database with the right disciplines and constraints, and (sometimes, not today) finishing off the cleaning and normalising
for analysis, do the heavy lifting in SQL queries but bring filtered or aggregated data into R for any statistical modelling or for presentation.
Here’s the R code for the first step. It draws first on code by David Hood for extracting the infromation on questions and answers that was encoded in the original file as attributes; then moves on to split the data into four tables representing respondents, questions, possible responses and (finally) the 38 million combinations of respondent – question – response combinations in a single long, thin table.
One of the key things I’ve done here is coded some of the responses with new classification columns such as agree – which is 1 for every response that is some kind of partly or fully agree, and 0 otherwise. This will let us analyse the multiple questions that use variants of this response set efficiently down the track. The invalid column for every variant on “not asked”, “dropped out”, “don’t know”, is also important.
Creating a database
Defining a data model
The data model I’ve split the data out into is basically this:
This is the classic data warehousing approach to this sort of data. A long, thin fact table; and shorter, wider dimension tables that hold all the text information and slowly changing information about the entities such as people, questions, etc. For those interested in these things, f_wvs is a good example of a “factless fact table” – meaning that other than indicating unique combinations of the three dimensions (this respondent gave this response to this question) it has no additional numeric information or “facts”.
There are many many advantages of this data model; as well as being convenient for analysis down the track, one of the advantages is that we can do cleaning and enhancements of the dimension data in a compact way. In this case, I can add classifications to the responses in the 28,000 row d_responses table much more efficiently (with each unique response represented only once) than if I’d tried to do it to the original data, where each response is repeated each time someone in the original survey ticked it. And those columns of 1s and 0s with names like agree and invalid are going to become crucial for efficient analysis down the track.
Here’s the SQL that I ran in SQL Server to set up that simple empty wvs schema ready to accept the data from R:
Uploading data from R
For larger datasets, my preferred approach to getting the data from R into a SQL Server database is:
define the database tables in SQL (as done above)
from R, save the tables as pipe-delimited text files with no columns. The fwrite function in data.table package is super-fast at doing this – less than a second for the 38 million rows in f_wvs in this case.
upload the data from the local disk to the database server via SQL Server’s bcp bulk import and export tool. This is a common enough task for me that I’ve made a convenience wrapper function bcp() in my frs R package of miscellaneous bits and pieces.
Here’s how that upload process looks in R:
Analysis
After all this, the analysis is the easy and fun part. Because this post is already long enough I’m going to just show the creation of that graphic I started with. My workflow at this point is writing SQL queries in SQL Server Management Studio until I’m satisfied I’ve got the right filtering and aggregation; then I copy and paste that query into my R script so everything is in one place. If the SQL gets much more complicated than the below I’d want to save separate SQL scripts rather than copy them into R.
So here it is, code to pull down the appropriately weighted responses for all questions in all countries that have an “agree” part of the response. The code below is specific to my setup in that I have an ODBC data source name “sqlserver” which I use to connect to the localhost database; that was just set up in the Windows ODBC administrator by following the prompts. If you’ve got a remote server it gets a little more complicated.
I’ve ignored statistical issues so far, in particular dealing with sampling uncertainty and the complex survey nature of the data. But I’m all set up to do this when I need to (and I need to understand better the sampling design to do it properly), down the track at some point I hope.
That’s all for now. But I’ll definitely be coming back to this data!
Related
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.