Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Note to self – Remember to serialize R objects as RDS files when it makes sense.
Importing Stata data into R
The European Social Survey recently announced that it had added Round 7 of its survey to its cumulative dataset, which can be downloaded in CSV, SPSS or Stata format.
While my instinctive preference for storing data is to use CSV, in the case of survey data, many/most measurements come with detailed variable and value labels.
Furthermore, as is the case in the European Social Survey, the missing values of survey data generally take several different values to code for different forms of nonresponse, depending on whether the respondents “did not know” what to answer, provided “no answer,” or “refused to answer” the question.
For these reasons, I tried to download the European Social Survey as a Stata dataset, only to realise later that the data had been produced with Stata 14—which means that it cannot be opened with older versions of Stata, unless the data were saved with the saveold
command and with the appropriate argument for my version of Stata.
Fortunately, I was able to read the data in R with haven
. The package, which wraps around the ReadStat C library, can import SAS, SPSS and Stata files. Once imported, the data are available as a standard data frame, with value labels accessible via functions like print_labels
and as_factor
.
Saving the data as a RDS file
Another issue that then I faced with the European Social Survey dataset was its size: while only 103.5 MB compressed, the uncompressed Stata DTA file for the complete (all variables, all waves) version of the cumulative dataset is extremely large: 3.16 GB.
In comparison, the CSV file for the same dataset, which does not contain labels or detailed missing values, is 58.1 MB compressed and 559.7 MB uncompressed.
Here again, R offers a superior alternative to both the CSV and Stata formats: by saving the file as a RDS file, which creates a serialized version of the dataset and then saves it with gzip
compression, I was able to bring the size of the dataset down to 51.6 MB.
Note that, when loaded into R, the RDS object still takes around 3 GB of (live) memory.
The full code used to convert the European Social Survey data from the DTA (Stata) to the RDS (R) format follows. The code requires the haven
package, which is part of Hadley Wickham’s tidyverse package suite.
Update (December 14, 2016): having discussed the issue on Twitter, it appears that the data mentioned in this note can be compressed quite efficiently in Stata. That operation, however, requires Stata 14 or above, if Stata keeps its commitment backwards compatibility. There is currently no other way to load the file in lower versions of Stata.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.