How to read Stata DTA files into R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The file contains 2017 face-to-face post-election survey responses along with explanatory notes. Read the Stata DTA file into R with two these two lines:
library(haven) df <- read_dta("http://www.britishelectionstudy.com/wp-content/uploads/2018/01/bes_f2f_2017_v1.2.dta")
The data set is now stored as a dataframe df with 357 variables. To check the properties of the data set we type
str(df)
This gives the following output:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2194 obs. of 357 variables: $ finalserialno : atomic 10115 10119 10125 10215 10216 ... ..- attr(*, "label")= chr "Final Serial Number" ..- attr(*, "format.stata")= chr "%12.0g" $ serial : atomic 000000399 000000398 000000400 000000347 ... ..- attr(*, "label")= chr "Respondent Serial Number" ..- attr(*, "format.stata")= chr "%9s" $ a01 : atomic nhs brexit society immigration ... ..- attr(*, "label")= chr "A1: Most important issue" ..- attr(*, "format.stata")= chr "%240s" $ a02 :Class 'labelled' atomic [1:2194] 1 0 -1 -1 1 -1 2 -1 2 2 ... .. ..- attr(*, "label")= chr "Best party on most important issue" .. ..- attr(*, "format.stata")= chr "%8.0g" .. ..- attr(*, "labels")= Named num [1:13] NA NA NA 0 1 2 3 4 5 6 ... .. .. ..- attr(*, "names")= chr [1:13] "Not stated" "Refused" "Don`t know" "None/No party" ...
The above output shows that the variables are already set to the correct types. The first variable finalserialno is numeric (i.e., atomic), the third variable a01 is character, and the fourth variable a02 has a class of ‘labelled’ which can be converted to a factor or categorical variable (after we handle missing values).
Each variable has an associated label attribute to help with interpretation. For example, without having to look up the explanatory notes, we can see that variable a01 contains the responses to the question “A1: most important issue” and variable a02 contains the responses to “Best party on most important issue”.
Missing values
Stata supports multiple types of missing values. Read_dta automatically handles missing values in numeric and character variables. For categorical variables, missing values are typically encoded by negative numbers. Section 5.3 of the explanatory notes describes the encoding for this file: -1 (Don’t know), -2 (Refused) and -999 (Not stated). We now convert all three of these values to NA.
for (i in 1:length(df)) { if (class(df[[i]] == "labelled") df[[i]][df[[i]] < 0] <- NA }
Encoding categorical variables
The categorical variables of class “labelled” are stored as numeric vectors. Convert them into factors so they are correctly associated with the labels with only a single command:
df <- as_factor(df)
Note that we do this after converting the missing values to avoid spurious factor levels in the final dataset.
Find out more
You can find out about how to import and read Excel files into Displayr as well.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.