Working With SPSS© Data in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign
and haven
R packages. I prefer haven
because it integrates better with R’s tidyverse and started using it in detriment of foreign
when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.
The Data
For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.
Importing Data
#devtools::install_github("ropenscilabs/skimr") # Exploratory Data Analysis tools library(ggplot2) library(dplyr) library(sjlabelled) library(skimr) library(readr) # Import foreign statistical formats library(haven) # Data url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav" sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav" if(!file.exists(sav)){download.file(url,sav)} survey = read_sav(sav)
Exploring data
To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.
# How many surveys do I have by day? daily = survey %>% mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>% rename(date = Fecha) %>% group_by(date) %>% summarise(n = n()) ggplot(daily, aes(date, n)) + geom_line()
# How is the age distributed? summary(survey$Edad_Entrevistado) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 32.00 48.00 47.92 61.00 89.00 age = survey %>% mutate(as.integer(Edad_Entrevistado)) %>% rename(age = Edad_Entrevistado) %>% group_by(age) %>% summarise(n = n()) ggplot(age, aes(age, n)) + geom_line()
# How is the sex distributed? survey %>% rename(sex_id = Sexo_Entrevistado) %>% group_by(sex_id) %>% summarise(n = n()) # A tibble: 2 x 2 sex_id n <dbl+lbl> <int> 1 1 651 2 2 651
Exploring labels
In the last tibble we have no idea what is 1 and 2.
survey %>% select(Sexo_Entrevistado) %>% rename(sex_id = Sexo_Entrevistado) %>% distinct() %>% mutate(sex = as_factor(sex_id)) # A tibble: 2 x 2 sex_id sex <dbl+lbl> <fctr> 1 2 Mujer 2 1 Hombre
The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.
I could run
survey %>% rename(sex = Sexo_Entrevistado) %>% mutate(sex = as.integer(sex)) %>% mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% group_by(sex) %>% summarise(n = n()) # A tibble: 2 x 2 sex n <chr> <int> 1 Female 651 2 Male 651
The column names are labelled as well. Here sjlabelled
helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.
Describing the dataset
valid_replies = survey %>% mutate_if(is.labelled,as.numeric) %>% skim() %>% filter(stat=="complete") %>% mutate(description = get_label(survey)) %>% select(var,description,everything()) %>% select(-c(stat,level,type)) %>% rename(pcent_valid = value) %>% mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%')) histograms = survey %>% mutate_if(is.labelled,as.numeric) %>% skim() %>% filter(stat=="hist") %>% select(var,level) %>% rename(histogram = level) survey_description = valid_replies %>% left_join(histograms) %>% write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv") survey_description # A tibble: 203 x 4 var description pcent_valid histogram <chr> <chr> <chr> <chr> 1 PONDERADOR Ponderador 100% ▂▇▇▅▅▃▁▁▁▁ 2 Folio Folio 100% ▇▇▇▇▇▇▇▇▇▇ 3 Región Región 100% ▁▁▂▁▂▁▁▁▇▁ 4 Comuna Comuna 100% ▁▁▂▁▁▂▁▁▇▁ 5 Fecha Fecha entrevista 100% <NA> 6 Sexo_Encuestador Sexo Entrevistador 91% ▂▁▁▁▁▁▁▁▁▇ 7 GSE GSE Visual 100% ▁▁▂▁▇▁▁▆▁▁ 8 Sexo_Entrevistado Sexo Entrevistado 100% ▇▁▁▁▁▁▁▁▁▇ 9 Edad_Entrevistado Edad Entrevistado 100% ▇▆▅▆▇▇▅▃▃▂ 10 Hora_Inicio Hora Inicio Medición 100% <NA> # ... with 193 more rows
Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.