Working With SPSS© Data in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.
The Data
For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.
Importing Data
#devtools::install_github("ropenscilabs/skimr")
# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)
# Import foreign statistical formats
library(haven)
# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"
if(!file.exists(sav)){download.file(url,sav)}
survey = read_sav(sav)
Exploring data
To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.
# How many surveys do I have by day? daily = survey %>% mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>% rename(date = Fecha) %>% group_by(date) %>% summarise(n = n()) ggplot(daily, aes(date, n)) + geom_line()

# How is the age distributed? summary(survey$Edad_Entrevistado) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 32.00 48.00 47.92 61.00 89.00 age = survey %>% mutate(as.integer(Edad_Entrevistado)) %>% rename(age = Edad_Entrevistado) %>% group_by(age) %>% summarise(n = n()) ggplot(age, aes(age, n)) + geom_line()

# How is the sex distributed?
survey %>%
rename(sex_id = Sexo_Entrevistado) %>%
group_by(sex_id) %>%
summarise(n = n())
# A tibble: 2 x 2
sex_id n
<dbl+lbl> <int>
1 1 651
2 2 651
Exploring labels
In the last tibble we have no idea what is 1 and 2.
survey %>%
select(Sexo_Entrevistado) %>%
rename(sex_id = Sexo_Entrevistado) %>%
distinct() %>%
mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
sex_id sex
<dbl+lbl> <fctr>
1 2 Mujer
2 1 Hombre
The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.
I could run
survey %>%
rename(sex = Sexo_Entrevistado) %>%
mutate(sex = as.integer(sex)) %>%
mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>%
group_by(sex) %>%
summarise(n = n())
# A tibble: 2 x 2
sex n
<chr> <int>
1 Female 651
2 Male 651
The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.
Describing the dataset
valid_replies = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="complete") %>%
mutate(description = get_label(survey)) %>%
select(var,description,everything()) %>%
select(-c(stat,level,type)) %>%
rename(pcent_valid = value) %>%
mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))
histograms = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="hist") %>%
select(var,level) %>%
rename(histogram = level)
survey_description = valid_replies %>%
left_join(histograms) %>%
write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")
survey_description
# A tibble: 203 x 4
var description pcent_valid histogram
<chr> <chr> <chr> <chr>
1 PONDERADOR Ponderador 100% ▂▇▇▅▅▃▁▁▁▁
2 Folio Folio 100% ▇▇▇▇▇▇▇▇▇▇
3 Región Región 100% ▁▁▂▁▂▁▁▁▇▁
4 Comuna Comuna 100% ▁▁▂▁▁▂▁▁▇▁
5 Fecha Fecha entrevista 100% <NA>
6 Sexo_Encuestador Sexo Entrevistador 91% ▂▁▁▁▁▁▁▁▁▇
7 GSE GSE Visual 100% ▁▁▂▁▇▁▁▆▁▁
8 Sexo_Entrevistado Sexo Entrevistado 100% ▇▁▁▁▁▁▁▁▁▇
9 Edad_Entrevistado Edad Entrevistado 100% ▇▆▅▆▇▇▅▃▃▂
10 Hora_Inicio Hora Inicio Medición 100% <NA>
# ... with 193 more rows
Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.