Iterating on the GSS
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s say we’re working with the General Social Survey. We’re
interested in repeatedly fitting some model each year to see how some
predictor changes over time. For example, the GSS has
a longstanding question named fefam
, where respondents are asked to
give their opinion on the following statement:
It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.
Respondents may answer that they Strongly Agree, Agree, Disagree, or Strongly Disagree with the statement (as well as refusing to answer, or saying they don’t know). Perhaps we’re interested in what predicts respondents being inclined to agree with the statement.
Setup
First we load the packages we’ll need.
1 2 3 4 5 |
library(tidyverse) library(survey) library(srvyr) library(broom) library(gssr) # https://kjhealy.github.io/gssr |
Next, the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
data(gss_all) gss_all ## # A tibble: 68,846 × 6,311 ## year id wrkstat hrs1 hrs2 evwork occ prestige wrkslf wrkgovt ## <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl+l> <dbl+l> ## 1 1972 1 1 [workin… NA(i) NA(i) NA(i) 205 50 2 [som… NA(i) ## 2 1972 2 5 [retire… NA(i) NA(i) 1 [yes] 441 45 2 [som… NA(i) ## 3 1972 3 2 [workin… NA(i) NA(i) NA(i) 270 44 2 [som… NA(i) ## 4 1972 4 1 [workin… NA(i) NA(i) NA(i) 1 57 2 [som… NA(i) ## 5 1972 5 7 [keepin… NA(i) NA(i) 1 [yes] 385 40 2 [som… NA(i) ## 6 1972 6 1 [workin… NA(i) NA(i) NA(i) 281 49 2 [som… NA(i) ## 7 1972 7 1 [workin… NA(i) NA(i) NA(i) 522 41 2 [som… NA(i) ## 8 1972 8 1 [workin… NA(i) NA(i) NA(i) 314 36 2 [som… NA(i) ## 9 1972 9 2 [workin… NA(i) NA(i) NA(i) 912 26 2 [som… NA(i) ## 10 1972 10 1 [workin… NA(i) NA(i) NA(i) 984 18 2 [som… NA(i) ## # … with 68,836 more rows, and 6,301 more variables: commute <dbl>, ## # industry <dbl>, occ80 <dbl>, prestg80 <dbl>, indus80 <dbl+lbl>, ## # indus07 <dbl>, occonet <dbl>, found <dbl>, occ10 <dbl+lbl>, occindv <dbl>, ## # occstatus <dbl>, occtag <dbl>, prestg10 <dbl>, prestg105plus <dbl>, ## # indus10 <dbl+lbl>, indstatus <dbl>, indtag <dbl>, marital <dbl+lbl>, ## # martype <dbl+lbl>, agewed <dbl>, divorce <dbl+lbl>, widowed <dbl+lbl>, ## # spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, … |
We select the subset of variables of interest, together with the survey weights, which we’ll use later on.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
cont_vars <- c("year", "id", "ballot", "age") cat_vars <- c("race", "sex", "fefam") wt_vars <- c("vpsu", "vstrat", "oversamp", "formwt", # weight to deal with experimental randomization "wtssall", # weight variable "sampcode", # sampling error code "sample") # sampling frame and method my_vars <- c(cont_vars, cat_vars, wt_vars) # clean up labeled vars as we go, create compwt gss_df <- gss_all %>% filter(year > 1974 & year < 2021) %>% select(all_of(my_vars)) %>% mutate(across(everything(), haven::zap_missing), # Convert labeled missing to regular NA across(all_of(wt_vars), as.numeric), across(all_of(cat_vars), as_factor), across(all_of(cat_vars), fct_relabel, tolower), across(all_of(cat_vars), fct_relabel, tools::toTitleCase), compwt = oversamp * formwt * wtssall) |
The outcome we’re interested in is fefam
:
1 2 3 4 5 6 7 8 9 10 11 12 |
gss_df %>% count(fefam) ## # A tibble: 5 × 2 ## fefam n ## <fct> <int> ## 1 Strongly Agree 2543 ## 2 Agree 8992 ## 3 Disagree 13061 ## 4 Strongly Disagree 5479 ## 5 <NA> 30138 |
We’ll recode it so that it’s a binary outcome:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
gss_df <- gss_df %>% mutate(fefam_d = forcats::fct_recode(fefam, Agree = "Strongly Agree", Disagree = "Strongly Disagree"), fefam_n = recode(fefam_d, "Agree" = 1, "Disagree" = 0)) # factor version gss_df %>% count(fefam_d) ## # A tibble: 3 × 2 ## fefam_d n ## <fct> <int> ## 1 Agree 11535 ## 2 Disagree 18540 ## 3 <NA> 30138 # numeric version, 1 is "Agree" gss_df %>% count(fefam_n) ## # A tibble: 3 × 2 ## fefam_n n ## <dbl> <int> ## 1 0 18540 ## 2 1 11535 ## 3 NA 30138 |
Unweighted models
Let’s predict whether a respondent agrees with the fefam
statement
based on their age, sex, and race. If we want to fit a model for the
entire dataset and do not care about survey weights, then we can do the
following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
out_all <- glm(fefam_n ~ age + sex + race, data = gss_df, family="binomial", na.action = na.omit) summary(out_all) ## ## Call: ## glm(formula = fefam_n ~ age + sex + race, family = "binomial", ## data = gss_df, na.action = na.omit) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6809 -0.9516 -0.7550 1.1813 1.8716 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.9185878 0.0399581 -48.015 < 2e-16 *** ## age 0.0323648 0.0007275 44.486 < 2e-16 *** ## sexFemale -0.2247518 0.0248741 -9.036 < 2e-16 *** ## raceBlack 0.0668275 0.0363201 1.840 0.0658 . ## raceOther 0.3659411 0.0493673 7.413 1.24e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 39921 on 29980 degrees of freedom ## Residual deviance: 37746 on 29976 degrees of freedom ## (30232 observations deleted due to missingness) ## AIC: 37756 ## ## Number of Fisher Scoring iterations: 4 |
We can get broom
to give us a tidy summary of the results:
1 2 3 4 5 6 7 8 9 10 |
tidy(out_all) ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.92 0.0400 -48.0 0 ## 2 age 0.0324 0.000728 44.5 0 ## 3 sexFemale -0.225 0.0249 -9.04 1.63e-19 ## 4 raceBlack 0.0668 0.0363 1.84 6.58e- 2 ## 5 raceOther 0.366 0.0494 7.41 1.24e-13 |
We are primarily interested in getting annual estimates of the same model. At this point it might be tempting to write a loop, taking a piece of the dataset, fitting the model, and saving the results to an object or a file. But if we do that we will end up doing things like explicitly “growing” an object a piece (i.e. a year) at a time, and we will also have to worry about keeping track of every object we create and so on. It’s better to do the iteration directly in the pipeline via grouping instead.
Again, if we are ignoring the weights, we can do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
out_yr <- gss_df %>% group_by(year) %>% group_map_dfr(possibly(~ tidy(glm(fefam_n ~ age + sex + race, data = .x, family = "binomial", na.action = na.omit), conf.int = TRUE), otherwise = NULL)) out_yr ## # A tibble: 105 × 8 ## year term estimate std.error statistic p.value conf.low conf.high ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1977 (Intercept) -1.20 0.178 -6.75 1.47e-11 -1.55 -0.854 ## 2 1977 age 0.0483 0.00388 12.4 1.56e-35 0.0408 0.0561 ## 3 1977 sexFemale -0.341 0.118 -2.90 3.77e- 3 -0.572 -0.111 ## 4 1977 raceBlack -0.0613 0.180 -0.340 7.34e- 1 -0.412 0.295 ## 5 1977 raceOther 0.188 0.576 0.326 7.44e- 1 -0.912 1.40 ## 6 1985 (Intercept) -1.89 0.168 -11.2 2.89e-29 -2.23 -1.56 ## 7 1985 age 0.0432 0.00332 13.0 1.03e-38 0.0368 0.0498 ## 8 1985 sexFemale -0.261 0.112 -2.34 1.94e- 2 -0.481 -0.0426 ## 9 1985 raceBlack 0.148 0.189 0.782 4.34e- 1 -0.223 0.519 ## 10 1985 raceOther -0.319 0.338 -0.944 3.45e- 1 -1.00 0.329 ## # … with 95 more rows |
Pretty neat, right? That’s a compact way to produce tidied estimates for a series of regressions.
What’s happening in the pipeline? Fundamentally, we group the data by
year, run a glm()
on each year, and then immediately tidy()
the
output. The call to group_map_dfr()
binds all the tidied output
together by row, with year
acting as an index column. The .x
there
in data = .x
is a dplyr placeholder that means in general “the thing
we’re computing on right now”, and in this specific case “the data for a
specific year as we go down the list of years”.
You you can see from the output that we have no results before 1977 or
between 1977 and 1985. That’s because the fefam
question was not asked
during those years. If we just ran the glm()
call directly with
group_map_dfr()
it would fail because of this. We’d get this error:
1 2 |
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : #> contrasts can be applied only to factors with 2 or more levels |
This is the glm()
function saying “You’ve asked me to run this model
that contains a factor but I can’t because the factor has no levels”.
The reason it has no levels is that there’s no data at all, because the
question wasn’t asked that year, so the model fails.
To avoid this, we wrap the function call in possibly()
. This is a
function from purrr
that says “Try something, but if it fails do the
following instead”. So we ask, in effect,
1 |
possibly(~ tidy(glm()), otherwise = NULL) |
Each time the model fails we get a NULL
result and the assembly of our
final table of output can continue. In this case we know the dataset
well enough and the model is simple enough that the only reason the
model fails should be if the question wasn’t asked that year. But more
generally, when running a sequence of independent tasks it’s good to
have code that (a) can fail gracefully, (b) continue to run even if one
piece fails, and (c) allows for some investigation of the errors later.
The purrr
package has several related functions—possibly()
,
safely()
, and quietly()
—that make this easier in pipelined code.
With our tidied output we could make a plot like this:
1 2 3 4 5 6 7 8 9 10 |
out_yr %>% filter(term == "sexFemale") %>% ggplot(mapping = aes(x = year, y = estimate, ymin = conf.low, ymax = conf.high)) + geom_hline(yintercept = 0, linetype = "dotted") + geom_line() + geom_pointrange() + theme_bw() |
Tables and models on the weighted data
The proper used of survey weights in regression models is a thing people
argue about, but I won’t pursue that here. We can turn the tibble into a
survey object by integrating the weights and other sampling information
using Thomas Lumley’s survey
package, and Greg Freedman Ellis’s
srvyr
, which integrates survey
functions with the tidyverse.
First we create a survey design object. This is our GSS data with additional information wrapped around it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
options(survey.lonely.psu = "adjust") options(na.action="na.pass") ## Before 1975 vpsus are not available gss_svy <- gss_df %>% filter(year > 1974) %>% mutate(stratvar = interaction(year, vstrat)) %>% as_survey_design(id = vpsu, strata = stratvar, weights = compwt, nest = TRUE) gss_svy ## Stratified 1 - level Cluster Sampling design (with replacement) ## With (4555) clusters. ## Called via srvyr ## Sampling variables: ## - ids: vpsu ## - strata: stratvar ## - weights: compwt ## Data variables: year (dbl), id (dbl), ballot (dbl+lbl), age (dbl+lbl), race ## (fct), sex (fct), fefam (fct), vpsu (dbl), vstrat (dbl), oversamp (dbl), ## formwt (dbl), wtssall (dbl), sampcode (dbl), sample (dbl), compwt (dbl), ## fefam_d (fct), fefam_n (dbl), stratvar (fct) |
We’re now in a position to calculate some properly-weighted summary statistics. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
gss_svy %>% drop_na(fefam_d) %>% group_by(year, sex, race, fefam_d) %>% summarize(prop = survey_mean(na.rm = TRUE, vartype = "ci")) ## # A tibble: 252 × 7 ## # Groups: year, sex, race [126] ## year sex race fefam_d prop prop_low prop_upp ## <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> ## 1 1977 Male White Agree 0.694 0.655 0.732 ## 2 1977 Male White Disagree 0.306 0.268 0.345 ## 3 1977 Male Black Agree 0.686 0.564 0.807 ## 4 1977 Male Black Disagree 0.314 0.193 0.436 ## 5 1977 Male Other Agree 0.632 0.357 0.906 ## 6 1977 Male Other Disagree 0.368 0.0936 0.643 ## 7 1977 Female White Agree 0.640 0.601 0.680 ## 8 1977 Female White Disagree 0.360 0.320 0.399 ## 9 1977 Female Black Agree 0.553 0.472 0.634 ## 10 1977 Female Black Disagree 0.447 0.366 0.528 ## # … with 242 more rows |
Let’s fit our model again, this time with svyglm()
instead of glm()
.
The arguments are slightly different: data
becomes design
, and the
“binomial” family becomes quasibinomial()
for reasons explained in the
svyglm()
documentation.
1 2 3 4 |
out_svy_all <- svyglm(fefam_n ~ age + sex + race, design = gss_svy, family = quasibinomial(), na.action = na.omit) |
Broom gives us a tidy table of the estimates:
1 2 3 4 5 6 7 8 9 10 |
tidy(out_svy_all) ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.83 0.0480 -38.1 3.04e-232 ## 2 age 0.0311 0.000853 36.4 5.29e-217 ## 3 sexFemale -0.240 0.0279 -8.63 1.40e- 17 ## 4 raceBlack 0.0285 0.0436 0.653 5.14e- 1 ## 5 raceOther 0.385 0.0589 6.52 8.87e- 11 |
Again, we are interested in annual estimates. We use group_map_dfr()
and possibly()
again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
out_svy_yrs <- gss_svy %>% group_by(year) %>% group_map_dfr(possibly(~ tidy(svyglm(fefam_n ~ age + sex + race, design = .x, family = quasibinomial(), na.action = na.omit), conf.int = TRUE), otherwise = NULL)) out_svy_yrs ## # A tibble: 105 × 8 ## year term estimate std.error statistic p.value conf.low conf.high ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1977 (Intercept) -1.09 0.184 -5.93 3.74e- 7 -1.46 -0.720 ## 2 1977 age 0.0469 0.00403 11.6 2.63e-15 0.0388 0.0550 ## 3 1977 sexFemale -0.344 0.126 -2.73 9.05e- 3 -0.599 -0.0901 ## 4 1977 raceBlack -0.144 0.215 -0.669 5.07e- 1 -0.576 0.288 ## 5 1977 raceOther 0.276 0.552 0.500 6.19e- 1 -0.835 1.39 ## 6 1985 (Intercept) -1.90 0.205 -9.29 1.79e-12 -2.31 -1.49 ## 7 1985 age 0.0447 0.00377 11.9 3.86e-16 0.0371 0.0523 ## 8 1985 sexFemale -0.268 0.135 -1.99 5.20e- 2 -0.538 0.00243 ## 9 1985 raceBlack 0.119 0.293 0.405 6.88e- 1 -0.470 0.707 ## 10 1985 raceOther -0.486 0.279 -1.75 8.69e- 2 -1.05 0.0731 ## # … with 95 more rows |
And here’s our quick plot again:
1 2 3 4 5 6 7 8 9 10 |
out_svy_yrs %>% filter(term == "sexFemale") %>% ggplot(mapping = aes(x = year, y = estimate, ymin = conf.low, ymax = conf.high)) + geom_hline(yintercept = 0, linetype = "dotted") + geom_line() + geom_pointrange() + theme_bw() |
A quick word about group_map()
and group_map_dfr()
Just like the apply
family of functions in base R, the map
family in
purrr
is a way to iterate on data without explicitly writing a loop.
Because its operations are vectorized, R already does this with many
functions. If we start with some letters,
1 2 3 4 |
letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" ## [20] "t" "u" "v" "w" "x" "y" "z" |
and we want to make them upper-case, we don’t have to write a loop. We do
1 2 3 4 |
toupper(letters) ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" ## [20] "T" "U" "V" "W" "X" "Y" "Z" |
The toupper()
function goes along each element of letters
and
converts it. You can think of a map
operation as generalizing this
idea. For example, here’s the first few rows and the first five columns
of the mtcars
data:
1 2 3 4 5 6 7 8 9 |
head(mtcars[,1:5]) ## mpg cyl disp hp drat ## Mazda RX4 21.0 6 160 110 3.90 ## Mazda RX4 Wag 21.0 6 160 110 3.90 ## Datsun 710 22.8 4 108 93 3.85 ## Hornet 4 Drive 21.4 6 258 110 3.08 ## Hornet Sportabout 18.7 8 360 175 3.15 ## Valiant 18.1 6 225 105 2.76 |
we can take the mean of each of these columns by using map()
to feed
them one at a time to the mean()
function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
map(mtcars[,1:5], mean) ## $mpg ## [1] 20.09062 ## ## $cyl ## [1] 6.1875 ## ## $disp ## [1] 230.7219 ## ## $hp ## [1] 146.6875 ## ## $drat ## [1] 3.596563 |
The map()
function always returns a list. But variants of it return,
for example, a vector of numbers:
1 2 3 4 |
map_dbl(mtcars[,1:5], mean) ## mpg cyl disp hp drat ## 20.090625 6.187500 230.721875 146.687500 3.596563 |
When we’re working with data, we normally won’t use map()
directly.
Instead various dplyr
operations will do what we want, like this:
1 2 3 4 5 6 7 8 |
mtcars %>% summarize(across(everything(), mean)) ## mpg cyl disp hp drat wt qsec vs am ## 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 ## gear carb ## 1 3.6875 2.8125 |
We can group by some variable and then summarize across all columns, too:
1 2 3 4 5 6 7 8 9 10 |
mtcars %>% group_by(cyl) %>% summarize(across(everything(), mean)) ## # A tibble: 3 × 11 ## cyl mpg disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 ## 2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 ## 3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 |
More complex grouped operations need a little more work. That’s what we
did above when we used group_map_dfr()
. The function took our grouped
survey data and fed it a year at a time to a couple of nested functions
that produced tidied output from a glm. If you try using group_map()
instead in the code above you’ll see that you’ll get a list of tidied
results rather than a data frame with the groups bound together in
row-order—that’s what the dfr
part means.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.