gssr Update
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
NORC released version 2a of the 1972-2022 General Social Survey cumulative file. I’ve updated {gssr}
, an R package that makes it more convenient for R users to work with GSS Data. One handy feature of {gssr}
is that it lets you see documentation for individual GSS variables as R help pages.
gssr
is a data package, bundling several datasets into a convenient
format. The relatively large size of the data in the package means it is
not suitable for hosting on CRAN, the
core R package repository.
Install direct from GitHub
You can install gssr from GitHub with:
1 |
remotes::install_github("kjhealy/gssr") |
Load the package:
1 2 3 4 5 |
library(gssr) #> Package loaded. To attach the GSS data, type data(gss_all) at the console. #> For the codebook, type data(gss_dict). #> For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc). #> For help on a specific GSS variable, type ?varname at the console. |
Single GSS years
You can quickly get the data for any single GSS year by using
gss_get_yr()
to download the data file from NORC and put it directly
into a tibble.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
gss18 <- gss_get_yr(2018) #> Fetching: https://gss.norc.org/documents/stata/2018_stata.zip gss18 #> # A tibble: 2,348 × 1,068 #> year id wrkstat hrs1 hrs2 evwork wrkslf wrkgovt #> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+l> #> 1 2018 1 3 [with … NA(i) [iap] 41 NA(i) [iap] 2 [som… 2 [pri… #> 2 2018 2 5 [retir… NA(i) [iap] NA(i) [iap] 1 [yes] 2 [som… 2 [pri… #> 3 2018 3 1 [worki… 40 NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri… #> 4 2018 4 1 [worki… 40 NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri… #> 5 2018 5 5 [retir… NA(i) [iap] NA(i) [iap] 1 [yes] 2 [som… 2 [pri… #> 6 2018 6 5 [retir… NA(i) [iap] NA(i) [iap] 1 [yes] 2 [som… 2 [pri… #> 7 2018 7 1 [worki… 35 NA(i) [iap] NA(i) [iap] 2 [som… 1 [gov… #> 8 2018 8 1 [worki… 89 [89+… NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri… #> 9 2018 9 1 [worki… 40 NA(i) [iap] NA(i) [iap] 1 [sel… 2 [pri… #> 10 2018 10 1 [worki… 40 NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri… #> # ℹ 2,338 more rows #> # ℹ 1,060 more variables: occ10 <dbl+lbl>, prestg10 <dbl+lbl>, #> # prestg105plus <dbl+lbl>, indus10 <dbl+lbl>, marital <dbl+lbl>, #> # martype <dbl+lbl>, divorce <dbl+lbl>, widowed <dbl+lbl>, #> # spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, spevwork <dbl+lbl>, #> # cowrksta <dbl+lbl>, cowrkslf <dbl+lbl>, coevwork <dbl+lbl>, #> # cohrs1 <dbl+lbl>, cohrs2 <dbl+lbl>, spwrkslf <dbl+lbl>, … |
The GSS data comes in a labelled format, mirroring the way it is encoded for Stata and SPSS platforms. The numeric codes are the content of the column cells. The labeling information is stored as an attribute of the column.
Here’s a typical workflow for getting the data ready:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
suppressPackageStartupMessages({ library(tidyverse) library(survey) library(srvyr) }) library(gssr) #> Package loaded. To attach the GSS data, type data(gss_all) at the console. #> For the codebook, type data(gss_dict). #> For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc). #> For help on a specific GSS variable, type ?varname at the console. ## Fn to capitalize strings nicely (from chartr) capwords <- function(x, strict = FALSE) { cap <- function(x) paste(toupper(substring(x, 1, 1)), {x <- substring(x, 2); if(strict) tolower(x) else x}, sep = "", collapse = " " ) sapply(strsplit(x, split = " "), cap, USE.NAMES = !is.null(names(x))) } ## The variables we want cont_vars <- c("year", "id", "ballot", "age") cat_vars <- c("race", "sex", "fefam") wt_vars <- c("vpsu", "vstrat", "wtssps") my_vars <- c(cont_vars, cat_vars, wt_vars) ## Get and clean up the 2018 data gss_fam <- gss_get_yr(2018) |> select(all_of(my_vars)) |> mutate( # Convert all missing to NA across(everything(), haven::zap_missing), # Convert all weight vars to numeric across(all_of(wt_vars), as.numeric), # Convert year to numeric year = as.integer(year), # Make all categorical variables factors and relabel nicely across(all_of(cat_vars), forcats::as_factor), across(all_of(cat_vars), \(x) forcats::fct_relabel(x, capwords, strict = TRUE)), fefam = forcats::fct_recode(fefam, NULL = "IAP", NULL = "DK", NULL = "NA"), young = ifelse(age < 26, "Yes", "No"), fefam_d = forcats::fct_recode(fefam, Agree = "Strongly Agree", Disagree = "Strongly Disagree"), fefam_n = recode(fefam_d, "Agree" = 0, "Disagree" = 1)) #> Fetching: https://gss.norc.org/documents/stata/2018_stata.zip ## Take a look gss_fam |> select(-vpsu, -vstrat) |> relocate(c(fefam_d, fefam_n), .after = fefam) #> # A tibble: 2,348 × 11 #> year id ballot age race sex fefam fefam_d fefam_n wtssps young #> <int> <dbl> <dbl+lbl> <dbl> <fct> <fct> <fct> <fct> <dbl> <dbl> <chr> #> 1 2018 1 1 [ballot a] 43 White Male Disa… Disagr… 1 1.91 No #> 2 2018 2 3 [ballot c] 74 White Fema… <NA> <NA> NA 0.915 No #> 3 2018 3 2 [ballot b] 42 White Male Disa… Disagr… 1 0.609 No #> 4 2018 4 2 [ballot b] 63 White Fema… Disa… Disagr… 1 0.642 No #> 5 2018 5 3 [ballot c] 71 Black Male <NA> <NA> NA 0.396 No #> 6 2018 6 1 [ballot a] 67 White Fema… Disa… Disagr… 1 0.529 No #> 7 2018 7 3 [ballot c] 59 Black Fema… <NA> <NA> NA 1.61 No #> 8 2018 8 3 [ballot c] 43 White Male <NA> <NA> NA 0.672 No #> 9 2018 9 2 [ballot b] 62 White Fema… Stro… Disagr… 1 0.594 No #> 10 2018 10 2 [ballot b] 55 White Male Disa… Disagr… 1 0.482 No #> # ℹ 2,338 more rows ## Put in a survey object options(survey.lonely.psu = "adjust") options(na.action="na.pass") gss_svy <- gss_fam |> mutate(stratvar = interaction(year, vstrat)) |> as_survey_design(id = vpsu, strata = stratvar, weights = wtssps, nest = TRUE) gss_svy #> Stratified 1 - level Cluster Sampling design (with replacement) #> With (156) clusters. #> Called via srvyr #> Sampling variables: #> - ids: vpsu #> - strata: stratvar #> - weights: wtssps #> Data variables: year (int), id (dbl), ballot (dbl+lbl), age (dbl+lbl), race #> (fct), sex (fct), fefam (fct), vpsu (dbl), vstrat (dbl), wtssps (dbl), young #> (chr), fefam_d (fct), fefam_n (dbl), stratvar (fct) |
The Cumulative Data File
The GSS cumulative data file is large. It is not loaded by default when
you invoke the package. (That is, gssr
does not use R’s “lazy loading”
facility. The data file is too big to do this without error.) To load
one of the datasets, first load the library and then use data()
to
make the data available. For example, load the cumulative GSS file like
this:
1 |
data(gss_all) |
This will take a moment. Once it is ready, the gss_all
object is
available to use in the usual way:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
gss_all #> # A tibble: 72,390 × 6,694 #> year id wrkstat hrs1 hrs2 evwork occ prestige #> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lb> #> 1 1972 1 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 205 50 #> 2 1972 2 5 [retire… NA(i) [iap] NA(i) [iap] 1 [yes] 441 45 #> 3 1972 3 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 270 44 #> 4 1972 4 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 57 #> 5 1972 5 7 [keepin… NA(i) [iap] NA(i) [iap] 1 [yes] 385 40 #> 6 1972 6 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 281 49 #> 7 1972 7 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 522 41 #> 8 1972 8 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 314 36 #> 9 1972 9 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 912 26 #> 10 1972 10 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 984 18 #> # ℹ 72,380 more rows #> # ℹ 6,686 more variables: wrkslf <dbl+lbl>, wrkgovt <dbl+lbl>, #> # commute <dbl+lbl>, industry <dbl+lbl>, occ80 <dbl+lbl>, prestg80 <dbl+lbl>, #> # indus80 <dbl+lbl>, indus07 <dbl+lbl>, occonet <dbl+lbl>, found <dbl+lbl>, #> # occ10 <dbl+lbl>, occindv <dbl+lbl>, occstatus <dbl+lbl>, occtag <dbl+lbl>, #> # prestg10 <dbl+lbl>, prestg105plus <dbl+lbl>, indus10 <dbl+lbl>, #> # indstatus <dbl+lbl>, indtag <dbl+lbl>, marital <dbl+lbl>, … |
In addition to the integrated help, information about the variables is also contained in the gss_dict
object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
data(gss_dict) gss_dict #> # A tibble: 6,663 × 12 #> pos variable label missing var_doc_label value_labels var_text years #> <int> <chr> <chr> <int> <chr> <chr> <chr> <list> #> 1 1 year gss year… 0 gss year for… [NA(d)] don… None <NULL> #> 2 2 wrkstat labor fo… 36 labor force … [1] working… 1. Last… <tibble> #> 3 3 hrs1 number o… 30830 number of ho… [89] 89+ ho… 1a. If … <tibble> #> 4 4 hrs2 number o… 70989 number of ho… [89] 89+ ho… 1b. If … <tibble> #> 5 5 evwork ever wor… 46944 ever work as… [1] yes; [2… 1c. If … <tibble> #> 6 6 occ r's cens… 48123 r's census o… [NA(d)] don… 2a. Wha… <tibble> #> 7 7 prestige r's occu… 48123 r's occupati… [NA(d)] don… 2a. Wha… <tibble> #> 8 8 wrkslf r self-e… 4041 r self-emp o… [1] self-em… 2e. (Ar… <tibble> #> 9 9 wrkgovt govt or … 44311 govt or priv… [1] governm… 2f. (Ar… <tibble> #> 10 10 commute travel t… 71060 travel time … [97] 97+ mi… 2g. Abo… <tibble> #> # ℹ 6,653 more rows #> # ℹ 4 more variables: var_yrtab <list>, col_type <chr>, var_type <chr>, #> # var_na_codes <chr> |
There are also a few convenience functions. For example, to see which years some questions were ask, use gss_which_years()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
gss_all |> gss_which_years(c(industry, indus80, wrkgovt, commute)) |> print(n = Inf) ## # A tibble: 34 × 5 ## year industry indus80 wrkgovt commute ## <dbl+lbl> <lgl> <lgl> <lgl> <lgl> ## 1 1972 TRUE FALSE FALSE FALSE ## 2 1973 TRUE FALSE FALSE FALSE ## 3 1974 TRUE FALSE FALSE FALSE ## 4 1975 TRUE FALSE FALSE FALSE ## 5 1976 TRUE FALSE FALSE FALSE ## 6 1977 TRUE FALSE FALSE FALSE ## 7 1978 TRUE FALSE FALSE FALSE ## 8 1980 TRUE FALSE FALSE FALSE ## 9 1982 TRUE FALSE FALSE FALSE ## 10 1983 TRUE FALSE FALSE FALSE ## 11 1984 TRUE FALSE FALSE FALSE ## 12 1985 TRUE FALSE TRUE FALSE ## 13 1986 TRUE FALSE TRUE TRUE ## 14 1987 TRUE FALSE FALSE FALSE ## 15 1988 TRUE TRUE FALSE FALSE ## 16 1989 TRUE TRUE FALSE FALSE ## 17 1990 TRUE TRUE FALSE FALSE ## 18 1991 FALSE TRUE FALSE FALSE ## 19 1993 FALSE TRUE FALSE FALSE ## 20 1994 FALSE TRUE FALSE FALSE ## 21 1996 FALSE TRUE FALSE FALSE ## 22 1998 FALSE TRUE FALSE FALSE ## 23 2000 FALSE TRUE TRUE FALSE ## 24 2002 FALSE TRUE TRUE FALSE ## 25 2004 FALSE TRUE TRUE FALSE ## 26 2006 FALSE TRUE TRUE FALSE ## 27 2008 FALSE TRUE TRUE FALSE ## 28 2010 FALSE TRUE TRUE FALSE ## 29 2012 FALSE FALSE TRUE FALSE ## 30 2014 FALSE FALSE TRUE FALSE ## 31 2016 FALSE FALSE TRUE FALSE ## 32 2018 FALSE FALSE TRUE FALSE ## 33 2021 FALSE FALSE FALSE FALSE ## 34 2022 FALSE FALSE FALSE FALSE |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.