Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The sjmisc-package
My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.
Basically, this package covers three domains of functionality:
- reading and writing data between other statistical packages (like SPSS) and R, based on the haven and
foreign
packages; hence,sjmisc
also includes function to work with labelled data. - frequently used statistical tests, or at least convenient wrappers for such test functions
- frequently applied recoding and variable conversion tasks
In this posting, I want to give a quick and short introduction into the labeling features.
Labelled Data
In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:
> factor(c("low", "high", "mid", "high", "low")) [1] low high mid high low Levels: high low mid
Reading SPSS-data (from haven
, foreign
or sjmisc
), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc
, which is part of the sjmisc
-package:
library(sjmisc) data(efc) str(efc$e42dep) > atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ... > - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer" > - attr(*, "labels")= Named num [1:4] 1 2 3 4 > ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:
library(sjmisc) data(efc) barplot(table(efc$e42dep, efc$e16sex), beside = T, legend.text = T)
Adding value labels as factor values
to_label
is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.
library(sjmisc) data(efc) barplot(table(to_label(efc$e42dep), to_label(efc$e16sex)), beside = T, legend.text = T)
to_fac
is a convenient replacement of as.factor
, which converts a numeric vector into a factor, but keeps the value and variable label attributes.
Getting and setting value and variable labels
There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:
get_var_labels()
to get variable labelsget_val_labels()
to get value labelsset_var_labels()
to set variable labels (add them as vector attribute)set_val_labels()
to set value labels (add them as vector attribute)
library(sjmisc) data(efc) barplot(table(to_label(efc$e42dep), to_label(efc$e16sex)), beside = T, legend.text = T, main = get_var_labels(efc$e42dep))
get_var_labels(efc)
would return all data.frame’s variable labels. And get_val_labels(etc)
would return a list with all value labels of all data.frame’s variables.
Restore labels from subsetted data
The base subset
function as well as dplyr’s (at least up to 0.4.1) filter
and select
functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels
and remove_labels
.
add_labels
adds back labels to a subsetted data frame based on the original data frame. And remove_labels
removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).
Losing labels during subset
library(sjmisc) data(efc) efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8)) str(efc.sub) > 'data.frame': 296 obs. of 5 variables: > $ e17age : num 74 68 80 72 94 79 67 80 76 88 ... > $ e42dep : num 4 4 1 3 3 4 3 4 2 4 ... > $ c82cop1: num 4 3 3 4 3 3 4 2 2 3 ... > $ c83cop2: num 2 4 2 2 2 2 1 3 2 2 ... > $ c84cop3: num 4 4 1 1 1 4 2 4 2 4 ...
Add back labels
efc.sub <- add_labels(efc.sub, efc) str(efc.sub) > 'data.frame': 296 obs. of 5 variables: > $ e17age : atomic 74 68 80 72 94 79 67 80 76 88 ... > ..- attr(*, "label")= Named chr "elder' age" > .. ..- attr(*, "names")= chr "e17age" > $ e42dep : atomic 4 4 1 3 3 4 3 4 2 4 ... > ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer" > .. ..- attr(*, "names")= chr "e42dep" > ..- attr(*, "labels")= Named chr "1" "2" "3" "4" > .. ..- attr(*, "names")= chr "independent" "slightly dependent" "moderately dependent" "severely dependent" # truncated output
So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc
package supports this feature and offers some useful functions for these tasks…
Tagged: R, rstats, sjmisc, sjPlot, SPSS, Statistik
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.