Getting the data for the sjPlotting-functions into shape #rstats
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I sometimes get questions on how to reproduce the samples that are posted in this blog. Currently, I’m referring to these posts:
- Plotting lm and glm models with ggplot
- Easily plotting grouped bars with ggplot
- Simplify frequency plots with ggplot in R
When I wrote these functions, I had in mind how to visualize the date that we (in our research projects in specific, and probably in social sciences in general) commonly use. These are often Likert-scales, unordered or ordered categorial variables. Let’s take the example from Easily plotting grouped bars with ggplot. I have used the variables e42dep and n1pv_ovs. The first variable is taken from this questionnaire on page 6. The second variable is country-specific and relates to the German Long-Term Care Insurance and is taken from this translated questionnaire on page 19 (and is an unordered categorial variable).
Usually, the scaling from these variables are “continuous”, i.e. the first answer is coded as 0 or 1 and all following answers are sequentially numbered. Thus, the variables I use for analysis have factor values from 0 or 1 to n in a sequential order (e.g. 1,2,3,4,5). And that’s the way how the variables resp. their factor levels should be!
But why sticking to this sequential order? Because of the zero-counts! Let’s say, I have 5 possible answer categories. And only answers 1, 2, 4 and 5 have been answered, while category 3 has zero counts (no answers). How do I know if a category – in this example the “3″ – is missing due to zero counts or due to not being a valid answer category? If I have four possible answers with the category values 1, 2, 4 and 5, I would have no zero counts, i.e. no category is missing. For this reason, I assume a “quasi-continuous” scale and determine the zero counts by looking at which category on a sequentially numeric scale is missing.
An example from the sjPlotFrequencies-script how I do this (thanks for the help from stackoverflow-users):
# -------------------------------------------------------- # Define amount of category, include zero counts # -------------------------------------------------------- # Zero counts of categories are not plotted by default just because # these categories don't appear in the data. If we assume a # "quasi-continuous" scale (categories from 1 to 4 etc.), we now # identify the zero counts and add / insert them into the data frame. # This enables us to plot zero counts as well. # We guess the maximum amount of categories either by the amount # of supplied category labels. If no category labels were passed # as parameter, we assume that the maximum value found in the category # columns represents the highest category number catcount <- 0 catmin <- min(na.omit(variable)) df <- as.data.frame(table(variable)) # Factors have to be transformed into numeric values # for continiuos x-axis-scale df$variable <- as.numeric(as.character(df$variable)) # if categories start with zero, fix this here addone <- ifelse (min(df$variable)==0, 1, 0) # get the highest answer category of "variable", so we know where the # range of the x-axis ends if (!is.null(categoryLabels)) { catcount <- length(categoryLabels) } else { catcount <- length(unique(na.omit(df$variable))) # max(df$variable) } # Create a vector of zeros frq <- rep(0,catcount) # Replace the values in freq for those indices which equal dummyf$xa # by dummyf$ya so that remaining indices are ones which you # intended to insert frq[df$variable+addone] <- df$Freq # create new data frame. We now have a data frame with all # variable categories abd their related counts, including # zero counts, but no(!) missings! mydat <- as.data.frame(cbind(var = 1:catcount, frq)) # caculate missings here missingcount <- length(which(is.na(variable))) # --------------------------------------------------------
So, if you want to use the available data sets in R, make sure that your variables…
- are categorial (factors)
- have factor levels from 1 to n, sequentially increasing by 1 (1,2,3,4,5…)
- if the variable has other factor levels, change the factor levels and perhaps use the original levels as categoryLabel-parameter
Since the data set I use is not public yet, I have found a free accessible SPSS-data set which might be used for reproducing my examples (http://calcnet.mth.cmich.edu/org/spss/Prjs_DataSets.htm). These data sets are not labelled nor do have they variable names, so I fixed this in SPSS and added variable names and value labels. You can download the disease-dataset here: http://www.strengejacke.de/R-Stuff/disease.sav.
If you download that data set, you can do the following:
source("sjImportSPSS.R") disease <- importSPSS("/Users/danielludecke/Desktop/disease.sav") source("sjPlotGroupFrequencies.R") sjp.grpfrq(disease$sector, disease$class)
or you use
source("sjImportSPSS.R") disease <- importSPSS("/Users/danielludecke/Desktop/disease.sav") dislab <- getValueLabels(disease) disvars <- getVariableLabels(disease) source("sjPlotGroupFrequencies.R") sjp.grpfrq(disease$sector, disease$class, title=disvars['sector'], axisLabels.x=dislab[['sector']], legendLabels=dislab[['class']])
(the sector-variable has no label and variable values in the SPSS-data set, so the title- and categoryLabel parameter have no influence here).
Tagged: R, rstats
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.