Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have until recently made it a habit to draw a clear distinction about
using R for data analysis and Microsoft Excel for other office
productivity tasks. I know there are people who use Excel to process
data and even (gasp) to teach statistics with it. But I’m a bit
snobbish that way and to date all my efforts have been in getting data
out of Excel and into R, either by simple methods like read.csv
or if
the task was more meaty by using Hadley Wickham’s marvelous readxl
package.
But this week I got a request that may sound familiar to some of you who
work in an environment where the MS Office products are ubiquitous. My
colleague wanted to be able to do a little analysis and graphing of a
modest sized dataset and they only really had MS Excel experience. To be
honest I could have just done the work in R
and provided the results
but that’s not what they asked for and would have meant I was
responsible for future updates and questions. So I decided to just
provide the data in the most useful way I could and that led to this
post. To both document the process and also to potentially help others
who have this need in the future.
Along the way I also had some nice learning experiences around functions
and both standard and non standard evaluation NSE
that I’ll document
in some future posts.
Background
My colleague wanted to be able to do some simple analysis around health care using the Centers for Disease Control and Prevention (https://www.cdc.gov), National Center for Health Statistics (https://www.cdc.gov/nchs/index.htm), National Health Interview Survey (https://www.cdc.gov/nchs/nhis/nhis_2016_data_release.htm). They wanted a series of cross tabulated sets of summary data for variable pairings (for example whether or not the respondent had a formal health care provider by region of the country). They wanted one Excel “workbook” with 12 worksheets each one of which was the summary of counts for a pair of variables. From there they could use Excel’s native plotting tools to make the graphs they needed.
A little sleuthing around CRAN helped me discover openxlsx
which seems
to be quite active, well maintained, and have a variety of features I
would need. In my case that involved a download and install first (but
I’ll comment it out in this version for you). As long as I’m at it
I’ll load dplyr
and ggplot2
(they’ll figure more prominently in my
next post).
knitr::opts_chunk$set(echo = TRUE, warning = FALSE) library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(ggplot2) theme_set(theme_bw()) # set theme to my personal preference # install.packages("openxlsx") require(openxlsx) ## Loading required package: openxlsx
Download and structure the data
The data my colleague wanted was from 2014 and she was kind enough to provide the URL for a compressed zipfile ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2016/personsxcsv.zip. Full documentation about the file and its contents and the methodology is here https://www.cdc.gov/nchs/nhis/nhis_2016_data_release.htm.
After downloading and uncompressing it into a local project directory (it’s about 77Mb of data, with over 100,000 rows and more than 600 columns). I could go to work on processing it to get what was really needed for my colleague.
FullFile <- read.csv(file = "personsx.csv") dim(FullFile) ## [1] 103789 606 # 606 variables too many whittle it down with wild cards
A little dplyr
to make our life easier
While strictly speaking nothing in the next few steps requires dplyr
(they can all be done in base R
) I will showcase a couple ways dplyr
can make your data analysis faster and easier.
First off, call me old-fashioned but, I abhor having lots of data in
working memory that I know for a fact I’ll never use. While my Mac has
plenty of space let’s showcase dplyr
’s ability to help us rapidly
reduce down to a more manageable dataset. A quick look at the data
dictionary provided by the CDC shows we’ll never use any of the
variables in the dataset that start with “L” or with “INT” so let’s make
them go away.
FullFile <- select(FullFile, -starts_with("L")) FullFile <- select(FullFile, -starts_with("INT")) dim(FullFile) ## [1] 103789 289 # 289 is still big but ...
The other thing we notice is that the variables are all coded as integers when we know they are truly factors. Since we know we’re going to analyze them as factors let’s take a moment to recode which will also mean an opportunity to provide more user friendly labels so we’re not constantly referring to the code book to see what “1” really represents in the data.
The function is recode_factor
and allows us to map factor labels onto
the integers to provide a more useful dataset.
FullFile$REGION <- recode_factor(FullFile$REGION, `1` = "Northeast", `2` = "Midwest", `3` = "South", `4` = "West") FullFile$SEX <- recode_factor(FullFile$SEX, `1` = "Male", `2` = "Female") FullFile$RACERPI2 <- recode_factor(FullFile$RACERPI2, `1` = "White only", `2` = "Black/African American only", `3` = "AIAN only", `4` = "Asian only", `5` = "Race group not releasable", `6` = "Multiple race") FullFile$PDMED12M <- recode_factor(FullFile$PDMED12M, `1` = "Yes", `2` = "No") ## Warning: Unreplaced values treated as NA as .x is not compatible. Please ## specify replacements exhaustively or supply .default summary(FullFile$PDMED12M) ## Yes No NA's ## 6744 96986 59
Uh oh, what are we being warned about? recode_factor
has a nice
rational solution for cases where we don’t specify all the possible
choices for a variable (see ?recode_factor
) it simply assigns them an
NA
value. In our case that’s just what we want.
Finishing the rest (and suppressing the warnings)…
FullFile$PNMED12 <- recode_factor(FullFile$PNMED12, `1` = "Yes", `2` = "No") FullFile$PNMED12M <- recode_factor(FullFile$PNMED12M, `1` = "Yes", `2` = "No") FullFile$NOTCOV <- recode_factor(FullFile$NOTCOV, `1` = "Not covered", `2` = "Covered") FullFile$COVER <- recode_factor(FullFile$COVER, `1` = "Private", `2` = "Medicaid and other public", `3` = "Other coverage", `4` = "Uninsured", `5` = "Do not know") FullFile$PLNWRKS1 <- recode_factor(FullFile$PLNWRKS1, `1` = "Through employer", `2` = "Through union", `3` = "Through workplace, but don't know if employer or union", `4` = "Through workplace, self-employed or professional association", `5` = "Purchased directly", `6` = "Through Healthcare.gov or the Affordable Care Act", `7` = "Through a state/local government or community program", `8` = "Other", `9` = "Through school", `10` = "Through parents", `11` = "Through relative other than parents") FullFile$HCSPFYR <- recode_factor(FullFile$HCSPFYR, `0` = "Zero", `1` = "Less than $500", `2` = "$500 - $1,999", `3` = "$2,000 - $2,999", `4` = "$3,000 - $4,999", `5` = "$5,000 or more") FullFile$MEDBILL <- recode_factor(FullFile$MEDBILL, `1` = "Yes", `2` = "No") FullFile$MEDBPAY <- recode_factor(FullFile$MEDBPAY, `1` = "Yes", `2` = "No") # I am thinking that earnings can be collapsed into three attributes: low; medium; high FullFile$EARNINGS <- recode_factor(FullFile$ERNYR, `1` = "$01-$34,999", `2` = "$01-$34,999", `3` = "$01-$34,999", `4` = "$01-$34,999", `5` = "$01-$34,999", `6` = "$01-$34,999", `7` = "$35,000-$74,999", `8` = "$35,000-$74,999", `9` = "$35,000-$74,999", `10` = "$35,000-$74,999", `11` = "$75,000 and over") # Education the same: low; medium; high FullFile$EDUCATION <- recode_factor(FullFile$EDUC1, `0` = "HSchool Grad or less", `1` = "HSchool Grad or less", `2` = "HSchool Grad or less", `3` = "HSchool Grad or less", `4` = "HSchool Grad or less", `5` = "HSchool Grad or less", `6` = "HSchool Grad or less", `7` = "HSchool Grad or less", `8` = "HSchool Grad or less", `9` = "HSchool Grad or less", `10` = "HSchool Grad or less", `11` = "HSchool Grad or less", `12` = "HSchool Grad or less", `13` = "HSchool Grad or less", `14` = "HSchool Grad or less", `15` = "Some college or AA degree", `16` = "Some college or AA degree", `17` = "Some college or AA degree", `18` = "Bachelor's or higher", `19` = "Bachelor's or higher", `20` = "Bachelor's or higher", `21` = "Bachelor's or higher") # Age also collapsed: low; medium; high FullFile$AGE <- cut(FullFile$AGE_P, breaks = c(-Inf, 18, 61, Inf), labels = c("Less than 18", "18 to 60", "More than 60"), right = FALSE) table(FullFile$EDUC1,FullFile$EDUCATION) ## ## HSchool Grad or less Some college or AA degree Bachelor's or higher ## 0 3424 0 0 ## 1 1679 0 0 ## 2 1633 0 0 ## 3 1787 0 0 ## 4 1716 0 0 ## 5 1822 0 0 ## 6 2726 0 0 ## 7 1860 0 0 ## 8 2644 0 0 ## 9 2908 0 0 ## 10 2884 0 0 ## 11 3186 0 0 ## 12 1774 0 0 ## 13 2290 0 0 ## 14 18413 0 0 ## 15 0 14822 0 ## 16 0 5413 0 ## 17 0 2966 0 ## 18 0 0 13883 ## 19 0 0 5956 ## 20 0 0 955 ## 21 0 0 1018 ## 96 0 0 0 ## 97 0 0 0 ## 98 0 0 0 ## 99 0 0 0
One easy way to see whether your recoding has had the desired effect is
to make a simple table that maps the original values to the new values.
That’s what I’ve done here for EDUCATION which shows the collapse of
categories and the fact that values like 99 have been mapped to NA
.
Finally let’s grab just the variables we really need including the newly recoded versions and make them a new dataset and take the full version out of memory.
OfInterest <- select(FullFile, AGE, REGION, SEX, EDUCATION, EARNINGS, PDMED12M, PNMED12M, NOTCOV, MEDBILL) str(OfInterest) ## 'data.frame': 103789 obs. of 9 variables: ## $ AGE : Factor w/ 3 levels "Less than 18",..: 2 2 2 1 1 2 2 2 3 2 ... ## $ REGION : Factor w/ 4 levels "Northeast","Midwest",..: 3 4 4 4 4 4 3 3 4 3 ... ## $ SEX : Factor w/ 2 levels "Male","Female": 1 2 1 1 2 2 2 1 1 1 ... ## $ EDUCATION: Factor w/ 3 levels "HSchool Grad or less",..: 2 1 3 1 NA 3 1 1 3 2 ... ## $ EARNINGS : Factor w/ 3 levels "$01-$34,999",..: 1 2 2 NA NA 3 1 NA 2 3 ... ## $ PDMED12M : Factor w/ 2 levels "Yes","No": 2 1 2 2 2 2 2 2 2 2 ... ## $ PNMED12M : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ... ## $ NOTCOV : Factor w/ 2 levels "Not covered",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ MEDBILL : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 2 2 2 2 2 ... summary(OfInterest) ## AGE REGION SEX ## Less than 18:25680 Northeast:16883 Male :50257 ## 18 to 60 :58025 Midwest :20507 Female:53532 ## More than 60:20084 South :35826 ## West :30573 ## EDUCATION EARNINGS PDMED12M ## HSchool Grad or less :50746 $01-$34,999 :20390 Yes : 6744 ## Some college or AA degree:23201 $35,000-$74,999 :12927 No :96986 ## Bachelor's or higher :21812 $75,000 and over: 6413 NA's: 59 ## NA's : 8030 NA's :64059 ## PNMED12M NOTCOV MEDBILL ## Yes : 4978 Not covered:10506 Yes :16282 ## No :98741 Covered :92181 No :87096 ## NA's: 70 NA's : 1102 NA's: 411 ## rm(FullFile)
Driving Excel
Everything we have done so far has been in preparation for actually
providing my colleague the data she wanted in Excel so now on to the
main event. In this post I am deliberately going to NOT build
functions or use ggplot
to make the graphs she wanted. Those will be
topics for future posts.
You may have noticed that a table
is almost exactly what she wants. If
the variables of interest are AGE
and whether or not the person has
health care coverage (NOTCOV
) then the command would be
table(OfInterest$AGE,OfInterest$NOTCOV)
and we have results we need.
The openxlsx
package provides a write.xlsx
function that will accept the
table command and produce a properly formatted workbook as output. The
code below will produce SimpleExcelExample.xlsx
in your working
directory. The number of options is legion and ?write.xlsx
will
display them for you.
table(OfInterest$AGE,OfInterest$NOTCOV) ## ## Not covered Covered ## Less than 18 1312 24193 ## 18 to 60 8730 48484 ## More than 60 464 19504 write.xlsx(table(OfInterest$AGE,OfInterest$NOTCOV), file = "SimpleExcelExample.xlsx")
Finishing up the manual way
The last thing I will describe in this post is how to make one workbook
with multiple sheets. Where each sheet represents one variable pairing
(e.g. AGE
by NOTCOV
). We’ll even label the tabs (worksheets) in a
thoughtful way. So my colleague can easily see which sheet corresponds
to which variable pairing.
The code below follows this pattern
- Create a new empty workbook object
wb <- createWorkbook()
- Invent a name for the tab or worksheet inside the workbook
NameofSheet
- Make a
table
for a pair of variablesTheData <- table(OfInterest$EDUCATION,OfInterest$NOTCOV)
- Add a worksheet (tab) into the workbook
addWorksheet
- Write the table we made onto the worksheet with
writeData
- Repeat steps 2 through 5 12 times
### Manual and painful way ## Create a new workbook wb <- createWorkbook() # education by each of the other 4 variables NameofSheet <- "CoverageByEducation" TheData <- table(OfInterest$EDUCATION,OfInterest$NOTCOV) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "MedbillByEducation" TheData <- table(OfInterest$EDUCATION,OfInterest$MEDBILL) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PDMED12MByEducation" TheData <- table(OfInterest$EDUCATION,OfInterest$PDMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PNMED12MByEducation" TheData <- table(OfInterest$EDUCATION,OfInterest$PNMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") # earnings by each of the other 4 variables NameofSheet <- "CoverageByEarnings" TheData <- table(OfInterest$EARNINGS,OfInterest$NOTCOV) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "MedbillByEarnings" TheData <- table(OfInterest$EARNINGS,OfInterest$MEDBILL) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PDMED12MByEarnings" TheData <- table(OfInterest$EARNINGS,OfInterest$PDMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PNMED12MByEarnings" TheData <- table(OfInterest$EARNINGS,OfInterest$PNMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") # age by each of the other 4 variables NameofSheet <- "CoverageByAge" TheData <- table(OfInterest$AGE,OfInterest$NOTCOV) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "MedbillByAge" TheData <- table(OfInterest$AGE,OfInterest$MEDBILL) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PDMED12MByAge" TheData <- table(OfInterest$AGE,OfInterest$PDMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") NameofSheet <- "PNMED12MByAge" TheData <- table(OfInterest$AGE,OfInterest$PNMED12M) addWorksheet(wb = wb, sheetName = NameofSheet) writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n") # saveWorkbook(wb, "BetterExcelExample.xlsx", overwrite = TRUE) ## save to working directory
All done (not really)
Hopefully this post helps you understand how to use the openxlsx
package to have R
drive Excel to help you with your data analysis. In
my next post I’ll build on this scaffolding to discuss how to make these
very same graphs in ggplot2
(which IMHO runs circles around Excel for
scientific plotting), as well as making this all more efficient through
the use of functions to take care of some of the more repetitive chores.
I hope you’ve found this useful. I am always open to comments, corrections and suggestions.
Chuck (ibecav at gmail dot com)
License
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.