Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The concept of “wide data” is relative. In some domains 100 columns is considered “wide”, while in others that’s perfectly normal and you’d need to have thousands (or tens of thousands!) of columns for it to be considered even remotely “wide”. The data that we work with at Fathom Data generally lies in the first domain, but from time to time we do work on data that is considerably wider.
This post touches on a couple of approaches for dealing with that sort of data. We’ll be using some HCRIS (Healthcare Cost Report Information System) data, which are available for download here. Specifically, we’ll be working with an extract from the hcris2552_10_2017.csv
file, which contains “select variables in flat shape”.
How Wide is it?
How many columns are there in the CSV file?
awk -F, 'NR == 1 {print NF}' hcris2552_10_2017.csv 1708
This qualifies as “wide” in our books. The original file has numerous rows, but for this post, we will only be looking at the first 100 rows.
Opening Gambit: Cram it into a Database
My initial thought was to just pop the entire file contents into a database and then do the processing in SQL. However, I certainly don’t have the fortitude to create an appropriate table definition by hand. However, the csvsql
tool from the csvkit
package will generate that table definition for me. Zero fortitude required.
csvsql --dialect postgresql hcris2552_10_2017.csv
Take a look at the resulting SQL script. Neatly formatted and including appropriate constraints where relevant.
Useful options for csvsql
(there are many others, but these are the ones that I’ve used frequently):
--dialect
— the required flavour of SQL (choose fromfirebird
,mssql
,mysql
,oracle
,postgresql
,sqlite
, andsybase
);--date-format
and--datetime-format
— the format of date/time fields in the CSV (usingstrptime()
placeholders);--no-create
— don’t outputCREATE TABLE
, justINSERT
; and--snifflimit
— number of bytes to read for determining CSV format.
A Problem: Column Limit in PostgreSQL
But there’s a problem: the resulting table has too many columns! (PostgreSQL has a hard limit at 1600 columns per table.)
There are some options for fixing this (there are undoubtedly others, but these are the ones that came to mind immediately):
- discard some columns (starting with the empty ones) or
- split into multiple tables.
Neither of these was particularly palatable. So, although csvsql
could magically produce a table definition for me, I had to resort to another approach.
Tidy Data
Looking critically at the data I realised that most of the “columns” should not be columns at all. This was long data stored in wide format. To put it bluntly, this was untidy data.
We generally try to work with tidy data, which means that it satisfies the following rules:
- each variable must have its own column;
- each observation must have its own row; and
- each value must have its own cell.
The HCRIS data flagrantly violates rules 2 and 3.
Fortunately, R has tooling in the {tidyr}
package that makes it simple to transform this kind of deviant data into something less perverse.
Change of Plan
Pull in a few fundamental R libraries.
library(readr) library(dplyr) library(tidyr) library(janitor)
Use read_csv()
to swiftly load the HCRIS data into a data frame.
hcris <- read_csv("hcris2552_10_2017.csv")
How many rows and columns?
dim(hcris) [1] 100 1708
What kind of data are we looking at?
table(sapply(hcris, class)) character logical numeric 61 624 1023
Mostly numeric
columns, with a decent number of logical
and a smattering of character
. A quick check reveals that almost all of the logical
columns only have NA
entries. These are effectively empty columns and they’re just wasting space.
Removing Empty Columns
The first thing that we’ll do is strip out the empty columns using {janitor}
.
hcris <- remove_empty(hcris, "cols")
How many columns are we left with?
ncol(hcris) [1] 1085
Well, that would fit into a PostgreSQL table now, but we’ve already started down this road, so let’s press on.
table(sapply(hcris, class)) character logical numeric 61 1 1023
We still have a mixture of column types. And one of them is logical
. Since we’re already removed the empty columns, that one must really contain Boolean data. I did a quick investigation and found that the column concerned, util_cd
, contains text data but was deemed to be logical
since all of the values were "F"
(which gets translated into FALSE
). Digging a little further (looking at data for other years) I found that this is a categorical column (no surprise given its name) with levels "F"
, "N"
and "L"
. So let’s fix this properly.
hcris <- hcris %>% mutate( util_cd = factor(ifelse(is.logical(util_cd) & !util_cd, "F", NA), levels = c("F", "N", "L")) )
One last look at the column types and counts.
table(sapply(hcris, class)) character factor numeric 61 1 1023
Yup, that looks about right.
Pivoting
We’re ready to pivot the data. But there are still a few issues that we need to handle. One problem is that the columns that we are going to pivot don’t have a single type: there are both numeric
and character
values. We also need to identify the columns that we are not going to pivot.
# Unique row identified. # RECID <- "rpt_rec_num" # What columns are not going to pivot? # COMMON <- c( "prvdr_ctrl_type_cd", "prvdr_num", "rpt_stus_cd", "initl_rpt_sw", "last_rpt_sw", "trnsmtl_num", "fi_num", "adr_vndr_cd", "util_cd", "fy_bgn_dt", "fy_end_dt", "proc_dt", "fi_creat_dt", "fi_rcpt_dt", "fyear" )
We’ll create three separate data frames, one with common columns (not to be pivoted), another with numeric
columns and finally one with character
columns.
hcris_common <- hcris %>% select(all_of(c(RECID, COMMON))) hcris_numeric <- hcris %>% select( any_of(RECID) | # Include unique ID where(is.numeric) & # and all numeric columns !any_of(COMMON) # but not common columns. ) hcris_character <- hcris %>% select( any_of(RECID) | # Include unique ID where(is.character) & # and all character columns !any_of(COMMON) # but not common columns. )
Okay, so let’s pivot the numeric data.
hcris_numeric <- hcris_numeric %>% pivot_longer( -all_of(RECID), names_to = "cell", values_to = "nmrc", values_drop_na = TRUE ) # A tibble: 30,202 × 3 rpt_rec_num cell nmrc <dbl> <chr> <dbl> 1 629478 a_c1_30 25350799 2 629478 a_c2_30 8295648 3 629478 a_c6_30 -112553 4 629478 a_c2_113 2977228 5 629478 a_c1_200 160050242 6 629478 a_c2_200 184197641 7 629478 a_c6_200 -66508288 8 629478 a7_1_c1_1 20064648 9 629478 a7_1_c2_3 1016540 10 629478 a7_1_c1_5 104979100 # … with 30,192 more rows
Looks good! Now the character data.
hcris_character <- hcris_character %>% pivot_longer( -all_of(RECID), names_to = "cell", values_to = "alpha", values_drop_na = TRUE ) # A tibble: 30,202 × 3 rpt_rec_num cell nmrc <dbl> <chr> <dbl> 1 629478 a_c1_30 25350799 2 629478 a_c2_30 8295648 3 629478 a_c6_30 -112553 4 629478 a_c2_113 2977228 5 629478 a_c1_200 160050242 6 629478 a_c2_200 184197641 7 629478 a_c6_200 -66508288 8 629478 a7_1_c1_1 20064648 9 629478 a7_1_c2_3 1016540 10 629478 a7_1_c1_5 104979100 # … with 30,192 more rows
Unpack Field Name
It might not surprise you to learn that the cell
field in the pivoted data (previously the column names) contains values for multiple quantities. The image below (taken from the HCRIS documentation) indicates how this field can be unpacked to yield a worksheet code (wksht_cd
), line and column number.
A quick note about the data format seems in order. These data encode information from a large spreadsheet consisting of multiple worksheets. The cell
column is an aggregate of the worksheet code, line and column number for the corresponding entry in the spreadsheet.
We could use separate()
to tease this field apart. However, there are several intricacies and exceptions which are somewhat fiddly to deal with, so I’ll postpone those for another day.
Gambit Revisited
I didn’t show the SQL produced by csvsql
because, well, there was a lot of it. But it’s still interesting to take a look. Let’s dump one of the pivoted tables to a CSV file now and run csvsql
over it.
write_csv(hcris_numeric, file = "hcris-numeric.csv")
Take a look at the file.
head hcris-numeric.csv rpt_rec_num,cell,nmrc 629478,a_c1_30,25350799 629478,a_c2_30,8295648 629478,a_c6_30,-112553 629478,a_c2_113,2977228 629478,a_c1_200,160050242 629478,a_c2_200,184197641 629478,a_c6_200,-66508288 629478,a7_1_c1_1,20064648 629478,a7_1_c2_3,1016540
And now run csvsql
to generate a table definition suitable for PostgreSQL.
csvsql --dialect postgresql hcris-numeric.csv CREATE TABLE "hcris-numeric" ( rpt_rec_num DECIMAL NOT NULL, cell VARCHAR(15) NOT NULL, nmrc DECIMAL NOT NULL );
What about SQLite with a more civilised table name and no constraints on field length?
csvsql --dialect sqlite --tables hcris_nmrc --no-constraints hcris-numeric.csv CREATE TABLE hcris_nmrc ( rpt_rec_num DECIMAL, cell VARCHAR, nmrc DECIMAL );
This really is a handy tool. And it’ll save you a bunch of time. Even if you need to tweak the table definition a bit, the fact that you can automagically get something that works straight away is priceless.
Summary
So, just to pull this all together, we started with a CSV data file with a large number of columns. First, an attempt was made to use csvsql
to create a SQL table definition that would allow us to copy the raw data straight into a database. However, the resulting table ended up being too wide for the database to ingest. So, instead, we pivoted the data around into a long format. Not only does this format translate easily into a SQL table, but for analytical purposes, it’s also a lot easier to work with.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.