Site icon R-bloggers

TidyR Challenge: Help Me Do My Job

[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I was handed a drug prescription data set and asked to create some interesting graphics. But before I can even get to the fun part, I was faced with actually transforming the set into something that ggplot2 could read.

Obviously I can’t share the data, but Tyler Rinker has created a fantastic package called wakefield that creates “random data sets quickly” and I was able to mimic the prescription data structure exactly!

Fake Data with Wakefield

So the fake data set below contains one row per patient (e.g. ID) with lots and lots of columns that encapsulates three different drug types (foo, bar, and baz) with their prescription fill date and prescription supply amount:

library(wakefield)
library(tidyr)
library(dplyr)
d <- r_data_frame(
  n=100,
  id,
  r_series(date_stamp,15,name='foo_date'),
  r_series(level,15,name='foo_supply'),
  r_series(date_stamp,10,name='bar_date'),
  r_series(level,10,name='bar_supply'),
  r_series(date_stamp,3,name='baz_date'),
  r_series(level,3,name='baz_supply')
)
names(d)

##  [1] "ID"            "foo_date_1"    "foo_date_2"    "foo_date_3"   
##  [5] "foo_date_4"    "foo_date_5"    "foo_date_6"    "foo_date_7"   
##  [9] "foo_date_8"    "foo_date_9"    "foo_date_10"   "foo_date_11"  
## [13] "foo_date_12"   "foo_date_13"   "foo_date_14"   "foo_date_15"  
## [17] "foo_supply_1"  "foo_supply_2"  "foo_supply_3"  "foo_supply_4" 
## [21] "foo_supply_5"  "foo_supply_6"  "foo_supply_7"  "foo_supply_8" 
## [25] "foo_supply_9"  "foo_supply_10" "foo_supply_11" "foo_supply_12"
## [29] "foo_supply_13" "foo_supply_14" "foo_supply_15" "bar_date_1"   
## [33] "bar_date_2"    "bar_date_3"    "bar_date_4"    "bar_date_5"   
## [37] "bar_date_6"    "bar_date_7"    "bar_date_8"    "bar_date_9"   
## [41] "bar_date_10"   "bar_supply_1"  "bar_supply_2"  "bar_supply_3" 
## [45] "bar_supply_4"  "bar_supply_5"  "bar_supply_6"  "bar_supply_7" 
## [49] "bar_supply_8"  "bar_supply_9"  "bar_supply_10" "baz_date_1"   
## [53] "baz_date_2"    "baz_date_3"    "baz_supply_1"  "baz_supply_2" 
## [57] "baz_supply_3"

And you can see from the names of the columns there can be a maximum of 15 foo prescriptions, a maximum of 10 bar prescriptions, and a maximum of 3 baz prescriptions per patient.

How Can I Create A TidyR Data Set?

Here’s my solution, but as you can see there’s 3 sections of repeated code and surely they can be compacted into an elegant and beautiful piece of R poetry. Can you help me do my job?

# foo
med_dates <- d %>% 
    select(ID,foo_date_1:foo_date_15) %>% 
    gather(med_seq, med_date, foo_date_1:foo_date_15)
med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,foo_supply_1:foo_supply_15) %>% 
    gather(med_seq, med_supply, foo_supply_1:foo_supply_15)
med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq))
foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
foo$med_name <- 'foo'

# bar
med_dates <- d %>%
select(ID,bar_date_1:bar_date_10) %>%
gather(med_seq, med_date, bar_date_1:bar_date_10)
med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq))
med_supply <- d %>%
select(ID,bar_supply_1:bar_supply_10) %>%
gather(med_seq, med_supply, bar_supply_1:bar_supply_10)
med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq))
bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>%
select(ID,med_date,med_supply)
bar$med_name <- 'bar'

# baz
med_dates <- d %>%
select(ID,baz_date_1:baz_date_3) %>%
gather(med_seq, med_date, baz_date_1:baz_date_3)
med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq))
med_supply <- d %>%
select(ID,baz_supply_1:baz_supply_3) %>%
gather(med_seq, med_supply, baz_supply_1:baz_supply_3)
med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq))
baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>%
select(ID,med_date,med_supply)
baz$med_name <- 'baz'
scripts <- rbind(foo,bar,baz)
scripts

## Source: local data frame [2,800 x 4]
## 
##     ID   med_date med_supply med_name
## 1  001 2014-06-06          1      foo
## 2  002 2014-06-06          2      foo
## 3  003 2014-06-06          2      foo
## 4  004 2014-06-06          1      foo
## 5  005 2014-06-06          1      foo
## 6  006 2014-06-06          4      foo
## 7  007 2014-06-06          1      foo
## 8  008 2014-06-06          1      foo
## 9  009 2014-06-06          1      foo
## 10 010 2014-06-06          2      foo
## .. ...        ...        ...      ...
To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.