TidyR Challenge: Help Me Do My Job
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I was handed a drug prescription data set and asked to create some interesting graphics. But before I can even get to the fun part, I was faced with actually transforming the set into something that ggplot2 could read.
Obviously I can’t share the data, but Tyler Rinker has created a fantastic package called wakefield that creates “random data sets quickly” and I was able to mimic the prescription data structure exactly!
Fake Data with Wakefield
So the fake data set below contains one row per patient (e.g. ID) with lots and lots of columns that encapsulates three different drug types (foo, bar, and baz) with their prescription fill date and prescription supply amount:
library(wakefield) library(tidyr) library(dplyr) d <- r_data_frame( n=100, id, r_series(date_stamp,15,name='foo_date'), r_series(level,15,name='foo_supply'), r_series(date_stamp,10,name='bar_date'), r_series(level,10,name='bar_supply'), r_series(date_stamp,3,name='baz_date'), r_series(level,3,name='baz_supply') ) names(d) ## [1] "ID" "foo_date_1" "foo_date_2" "foo_date_3" ## [5] "foo_date_4" "foo_date_5" "foo_date_6" "foo_date_7" ## [9] "foo_date_8" "foo_date_9" "foo_date_10" "foo_date_11" ## [13] "foo_date_12" "foo_date_13" "foo_date_14" "foo_date_15" ## [17] "foo_supply_1" "foo_supply_2" "foo_supply_3" "foo_supply_4" ## [21] "foo_supply_5" "foo_supply_6" "foo_supply_7" "foo_supply_8" ## [25] "foo_supply_9" "foo_supply_10" "foo_supply_11" "foo_supply_12" ## [29] "foo_supply_13" "foo_supply_14" "foo_supply_15" "bar_date_1" ## [33] "bar_date_2" "bar_date_3" "bar_date_4" "bar_date_5" ## [37] "bar_date_6" "bar_date_7" "bar_date_8" "bar_date_9" ## [41] "bar_date_10" "bar_supply_1" "bar_supply_2" "bar_supply_3" ## [45] "bar_supply_4" "bar_supply_5" "bar_supply_6" "bar_supply_7" ## [49] "bar_supply_8" "bar_supply_9" "bar_supply_10" "baz_date_1" ## [53] "baz_date_2" "baz_date_3" "baz_supply_1" "baz_supply_2" ## [57] "baz_supply_3"
And you can see from the names of the columns there can be a maximum of 15 foo prescriptions, a maximum of 10 bar prescriptions, and a maximum of 3 baz prescriptions per patient.
How Can I Create A TidyR Data Set?
Here’s my solution, but as you can see there’s 3 sections of repeated code and surely they can be compacted into an elegant and beautiful piece of R poetry. Can you help me do my job?
# foo med_dates <- d %>% select(ID,foo_date_1:foo_date_15) %>% gather(med_seq, med_date, foo_date_1:foo_date_15) med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,foo_supply_1:foo_supply_15) %>% gather(med_seq, med_supply, foo_supply_1:foo_supply_15) med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq)) foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) foo$med_name <- 'foo' # bar med_dates <- d %>% select(ID,bar_date_1:bar_date_10) %>% gather(med_seq, med_date, bar_date_1:bar_date_10) med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,bar_supply_1:bar_supply_10) %>% gather(med_seq, med_supply, bar_supply_1:bar_supply_10) med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq)) bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) bar$med_name <- 'bar' # baz med_dates <- d %>% select(ID,baz_date_1:baz_date_3) %>% gather(med_seq, med_date, baz_date_1:baz_date_3) med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,baz_supply_1:baz_supply_3) %>% gather(med_seq, med_supply, baz_supply_1:baz_supply_3) med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq)) baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) baz$med_name <- 'baz' scripts <- rbind(foo,bar,baz) scripts ## Source: local data frame [2,800 x 4] ## ## ID med_date med_supply med_name ## 1 001 2014-06-06 1 foo ## 2 002 2014-06-06 2 foo ## 3 003 2014-06-06 2 foo ## 4 004 2014-06-06 1 foo ## 5 005 2014-06-06 1 foo ## 6 006 2014-06-06 4 foo ## 7 007 2014-06-06 1 foo ## 8 008 2014-06-06 1 foo ## 9 009 2014-06-06 1 foo ## 10 010 2014-06-06 2 foo ## .. ... ... ... ...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.