Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We often find ourselves tidying and reshaping data. Here we consider the two packages tidyr and reshape2, our aim is to see where their purposes overlap and where they differ by comparing the functions gather()
, separate()
and spread()
, from tidyr, with the functions melt()
, colsplit()
and dcast()
, from reshape2.
Data tidying
Data tidying is the operation of transforming data into a clear and simple form that makes it easy to work with. “Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Compare the two data frames below (cf.Wickham (2014)) to get an idea of the differences: example.tidy
is the tidy version of example.messy
, the same information is organized in two different ways.
example.messy
## treatmenta treatmentb ## John Smith NA 2 ## Jane Doe 16 11 ## Mary Johnson 3 1
example.tidy
## name trt result ## 1 John Smith treatmenta NA ## 2 Jane Doe treatmenta 16 ## 3 Mary Johnson treatmenta 3 ## 4 John Smith treatmentb 2 ## 5 Jane Doe treatmentb 11 ## 6 Mary Johnson treatmentb 1
From the wide to the long format: gather() vs melt()
We now begin by seeing in action how we can bring data from the “wide” to the “long” format.
Let’s start loading the packages we need:
library(tidyr) library(reshape2)
and some data (from RStudio Blog – Introducing tidyr): we have measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.
set.seed(10) messy <- data.frame(id = 1:4, trt = sample(rep(c('control', 'treatment'), each = 2)), work.T1 = runif(4), home.T1 = runif(4), work.T2 = runif(4), home.T2 = runif(4)) messy
## id trt work.T1 home.T1 work.T2 home.T2 ## 1 1 treatment 0.08513597 0.6158293 0.1135090 0.05190332 ## 2 2 control 0.22543662 0.4296715 0.5959253 0.26417767 ## 3 3 treatment 0.27453052 0.6516557 0.3580500 0.39879073 ## 4 4 control 0.27230507 0.5677378 0.4288094 0.83613414
Our first step is to put the data in the tidy format, to do that we use tidyr’s functions gather()
and separate()
. Following Wickham’s tidy data definition, this data frame is not tidy because some variable values are in the column names. We bring this messy data frame from the wide to the long format by using the gather()
function (give a look at Sean C. Anderson – An Introduction to reshape2 to get an idea of the wide/long format). We want to gather all the columns, except for the id and trt ones, in two columns key and value:
gathered.messy <- gather(messy, key, value, -id, -trt) head(gathered.messy)
## id trt key value ## 1 1 treatment work.T1 0.08513597 ## 2 2 control work.T1 0.22543662 ## 3 3 treatment work.T1 0.27453052 ## 4 4 control work.T1 0.27230507 ## 5 1 treatment home.T1 0.61582931 ## 6 2 control home.T1 0.42967153
Note that in gather()
we used bare variable names to specify the names of the key, value, id and trt columns.
We can get the same result with the melt()
function from reshape2:
molten.messy <- melt(messy, variable.name = "key", value.names = "value", id.vars = c("id", "trt")) head(molten.messy)
## id trt key value ## 1 1 treatment work.T1 0.08513597 ## 2 2 control work.T1 0.22543662 ## 3 3 treatment work.T1 0.27453052 ## 4 4 control work.T1 0.27230507 ## 5 1 treatment home.T1 0.61582931 ## 6 2 control home.T1 0.42967153
We now compare the two functions by running them over the data without any further parameter and see what happen:
head(gather(messy))
## Warning: attributes are not identical across measure variables; they will ## be dropped
## key value ## 1 id 1 ## 2 id 2 ## 3 id 3 ## 4 id 4 ## 5 trt treatment ## 6 trt control
head(melt(messy))
## Using trt as id variables
## trt variable value ## 1 treatment id 1.00000000 ## 2 control id 2.00000000 ## 3 treatment id 3.00000000 ## 4 control id 4.00000000 ## 5 treatment work.T1 0.08513597 ## 6 control work.T1 0.22543662
We see a different behaviour: gather()
has brought messy
into a long data format with a warning by treating all columns as variable, while melt()
has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if melt()
does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather()
requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.
Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, gather()
cannot handle matrices or arrays, while melt()
can as shown below.
set.seed(3) M <- matrix(rnorm(6), ncol = 3) dimnames(M) <- list(letters[1:2], letters[1:3]) melt(M)
## Var1 Var2 value ## 1 a a -0.96193342 ## 2 b a -0.29252572 ## 3 a b 0.25878822 ## 4 b b -1.15213189 ## 5 a c 0.19578283 ## 6 b c 0.03012394
gather(M)
## Error in UseMethod("gather_"): no applicable method for 'gather_' applied to an object of class "c('matrix', 'double', 'numeric')"
Split a column: separate() vs colsplit()
Our next step is to split the column key into two different columns in order to separate the location and time variables and obtain a tidy data frame:
tidy <- separate(gathered.messy, key, into = c("location", "time"), sep = "\.") res.tidy <- cbind(molten.messy[1:2], colsplit(molten.messy[, 3], "\.", c("location", "time")), molten.messy[4]) head(tidy)
## id trt location time value ## 1 1 treatment work T1 0.08513597 ## 2 2 control work T1 0.22543662 ## 3 3 treatment work T1 0.27453052 ## 4 4 control work T1 0.27230507 ## 5 1 treatment home T1 0.61582931 ## 6 2 control home T1 0.42967153
head(res.tidy)
## id trt location time value ## 1 1 treatment work T1 0.08513597 ## 2 2 control work T1 0.22543662 ## 3 3 treatment work T1 0.27453052 ## 4 4 control work T1 0.27230507 ## 5 1 treatment home T1 0.61582931 ## 6 2 control home T1 0.42967153
Again, the result is the same but we need a workaround: because colsplit()
operates only on a single column we usecbind()
to insert the new two columns in the data frame. separate()
performs all the operation at once reducing the possibility of making mistakes.
From the long to the wide format: spread() vs dcast()
Finally, we compare spread()
with dcast()
using the data frame example for the spread()
documentation itself. Briefly,spread()
is complementary to gather()
and brings data from the long to the wide format.
set.seed(14) stocks <- data.frame(time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4)) stocksm <- gather(stocks, stock, price, -time) spread.stock <- spread(stocksm, stock, price) head(spread.stock)
## time X Y Z ## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590 ## 2 2009-01-02 1.71895416 0.5988432 -0.7943331 ## 3 2009-01-03 2.12166699 1.3484795 0.5554631 ## 4 2009-01-04 1.49715368 -0.5856326 -1.1173440 ## 5 2009-01-05 -0.03614058 0.9761067 2.8356777 ## 6 2009-01-06 1.23194518 1.7656036 -3.0664418
cast.stock <- dcast(stocksm, formula = time ~ stock, value.var = "price") head(cast.stock)
## time X Y Z ## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590 ## 2 2009-01-02 1.71895416 0.5988432 -0.7943331 ## 3 2009-01-03 2.12166699 1.3484795 0.5554631 ## 4 2009-01-04 1.49715368 -0.5856326 -1.1173440 ## 5 2009-01-05 -0.03614058 0.9761067 2.8356777 ## 6 2009-01-06 1.23194518 1.7656036 -3.0664418
Again, the same result produced by spread()
can be obtained using dcast()
by specifying the correct formula
.
In the next session, we are going to modify the formula
parameter in order to perform some data aggregation and compare further the two packages.
Data aggregation
Up to now we made reshape2 following tidyr, showing that everything you can do with tidyr can be achieved by reshape2, too, at the price of a some workarounds. As we now go on with our simple example we will get out of the purposes of tidyr and have no more functions available for our needs. Now we have a tidy data set – one observation per row and one variable per column – to work with. We show some aggregations that are possible with dcast()
using the tips
data frame from reshape2. Tips
contains the information one waiter recorded about each tip he received over a period of a few months working in one restaurant.
head(tips)
## total_bill tip sex smoker day time size ## 1 16.99 1.01 Female No Sun Dinner 2 ## 2 10.34 1.66 Male No Sun Dinner 3 ## 3 21.01 3.50 Male No Sun Dinner 3 ## 4 23.68 3.31 Male No Sun Dinner 2 ## 5 24.59 3.61 Female No Sun Dinner 4 ## 6 25.29 4.71 Male No Sun Dinner 4
m.tips <- melt(tips)
## Using sex, smoker, day, time as id variables
head(m.tips)
## sex smoker day time variable value ## 1 Female No Sun Dinner total_bill 16.99 ## 2 Male No Sun Dinner total_bill 10.34 ## 3 Male No Sun Dinner total_bill 21.01 ## 4 Male No Sun Dinner total_bill 23.68 ## 5 Female No Sun Dinner total_bill 24.59 ## 6 Male No Sun Dinner total_bill 25.29
We use dcast()
to get information on the average total bill, tip and group size per day and time:
dcast(m.tips, day+time ~ variable, mean)
## day time total_bill tip size ## 1 Fri Dinner 19.66333 2.940000 2.166667 ## 2 Fri Lunch 12.84571 2.382857 2.000000 ## 3 Sat Dinner 20.44138 2.993103 2.517241 ## 4 Sun Dinner 21.41000 3.255132 2.842105 ## 5 Thur Dinner 18.78000 3.000000 2.000000 ## 6 Thur Lunch 17.66475 2.767705 2.459016
Averages per smoker or not in the group.
dcast(m.tips, smoker ~ variable, mean)
## smoker total_bill tip size ## 1 No 19.18828 2.991854 2.668874 ## 2 Yes 20.75634 3.008710 2.408602
There is no function in the tidyr package that allows us to perform a similar operation, the reason is that tidyr is designed only for data tidying and not for data reshaping.
Conclusions
At the beginning we have seen tidyr and reshape2 functions performing the same operations, therefore, suggesting that the two packages are similar, if not equal in what they do; lately, we have seen that reshape2’s functions can do data aggregation that is not possible with tidyr. Indeed, tidyr’s aim is data tidying while reshape2 has the wider purpose of data reshaping and aggregating. It follows that tidyr syntax is easier to understand and to work with, but its functionalities are limited. Therefore, we use tidyr gather()
and separate()
functions to quickly tidy our data and reshape2dcast()
to aggregate them.
Further readings:
- RStudio Blog – Introducing tidyr
- R-blogger – Reshape and aggregate data with the R package reshape2.
- reshape2 – R Documentation
- tidyr – R Documentation
References:
Wickham, Hadley. 2014. “Tidy data.” Journal of Statistical Software 59 (10).
The post How to reshape data in R: tidyr vs reshape2 appeared first on MilanoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.