Reshape and aggregate data with the R package reshape2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Creating molten data
Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.
Identifier variables (id) identify the unit that measurements take place on. In the data.frame
below subject
and time
are the two id variables and age
, weight
and height
are the measured variables.
require(reshape2) x = data.frame(subject = c("John", "Mary"), time = c(1,1), age = c(33,NA), weight = c(90, NA), height = c(2,2)) x subject time age weight height 1 John 1 33 90 2 2 Mary 1 NA NA 2
We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the melt
function of the R package reshape2
.
molten = melt(x, id = c("subject", "time")) molten subject time variable value 1 John 1 age 33 2 Mary 1 age NA 3 John 1 weight 90 4 Mary 1 weight NA 5 John 1 height 2 6 Mary 1 height 2
All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.
Missing values
In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an NA
value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent NA
implicitly, i.e., by removing the row with NA
, we need to set na.rm = TRUE
in the call to melt. Otherwise, NA
will be present in the molten data by default.
molten = melt(x, id = c("subject", "time"), na.rm = TRUE) molten subject time variable value 1 John 1 age 33 3 John 1 weight 90 5 John 1 height 2 6 Mary 1 height 2
Reshaping your data
Now that you have a molten data you can reshape it into a data frame using dcast
function or into a vector/matrix/array using the acast
function. The basic arguments of *cast
is the molten data and a formula of the form x1 + x2 ~ y1 + y2
. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “...
” represents all other variables not used in the formula and “.
” represents no variable, so you can do formula = x1 ~ .
It is easier to understand the way it works by doing it yourself. Try different options and see what happens:
dcast(molten, formula = time + subject ~ variable) dcast(molten, formula = subject + time ~ variable) dcast(molten, formula = subject ~ variable) dcast(molten, formula = ... ~ variable)
It is also possible to create higher dimension arrays by using more than one ~
in the formula. For example,
acast(molten, formula = subject ~ time ~ variable)
From [3], we see that we can also supply a list of quoted expressions, in the form list (.(x1, x1), .(y1, y2), .(z))
. The advantage of this form is that you can cast based on transformations of the variables: list(.(a + b), (c = round(c)))
. To use this we need the function ‘.
‘ from the plyr package to form a list of unevaluated expressions.
The function ‘.
‘ can also be used to form more complicated expressions to be used within the subset
argument of the *cast
function:
data(airquality) aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE) library(plyr) # needed to access . function acast(aqm, variable ~ Month, mean, subset = .(variable == "Ozone"))
Aggregation
Aggregation occurs when the combination of variables in the *cast
formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the fun.aggregate
argument, which default to length
(with a warning). Further arguments can be passed to the aggregation function through ‘...
‘ in *cast
.
# Melt French Fries dataset data(french_fries) ffm <- melt(french_fries, id = 1:4, na.rm = TRUE) # Aggregate examples - all 3 yield the same result dcast(ffm, treatment ~ .) dcast(ffm, treatment ~ ., function(x) length(x)) dcast(ffm, treatment ~ ., length) # Passing further arguments through ... dcast(ffm, treatment ~ ., sum) dcast(ffm, treatment ~ ., sum, trim = 0.1)
Margins
To add statistics to the margins of your table, we can set the argument margins
appropriately. Set margins = TRUE
to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the *cast
formula affects the margins that can be computed.
dcast(ffm, treatment + rep ~ time, sum, margins = "rep") dcast(ffm, rep + treatment ~ time, sum, margins = "treatment") dcast(ffm, treatment + rep ~ time, sum, margins = TRUE)
A nice piece of advice
A nice piece of advice from [1]: As mentioned above, the order of variables in the formula
matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.
References:
[1] Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software 21(12), 1-20. – in the INLA website.
[2] melt R documentation.
[3] cast R documentation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.