R:case4base – reshape data with base R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Contents
Introduction
This is the first post in the R:case4base series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages.
The simplest, yet probably one of the most common use cases would be to change the data from what is called “wide” shape to “long” shape. Base R offers a very good function for this very purpose. Meet stats::reshape
.
How to use this article
- This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a
case4base
after all! - If you have no time for reading, you can click here to get just the code with commentary
Basic wide to long reshape
First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):
gdi <- read.csv( stringsAsFactors = FALSE , url("https://jozef.io/post/data/ESA2010_GDI.csv") ) head(gdi[, 1:7]) ## country Y.1995 Y.1996 Y.1997 Y.1998 Y.1999 ## 1 EU 28 NA NA NA NA 5982392.8 ## 2 Euro area 19 NA NA NA NA 4393727.3 ## 3 Belgium 140734.1 141599.4 145023.2 149705.2 153804.0 ## 4 Bulgaria 1036.0 1468.1 12367.4 14921.1 16052.8 ## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0 ## 6 Denmark 566363.0 578102.0 591416.0 621236.0 614893.0 ## Y.2000 ## 1 6425313.4 ## 2 4598956.1 ## 3 161753.6 ## 4 17676.4 ## 5 1359309.0 ## 6 639955.0
Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.
The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.
To reshape our data.frame
from wide to long, all we have to do is:
gdi_long <- reshape(data = gdi # data.frame in wide format to be reshaped , direction = "long" # we are going from wide to long , varying = 2:23 # columns that will be stacked into 1 ) head(gdi_long) ## country time Y id ## 1.1995 EU 28 1995 NA 1 ## 2.1995 Euro area 19 1995 NA 2 ## 3.1995 Belgium 1995 140734.1 3 ## 4.1995 Bulgaria 1995 1036.0 4 ## 5.1995 Czech Republic 1995 894042.0 5 ## 6.1995 Denmark 1995 566363.0 6
Before we get into clean-up such that the output data.frame
is nice and tidy, let us first take look at the arguments of the function that we used already
data
- almost obviously, this is thedata.frame
we want to reshapevarying
- names or indices of columns which we want to stack on each other into a single column
Tip
We can see that R automatically recognizes the
Y
and the years that get translated into thetime
column. This is because the column names are in a format that reshape can guess automatically:[string].[integer]
, in our case"Y.1996"
,"Y.1997"
, etc. It has a lot of benefits to keep this naming convention for your column names before reshaping. If your names have a different character between the[string]
and the[integer]
(for example"something_1996"
,"something_1997"
), you can specify this character with thesep
argument (e.g.sep = "_"
).
Now looking back at the reshaped gdi_long
, we see that the reshape worked, however we have 4 improvements that can be done providing the function with more arguments:
- the
id
column, which is not particularly useful this way - the
Y
column, which does have the correct data, however we would perhaps like to call it something a bit more descriptive - the
time
column, which could be named differently - we may want to update the values in the
time
column to something custom
gdi_long_full <- reshape(data = gdi # data.frame in wide format to be reshaped , direction = "long" # still going from wide to long , varying = 2:23 # columns that will be stacked into 1 , idvar = "country" # what identifies the rows? , v.names = "GDI" # how will the column with values be called , timevar = "year" # how will the time column be called , times = 1995:2016 # what are the values for the timevar column ) head(gdi_long_full) ## country year GDI ## EU 28.1995 EU 28 1995 NA ## Euro area 19.1995 Euro area 19 1995 NA ## Belgium.1995 Belgium 1995 140734.1 ## Bulgaria.1995 Bulgaria 1995 1036.0 ## Czech Republic.1995 Czech Republic 1995 894042.0 ## Denmark.1995 Denmark 1995 566363.0
We easily see the solution to our 4 improvements:
- specify
idvar = "country"
argument, as this column identifies the subjects in the rows - specify
v.names = "GDI"
argument, as this will rename the column with values (our values are gross disposable income) - specify
timevar = "year"
argument, as our time is actually years (the data is measure on a yearly basis) - specify
times = 1995:2016
argument, this is shown just for completion, we could for example dotimes = -21:0
if we want the years to be measured based on 2016 instead of actual years
Basic long to wide reshape
Now that have the wide to long reshape done, the reshape from long to wide format is a formality. It works exactly the same way, we just switch the arguments around a bit:
gdi_wide <- reshape(gdi_long_full # data.frame in long format to be reshaped , direction = "wide" # going from long to wide this time , idvar = "country" # identifying the subject in rows , timevar = "year" # column with values that will change to columns , v.names = "GDI" # column with the values ) head(gdi_wide[, 1:7, drop = FALSE]) ## country GDI.1995 GDI.1996 GDI.1997 GDI.1998 ## EU 28.1995 EU 28 NA NA NA NA ## Euro area 19.1995 Euro area 19 NA NA NA NA ## Belgium.1995 Belgium 140734.1 141599.4 145023.2 149705.2 ## Bulgaria.1995 Bulgaria 1036.0 1468.1 12367.4 14921.1 ## Czech Republic.1995 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 ## Denmark.1995 Denmark 566363.0 578102.0 591416.0 621236.0 ## GDI.1999 GDI.2000 ## EU 28.1995 5982392.8 6425313.4 ## Euro area 19.1995 4393727.3 4598956.1 ## Belgium.1995 153804.0 161753.6 ## Bulgaria.1995 16052.8 17676.4 ## Czech Republic.1995 1280040.0 1359309.0 ## Denmark.1995 614893.0 639955.0
Advanced reshape
Let us now examine a bit more advanced reshape with some more data. First, we will look at the generic setup. We now have data not just for the GDI, but for 3 measurements in the columns:
ConspC
- in columnsX1995ConspC .. X2016ConspC
AGDIpC
- in columnsX1995AGDIpC .. X2016AGDIpC
GrossSaving
- in columnsX1995GrossSaving .. X2016GrossSaving
more_notpretty <- read.csv( stringsAsFactors = FALSE , file = "https://jozef.io/post/data/ESA2010_not_pretty.csv" ) head(more_notpretty[, 1:5, drop = FALSE]) ## country X1995ConspC X1996ConspC X1997ConspC X1998ConspC ## 1 EU 28 NA NA NA NA ## 2 Euro area 19 NA NA NA NA ## 3 Belgium 18168.83 18634.68 18867.78 19334.14 ## 4 Bulgaria NA 3777.06 3163.05 3326.24 ## 5 Czech Republic 148721.29 159428.17 162742.83 161855.85 ## 6 Denmark 176096.32 179576.05 182940.60 187630.27
Since these data do not have column names that R would be able to guess automatically, we will have to provide quite a few arguments:
varying
as a list of vectors, each specifying the columns for one varying variablev.names
as a vector of names for those variables
more_notpretty_long <- reshape(data = more_notpretty , direction = "long" , varying = list(2:23 , 24:45 , 46:67 ) , timevar = "year" , times = 1995:2016 , idvar = "country" , v.names = c("ConspC" , "AGDIpC" , "GrossSaving" ) ) head(more_notpretty_long) ## country year ConspC AGDIpC GrossSaving ## EU 28.1995 EU 28 1995 NA NA NA ## Euro area 19.1995 Euro area 19 1995 NA NA NA ## Belgium.1995 Belgium 1995 18168.83 21577.92 27350.1 ## Bulgaria.1995 Bulgaria 1995 NA NA 448.4 ## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46 116646.0 ## Denmark.1995 Denmark 1995 176096.32 179741.30 42398.0
Now let us showcase how easy the reshape is if we adhere to R’s favourite column naming with the same data:
more_pretty <- read.csv( stringsAsFactors = FALSE , file = "https://jozef.io/post/data/ESA2010_pretty.csv" ) head(more_pretty[, 1:5, drop = FALSE]) ## country ConspC.1995 ConspC.1996 ConspC.1997 ConspC.1998 ## 1 EU 28 NA NA NA NA ## 2 Euro area 19 NA NA NA NA ## 3 Belgium 18168.83 18634.68 18867.78 19334.14 ## 4 Bulgaria NA 3777.06 3163.05 3326.24 ## 5 Czech Republic 148721.29 159428.17 162742.83 161855.85 ## 6 Denmark 176096.32 179576.05 182940.60 187630.27
We tell R only the information it necessarily needs, same as with the simple reshape:
more_pretty_long <- reshape(data = more_pretty , direction = "long" , varying = 2:67 , idvar = "country" ) head(more_pretty_long) ## country time ConspC AGDIpC GrossSaving ## EU 28.1995 EU 28 1995 NA NA NA ## Euro area 19.1995 Euro area 19 1995 NA NA NA ## Belgium.1995 Belgium 1995 18168.83 21577.92 27350.1 ## Bulgaria.1995 Bulgaria 1995 NA NA 448.4 ## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46 116646.0 ## Denmark.1995 Denmark 1995 176096.32 179741.30 42398.0
That was really easy and we got the desired result!
Now as the very last example, we may want to get the data into an even longer form, if we also
- consider the actual variables we are measuring as
varying
- their names will turn into
times
- with
measurement
being the name fortimevar
more_longer <- reshape(data = more_pretty_long , direction = "long" , varying = 3:5 , timevar = "measurement" , times = names(more_pretty_long[, 3:5]) , v.names = "Value" ) head(more_longer) ## country time measurement Value id ## 1.ConspC EU 28 1995 ConspC NA 1 ## 2.ConspC Euro area 19 1995 ConspC NA 2 ## 3.ConspC Belgium 1995 ConspC 18168.83 3 ## 4.ConspC Bulgaria 1995 ConspC NA 4 ## 5.ConspC Czech Republic 1995 ConspC 148721.29 5 ## 6.ConspC Denmark 1995 ConspC 176096.32 6
Alternatives to base R
There are many alternatives to the base functionality, each with their own pros and cons, here is a selection of three in no particular order:
melt
andcast
from the reshape2 packagegather
andspread
from the tidyR packagemelt
anddcast
from the data.table package
TL;DR - Just want the code
No time for reading? Click here to get just the code with commentary
Exercises
- At the beginning of the article, our data had countries in rows and yearly data as columns. Reshape the data such that the countries will be in columns and the years are in rows.
reshape(reshape(gdi_long_full))
gives us a data.frame equivalent togdi_long_full
, even though we call the function twice with no extra arguments, just the data. What kind of sorcery is this? Why don’t we need to provide at least thedirection
, or thevarying
arguments?
Exercise answers
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.