R:case4base – reshape data with base R

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This is the first post in the R:case4base series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages.

The simplest, yet probably one of the most common use cases would be to change the data from what is called “wide” shape to “long” shape. Base R offers a very good function for this very purpose. Meet stats::reshape.

How to use this article

  1. This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a case4base after all!
  2. If you have no time for reading, you can click here to get just the code with commentary

Basic wide to long reshape

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:7])
##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0
##      Y.2000
## 1 6425313.4
## 2 4598956.1
## 3  161753.6
## 4   17676.4
## 5 1359309.0
## 6  639955.0

Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.

The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.

To reshape our data.frame from wide to long, all we have to do is:

gdi_long <- reshape(data = gdi         # data.frame in wide format to be reshaped
                  , direction = "long" # we are going from wide to long
                  , varying = 2:23     # columns that will be stacked into 1
                  )

head(gdi_long)
##               country time        Y id
## 1.1995          EU 28 1995       NA  1
## 2.1995   Euro area 19 1995       NA  2
## 3.1995        Belgium 1995 140734.1  3
## 4.1995       Bulgaria 1995   1036.0  4
## 5.1995 Czech Republic 1995 894042.0  5
## 6.1995        Denmark 1995 566363.0  6

Before we get into clean-up such that the output data.frame is nice and tidy, let us first take look at the arguments of the function that we used already

  • data - almost obviously, this is the data.frame we want to reshape
  • varying - names or indices of columns which we want to stack on each other into a single column

Tip

We can see that R automatically recognizes the Y and the years that get translated into the time column. This is because the column names are in a format that reshape can guess automatically: [string].[integer], in our case "Y.1996", "Y.1997", etc. It has a lot of benefits to keep this naming convention for your column names before reshaping. If your names have a different character between the [string] and the [integer] (for example "something_1996", "something_1997"), you can specify this character with the sep argument (e.g. sep = "_").

Now looking back at the reshaped gdi_long, we see that the reshape worked, however we have 4 improvements that can be done providing the function with more arguments:

  1. the id column, which is not particularly useful this way
  2. the Y column, which does have the correct data, however we would perhaps like to call it something a bit more descriptive
  3. the time column, which could be named differently
  4. we may want to update the values in the time column to something custom
gdi_long_full <- reshape(data = gdi         # data.frame in wide format to be reshaped
                       , direction = "long" # still going from wide to long
                       , varying = 2:23     # columns that will be stacked into 1
                       , idvar = "country"  # what identifies the rows?
                       , v.names = "GDI"    # how will the column with values be called
                       , timevar = "year"   # how will the time column be called
                       , times = 1995:2016  # what are the values for the timevar column
                       )
head(gdi_long_full)
##                            country year      GDI
## EU 28.1995                   EU 28 1995       NA
## Euro area 19.1995     Euro area 19 1995       NA
## Belgium.1995               Belgium 1995 140734.1
## Bulgaria.1995             Bulgaria 1995   1036.0
## Czech Republic.1995 Czech Republic 1995 894042.0
## Denmark.1995               Denmark 1995 566363.0

We easily see the solution to our 4 improvements:

  1. specify idvar = "country" argument, as this column identifies the subjects in the rows
  2. specify v.names = "GDI" argument, as this will rename the column with values (our values are gross disposable income)
  3. specify timevar = "year" argument, as our time is actually years (the data is measure on a yearly basis)
  4. specify times = 1995:2016 argument, this is shown just for completion, we could for example do times = -21:0 if we want the years to be measured based on 2016 instead of actual years

Basic long to wide reshape

Now that have the wide to long reshape done, the reshape from long to wide format is a formality. It works exactly the same way, we just switch the arguments around a bit:

gdi_wide <- reshape(gdi_long_full      # data.frame in long format to be reshaped  
                  , direction = "wide" # going from long to wide this time
                  , idvar = "country"  # identifying the subject in rows
                  , timevar = "year"   # column with values that will change to columns
                  , v.names = "GDI"    # column with the values
                  )
head(gdi_wide[, 1:7, drop = FALSE])
##                            country GDI.1995  GDI.1996  GDI.1997  GDI.1998
## EU 28.1995                   EU 28       NA        NA        NA        NA
## Euro area 19.1995     Euro area 19       NA        NA        NA        NA
## Belgium.1995               Belgium 140734.1  141599.4  145023.2  149705.2
## Bulgaria.1995             Bulgaria   1036.0    1468.1   12367.4   14921.1
## Czech Republic.1995 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0
## Denmark.1995               Denmark 566363.0  578102.0  591416.0  621236.0
##                      GDI.1999  GDI.2000
## EU 28.1995          5982392.8 6425313.4
## Euro area 19.1995   4393727.3 4598956.1
## Belgium.1995         153804.0  161753.6
## Bulgaria.1995         16052.8   17676.4
## Czech Republic.1995 1280040.0 1359309.0
## Denmark.1995         614893.0  639955.0

Advanced reshape

Let us now examine a bit more advanced reshape with some more data. First, we will look at the generic setup. We now have data not just for the GDI, but for 3 measurements in the columns:

  • ConspC - in columns X1995ConspC .. X2016ConspC
  • AGDIpC - in columns X1995AGDIpC .. X2016AGDIpC
  • GrossSaving - in columns X1995GrossSaving .. X2016GrossSaving
more_notpretty <- read.csv(
  stringsAsFactors = FALSE
, file = "https://jozef.io/post/data/ESA2010_not_pretty.csv"
)
head(more_notpretty[, 1:5, drop = FALSE])
##          country X1995ConspC X1996ConspC X1997ConspC X1998ConspC
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27

Since these data do not have column names that R would be able to guess automatically, we will have to provide quite a few arguments:

  • varying as a list of vectors, each specifying the columns for one varying variable
  • v.names as a vector of names for those variables
more_notpretty_long <- reshape(data = more_notpretty
                             , direction = "long"
                             , varying = list(2:23
                                            , 24:45
                                            , 46:67
                                            )
                             , timevar = "year"
                             , times = 1995:2016
                             , idvar = "country"
                             , v.names = c("ConspC"
                                         , "AGDIpC"
                                         , "GrossSaving"
                                         )
                             )
head(more_notpretty_long)
##                            country year    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0

Now let us showcase how easy the reshape is if we adhere to R’s favourite column naming with the same data:

more_pretty <- read.csv(
  stringsAsFactors = FALSE
, file = "https://jozef.io/post/data/ESA2010_pretty.csv"
)
head(more_pretty[, 1:5, drop = FALSE])
##          country ConspC.1995 ConspC.1996 ConspC.1997 ConspC.1998
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27

We tell R only the information it necessarily needs, same as with the simple reshape:

more_pretty_long <- reshape(data = more_pretty
                           , direction = "long"
                           , varying = 2:67
                           , idvar = "country"
                           )
head(more_pretty_long)
##                            country time    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0

That was really easy and we got the desired result!

Now as the very last example, we may want to get the data into an even longer form, if we also

  • consider the actual variables we are measuring as varying
  • their names will turn into times
  • with measurement being the name for timevar
more_longer <- reshape(data = more_pretty_long
                    , direction = "long"
                    , varying = 3:5
                    , timevar = "measurement"
                    , times = names(more_pretty_long[, 3:5])
                    , v.names = "Value"
                    )
head(more_longer)
##                 country time measurement     Value id
## 1.ConspC          EU 28 1995      ConspC        NA  1
## 2.ConspC   Euro area 19 1995      ConspC        NA  2
## 3.ConspC        Belgium 1995      ConspC  18168.83  3
## 4.ConspC       Bulgaria 1995      ConspC        NA  4
## 5.ConspC Czech Republic 1995      ConspC 148721.29  5
## 6.ConspC        Denmark 1995      ConspC 176096.32  6

Alternatives to base R

There are many alternatives to the base functionality, each with their own pros and cons, here is a selection of three in no particular order:

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

  1. At the beginning of the article, our data had countries in rows and yearly data as columns. Reshape the data such that the countries will be in columns and the years are in rows.
  2. reshape(reshape(gdi_long_full)) gives us a data.frame equivalent to gdi_long_full, even though we call the function twice with no extra arguments, just the data. What kind of sorcery is this? Why don’t we need to provide at least the direction, or the varying arguments?

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)