Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the tidy
format that the tidyverse
advocates. Even while, it’s been a very common task – the tidyr
package’s solution of using spread()
and gather()
almost never was intuitive enough to be used in the code without SOing or Referring the documentation. While I used to think, It’s just me who didn’t spread
and gather
, It seemed to be mass problem when the package developer Hadley Wickham realized this to completely rewrite those functions into two new functions pivot_wider()
and pivot_longer()
(which is also equivalent to melt()
and cast()
– or simply unpivot
and pivot
)
update your tidyr
If you are an R-user, It’s highly likely that you’ve got tidyr
either as a standalone package or part of the tidyverse
packages. Please update the package the get the latest tidyr
to get these functions for reshaping.
install.packages("tidyr") or devtools::install_github("tidyverse/tidyr")
and load the library.
library(tidyr) packageVersion("tidyr") ## [1] '0.8.3.9000'
Data
Let’s take this problem from a very popular Stack Overflow Question.
#data df <- read.table(textConnection( "Code Country 1950 1951 1952 1953 1954 AFG Afghanistan 20249 19000 22532 5000 24555 ALB Albania 8097 8986 3000 11123 12246"), header=TRUE) df ## Code Country X1950 X1951 X1952 X1953 X1954 ## 1 AFG Afghanistan 20249 19000 22532 5000 24555 ## 2 ALB Albania 8097 8986 3000 11123 12246
This wide format is good for some use-cases but for most of what we do with Data Analysis, it’d require the data to be in the long format.
Why Reshaping – The Problem
Given this data, Let’s try to make a Time-series Line Chart using ggplot2
. But the format in which the data is currently shaped (wide) can’t help us in building the line chart because for a line chart using geom_line()
we need the data in the long format – where the x-axis and y-axis are columns (ideally with x being a Time variable and y being a continuous variable) and also that we’ve got the country which could help us comparing both the countries.
Let’s do that!
library(tidyverse) df_pivoted <- pivot_longer(df, cols = -c("Code","Country"), names_to = "year") df_pivoted ## # A tibble: 10 x 4 ## Code Country year value ## <fct> <fct> <chr> <int> ## 1 AFG Afghanistan X1950 20249 ## 2 AFG Afghanistan X1951 19000 ## 3 AFG Afghanistan X1952 22532 ## 4 AFG Afghanistan X1953 5000 ## 5 AFG Afghanistan X1954 24555 ## 6 ALB Albania X1950 8097 ## 7 ALB Albania X1951 8986 ## 8 ALB Albania X1952 3000 ## 9 ALB Albania X1953 11123 ## 10 ALB Albania X1954 12246
I’m personally glad that I don’t have to check the confusing syntax of gather()
instead the intuitve name of pivot_longer()
(which is literally what we’ve to do) flows through my fingers.
As you can see above, the pivot_longer()
function as any tidyverse function supports %>%
which also means the first argument is the dataframe itself, then the list of cols
we would like to pivot_longer()
and the next essential argument names_to
– new column name under whcih these are rolled up. This gives us the reshaped data which we could use for plotting. If you came to this article, just to see reshaping, you’re good by this place. But if you want to connect better with the reason why we wanted to reshape in the first place, Let’s go do the line graph!
Line Graph with reshaped data
Now that we know how to reshape the data (from wide to long format), we can use ggplot2
’s geom_line()
to plot the (time-series) lines.
df %>% pivot_longer(-c("Code","Country"),"year") %>% mutate(year = as.integer(str_replace(year, "X",""))) %>% ggplot() + geom_line(aes(year, value, group = Country, color = Country)) + labs(title = "Line Graph after Reshaping", subtitle = "I love pivot_longer()", caption = "I was always confused with gather and spread") + theme_minimal()
In the above code, I’ve repeated the reshaping again – just to imply a point that this is why %>%
s are cool because we can build intuitive ETL (kind-of) pipeline and visualization together step-by-step.
The End
Well, We finally ended up with a beautiful (opinionated) line graph to compare two countries’ (made-up values) which came from a wide format data originally which we reshaped to long format data.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.