Workaround for tidyr::spread with duplicate row identifiers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The problem
The tidyverse
api is one the easiest APIs in R
to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)...
. I am sure we have all seen this some or other time.
Well, lets reproduce this error:
library(dplyr) library(tidyr) df ## # A tibble: 5 x 2 ## age gender ## <chr> <fct> ## 1 21 Male ## 2 17 Female ## 3 32 Female ## 4 29 Male ## 5 15 Male # Spread df %>% spread(gender, age) ## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be:
## # A tibble: 3 x 2 ## Female Male ## * <chr> <chr> ## 1 17 21 ## 2 32 29 ## 3 <NA> 15
But given that there is no index for spread
to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread
decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df
without getting the warning. Here are some links:
And the list goes on…
The solution
Luckily I stumbled accross a really nice issue on the tidyr
github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by
and row_number
and just removing the column after the widening of the data.
df <- df %>% group_by(gender) %>% mutate(grouped_id = row_number()) df ## # A tibble: 5 x 3 ## # Groups: gender [2] ## age gender grouped_id ## <chr> <fct> <int> ## 1 21 Male 1 ## 2 17 Female 1 ## 3 32 Female 2 ## 4 29 Male 2 ## 5 15 Male 3 # Now Spread df %>% spread(gender, age) %>% select(-grouped_id) ## # A tibble: 3 x 2 ## Female Male ## * <chr> <chr> ## 1 17 21 ## 2 32 29 ## 3 <NA> 15
Using this method of giving spread
an index ensures no duplicated row names are introduced and a rectangular dataset can remain.
I hope that this little gem of an insight into working with spread
has saved you some time!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.