Data manipulation with tidyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello everyone! In this article, I will show you how you can use tidyr
for data manipulation. tidyr
is a package by Hadley Wickham that makes it easy to tidy your data. It is often used in conjunction with dplyr
. Data is said to be tidy when each column represents a variable, and each row represents an observation.
I will demonstrate the usage of the following four functions from the tidyr
package:
gather
– converts wide data to longer format. It is analogous to themelt
function fromreshape2
.spread
– converts long data to wider format. It is analogous to thecast
function fromreshape2
.unite
– combines two or more columns into a single column.separate
– splits one column into two or more columns.
I will use the mtcars
dataset from the datasets
library. If you are not familiar with it, this is what it looks like:
library(tidyr) library(dplyr) head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let us include the names of the cars in a column called car
for easier manipulation.
mtcars$car <- rownames(mtcars) mtcars <- mtcars[, c(12, 1:11)]
gather
gather
takes the form (from the help file):
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
where ...
is the specification of the columns to gather.
We can replicate what melt
does as follows:
mtcarsNew <- mtcars %>% gather(attribute, value, -car) head(mtcarsNew) car attribute value 1 Mazda RX4 mpg 21.0 2 Mazda RX4 Wag mpg 21.0 3 Datsun 710 mpg 22.8 4 Hornet 4 Drive mpg 21.4 5 Hornet Sportabout mpg 18.7 6 Valiant mpg 18.1 tail(mtcarsNew) car attribute value 347 Porsche 914-2 carb 2 348 Lotus Europa carb 2 349 Ford Pantera L carb 4 350 Ferrari Dino carb 6 351 Maserati Bora carb 8 352 Volvo 142E carb 2
As you can see, it gathers all the columns except car
and places their name and value into the attritube
and value
column respectively.
The great thing about tidyr
is that you can gather
only certain columns and leave the others alone. If we want to gather
all the columns from mpg
to gear
and leave the carb
and car
columns as they are, we can do it as follows:
mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear) head(mtcarsNew) car carb attribute value 1 Mazda RX4 4 mpg 21.0 2 Mazda RX4 Wag 4 mpg 21.0 3 Datsun 710 1 mpg 22.8 4 Hornet 4 Drive 1 mpg 21.4 5 Hornet Sportabout 2 mpg 18.7 6 Valiant 1 mpg 18.1
spread
spread
takes the form(from the help file):
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
We can replicate what cast
does as follows:
mtcarsSpread <- mtcarsNew %>% spread(attribute, value) head(mtcarsSpread) car carb mpg cyl disp hp drat wt qsec vs am gear 1 AMC Javelin 2 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2 Cadillac Fleetwood 4 10.4 8 472 205 2.93 5.250 17.98 0 0 3 3 Camaro Z28 4 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 Chrysler Imperial 4 14.7 8 440 230 3.23 5.345 17.42 0 0 3 5 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 6 Dodge Challenger 2 15.5 8 318 150 2.76 3.520 16.87 0 0 3
unite
unite
takes the form (from the help file):
unite(data, col, ..., sep = "_", remove = TRUE)
where ...
represents the columns to unite and col
represents the column to add.
Let us create some fake data:
set.seed(1) date <- as.Date('2016-01-01') + 0:14 hour <- sample(1:24, 15) min <- sample(1:60, 15) second <- sample(1:60, 15) event <- sample(letters, 15) data <- data.frame(date, hour, min, second, event) data date hour min second event 1 2016-01-01 7 30 29 u 2 2016-01-02 9 43 36 a 3 2016-01-03 13 58 60 l 4 2016-01-04 20 22 11 q 5 2016-01-05 5 44 47 p 6 2016-01-06 18 52 37 k 7 2016-01-07 19 12 43 r 8 2016-01-08 12 35 6 i 9 2016-01-09 11 7 38 e 10 2016-01-10 1 14 21 b 11 2016-01-11 3 20 42 w 12 2016-01-12 14 1 32 t 13 2016-01-13 23 19 52 h 14 2016-01-14 21 41 26 s 15 2016-01-15 8 16 25 o
Now, let us combine the date
, hour
, min
, and second
columns into a new column called datetime
. Usually, datetime in R is of the form Year-Month-Day Hour:Min:Second.
dataNew <- data %>% unite(datehour, date, hour, sep = ' ') %>% unite(datetime, datehour, min, second, sep = ':') dataNew datetime event 1 2016-01-01 7:30:29 u 2 2016-01-02 9:43:36 a 3 2016-01-03 13:58:60 l 4 2016-01-04 20:22:11 q 5 2016-01-05 5:44:47 p 6 2016-01-06 18:52:37 k 7 2016-01-07 19:12:43 r 8 2016-01-08 12:35:6 i 9 2016-01-09 11:7:38 e 10 2016-01-10 1:14:21 b 11 2016-01-11 3:20:42 w 12 2016-01-12 14:1:32 t 13 2016-01-13 23:19:52 h 14 2016-01-14 21:41:26 s 15 2016-01-15 8:16:25 o
separate
separate
takes the form (from the help file):
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
We can get back the original data we created using separate
as follows:
data1 <- dataNew %>% separate(datetime, c('date', 'time'), sep = ' ') %>% separate(time, c('hour', 'min', 'second'), sep = ':') data1 date hour min second event 1 2016-01-01 07 30 29 u 2 2016-01-02 09 43 36 a 3 2016-01-03 13 59 00 l 4 2016-01-04 20 22 11 q 5 2016-01-05 05 44 47 p 6 2016-01-06 18 52 37 k 7 2016-01-07 19 12 43 r 8 2016-01-08 12 35 06 i 9 2016-01-09 11 07 38 e 10 2016-01-10 01 14 21 b 11 2016-01-11 03 20 42 w 12 2016-01-12 14 01 32 t 13 2016-01-13 23 19 52 h 14 2016-01-14 21 41 26 s 15 2016-01-15 08 16 25 o
It first splits the datetime
column into date
and time
, and then splits time
into hour
, min
, and second
.
That brings us to the end of the article. If you have questions or feedback, feel free to leave a comment or reach out to me on Twitter.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.