Site icon R-bloggers

Data Manipulation with dplyr

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package.

For this article, I will be using the airquality dataset from the datasets package. The airquality dataset contains information about air quality measurements in New York from May 1973 – September 1973.

The head of the dataset looks like this:

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Before we dive into the functions, let’s load up the two packages:

library(datasets)
library(dplyr)

Okay, now let’s get to the functions.

Filter
The filter function will return all the rows that satisfy a following condition. For example below will return all the rows where Temp is larger than 70.

filter(airquality, Temp > 70)
  Ozone Solar.R Wind Temp Month Day
1    36     118  8.0   72     5   2
2    12     149 12.6   74     5   3
3     7      NA  6.9   74     5  11
4    11     320 16.6   73     5  22
5    45     252 14.9   81     5  29
6   115     223  5.7   79     5  30
...

Another example of filter is to return all the rows where Temp is larger than 80 and Month is after May.

filter(airquality, Temp > 80 & Month > 5)
  Ozone Solar.R Wind Temp Month Day
1    NA     286  8.6   78     6   1
2    NA     287  9.7   74     6   2
3    NA     186  9.2   84     6   4
4    NA     220  8.6   85     6   5
5    NA     264 14.3   79     6   6
...

Mutate
Mutate is used to add new variables to the data. For example lets adds a new column that displays the temperature in Celsius.

mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
  Ozone Solar.R Wind Temp Month Day  TempInC
1    41     190  7.4   67     5   1 19.44444
2    36     118  8.0   72     5   2 22.22222
3    12     149 12.6   74     5   3 23.33333
4    18     313 11.5   62     5   4 16.66667
5    NA      NA 14.3   56     5   5 13.33333
...

Summarise
The summarise function is used to summarise multiple values into a single value. It is very powerful when used in conjunction with the other functions in the dplyr package, as demonstrated below. na.rm = TRUE will remove all NA values while calculating the mean, so that it doesn’t produce spurious results.

summarise(airquality, mean(Temp, na.rm = TRUE))
  mean(Temp)
1   77.88235

Group By
The group_by function is used to group data by one or more variables. Will group the data together based on the Month, and then the summarise function is used to calculate the mean temperature in each month.

summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
  Month mean(Temp)
1     5   65.54839
2     6   79.10000
3     7   83.90323
4     8   83.96774
5     9   76.90000

Sample
The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 15 rows (10% of the original 153 rows) from the dataset.

sample_n(airquality, size = 10)
sample_frac(airquality, size = 0.1)

Count
The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:

count(airquality, Month)
  Month  n
1     5 31
2     6 30
3     7 31
4     8 31
5     9 30

This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.

Arrange
The arrange function is used to arrange rows by variables. Currently, the airquality dataset is arranged based on Month, and then Day. We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.

arrange(airquality, desc(Month), Day)
  Ozone Solar.R Wind Temp Month Day
1    96     167  6.9   91     9   1
2    78     197  5.1   92     9   2
3    73     183  2.8   93     9   3
4    91     189  4.6   93     9   4
5    47      95  7.4   87     9   5
6    32      92 15.5   84     9   6

Pipe
The pipe operator in R, represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:

filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))

With piping, the above code can be rewritten as:

airquality %>% 
    filter(Month != 5) %>% 
    group_by(Month) %>% 
    summarise(mean(Temp, na.rm = TRUE))

This is a very basic example, and the usefulness may not be very apparent, but as the number of operations/functions perfomed on the data increase, the pipe operator becomes more and more useful!

That brings us to the end of this article. I hope you enjoyed it and found it useful. If you have questions, feel free to leave a comment or reach out to me on Twitter.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.