Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Having written a lot of R code over the last few years, I’ve developed a set of constructs for my most common tasks. Like an idiom in a natural language (e.g. “break a leg”), I automatically grasp their meaning without having to think about it. Because they allow me to become more and more productive in R, these idioms have become ingrained in my head (and muscle memory) and, in large part, inform how I approach problems.
It’s no wonder, then, why I’m hesitant to embrace new packages that threaten to displace these idioms; switching is tantamount to learning a new dialect after investing a lot of time becoming fluent in another.
On occasion, though, a package comes along whose benefits are so compelling (here’s looking at you, Hadley Wickham, Dirk Eddelbuettel, and Romain François) that it incites me to take the plunge and employ new patterns and learn new idioms. The most recent package to accomplish this is the dplyr package. This package (among other things) reimplements 5 of my most common R data manipulation idioms–often, in blazing fast parallel C++.
This post serves as both an advocation for dplyr (by comparing form and speed) but also as a rosetta stone–to serve as a personal reference for translating my old R idioms.
This uses a dataset documenting crimes in the US by state available here
library(dplyr) crime.by.state <- read.csv("CrimeStatebyState.csv")
Filtering rows
# base R crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 & crime.by.state$State=="New York", ] # dplyr crime.ny.2005 <- filter(crime.by.state, State=="New York", Year==2005)
There is a lot going on with my base R solution. It uses logical subsetting to extract choice rows from crime.by.state. Specifically, it creates a two boolean vectors: one that is true only when the “Year” column’s value is 2005, and one that is true only when the “State” column’s value is “New York”. It then logical “AND”s these vectors, so that the resulting boolean vector is true only where the year was 2005 and the state was New York. This vector then is used to subset crime.by.state, and includes all columns. In contrast, the dplyr solution reads much more naturally, and in far fewer characters. According to my (crude) benchmarks the dplyr solution appears to be twice as fast.
A quick note before moving on, we could’ve drastically cut down on the number of characters in the base R solution by “attaching” the crime.ny.2005 dataset, eliminating the need to preface the “Year” and “State” names with “crime.by.state$”, but there are two reasons why I don’t do this. (1) I consider it to be bad form in a lot of circumstances (for example, it can become confusing when more than one dataset is loaded), and (2) RStudio will tab-auto-complete a column name after prefacing it with “name-of-dataframe$” and that drastically increases my coding speed. My only complaint(?) about dplyr is that it disallows this prefacing syntax and requires me to lookup the column names (and spell them correctly).
Arranging and ordering
# base R crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, decreasing=TRUE), ] # dplyr crime.ny.2005 <- arrange(crime.ny.2005, desc(Count))
The base R solution ranks each row by value of “Count” in decreasing order, and uses the rank vector to subset the “crime.ny.2005” data frame. The dplyr solution appears to be about 20% faster.
Selecting columns
# base R crime.ny.2005 <- crime.ny.2005[, c("Type.of.Crime", "Count")] # dplyr crime.ny.2005 <- select(crime.ny.2005, Type.of.Crime, Count)
This example is relatively self-explanatory. Here the base R solution appears to be faster, by about 30%.
Creating new columns
# base R crime.ny.2005$Proportion <- crime.ny.2005$Count / sum(crime.ny.2005$Count) # dplyr crime.ny.2005 <- mutate(crime.ny.2005, Proportion=Count/sum(Count))
Very often, I have to create a new column that is a function of one or more existing columns. Here, we are creating a new column, that represents the proportion that a particular crime claims from the total number of crimes, among all types. Incredibly, base R beats dplyr in this task–it is about 18 times faster.
If I had to guess, I think this is because of the nuances of R’s vectorization. In the base R solution, a vector of crime counts is extracted. R recognizes that it is being divided by a scalar (the sum of the counts), and automatically creates a vector with this scalar repeated so that the length of the vectors match. Both of the vectors are stored contiguously and the resulting element-wise division is blindingly fast. In contrast, I think that in the dplyr solution, the sum of the counts column is actually evaluated for each element in the count vector, although I am not sure.
Aggregation and summarization
# base R summary1 <- aggregate(Count ~ Type.of.Crime, data=crime.ny.2005, FUN=sum) summary2 <- aggregate(Count ~ Type.of.Crime, data=crime.ny.2005, FUN=length) summary.crime.ny.2005 <- merge(summary1, summary2, by="Type.of.Crime") # dplyr by.type <- group_by(crime.ny.2005, Type.of.Crime) summary.crime.ny.2005 <- summarise(by.type, num.types = n(), counts = sum(Count))
This is the arena in which dplyr really shines over base R. In the original dataset, crime was identified by specific names (“Burglary”, “Aggravated assault”) and by a broader category (“Property Crime” and “Violent Crime”)
Before this point, the data frame we are working with looks like this:
Type.of.Crime Count Violent Crime 874 Violent Crime 3636 Violent Crime 35179 Violent Crime 46150 Property Crime 68034 Property Crime 302220 Property Crime 35736
In this pedagogical example we want to aggregate by the type of crime and (a) get the number of specific crimes that fall into each category, and (b) get the sum of all crimes committed in those categories. Base R makes it very easy to do one of these aggregations, but to get two values, it requires that we make two calls to aggregate and then merge the results. Dplyr’s solution, on the other hand, is relatively intuitive, and requires just two function calls.
All together now
We haven’t showcased the best part of dplyr yet… it presents itself when combining all of these statements:
# base R crime.by.state <- read.csv("CrimeStatebyState.csv") crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 & crime.by.state$State=="New York", c("Type.of.Crime", "Count")] crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, decreasing=TRUE), ] crime.ny.2005$Proportion <- crime.ny.2005$Count / sum(crime.ny.2005$Count) summary1 <- aggregate(Count ~ Type.of.Crime, data=crime.ny.2005, FUN=sum) summary2 <- aggregate(Count ~ Type.of.Crime, data=crime.ny.2005, FUN=length) final <- merge(summary1, summary2, by="Type.of.Crime") # dplyr crime.by.state <- read.csv("CrimeStatebyState.csv") final <- crime.by.state %.% filter(State=="New York", Year==2005) %.% arrange(desc(Count)) %.% select(Type.of.Crime, Count) %.% mutate(Proportion=Count/sum(Count)) %.% group_by(Type.of.Crime) %.% summarise(num.types = n(), counts = sum(Count))
When all combined, the base R solution took 60 seconds (over 10000 iterations) and the dplyr solution took 30 seconds. Perhaps more importantly, the dplyr code to uses many fewer lines and assignments and is more terse and probably more readable with its neat-o “%.%” operator.
I would be remiss if I didn’t mention at least one of the other benefits of dplyr…
Dplyr’s functions are generalized to handle more than just data.frames (like we were using here). As easily as dplyr handles the data frame, dplyr can also handle data.tables, remote (and out-of-memory) databases like MySQL; Postgres; Lite; and BigQuery by translating to the appropriate SQL on the fly.
There are still other neat features of dplyr but perhaps these are reason enough to give dplyr a shot. I know my own code may never look the same.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.