Create a data transformation pipeline
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
All data transformation functions in dplyr can be connected through the pipe %>%
operator to create powerful and yet expressive data transformation pipelines.
- Use the pipe operator
%>%
to combine multiple dplyr functions into one pipeline
%>% filter(___) %>% select(___) %>% arrange(___)
Using the %>% operator
The pipe operator %>%
is a special part of the tidyverse
universe. It is used to combine multiple functions and run them one after the other. In this setting the input of each function is the output of the previous function. Imagine we have the pres_results
data frame and want to create a smaller, more transparent data frame for answering the question: In which states was the democratic party the most popular choice in the 2016 US presidential election? To accomplish this task we would need to take the following steps:
-
filter()
the data frame for the rows, where theyear
variable equals 2016 -
select()
the two variablesstate
anddem
, since we are not interested in the rest of the columns. -
arrange()
the filtered and selected data frame based on thedem
column in a descending way.
The steps and functions described above should be run one after the other, where the input of each function is the output of the previous step. Applying the things you learned so far, you could accomplish this task by taking the following steps:
result <- filter(pres_results, year==2016) result <- select(result, state, dem) result <- arrange(result, desc(dem)) result # A tibble: 51 x 2 state dem <chr> <dbl> 1 DC 0.905 2 CA 0.617 3 HI 0.610 # … with 48 more rows
The first function takes the pres_results
data frame, filters it according to the task description and assigns it to the variable result
. Then, each subsequent function takes the result
variable as input and overwrites it with its own output.
The %>%
operator provides a practical way for combining the steps above into seemingly one step. It takes a data frame as the initial input. Then, it applies a list of functions, and passes on the output of each function for the input for the next function. The same task as above can be accomplished using the pipe operator %>%
like this:
pres_results %>% filter(year==2016) %>% select(state, dem, rep) %>% arrange(desc(dem)) # A tibble: 51 x 3 state dem rep <chr> <dbl> <dbl> 1 DC 0.905 0.0407 2 CA 0.617 0.316 3 HI 0.610 0.294 # … with 48 more rows
We can interpret the code in the following way:
- We define the original data set as a starting point.
- Using the
%>%
operator right after the data frame tells dplyr, that a function is coming, which takes the previously defined data frame as input. - We use each function as usual, but skip the first parameter. The data frame input is automatically provided by the output of the previous step.
- As long as we add the
%>%
operator after a step, dplyr will expect an additional step. - In our example the pipeline closes with a
arrange()
function. It gets the filtered and selected version of thepres_results
data frame as input and sorts it based on thedem
column in a descending way. Finally, it gives back the output.
One difference between the two approaches is, that the %>%
operator does not save permanently the intermediate or the final results. To save the resulting data frame we need to assign the output to a variable:
result <- pres_results %% filter(year==2016) %>% select(state, dem) %>% arrange(desc(dem)) result # A tibble: 51 x 2 state dem <chr> <dbl> 1 DC 0.905 2 CA 0.617 3 HI 0.610 # … with 48 more rows
Exercise: Austrian Life Expectancy
Use the %>%
operator on the gapminder
data set and create a simple data frame to answer the following question: How did the life expectancy in Austria change over the last decades? Required packages are already loaded.
- Define the
gapminder
data frame as the base data frame - Filter only the rows where the
country
column is equal toAustria
by pipinggapminder
to thefilter()
function. - Select only the columns:
year
andlifeExp
from the filtered result. - Arrange the results based on the
year
column based on the selected columns.
Exercise: European GDP Per Capita
Use the %>%
operator on the gapminder
dataset and create a simple tibble to answer the following question: Which European country had the highest GDP per capita in 2007? Required packages are already loaded.
- Define the
gapminder
tibble as the input - Filter only the rows where the
year
column is equal to2007
- Use a second layer of filter and keep only the rows where the
continent
column is equal toEurope
- Select only the columns:
country
andgdpPercap
- Arrange the results based on the
gdpPercap
column in a descending way
Exercise: Americas Population
Use the %>%
operator on the gapminder
dataset and create a simple tibble to answer the following question: Which country on the continent Americas
had the largest population in 2007?
- Define the
gapminder
tibble as the input - Filter only the rows where the
year
column is equal to2007
- Use a second layer of filter and keep only the rows where the
continent
column is equal toAmericas
- Select only the columns:
country
andpop
- Arrange the results based on the
pop
column in a descending way
Quiz: Malformed Code
gapminder %>% filter(year == 2007, continent == "Americas") %>% select(gapminder, country, pop) %>% arrange(desc(pop)) %>%Take a look at the code above. What mistakes does it contain?
- The
gapminder
tibble should not be defined in theselect()
function. - There should be no
%>%
applied after the last line. - There will be no output, because you cannot use these functions in this order.
- The
desc()
function should be applied on the wholearrange()
function and not on a single column.
Create a data transformation pipeline is an excerpt from the course Introduction to R, which is available for free at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.