Create a data transformation pipeline

Quantargo Blog

2 years ago

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

All data transformation functions in dplyr can be connected through the pipe %>% operator to create powerful and yet expressive data transformation pipelines.

Use the pipe operator %>% to combine multiple dplyr functions into one pipeline

 %>%
  filter(___) %>%
  select(___) %>%
  arrange(___)

Using the %>% operator

The pipe operator %>% is a special part of the tidyverse universe. It is used to combine multiple functions and run them one after the other. In this setting the input of each function is the output of the previous function. Imagine we have the pres_results data frame and want to create a smaller, more transparent data frame for answering the question: In which states was the democratic party the most popular choice in the 2016 US presidential election? To accomplish this task we would need to take the following steps:

filter() the data frame for the rows, where the year variable equals 2016
select() the two variables state and dem, since we are not interested in the rest of the columns.
arrange() the filtered and selected data frame based on the dem column in a descending way.

The steps and functions described above should be run one after the other, where the input of each function is the output of the previous step. Applying the things you learned so far, you could accomplish this task by taking the following steps:

result <- filter(pres_results, year==2016)
result <- select(result, state, dem)
result <- arrange(result, desc(dem))
result
# A tibble: 51 x 2
  state   dem
  < chr> < dbl>
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows

The first function takes the pres_results data frame, filters it according to the task description and assigns it to the variable result. Then, each subsequent function takes the result variable as input and overwrites it with its own output.

The %>% operator provides a practical way for combining the steps above into seemingly one step. It takes a data frame as the initial input. Then, it applies a list of functions, and passes on the output of each function for the input for the next function. The same task as above can be accomplished using the pipe operator %>% like this:

pres_results %>%
  filter(year==2016) %>%
  select(state, dem, rep) %>%
  arrange(desc(dem))
# A tibble: 51 x 3
  state   dem    rep
  < chr> < dbl>  < dbl>
1 DC    0.905 0.0407
2 CA    0.617 0.316 
3 HI    0.610 0.294 
# … with 48 more rows

We can interpret the code in the following way:

We define the original data set as a starting point.
Using the %>% operator right after the data frame tells dplyr, that a function is coming, which takes the previously defined data frame as input.
We use each function as usual, but skip the first parameter. The data frame input is automatically provided by the output of the previous step.
As long as we add the %>% operator after a step, dplyr will expect an additional step.
In our example the pipeline closes with a arrange() function. It gets the filtered and selected version of the pres_results data frame as input and sorts it based on the dem column in a descending way. Finally, it gives back the output.

One difference between the two approaches is, that the %>% operator does not save permanently the intermediate or the final results. To save the resulting data frame we need to assign the output to a variable:

result <- pres_results %%
  filter(year==2016) %>%
  select(state, dem) %>%
  arrange(desc(dem))

result
# A tibble: 51 x 2
  state   dem
  < chr> < dbl>
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows

Exercise: Austrian Life Expectancy

Use the %>% operator on the gapminder data set and create a simple data frame to answer the following question: How did the life expectancy in Austria change over the last decades? Required packages are already loaded.

Define the gapminder data frame as the base data frame
Filter only the rows where the country column is equal to Austria by piping gapminder to the filter() function.
Select only the columns: year and lifeExp from the filtered result.
Arrange the results based on the year column based on the selected columns.

Start Exercise

Exercise: European GDP Per Capita

Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which European country had the highest GDP per capita in 2007? Required packages are already loaded.

Define the gapminder tibble as the input
Filter only the rows where the year column is equal to 2007
Use a second layer of filter and keep only the rows where the continent column is equal to Europe
Select only the columns: country and gdpPercap
Arrange the results based on the gdpPercap column in a descending way

Start Exercise

Exercise: Americas Population

Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which country on the continent Americas had the largest population in 2007?

Define the gapminder tibble as the input
Filter only the rows where the year column is equal to 2007
Use a second layer of filter and keep only the rows where the continent column is equal to Americas
Select only the columns: country and pop
Arrange the results based on the pop column in a descending way

Start Exercise

Quiz: Malformed Code

gapminder %>%
  filter(year == 2007, continent == "Americas") %>%
  select(gapminder, country, pop) %>%
  arrange(desc(pop)) %>%

Take a look at the code above. What mistakes does it contain?

The gapminder tibble should not be defined in the select() function.
There should be no %>% applied after the last line.
There will be no output, because you cannot use these functions in this order.
The desc() function should be applied on the whole arrange() function and not on a single column.

Start Quiz

Create a data transformation pipeline is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.