Site icon R-bloggers

Introduction to dplyr

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is dplyr

There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data.

Anthony Goldbloom, Founder and CEO of Kaggle

Having clean data in any Data Science project is super important, because the results only get as good as is the data correct. Cleaning data is also the part which usually consumes most of the time and causes the biggest pains for data scientists. R already offers a broad set of tools and functions to manipulate data frames. However, due to its long history, the available base R toolset is fragmented and hard to use for new users.

The dplyr package facilitates the data tranformation process through a consistent collection of functions. These functions support different transformations on data frames, including

Multiple data frames can also be joined together by common attribute values.

The consistency of dplyr functions improves usability and enables user to connect transformations together to form data pipelines. These pipelines can also be seen as a high-level query language—much like e.g. the SQL language for database queries. Additionally, it is even possible to translate created data pipelines to other backends including databases.

Quiz: dplyr Facts

Which of the below statements are correct? Start Quiz

Function Framework

Every data transformation function in dplyr accepts a data frame as its first input parameter and returns the transformed data frame back as an output. A blueprint for a typical dplyr function looks like this:

transformed <- dplyr_function(my_data_frame, 
                              param_one, 
                              param_two, 
                              ...) 

The dplyr_function can be customized further through additional arguments (param_one, param_two) placed after the first data frame parameter (my_data_frame).

The real power of dplyr comes with the pipe operator %>% which allows users to concatenate dplyr functions to data pipelines. The pipe injects the resulting data frame from the previous calculation as the first argument of next one. A data transformation consisting of three functions looks like

dplyr_function_three(
  dplyr_function_two(
    dplyr_function_one(my_data_frame)))

but can be written with the pipe as

my_data_frame %>%
  dplyr_function_one() %>%
  dplyr_function_two() %>%
  dplyr_function_three()

The different reading order of data transformation functions in actual transformation order makes pipelines easier to read than nested function calls.

Quiz: Valid Functions

dplyr_function specifies the transformation function, param_one the parameter for the dplyr function and input_data_frame the data frame to be transformed. Which of the code lines below are valid according to the dplyr function framework? Start Quiz

Introduction to dplyr is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.