Data Manipulation Techniques with dplyr

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data manipulation techniques refer to the process of adjusting or rearranging data to make it organized and easier to read.

Data manipulation is a crucial function for all types of operations.

If you want to perform any kind of analysis like customer behavior, trend identification, prediction, etc… need to re-arrange the data in the way you need it.

As such, data manipulation techniques provides many benefits.

Correlation Plots in R

Data Manipulation Techniques Steps

Different steps involved in data manipulation, you’ll want to understand the general steps of operations.

  1. You need a database, which is created from your data sources.
  2. Need to cleanse your data, with data manipulation, you can clean, rearrange and restructure data.
  3. Develop the data frame for further analysis.
  4. Then analyze the data, to make all of this information and produce meaningful insights.

In this tutorial, we are going to explain data manipulation with dplyr package.

dplyr is the next iteration of plyr, focusing on only data frames. The main advantages of dplyr package is speed.

Exploratory data analysis in R

Load Library

library(dplyr)
library(readr)

Getting Data

mydata <- read_csv("D:/RStudio/Map/charts.csv")

In this dataset contains 327387 observations and 5 variables. You can download the dataset from here.

Linear Discriminant Analysis in R

Piping %>%

head(mydata, 10)
    date     order Title                    Name                                         week
                                                                    
  1 8/4/1958     1 Poor Little Fool         Ricky Nelson                                    1
  2 8/4/1958     2 Patricia                 Perez Prado And His Orchestra                   1
  3 8/4/1958     3 Splish Splash            Bobby Darin                                     1
  4 8/4/1958     4 Hard Headed Woman        Elvis Presley With The Jordanaires              1
  5 8/4/1958     5 When                     Kalin Twins                                     1
  6 8/4/1958     6 Rebel-'rouser            Duane Eddy His Twangy Guitar And The Rebels     1
  7 8/4/1958     7 Yakety Yak               The Coasters                                    1
  8 8/4/1958     8 My True Love             Jack Scott                                      1
  9 8/4/1958     9 Willie And The Hand Jive The Johnny Otis Show                            1
 10 8/4/1958    10 Fever                    Peggy Lee                                       1
mydata %>% head(10)
10 %>% head(mydata, .)
    date     order Title                    Name                                         week
                                                                    
  1 8/4/1958     1 Poor Little Fool         Ricky Nelson                                    1
  2 8/4/1958     2 Patricia                 Perez Prado And His Orchestra                   1
  3 8/4/1958     3 Splish Splash            Bobby Darin                                     1
  4 8/4/1958     4 Hard Headed Woman        Elvis Presley With The Jordanaires              1
  5 8/4/1958     5 When                     Kalin Twins                                     1
  6 8/4/1958     6 Rebel-'rouser            Duane Eddy His Twangy Guitar And The Rebels     1
  7 8/4/1958     7 Yakety Yak               The Coasters                                    1
  8 8/4/1958     8 My True Love             Jack Scott                                      1
  9 8/4/1958     9 Willie And The Hand Jive The Johnny Otis Show                            1
 10 8/4/1958    10 Fever                    Peggy Lee                                       1

Timeseries analysis in R

In dplyr the column operations are handled based on a select and mutate functions.

Select

mydata %>%   
select(date, order, Title, Name, 'week') 

Select function used for selecting the columns

date     order Title                    Name                                         week
                                                                    
  1 8/4/1958     1 Poor Little Fool         Ricky Nelson                                    1
  2 8/4/1958     2 Patricia                 Perez Prado And His Orchestra                   1
  3 8/4/1958     3 Splish Splash            Bobby Darin                                     1
  4 8/4/1958     4 Hard Headed Woman        Elvis Presley With The Jordanaires              1
  5 8/4/1958     5 When                     Kalin Twins                                     1
  6 8/4/1958     6 Rebel-'rouser            Duane Eddy His Twangy Guitar And The Rebels     1
  7 8/4/1958     7 Yakety Yak               The Coasters                                    1
  8 8/4/1958     8 My True Love             Jack Scott                                      1
  9 8/4/1958     9 Willie And The Hand Jive The Johnny Otis Show                            1
 10 8/4/1958    10 Fever                    Peggy Lee                                       1
mydata %>%
  select(date:Name, weeks_popular='week')
date     order Title                    Name                                        weeks_popular
                                                                            
  1 8/4/1958     1 Poor Little Fool         Ricky Nelson                                            1
  2 8/4/1958     2 Patricia                 Perez Prado And His Orchestra                           1
  3 8/4/1958     3 Splish Splash            Bobby Darin                                             1
  4 8/4/1958     4 Hard Headed Woman        Elvis Presley With The Jordanaires                      1
  5 8/4/1958     5 When                     Kalin Twins                                             1
  6 8/4/1958     6 Rebel-'rouser            Duane Eddy His Twangy Guitar And The Rebels             1
  7 8/4/1958     7 Yakety Yak               The Coasters                                            1
  8 8/4/1958     8 My True Love             Jack Scott                                              1
  9 8/4/1958     9 Willie And The Hand Jive The Johnny Otis Show                                    1
 10 8/4/1958    10 Fever                    Peggy Lee                                               1
mydata %>%
  select(-'order')
      date     Title                    Name                                         week
                                                                   
  1 8/4/1958 Poor Little Fool         Ricky Nelson                                    1
  2 8/4/1958 Patricia                 Perez Prado And His Orchestra                   1
  3 8/4/1958 Splish Splash            Bobby Darin                                     1
  4 8/4/1958 Hard Headed Woman        Elvis Presley With The Jordanaires              1
  5 8/4/1958 When                     Kalin Twins                                     1
  6 8/4/1958 Rebel-'rouser            Duane Eddy His Twangy Guitar And The Rebels     1
  7 8/4/1958 Yakety Yak               The Coasters                                    1
  8 8/4/1958 My True Love             Jack Scott                                      1
  9 8/4/1958 Willie And The Hand Jive The Johnny Otis Show                            1
 10 8/4/1958 Fever                    Peggy Lee                                       1

Decision Trees in R

Mutate

mydata %>%
  select(date:Name, weeks_popular='week') %>%
  mutate(is_collab = grepl('Featuring', Name)) %>%
  select(Name, is_collab, everything())
    Name                                        is_collab date     order Title                    weeks_popular
                                                                                 
  1 Ricky Nelson                                FALSE     8/4/1958     1 Poor Little Fool                     1
  2 Perez Prado And His Orchestra               FALSE     8/4/1958     2 Patricia                             1
  3 Bobby Darin                                 FALSE     8/4/1958     3 Splish Splash                        1
  4 Elvis Presley With The Jordanaires          FALSE     8/4/1958     4 Hard Headed Woman                    1
  5 Kalin Twins                                 FALSE     8/4/1958     5 When                                 1
  6 Duane Eddy His Twangy Guitar And The Rebels FALSE     8/4/1958     6 Rebel-'rouser                        1
  7 The Coasters                                FALSE     8/4/1958     7 Yakety Yak                           1
  8 Jack Scott                                  FALSE     8/4/1958     8 My True Love                         1
  9 The Johnny Otis Show                        FALSE     8/4/1958     9 Willie And The Hand Jive             1
 10 Peggy Lee                                   FALSE     8/4/1958    10 Fever                                1

Naïve Bayes Classification in R

In dplyr row operations are handled based on filter, distinct and arrange functions

Filter

mydata %>%
  select(date:Name, weeks_popular='week') %>%
  filter(weeks_popular >= 20, Name == 'Drake' | Name == 'Taylor Swift')
       date       order Title                  Name         weeks_popular
                                             
  1 2/3/2007      43 Tim McGraw             Taylor Swift            20
  2 8/4/2007      39 Teardrops On My Guitar Taylor Swift            20
  3 8/11/2007     33 Teardrops On My Guitar Taylor Swift            21
  4 8/18/2007     34 Teardrops On My Guitar Taylor Swift            22
  5 8/25/2007     38 Teardrops On My Guitar Taylor Swift            23
  6 9/1/2007      49 Teardrops On My Guitar Taylor Swift            24
  7 9/8/2007      50 Teardrops On My Guitar Taylor Swift            25
  8 12/15/2007    44 Teardrops On My Guitar Taylor Swift            26
  9 12/22/2007    30 Teardrops On My Guitar Taylor Swift            27
 10 12/29/2007    24 Teardrops On My Guitar Taylor Swift            28

distinct

mydata %>%
select(date:Name, weeks_popular='week') %>%
filter(Name == 'Jack Scott') %>%
distinct(Title)

This function can be used to keep only unique/distinct rows from a data frame.

    Title                                         
  1 My True Love                     
  2 Leroy                            
  3 With Your Love                   
  4 Geraldine                        
  5 Goodbye Baby                     
  6 Save My Soul                     
  7 I Never Felt Like This           
  8 The Way I Walk                   
  9 There Comes A Time               
 10 What In The World's Come Over You
 11 Burning Bridges                  
 12 Oh, Little One                   
 13 Cool Water                       
 14 It Only Happened Yesterday       
 15 Patsy                            
 16 Is There Something On Your Mind  
 17 A Little Feeling (Called Love)   
 18 My Dream Come True               
 19 Steps 1 And 2                    

In dplyr group operations are handled based on group_by, summarise and count functions.

Group_by & Summarise

mydata %>%
  select(date:Name, weeks_popular='week') %>%
  filter(Name == 'Kalin Twins') %>%
  group_by(Title) %>%
  summarise(total_weeks_popular = max(weeks_popular))
    Title         total_weeks_popular
                          
 1 Forget Me Not                  15
 2 When                            9

RMANOVA in R

Arrange

mydata %>%
  select(date:Name, weeks_popular='week') %>%
  filter(Name == 'Drake') %>%
  group_by(Title) %>%
  summarise(total_weeks_popular = max(weeks_popular)) %>%
  arrange(desc(total_weeks_popular), Title) %>%
  head(10)
     Title                   total_weeks_popular
                                     
  1 God's Plan                               36
  2 Hotline Bling                            36
  3 Controlla                                26
  4 Fake Love                                25
  5 Headlines                                25
  6 Nice For What                            25
  7 Best I Ever Had                          24
  8 In My Feelings                           22
  9 Nonstop                                  22
 10 Started From The Bottom                  22

Count

mydata %>%
  select(date:Name, weeks_popular='week')  %>%
  count(Name) %>%
  arrange(desc(n))
   Name              n
             
  1 Taylor Swift   1021
  2 Elton John      889
  3 Madonna         857
  4 Kenny Chesney   758
  5 Drake           742
  6 Tim McGraw      731
  7 Keith Urban     673
  8 Stevie Wonder   659
  9 Rod Stewart     657
 10 Mariah Carey    621

Conclusion

Based on dplyr package data can easily be modified the way we want and that too very easily.

tidyverse in R complete tutorial

The post Data Manipulation Techniques with dplyr appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Methods – finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)