Getting started with dplyr in R using Titanic Dataset
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
dplyr
is one of the most popular r-packages and also part of tidyverse
that’s been developed by Hadley Wickham. The mere fact that dplyr
package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr
so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.
dplyr
library can be installed directly from CRAN and loaded into R session like any other R package.
#install.packages('dplyr') library(dplyr) # Loading Dplyr package
Let us start by reading the input training file using the base r function read.csv
train <- read.csv('../input/train.csv',stringsAsFactors = F, header = T)
Getting the total number of rows in the given data frame (even though it’s been very straightforward with nrow()
in base-r, this being a dplyr
starter-kit, we’ll start with that.
train%>% count() n 891
The above code just gives the row count of the data frame that’s been passed with the pipe %>% operator. The pipe operator works very similar to the | (pipe) operator in Unix environment where the output of the current operation is fed as the input of the following operation. Similarly, in dplyr
or any other package that supports pipe operator, the functions in it will always take only data frame as the first argument hence the function can be called in two ways like below:
count(train) #Without pipe, passing the df as the first argument train %>% count() #with pipe, more convenient and more readability n 891 n 891
But dplyr’s real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):
- select()
- filter()
- arrange()
- mutate()
- summarise()
- group_by()
And let us see what every one of these does!
select
select()
as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of()
could be used within select.
select(train,Age) #without pipe Age 22 38 26 35 35 NA 54 2 27 14 4 #multicolumn selection train %>% select(one_of('Sex','Age')) Sex Age male 22 female 38 female 26 female 35 male 35 male NA male 54 male 2 female 27 female 14 female 4 #multicolumn rejection train %>% select(-one_of('Age','Sex')) PassengerId Survived Pclass Name SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris 1 0 A/5 21171 7.2500 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 0 PC 17599 71.2833 C85 C 3 1 3 Heikkinen, Miss. Laina 0 0 STON/O2. 3101282 7.9250 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0 113803 53.1000 C123 S 5 0 3 Allen, Mr. William Henry 0 0 373450 8.0500 S 6 0 3 Moran, Mr. James 0 0 330877 8.4583 Q 7 0 1 McCarthy, Mr. Timothy J 0 0 17463 51.8625 E46 S 8 0 3 Palsson, Master. Gosta Leonard 3 1 349909 21.0750 S 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 2 347742 11.1333 S 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 0 237736 30.0708 C
Like selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.
train %>% select(starts_with('P')) PassengerId Pclass Parch 1 3 0 2 1 0 3 3 0 4 1 0 5 3 0 6 3 0 7 1 0 8 3 1 9 3 2 10 2 0 train %>% select(ends_with('e')) Name Age Fare Braund, Mr. Owen Harris 22 7.2500 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38 71.2833 Heikkinen, Miss. Laina 26 7.9250 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35 53.1000 Allen, Mr. William Henry 35 8.0500 Moran, Mr. James NA 8.4583 McCarthy, Mr. Timothy J 54 51.8625 Palsson, Master. Gosta Leonard 2 21.0750 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27 11.1333 Nasser, Mrs. Nicholas (Adele Achem) 14 30.0708 Sandstrom, Miss. Marguerite Rut 4 16.7000
group_by
group_by
is a lot similar to SQL Group by but more versatile. It is related to concept of “split-apply-combine”. Let us understand group_by with a starter example of finding out number of male and number of female – which logically could be the count of each Sex Type (once grouped by Sex).
train %>% group_by(Sex) %>% count() Sex n female 314 male 577
Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.
train %>% group_by(Survived, Sex) %>% count() train %>% group_by(Sex, Survived) %>% count() Survived Sex n 0 female 81 0 male 468 1 female 233 1 male 109 Sex Survived n female 0 81 female 1 233 male 0 468 male 1 109
mutate and summarise
That’s minimally group_by
, but the true power of group_by is unveiled only when it is coupled with mutate
and summarise
functions.
Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.
Let us get the average age of all survivors (and non-survivors): so this must be group_by -ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.
train %>% group_by(Survived) %>% summarise(mean(Age)) #Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed. train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T)) Survived mean(Age) 0 NA 1 NA Survived average_age 0 30.62618 1 28.34369
That’s summarise()
giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that’s where mutate plays its role. Let us create a new column, Age_Bracket
containing value Minor
if Age is less than 18 else Major
train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age')) #In fact this can be coupled with Survivor list to see the impact of this Age_bracket train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% group_by(Survived,Age_Bracket) %>% summarise(pnt = (n()/nrow(train))*100) Age Age_Bracket 22 Major 38 Major 26 Major 35 Major 35 Major NA NA 54 Major 2 Minor 27 Major 14 Minor Survived Age_Bracket pnt 0 Major 41.750842 0 Minor 5.836139 0 NA 14.029181 1 Major 25.701459 1 Minor 6.846240 1 NA 5.836139
That’s how dplyr
can get more powerful with group_by coupled with mutate
or summarise for feautre engineering and for better data visualization. But this doesn’t stop here, because one of the most important function a dataanalyst would require is sorting and that’s what arrange()
does.
arrange
Extracting last 4 results after sorting the fare in asending order:
train %>% arrange(Fare) %>% tail(4) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 870 319 1 1 Wick, Miss. Mary Natalie female 31 0 2 36928 164.8667 C7 S 871 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45 1 1 36928 164.8667 S 872 690 1 1 Madill, Miss. Georgette Alexandra female 15 0 1 24160 211.3375 B5 S 873 731 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S 874 780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) female 43 0 1 24160 211.3375 B3 S
Arrange in descending order:
train %>% arrange(desc(Age)) %>% head(5) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 S 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 C 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 Q
filter
filter
does row_wise
filter ( similar to what select did with columns). filter()
takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.
Let us start with filtering (extracting) only male and getting their Embarked station count.
train %>% filter(Sex == 'male') %>% group_by(Embarked) %>% count() Embarked n C 95 Q 41 S 441 #Getting the count of everyone whose age is lesser than 18 train %>% filter(Age < 18) %>% count() n 113 train %>% filter(grepl('wick',train$Name)) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 344 0 2 Sedgwick, Mr. Charles Frederick Waddington male 25 0 0 244361 13 S
And this is dplyr in a nut shell and hope you get a decent start with this article, if you are a beginner. Please share your thoughts and suggestions in comments!. The notebook used here is available on my github.
References
Related Post
- How to apply Monte Carlo simulation to forecast Stock prices using Python
- Analysing iOS App Store iTunes Reviews in R
- Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R
- Creating Reporting Template with Glue in R
- Predict Employee Turnover With Python
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.