Getting started with dplyr in R using Titanic Dataset
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.
dplyr library can be installed directly from CRAN and loaded into R session like any other R package.
#install.packages('dplyr')
library(dplyr) # Loading Dplyr package
Let us start by reading the input training file using the base r function read.csv
train <- read.csv('../input/train.csv',stringsAsFactors = F, header = T)
Getting the total number of rows in the given data frame (even though it’s been very straightforward with nrow() in base-r, this being a dplyr starter-kit, we’ll start with that.
train%>% count() n 891
The above code just gives the row count of the data frame that’s been passed with the pipe %>% operator. The pipe operator works very similar to the | (pipe) operator in Unix environment where the output of the current operation is fed as the input of the following operation. Similarly, in dplyr or any other package that supports pipe operator, the functions in it will always take only data frame as the first argument hence the function can be called in two ways like below:
count(train) #Without pipe, passing the df as the first argument train %>% count() #with pipe, more convenient and more readability n 891 n 891
But dplyr’s real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):
- select()
- filter()
- arrange()
- mutate()
- summarise()
- group_by()
And let us see what every one of these does!
select
select() as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of() could be used within select.
select(train,Age) #without pipe
Age
22
38
26
35
35
NA
54
2
27
14
4
#multicolumn selection
train %>% select(one_of('Sex','Age'))
Sex Age
male 22
female 38
female 26
female 35
male 35
male NA
male 54
male 2
female 27
female 14
female 4
#multicolumn rejection
train %>% select(-one_of('Age','Sex'))
PassengerId Survived Pclass Name SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 0 237736 30.0708 C
Like selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.
train %>% select(starts_with('P'))
PassengerId Pclass Parch
1 3 0
2 1 0
3 3 0
4 1 0
5 3 0
6 3 0
7 1 0
8 3 1
9 3 2
10 2 0
train %>% select(ends_with('e'))
Name Age Fare
Braund, Mr. Owen Harris 22 7.2500
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38 71.2833
Heikkinen, Miss. Laina 26 7.9250
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35 53.1000
Allen, Mr. William Henry 35 8.0500
Moran, Mr. James NA 8.4583
McCarthy, Mr. Timothy J 54 51.8625
Palsson, Master. Gosta Leonard 2 21.0750
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27 11.1333
Nasser, Mrs. Nicholas (Adele Achem) 14 30.0708
Sandstrom, Miss. Marguerite Rut 4 16.7000
group_by
group_by is a lot similar to SQL Group by but more versatile. It is related to concept of “split-apply-combine”. Let us understand group_by with a starter example of finding out number of male and number of female – which logically could be the count of each Sex Type (once grouped by Sex).
train %>% group_by(Sex) %>% count() Sex n female 314 male 577
Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.
train %>% group_by(Survived, Sex) %>% count() train %>% group_by(Sex, Survived) %>% count() Survived Sex n 0 female 81 0 male 468 1 female 233 1 male 109 Sex Survived n female 0 81 female 1 233 male 0 468 male 1 109
mutate and summarise
That’s minimally group_by, but the true power of group_by is unveiled only when it is coupled with mutate and summarise functions.
Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.
Let us get the average age of all survivors (and non-survivors): so this must be group_by -ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.
train %>% group_by(Survived) %>% summarise(mean(Age)) #Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed. train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T)) Survived mean(Age) 0 NA 1 NA Survived average_age 0 30.62618 1 28.34369
That’s summarise() giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that’s where mutate plays its role. Let us create a new column, Age_Bracket containing value Minor if Age is less than 18 else Major
train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age'))
#In fact this can be coupled with Survivor list to see the impact of this Age_bracket
train %>%
mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>%
group_by(Survived,Age_Bracket) %>%
summarise(pnt = (n()/nrow(train))*100)
Age Age_Bracket
22 Major
38 Major
26 Major
35 Major
35 Major
NA NA
54 Major
2 Minor
27 Major
14 Minor
Survived Age_Bracket pnt
0 Major 41.750842
0 Minor 5.836139
0 NA 14.029181
1 Major 25.701459
1 Minor 6.846240
1 NA 5.836139
That’s how dplyr can get more powerful with group_by coupled with mutate or summarise for feautre engineering and for better data visualization. But this doesn’t stop here, because one of the most important function a dataanalyst would require is sorting and that’s what arrange() does.
arrange
Extracting last 4 results after sorting the fare in asending order:
train %>% arrange(Fare) %>% tail(4) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 870 319 1 1 Wick, Miss. Mary Natalie female 31 0 2 36928 164.8667 C7 S 871 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45 1 1 36928 164.8667 S 872 690 1 1 Madill, Miss. Georgette Alexandra female 15 0 1 24160 211.3375 B5 S 873 731 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S 874 780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) female 43 0 1 24160 211.3375 B3 S
Arrange in descending order:
train %>% arrange(desc(Age)) %>% head(5) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 S 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 C 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 Q
filter
filter does row_wise filter ( similar to what select did with columns). filter() takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.
Let us start with filtering (extracting) only male and getting their Embarked station count.
train %>%
filter(Sex == 'male') %>%
group_by(Embarked) %>%
count()
Embarked n
C 95
Q 41
S 441
#Getting the count of everyone whose age is lesser than 18
train %>% filter(Age < 18) %>% count()
n
113
train %>% filter(grepl('wick',train$Name))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
344 0 2 Sedgwick, Mr. Charles Frederick Waddington male 25 0 0 244361 13 S
And this is dplyr in a nut shell and hope you get a decent start with this article, if you are a beginner. Please share your thoughts and suggestions in comments!. The notebook used here is available on my github.
References
Related Post
- How to apply Monte Carlo simulation to forecast Stock prices using Python
- Analysing iOS App Store iTunes Reviews in R
- Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R
- Creating Reporting Template with Glue in R
- Predict Employee Turnover With Python
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.