Getting started with dplyr in R using Titanic Dataset

Posted on December 28, 2017 by Abdul Majed Raja in R bloggers | 0 Comments

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

dplyr library can be installed directly from CRAN and loaded into R session like any other R package.

#install.packages('dplyr')
library(dplyr) # Loading Dplyr package

Let us start by reading the input training file using the base r function read.csv

train <- read.csv('../input/train.csv',stringsAsFactors = F, header = T)

Getting the total number of rows in the given data frame (even though it’s been very straightforward with nrow() in base-r, this being a dplyr starter-kit, we’ll start with that.

train%>% count()
n
891

The above code just gives the row count of the data frame that’s been passed with the pipe %>% operator. The pipe operator works very similar to the | (pipe) operator in Unix environment where the output of the current operation is fed as the input of the following operation. Similarly, in dplyr or any other package that supports pipe operator, the functions in it will always take only data frame as the first argument hence the function can be called in two ways like below:

count(train) #Without pipe, passing the df as the first argument
train %>% count() #with pipe, more convenient and more readability
n
891
n
891

But dplyr’s real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):

select()
filter()
arrange()
mutate()
summarise()
group_by()

And let us see what every one of these does!

select

select() as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of() could be used within select.

select(train,Age) #without pipe
Age
22
38
26
35
35
NA
54
2
27
14
4


#multicolumn selection
train %>% select(one_of('Sex','Age'))
Sex	Age
male	22
female	38
female	26
female	35
male	35
male	NA
male	54
male	2
female	27
female	14
female	4


#multicolumn rejection
train %>% select(-one_of('Age','Sex'))
PassengerId	Survived	Pclass	Name	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	1	0	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	0	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	0	0	373450	8.0500		S
6	0	3	Moran, Mr. James	0	0	330877	8.4583		Q
7	0	1	McCarthy, Mr. Timothy J	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	3	1	349909	21.0750		S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	0	2	347742	11.1333		S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	1	0	237736	30.0708		C

Like selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.

train %>% select(starts_with('P'))
PassengerId	Pclass	Parch
1	3	0
2	1	0
3	3	0
4	1	0
5	3	0
6	3	0
7	1	0
8	3	1
9	3	2
10	2	0


train %>% select(ends_with('e'))
Name	Age	Fare
Braund, Mr. Owen Harris	22	7.2500
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	38	71.2833
Heikkinen, Miss. Laina	26	7.9250
Futrelle, Mrs. Jacques Heath (Lily May Peel)	35	53.1000
Allen, Mr. William Henry	35	8.0500
Moran, Mr. James	NA	8.4583
McCarthy, Mr. Timothy J	54	51.8625
Palsson, Master. Gosta Leonard	2	21.0750
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27	11.1333
Nasser, Mrs. Nicholas (Adele Achem)	14	30.0708
Sandstrom, Miss. Marguerite Rut	4	16.7000

group_by

group_by is a lot similar to SQL Group by but more versatile. It is related to concept of “split-apply-combine”. Let us understand group_by with a starter example of finding out number of male and number of female – which logically could be the count of each Sex Type (once grouped by Sex).

train %>% group_by(Sex) %>% count()
Sex	n
female	314
male	577

Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.

train %>% group_by(Survived, Sex) %>% count()
train %>% group_by(Sex, Survived) %>% count()
Survived	Sex	n
0	female	81
0	male	468
1	female	233
1	male	109

Sex	Survived	n
female	0	81
female	1	233
male	0	468
male	1	109

mutate and summarise

That’s minimally group_by, but the true power of group_by is unveiled only when it is coupled with mutate and summarise functions.
Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.

Let us get the average age of all survivors (and non-survivors): so this must be group_by -ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.

train %>% group_by(Survived) %>% summarise(mean(Age))
#Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed. 
train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T))
Survived	mean(Age)
0	NA
1	NA
Survived	average_age
0	30.62618
1	28.34369

That’s summarise() giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that’s where mutate plays its role. Let us create a new column, Age_Bracket containing value Minor if Age is less than 18 else Major

train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age'))
#In fact this can be coupled with Survivor list to see the impact of this Age_bracket
train %>% 
mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% 
group_by(Survived,Age_Bracket) %>% 
summarise(pnt = (n()/nrow(train))*100)
Age	Age_Bracket
22	Major
38	Major
26	Major
35	Major
35	Major
NA	NA
54	Major
2	Minor
27	Major
14	Minor
Survived	Age_Bracket	pnt
0	Major	41.750842
0	Minor	5.836139
0	NA	14.029181
1	Major	25.701459
1	Minor	6.846240
1	NA	5.836139

That’s how dplyr can get more powerful with group_by coupled with mutate or summarise for feautre engineering and for better data visualization. But this doesn’t stop here, because one of the most important function a dataanalyst would require is sorting and that’s what arrange() does.

arrange

Extracting last 4 results after sorting the fare in asending order:

train %>% arrange(Fare) %>%  tail(4) 
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
870	319	1	1	Wick, Miss. Mary Natalie	female	31	0	2	36928	164.8667	C7	S
871	857	1	1	Wick, Mrs. George Dennick (Mary Hitchcock)	female	45	1	1	36928	164.8667		S
872	690	1	1	Madill, Miss. Georgette Alexandra	female	15	0	1	24160	211.3375	B5	S
873	731	1	1	Allen, Miss. Elisabeth Walton	female	29	0	0	24160	211.3375	B5	S
874	780	1	1	Robert, Mrs. Edward Scott (Elisabeth Walton McMillan)	female	43	0	1	24160	211.3375	B3	S

Arrange in descending order:

train %>% arrange(desc(Age)) %>% head(5)
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	0	27042	30.0000	A23	S
852	0	3	Svensson, Mr. Johan	male	74.0	0	0	347060	7.7750		S
97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C
494	0	1	Artagaveytia, Mr. Ramon	male	71.0	0	0	PC 17609	49.5042		C
117	0	3	Connors, Mr. Patrick	male	70.5	0	0	370369	7.7500		Q

filter

filter does row_wise filter ( similar to what select did with columns). filter() takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.
Let us start with filtering (extracting) only male and getting their Embarked station count.

train %>% 
filter(Sex == 'male') %>%
group_by(Embarked) %>%
count()
Embarked	n
C	95
Q	41
S	441


#Getting the count of everyone whose age is lesser than 18
train %>% filter(Age < 18) %>% count()
n
113


train %>% filter(grepl('wick',train$Name))
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
344	0	2	Sedgwick, Mr. Charles Frederick Waddington	male	25	0	0	244361	13		S

And this is dplyr in a nut shell and hope you get a decent start with this article, if you are a beginner. Please share your thoughts and suggestions in comments!. The notebook used here is available on my github.

References

Titanic Dataset from Kaggle

Kaggle Kernel of the above Notebook

Github Code

Notebook Viewer

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.