Site icon R-bloggers

Predicting Titanic deaths on Kaggle III: Bagging

[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the third post on prediction the deaths. The first one used randomforest, the second boosting (gbm). The aim of the third post was to use bagging. In contrast to the former posts I abandoned dplyr in this post. It gave some now you see now you don’t errors.

Data

The data is supposed to be the same as previous.
library(ipred)
library(rpart)
library(lattice)

# read and combine
train <- read.csv(‘train.csv’)
train$status <- ‘train’
test  <- read.csv(‘test.csv’)
test$status <- ‘test’
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]’)[[1]][2])
tt$Title <- gsub(‘ ‘,”,tt$Title)
tt$Title[tt$Title %in% c(‘Capt’,’Col’,’Don’,’Sir’,’Jonkheer’,’Major’)] <- ‘Mr’
tt$Title[tt$Title %in% c(‘Lady’,’Ms’,’theCountess’,’Mlle’,’Mme’,’Ms’,’Dona’)] <- ‘Miss’
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl(‘A’,tt$Cabin))
tt$B <- factor(grepl(‘B’,tt$Cabin))
tt$C <- factor(grepl(‘C’,tt$Cabin))
tt$D <- factor(grepl(‘D’,tt$Cabin))
tt$E <- factor(grepl(‘E’,tt$Cabin))
tt$F <- factor(grepl(‘F’,tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl(‘PC’,tt$Ticket))
tt$STON <- factor(grepl(‘STON’,tt$Ticket))
tt$cn <- as.numeric(gsub(‘[[:space:][:alpha:]]’,”,tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)

Age

The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized.
forage <- tt[!is.na(tt$Age) & tt$status==’train’,names(tt) %in% 
   c(‘Age’,’Sex’,’Pclass’,’SibSP’,
   ‘Parch’,’Fare’,’Title’,’Embarked’,’A’,’B’,’C’,’D’,’E’,’F’,
   ‘ncabin’,’PC’,’STON’,’oe’)]

ipbag1 <- bagging(Age ~.,data=forage)
ipbag1
Bagging regression trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = Age ~ ., data = forage)
plot(tt$Age~predict(ipbag1,tt))
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),])

Selecting the survival model

ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models.
di1 <- subset(titanic,select=c(
    age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived))
dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6)
la <- lapply(1:nrow(dso),function(ii) {
   ee <-    errorest(Survived ~ .,
    ns=dso$ns[ii],
    control=rpart.control(minsplit=dso$minsplit[ii], cp=0, 
       xval=0,maxsurrogate=0),
    nbagg=dso$nbagg[ii],
    model=bagging,
    data=di1,
    est.para=control.errorest(k=20)
    )
    cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)
    print(cc)
    cc
  })
las <- do.call(rbind,la) 
las <- as.data.frame(las)
xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type=’l’)

Predictions

Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.
Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.
 bagmod <- bagging(Survived ~.,ns=275,nbagg=500,
    control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),
    data=di1)

pp <- predict(bagmod,test)

out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file=’bag8aug.csv’,
    row.names=FALSE,
    quote=FALSE)
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.