Site icon R-bloggers

Titanic Kaggle competition pt. 2

[This article was first published on numbr crunch - Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< size="4">Logistic Regression Continued< >
I’m finally getting back to tackling the Titanic competition. In my last entry, I had started with some basic models (only females live, only 1st and 2nd class females live, etc), and then moved onto logistic regression. My logistic regression model at the time was not performing that well but I was also only using four features. The forums on Kaggle contain a lot of great information. For example, I found that lots of other people were using logistic regression, and that feature engineering was what helped them increase their model performance. I’m going to describe what I did below, you can also follow along on Github.

So Let’s begin with the feature engineering that I did:

  • Replaced NAs in test$Fare with the median test$Fare of the remainder values
  • Created a new feature called “Family” which tallies the total individuals per family by SibSp + Parch + 1
  • Converted Fares to ranges (only used in logistic regression) from 1 through 6 by $20 increments, where range 6 is any Fare greater than $100
  • Used the Name feature and extracted people’s titles, put them into a new feature called Title
  • Replaced NAs in the Age feature with the average age of individuals based on their Title

Once all the feature engineering was complete, I used the updated features as part of a new basic logistic regression model to see my performance.

logist <- glm(Survived ~ Sex + Pclass + Age + SibSp + Pclass:Sex + Title + Fare, data = train, family = “binomial”)
test$Survived <- predict(logist, newdata = test, type = “response”)
test$Survived[test$Survived > 0.5] <- 1
test$Survived[test$Survived != 1] <- 0

This entry gave me a score of 0.78947, which is a significant improvement from the original logistic regression model without any feature engineering. At this point, my options to improve performance from logistic regression would be to tweak the model parameters or do some additional feature engineering. Since I haven’t had much success with parameter tweaking and I’m tired of feature engineering, I’m going to move onto another model.

< size="4">Decision Trees< >
After spending so much time on the message boards, I found that others were using decision trees as another classification option. Two common packages used for decision trees in R are “rpart” and “party”. The “rpart” package is pretty simple to use with the data as it exists.

install.packages(“rpart”)
library(“rpart”)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Family, data=train, method=”class”)
Prediction <- predict(fit, test, type = “class”)

This achieved a slight improvement in my score as compared to my logistic regression model. My goal at this point was to learn how to boost or bag the decision tree model to see what kind of performance increase I could get. My explanation for boosting and bagging won’t be as good as this one. I tried to understand the documentation for the “adabag” package (boosting and bagging algorithm package) in R but I could not get it to work. I went back to Trevor Stephens’ webpage and found that he used a package called “party” to run an ensemble decision tree model. This model does require some tweaking to the data prior to running. The number of factors for any features that are used must match for both the test and train sets. See my code below.

# Ensemble Decision Trees
# Match both train and test number of levels for factors for ensemble decision trees
table(train$Embarked)
table(test$Embarked)
which(train$Embarked == “”)
train$Embarked[c(62,830)] <- “S”
train$Embarked <- factor(train$Embarked)
table(train$Title)
table(test$Title)
which(train$Title == 0)
train$Title[c(370,444)] <- “Miss”
train$Title <- factor(train$Title)
train$Title[train$Title %in% c(“Capt”,”Major”,”Sir”)] <- “Col”
train$Title[train$Title %in% c(“Countess”,”Lady”)] <- “Mrs”
train$Title[train$Title %in% c(“Mlle”)] <- “Miss”
train$Title <- factor(train$Title)

# Install package and run ensemble decision tree
install.packages(‘party’)
library(party)
set.seed(500)
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Family,
data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction <- predict(fit, test, OOB=TRUE, type = “response”)

Hooray! I eeked out a slightly higher score!! Well, my original goal was to stop once I reached 0.80 but I’m fine with rounding 0.79904 up, which puts me in the top 25% of submissions. At this point, the time I am spending improving my models and updating my features is producing diminishing returns, and there are plenty of other data problems I would rather tackle. I really enjoyed playing around with the data and the models and seeing what worked and what didn’t and learned a lot doing it. One of the scariest things I learned was how little I actually understood about the models I was running. I have a basic understand of linear regression, logistic regression, and decision trees, but I don’t always know how to improve my performance if I get stuck. I think a lot of this comes from understanding the models really well so I can tweak parameters or decide to use an alternate model based on the data. Anyway, I think this can be a topic for a future post. Until then…

To leave a comment for the author, please follow the link and comment on their blog: numbr crunch - Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.