Titanic Kaggle competition pt. 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m finally getting back to tackling the Titanic competition. In my last entry, I had started with some basic models (only females live, only 1st and 2nd class females live, etc), and then moved onto logistic regression. My logistic regression model at the time was not performing that well but I was also only using four features. The forums on Kaggle contain a lot of great information. For example, I found that lots of other people were using logistic regression, and that feature engineering was what helped them increase their model performance. I’m going to describe what I did below, you can also follow along on Github.
So Let’s begin with the feature engineering that I did:
- Replaced NAs in test$Fare with the median test$Fare of the remainder values
- Created a new feature called “Family” which tallies the total individuals per family by SibSp + Parch + 1
- Converted Fares to ranges (only used in logistic regression) from 1 through 6 by $20 increments, where range 6 is any Fare greater than $100
- Used the Name feature and extracted people’s titles, put them into a new feature called Title
- Replaced NAs in the Age feature with the average age of individuals based on their Title
Once all the feature engineering was complete, I used the updated features as part of a new basic logistic regression model to see my performance.
logist <- glm(Survived ~ Sex + Pclass + Age + SibSp + Pclass:Sex + Title + Fare, data = train, family = "binomial")
test$Survived <- predict(logist, newdata = test, type = "response")
test$Survived[test$Survived > 0.5] <- 1
test$Survived[test$Survived != 1] <- 0
Decision Trees
After spending so much time on the message boards, I found that others were using decision trees as another classification option. Two common packages used for decision trees in R are “rpart” and “party”. The “rpart” package is pretty simple to use with the data as it exists.
install.packages(“rpart”)
library(“rpart”)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Family, data=train, method="class")
Prediction <- predict(fit, test, type = "class")
# Ensemble Decision Trees
# Match both train and test number of levels for factors for ensemble decision trees
table(train$Embarked)
table(test$Embarked)
which(train$Embarked == “”)
train$Embarked[c(62,830)] <- "S"
train$Embarked <- factor(train$Embarked)
table(train$Title)
table(test$Title)
which(train$Title == 0)
train$Title[c(370,444)] <- "Miss"
train$Title <- factor(train$Title)
train$Title[train$Title %in% c(“Capt”,”Major”,”Sir”)] <- "Col"
train$Title[train$Title %in% c(“Countess”,”Lady”)] <- "Mrs"
train$Title[train$Title %in% c(“Mlle”)] <- "Miss"
train$Title <- factor(train$Title)
# Install package and run ensemble decision tree
install.packages(‘party’)
library(party)
set.seed(500)
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Family,
data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction <- predict(fit, test, OOB=TRUE, type = "response")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.