Site icon R-bloggers

The Good oL’ Titanic Kaggle Competition pt. 1

[This article was first published on numbr crunch - Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After that I began playing around with logistic regression. So far, none of my attempts at logistic regression have improved my score but I have some ideas for tomorrow (already reached my submission limit for today). I do realize now that I need to have a plan with my logistic regression models, I need to determine which features have the best probability of providing signal instead of blindly plugging in different ones. Since the code for this portion is short, I included it below.

# Vineet Abraham
# Kaggle Titanic Problem

rm(list=ls())
train <- read.csv(“~/Documents/RStudio/Titanic/train.csv”)
test <- read.csv(“~/Documents/RStudio/Titanic/test.csv”)
str(train)
table(train$Survived)
prop.table(table((train$Survived)))
test$Survived <- rep(0,418)

# First submission, assume everybody dies
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###

prop.table(table(train$Survived, train$Pclass))
# More than %80 of 3rd class passengers died, most 1st class passengers lived
prop.table(table(train$Survived, train$Sex))
# Most males died
prop.table(table(train$Sex, train$Pclass))
test$Survived[test$Sex == “female” & test$Pclass == 1] <- 1

# Second submission, all 1st class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###

test$Survived <- rep(0,418)
test$Survived[test$Sex == “female” & test$Pclass == 1] <- 1
test$Survived[test$Sex == “female” & test$Pclass == 2] <- 1

# Third submission, only 1st and 2nd class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###

test$Survived <- rep(0,418)
test$Survived[test$Sex == “female”] <- 1

# Fourth submission, only females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###

ave_agetr <- mean(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- ave_agetr
ave_agete <- mean(test$Age, na.rm = TRUE)
test$Age[is.na(test$Age)] <- ave_agete
ave_farete <- mean(test$Fare, na.rm = TRUE)
test$Fare[is.na(test$Fare)] <- ave_farete
logist <- glm(Survived ~ Sex + Fare + Pclass + Age, data = train, family = “binomial”)
test$Survived <- predict(logist, newdata = test, type = “response”)
test$Survived[test$Survived > 0.5] <- 1
test$Survived[test$Survived != 1] <- 0

# Fifth submission, Logistic regression using Sex, Fare, Pclass, and Age
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###

It’s been over two months since I finished the Data Science certificate program through the University of Washington. Since then I’ve been trying to figure out my next step. The annoying thing about the internet is that it probably gives you too many options. Every time I search “learning data science”, or “how to become a data scientist”, or “what data science tools should I learn”, I get completely inundated with different information. I can’t tell you how many times one article has led to several others and in the end I can’t even remember where I started. In all of this noise, I’ve realized one thing, you just HAVE TO START SOMEWHERE. I’ve done Kaggle in the past and I’m pretty familiar with R, so I figured I would go back to the Titanic problem and see what happens. I won’t rehash the entire problem but basically you are given a set of features about passengers on the Titanic which you have to use to create a model to predict whether they died or survived. I have to give a shoutout to Trevor Stevens and his blog for getting me started.

For my analysis, I started by doing some simple proportion tables to see what impact different categorical features had on survival. You can see my code on Github for all the details. Passenger Class and Sex were the most obvious features to test since they have 3 and 2 factors respectively and they seem like they can provide some insight on survival (unlike the Embarked feature). I found that 3rd class passengers and males were the most likely to die. I created a few submissions based on sex and class. My females only prediction is currently my best score at 0.76555.

To leave a comment for the author, please follow the link and comment on their blog: numbr crunch - Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.