The Good oL’ Titanic Kaggle Competition pt. 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
# Vineet Abraham
# Kaggle Titanic Problem
rm(list=ls())
train <- read.csv("~/Documents/RStudio/Titanic/train.csv")
test <- read.csv("~/Documents/RStudio/Titanic/test.csv")
str(train)
table(train$Survived)
prop.table(table((train$Survived)))
test$Survived <- rep(0,418)
# First submission, assume everybody dies
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###
prop.table(table(train$Survived, train$Pclass))
# More than %80 of 3rd class passengers died, most 1st class passengers lived
prop.table(table(train$Survived, train$Sex))
# Most males died
prop.table(table(train$Sex, train$Pclass))
test$Survived[test$Sex == “female” & test$Pclass == 1] <- 1
# Second submission, all 1st class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###
test$Survived <- rep(0,418)
test$Survived[test$Sex == “female” & test$Pclass == 1] <- 1
test$Survived[test$Sex == “female” & test$Pclass == 2] <- 1
# Third submission, only 1st and 2nd class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###
test$Survived <- rep(0,418)
test$Survived[test$Sex == “female”] <- 1
# Fourth submission, only females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###
ave_agetr <- mean(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- ave_agetr
ave_agete <- mean(test$Age, na.rm = TRUE)
test$Age[is.na(test$Age)] <- ave_agete
ave_farete <- mean(test$Fare, na.rm = TRUE)
test$Fare[is.na(test$Fare)] <- ave_farete
logist <- glm(Survived ~ Sex + Fare + Pclass + Age, data = train, family = "binomial")
test$Survived <- predict(logist, newdata = test, type = "response")
test$Survived[test$Survived > 0.5] <- 1
test$Survived[test$Survived != 1] <- 0
# Fifth submission, Logistic regression using Sex, Fare, Pclass, and Age
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = “submission.csv”, row.names = FALSE)
###
For my analysis, I started by doing some simple proportion tables to see what impact different categorical features had on survival. You can see my code on Github for all the details. Passenger Class and Sex were the most obvious features to test since they have 3 and 2 factors respectively and they seem like they can provide some insight on survival (unlike the Embarked feature). I found that 3rd class passengers and males were the most likely to die. I created a few submissions based on sex and class. My females only prediction is currently my best score at 0.76555.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.