Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recursive binary partitioning
All “recursive binary partitioning” algorithms producing a decision tree will iteratively repeat two steps for every generated node realizing a split:
- Determine a covariate to split at
- Determine a split within the set of values of that covariate
Because step two will split a covariate’s set of values into two parts the method is called “binary partitioning” and because those two steps are traversing a tree from the root node to the terminal nodes (leafs) in the usual recursive manner it is called “recursive binary partitioning”.
What makes the party approach special?
Both steps and the stopping criteria for the algorithm are based on permutation tests instead of optimizing an information measure. So essentially a hypothesis test is performed for all covariates separately against the null hypothesis “The covariate X is independent of the repsonse variable.”. If this hypothesis cannot be rejected at a configured significance level then the algorithm stops because no split is assumed to improve the classification. Otherwise the covariate is chosen for the next split. Optimizing an information level instead (as it is done in rpart) will be performed until a specified level of information purity is reached – but because this approach disregards statistical meaningfulness it is prone to overfitting. For that reason the reasearcher is expected to perform a pruning of the tree – cutting off branches which are assumed to impair generalization.
Further advantages stem from having statistical tests guide the unfolding of the tree. The authors of party assess for example that the bias towards selecting a covariate with many possible splits is highly reduced compared to other common approaches. For the gory details and benchmarks I highly recommend the quite well written vignette provided for the party package.
Representing the data
The initial data set contains for every passenger the following information:
- class in {1,2,3}
- sex
- name (contains titles – {Mr., Mrs., Miss., Master., Dr., …}
- age (partially missing)
- number of people on board being sibling or spouse
- number of people on board being parent or child
- ticket number
- fare
- cabin (mostly missing)
- embarked in {S,Q,C}
I decided to use the following representations:
- class (as is)
- sex (as is)
- title in {Mister, Doctor, Soldier, Reverend, Missis, Miss, Master}
- age category in {0-4, 5-10, 11-15, 16-59, >59}
- number of people on board being sibling or spouse
- number of people on board being parent or child
- ticket (only the letters in the beginning or ’123′ for just digits)
- cabin (the first letter or “unknown”)
- embarked
In the end the formula to be modelled with SVMs and decision trees was in both cases as follows:
survived ~ class + title + parch + ticket + cabin + ageCat + embarked + sibsp + sex
Finding a good tree
To find a good tree I performed a 100-fold cross validated grid search on five settings of party::ctree():
- testtype in {Bonferroni, Univariate, Teststatistic} (no “MC”)
- teststat in {quad} (no “max”)
- mincriterion in {90%, 95%, 99%}
- minsplit in {60, 40, 20, 10}
- minbucket in {30, 20, 10, 5}
- maxdepth in {0, 4, 5, 6}
A quick initial check led me to disregard “max” because it didn’t seem to improve the performance and “MonteCarlo” because it took easily 10 times and more longer to come up with a tree while not showing interesting performance.
At the top of the results are multiple trees calculated using Univariate for the test type and a significance level of either 10% or 5%. The top trees of the grid search also showed the best performance for the test set evaluated on Kaggle of getting 79.4% of the cases right.
The winning tree
setwd("[...]/kaggle/titanic/tree") source("../prepare-data.R") library(party) df_train <- read.table("../data/train.csv", sep=",", header=TRUE) df_test <- read.table("../data/test.csv", sep=",", header=TRUE) data <- read_data(df_train, df_test) t <- ctree( survived ~ class + title + parch + ticket + cabin + ageCat + embarked + sibsp + sex, data$train, controls = ctree_control( teststat="quad", testtype="Univariate", mincriterion=.95, minsplit=10, minbucket=5, maxdepth=0 ) ) plot(t)
Most of what the tree tells us is no news – women and children show a higher survival rate and the lower the class the worse the prospects. Beyond that the tree seems to indicate that among women and children travelling third class the survival rate seems to be lower for those being companied by more than two people being their spouse or sibling. I would rather assume that the more people you are close to on board the higher your chance of survival.
Let’s give SVMs a try
To choose a promising SVM set up I realized a 100-fold cross validated grid search on parameters for three kernels using kernlab::ksvm() and the same data set as above:
- rbfdot for C x sigma = {10^-2,…,10^5} x {10^-5,…,10^1}
- tanhdot for C x sigma x r ={10^-2,…,10^5} x {10^-5,…,10^1} x {-2.4,-1.8,…,2.4}
- polydot for C x degree x offset = {10^-2,…,10^2} x {1,…,5} x {0,1}
And here you find the results of the grid search. One SVM got my out-of-sample accuracy as high as 81.3% which is almost 2p.p. higher than the best performance I got with trees. But – and that is a big but – in a real world scenario one does not have the luxury of so easily assessing an estimation of the out-of-sample error (which I get by submitting my predictions for the test set to Kaggle). This means that practically one would have chosen the kernel configuration which performed best in the grid search – and the big surprise is that the top SVMs of the grid search do show – surprise, surprise – exactly the same out-of-sample accuracy as the best performing tree – 79.4%.
setwd("[...]/kaggle/titanic/svm/") source("./grid-search.r") source("../prepare-data.R") library(kernlab) df_train <- read.table("../data/train.csv", sep=",", header=TRUE) df_test <- read.table("../data/test.csv", sep=",", header=TRUE) data <- read_data(df_train, df_test, shuffle=TRUE) formula <- survived ~ class + title + parch + ticket + cabin + ageCat + embarked + sibsp + sex m <- ksvm( formula, data = data$train, kernel = "tanhdot", C = 100, kpar = list(scale = 0.01, offset = -1.2) )
So in this case the match SVM vs. tree ends with a solid tie!
Of course I expected the SVM to leave the tree model behind and I guess it makes sense to consider the tie an indication of that 79.4% is the best to take from my initially chosen data representation. But that does not seem to be necessarily the right conclusion at all given that despite the identical measured out-of-sample performance the top SVM and the top tree – they only agree in 91% of the cases! Most of those cases apparently belong to terminal node 11 which is pretty much a 50/50 classification for the tree as it can be seen on the tree plot. Given that about 44% – less than 50% – of the passengers falling into this node did no survive, the tree exclusively predicts a non-survival. The SVM on the other hand predicts – for whatever mysterious reason, instead a survival for almost all of those passengers!
The post Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab) appeared first on joy of data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.