Classification from scratch, overview 0/8

arthur charpentier

4 years ago

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before my course on « big data and economics » at the university of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to some common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn’t it the case also for regression models, like generalized additive models (with splines) ? Do you really know what the algorithm is doing ? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it, in R ?

When I started working on academia, someone told me something like « if you really want to understand a theory, teach it ». And that has been my moto for more than 15 years. I wanted to add a second part to that statement: « if you really want to understand an algorithm, recode it ». So let’s try this… My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be

the logistic regression
the logistic regression with splines
the logistic regression with kernels (and knn)
the penalized logistic regression, ridge
the penalized logistic regression, lasso
the heuristics of neural nets
an introduction to SVM
classification trees
bagging and random forests
gradient boosting (and adaboost)

I will use two datasets to illustrate. The first one is inspired by the cover of « Foundations of Machine Learning » by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two – continuous – features)

x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
z = c(1,1,1,1,1,0,0,1,0,0)
df = data.frame(x1=x,x2=y,y=as.factor(z))
plot(x,y,pch=c(1,19)[1+z])

Here is some code to get a visualization of the prediction (here the probability to be a black point)

rmatrix_model = function(model){
u = seq(0,1,length=101)
p = function(x,y) predict(model,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
return(v)}
nice_graph=function(v){
u = seq(0,1,length=101)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10[c(1,10)],breaks=c(0,5,10)/10)
points(x,y,pch=19,cex=1.5,col="white")
points(x,y,pch=c(1,19)[1+z],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)
}
reg = glm(y~x1+x2,data=df,family=binomial)
nice_graph(rmatrix_model(reg))

Note that colors are defined here as

clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b")

or with some nonlinear model

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1
y = myocarde$PRONO
X = as.matrix(cbind(1,myocarde[,1:7]))

So far, I do not plan to talk (too much) on the choice of tunning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what’s going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a predict model.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.