Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We will start, in our Data Science course, to discuss classification techniques (in the context of supervised models). Consider the following case, with 10 points, and two classes (red and blue)
> clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) > clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) > x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) > z <- c(1,1,1,1,1,0,0,1,0,0) > df <- data.frame(x,y,z) > plot(x,y,pch=19,cex=2,col=clr1[z+1])
To get a prediction, i.e. a partition of the space in two parts, consider some logistic regression
> reg=glm(z~x+y,data=df,family=binomial) > summary(reg) Call: glm(formula = z ~ x + y, family = binomial, data = df) Deviance Residuals: Min 1Q Median 3Q Max -1.6593 -0.4400 0.2564 0.5830 1.5374 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.706 1.999 -0.854 0.393 x -5.489 5.360 -1.024 0.306 y 8.568 5.515 1.554 0.120 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.4602 on 9 degrees of freedom Residual deviance: 8.1445 on 7 degrees of freedom AIC: 14.144 Number of Fisher Scoring iterations: 5
Given some point, the predicted class is obtained using
> pred_1 <- function(x,y){ + predict(reg,newdata=data.frame(x=x, + y=y),type="response")>.5 + }
(here, the predicted class is simply the one that is the most likely). To visualize it use
> x_grid<-seq(0,1,length=101) > y_grid<-seq(0,1,length=101) > z_grid <- outer(x_grid,y_grid,pred_1) > image(x_grid,y_grid,z_grid,col=clr2) > points(x,y,pch=19,cex=2,col=clr1[z+1])
Since the logistic regression is a (generalized) linear model, the line that separate the two regions is a straight line.
What if we had three classes, and not two (red, yellow and blue)
> clr1=c(rgb(1,0,0,1),rgb(1,1,0,1),rgb(0,0,1,1)) > clr2=c(rgb(1,0,0,.2),rgb(1,1,0,.2), + rgb(0,0,1,.2)) > x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) > z=c(1,2,2,2,1,0,0,1,0,0) > df=data.frame(x,y,z) > plot(x,y,pch=19,cex=2,col=clr1[z+1])
We should use here a multinomial model. Recall that for the logistic regression model, if
i.e.
while
To derive a multivariate extension, write
and
Here, maximum likelihood techniques can be used, since
where variable
> library(nnet) > reg=multinom(z~x+y,data=df) # weights: 12 (6 variable) initial value 10.986123 iter 10 value 0.794930 iter 20 value 0.065712 iter 30 value 0.064409 iter 40 value 0.061612 iter 50 value 0.058756 iter 60 value 0.056225 iter 70 value 0.055332 iter 80 value 0.052887 iter 90 value 0.050644 iter 100 value 0.048117 final value 0.048117 stopped after 100 iterations
and the output is
> summary(reg) Call: multinom(formula = z ~ x + y, data = df) Coefficients: (Intercept) x y 1 14.44215 -126.90076 47.68488 2 -114.49676 65.23419 100.31341 Std. Errors: (Intercept) x y 1 23.03561 144.8639 52.99049 2 50.54085 267.1013 98.60386 Residual Deviance: 0.09623369 AIC: 12.09623
In the binomial case, there was only one interesting case (once we have the probability to be equal to 1, we have the probability to be null). Here, there are two. So we have two lines, for the three parameters. The prediction function, for the class, is here
> pred_2 <- function(x,y){ + which.max(predict(reg, + newdata=data.frame(x=x,y=y),type="probs")) + }
Now, to visualize predictions, use
> x_grid<-seq(0,1,length=101) > y_grid<-seq(0,1,length=101) > z_grid <- outer(x_grid,y_grid,pred_2) Error in outer(x_grid, y_grid, pred_2) : dims [produit 10201] do not match the length of object [1]
Unfortunately, it is not working. Neither is
> z_grid <- outer(x_grid,y_grid,function(x,y) pred_2(x,y)) Error in outer(x_grid, y_grid, function(x, y) pred_2(x, y)) : dims [produit 10201] do not match the length of object [1]
If any one has a suggestion, I’d be glad to know. Just because we really want to get our graph, use
> Outer <- function(x,y,fun) { + mat <- matrix(NA, length(x), length(y)) + for (i in seq_along(x)) { + for (j in seq_along(y)) mat[i,j]=fun(x[i],y[j]) + } + return(mat)} > z_grid <- Outer(x_grid,y_grid,pred_2)
(which is slow, but at least, it’s running). The prediction is
> image(x_grid,y_grid,z_grid,col=clr2) > points(x,y,pch=19,cex=2,col=clr1[z+1])
We now have three regions, the frontier being linear, and the intersection being the equiprobable case.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.