Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Now we move on to the second part of the Exercise 5.2, which requires to implement regularized logistic regression using Newton’s Method.
Plot the data:
x <- read.csv("ex5Logx.dat", header=F) y <- read.csv("ex5Logy.dat", header=F) y <- y[,1] d <- data.frame(x1=x[,1],x2=x[,2],y=factor(y)) require(ggplot2) p <- ggplot(d, aes(x=x1, y=x2))+ geom_point(aes(colour=y, shape=y))
We will now fit a regularized regression model to this data.
The hypothesis function in logistic regression is :
\(h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }=P(y=1\vert x;\theta)\)
In this exercise, we will assign \(x\) , in the \(\theta^Tx\) , to be all monomials of \(u\) and \(v\) up to the sixth power:
\( x=\left[\begin{array}{c} 1\\ u\\ v\\ u^2\\ uv\\ v^2\\ u^3\\ \vdots\\ uv^5\\ v^6\end{array}\right] \)
where \(x_0 = 1, x_1=u, x_2= v,\ldots x_{28} =v^6\) .
I defined the function mapFeature, that maps the original inputs to the feature vector.
mapFeature <- function(u,v, degree=6) { out <- sapply(0:degree,function(i) sapply(0:i, function(j) u^(i-j) * v^j ) ) out <- unlist(out) return(out) }
Regularized Logistic Regression:
The cost function of regularized logistic regression is defined as:
\( J(\theta)=-\frac{{1}}{m}\sum_{i=1}^{m}\left[ y^{(i)}\log(h_... ...\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2} \)
Notice that this function can work for regularized (lambda > 0) and unregularized (lambda = 0) logistic regression. The regularization term at the end will lead to a more tiny \(\theta\) , thus obtain a more generalized fit, which more likely will work better on new data (for doing predictions).
Newton's Method:
The Newton's Method update rule is:
\(\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J\)
In the regularized version of logistic regression, the gradient \(\nabla_{\theta}(J)\) and the Hessian \(H\) have different forms:
\(\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x + \frac{\lambda}{m} \theta\)
\(H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x] + \frac{\lambda}{m} \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix}\)
Also notice that, when lambda=0, you will see the same formulas as unregularized logistic regression.
Here is my implementation:
##sigmod function g <- function(z) { toReturn <- 1/(1+exp(-z)) return(toReturn) } ##hypothesis function h <- function(theta, x) { g(x %*% theta) } ## cost function J <- function(theta, x,y,lambda=1) { m <- length(y) j <- -1/m * ( y %*% log( h(theta,x) ) + (1-y) %*% log( 1- h(theta,x) ) ) r <- theta^2 r[1] <- 0 j <- j + lambda/(2*m) * sum(r) return(j) } ## gradient grad <- function(theta, x, y, lambda=1) { m <- length(y) r <- lambda/m * theta r[1] <- 0 g <- 1/m * t(x) %*% (h(theta,x)-y) + r return(g) } ## Hessian Hessian <- function(theta, x, lambda=1) { m <- nrow(x) n <- ncol(x) r <- lambda/m * diag(n) r[1] <- 0 H <- 1/m * t(x) %*% x *diag(h(theta,x)) * diag(1-h(theta,x)) + r return(H) }
First, I calculate the theta, for lambda=1.
colnames(x) <- c("u", "v") x <- mdply(x, mapFeature) x <- x[,c(-1,-2)] x <- as.matrix(x) theta <- matrix(rep(0, ncol(x)), ncol=1) lambda <- 1 j <- rep(0,10) for (i in 1:10) { theta <- theta - solve(Hessian(theta,x, lambda)) %*% grad(theta,x,y,lambda) j[i] <- J(theta,x,y, lambda) }
To validate the function is converging properly, We plot the values obtained from cost function against number of iterations.
ggplot()+ aes(x=1:10,y=j)+ geom_point(colour="red")+ geom_line()+xlab("Iteration")+ ylab("Cost J")
Now, we make it iterate for lambda = 0 and lambda=10 for comparing the fitting models.
theta0 <- matrix(rep(0, ncol(x)), ncol=1) for (i in 1:10) { theta0 <- theta0 - solve(Hessian(theta0,x, lambda=0)) %*% grad(theta0,x,y,lambda=0) } theta10 <- matrix(rep(0, ncol(x)), ncol=1) for (i in 1:10) { theta10 <- theta10 - solve(Hessian(theta10,x, lambda=10)) %*% grad(theta10,x,y,lambda=10) }
Finally calcuate the decision boundary line and visulize it.
u <- seq(-1,1.2, len=200) v <- seq(-1,1.2, len=200) z0 <- matrix(0, length(u), length(v)) z1 <- matrix(0, length(u), length(v)) z10 <- matrix(0, length(u), length(v)) for (i in 1:length(u)) { for (j in 1:length(v)) { features <- mapFeature(u[i],v[j]) z0[i,j] <- features %*% theta0 z1[i,j] <- features %*% theta z10[i,j] <- features %*% theta10 } } rownames(z0) <- rownames(z1) <- rownames(z10) <- as.character(u) colnames(z0) <- colnames(z1) <- colnames(z10) <- as.character(v) z0.melted <- melt(z0) z1.melted <- melt(z1) z10.melted <- melt(z10) z0.melted <- data.frame(z0.melted, lambda=0) z1.melted <- data.frame(z1.melted, lambda=1) z10.melted <- data.frame(z10.melted, lambda=10) zz <- rbind(z0.melted, z1.melted, z10.melted) zz$lambda <- factor(zz$lambda) colnames(zz) <- c("u", "v", "z", "lambda") p+geom_contour(data=zz, aes(x=u, y=v, z=z, group=lambda, colour=lambda),bins=1)
The red line (lambda=0) is more tightly fit to the crosses.
As lambda increase, the fit becomes more loose and more generalized.
PS:it's very weird that the legends in the above figure not shown properly.
Related Posts
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.