Gradient descent in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It has been well over a year since my last entry, I have been rather quiet because someone has been rather loud Just last week I found some time to rewrite a draft on gradient descent from about two years ago, so here we are – back in business!
Gradient descent is a fundamental concept in machine learning, and therefore one practitioners should be familiar with. In this post we will explore gradient descent from both theoretical and practical perspectives. First, we will establish some general definitions, review cost functions in the context of regression and binary classification, and introduce the chain rule of calculus. Then, we will put it all into practice to build a linear and a logistic regression models from the ground up. This is a short, introductory guide where a basic knowledge of statistics and calculus should be most helpful. Ready, set, go!
Introduction
Most predictive algorithms learn by estimating a set of parameters – also known as weights, or coefficients – that minimise the cost function, a measure of prediction error. In gradient-based learning, that search process invariably employs gradient descent or one of its numerous extensions. Gradients are partial derivatives of the cost function with respect to each model parameter, . On a high level, gradient descent is an iterative procedure that computes predictions and updates parameter estimates by subtracting their corresponding gradients weighted by a learning rate . This procedure will find estimates that incur ever smaller prediction errors, provided i) the learning rate is adequate, ii) the maximum number of iterations is sufficient and iii) the underlying dataset is suitable for modelling.
How gradient descent works will become clearer once we establish a general problem definition, review cost functions and derive gradient expressions using the chain rule of calculus, for both linear and logistic regression.
Problem definition
We start by establishing a general, formal definition. Suppose we have a matrix carrying p predictors observed across n observations, and the vector of parameters comprising the model intercept and the slopes from each individual predictor. Our aim is to predict some target y using a combination of and :
- In linear regression, we would like to predict the continuous target variable . The model prediction from any observation i takes the form
- In logistic regression, we would like to predict the binary target variable . The model prediction from any observation i takes the form , where is the logit of y, and is the logistic function that transforms logit to target probability
Note that in linear and logistic regression, in one shot you can directly access all n predictions and logit values, respectively, via the matrix product after appending a column of ones to the left of .
Cost function
In both linear and logistic regression, the aim is to find the parameter estimates that minimise the prediction error. This error is generally measured using a cost function:
- In linear regression, one can use the mean squared error (MSE), . It measures the average squared difference between the observed and predicted target values (lower is better)
- In logistic regression, one can use the binary cross-entropy, . It measures the amount of uncertainty shared between the observed and estimated probability distributions (lower is better)
Putting it together, we have a recipe to predict a target using and , and measure the error between resulting prediction and the target. What are we missing then?
Gradient descent
As we saw earlier, gradient descent is mainly driven by the partial derivatives of the cost function with respect to each individual parameter, . So, how do we calculate these?
To this end, we can use of the chain rule of calculus, which posits that for any number of functions, to simplify the derivative of a cost function with respect to any parameter. Assuming , which turns out to be and the logit for linear and logistic regressions respectively, we find that:
- For linear regression using a MSE cost function , we can decompose the derivative of the cost function into two parts,
- For logistic regression using a binary cross-entropy cost function , we can decompose the derivative of the cost function into three parts, , or equivalently
In both cases the application of gradient descent will iteratively update the parameter vector using the aforementioned equation . Also, it is worth noting that none of the predictors contributes to the intercept partial derivative, i.e. .
Now, I encourage you to apply the chain rule and try deriving these gradient expressions by hand, using the provided definitions of , and accordingly
Let’s get started with R
Let us now apply these learnings to build two models using R:
- Linear regression, to predict the age of orange trees from their circumference
- Logistic regression, to predict engine shape from miles-per-gallon (mpg) and horsepower
In both cases we will implement batch gradient descent, where all training observations are used in each iteration. Mini-batch and stochastic gradient descent are popular alternatives that use instead a random subset or a single training observation, respectively, making them computationally more efficient when handling large sample sizes. Make sure that you have RColorBrewer
installed.
Linear regression
Our first model, based on the Orange
dataset, will have the following structure:
In the code below we will configure gradient descent such that in each of 25 iterations, a prediction is made and the two parameters and are updated using the gradient expressions presented earlier, using the learning rate . We will also designate the two parameters as b0
and b1
, then plot and store their estimates under the lists intercepts
and slopes
, respectively. Finally, we will create a MSE cost function contour to visualise the trajectory of the same parameter estimates in the course of gradient descent.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Sun Nov 14 19:58:50 2021 —————————— | |
# Regression Oranges | |
library(RColorBrewer) | |
data(“Orange“) | |
# Determine number of iterations | |
niter <- 25 | |
# Determine learning rate / step size | |
alpha <- 1e-4 | |
set.seed(1) | |
b0 <- rnorm(1) # intercept | |
b1 <- rnorm(1) # slope | |
# Set palette | |
cols <- colorRampPalette(rev(brewer.pal(n = 7, name = “RdBu“)))(niter) | |
# Plot | |
layout(matrix(c(1,1,2,2,3), nrow = 1, ncol = 5)) | |
plot(age ~ circumference, data = Orange, pch = 16, | |
xlab = “Circumference (mm)“, | |
ylab = “Age (days)“, | |
col = rgb(0,0,1,.5)) | |
# Perform gradient descent | |
slopes <- rep(NA, niter) | |
intercepts <- rep(NA, niter) | |
for(i in 1:niter){ | |
# prediction | |
y <- b0 + b1 * Orange$circumference | |
# b0 = b0 – dJ/da * alpha | |
b0 <- b0 – sum(y – Orange$age) / nrow(Orange) * alpha | |
# b1 = b1 – dJ/db * alpha | |
b1 <- b1 – sum((y – Orange$age) * Orange$circumference) / nrow(Orange) * alpha | |
abline(a = b0, b = b1, col = cols[i], lty = 2) | |
# Save estimates over all iterations | |
intercepts[i] <- b0 | |
slopes[i] <- b1 | |
} | |
title(“Regression Fit“) | |
# Cost function contour | |
allCombs <- expand.grid(b0 = seq(–50, 50, length.out = 100), | |
b1 = seq(7.5, 8.5, length.out = 100)) | |
res <- matrix(NA, 100, 100) | |
# a by rows, b by cols | |
for(i in 1:nrow(allCombs)){ | |
y <- allCombs$b0[i] + allCombs$b1[i] * Orange$circumference | |
res[i] <- sum((y – Orange$age)**2)/(2*nrow(Orange)) | |
} | |
# Plot contour | |
contour(t(res), xlab = expression(beta[1]), | |
ylab = expression(beta[0]), axes = F, nlevels = 25) | |
axis(1, at = c(0, .5, 1), labels = c(7.5, 8, 8.5)) | |
axis(2, at = c(0, .5, 1), labels = c(–50, 0, 50)) | |
points((slopes – min(allCombs$b1)) / diff(range(allCombs$b1)), | |
(intercepts – min(allCombs$b0)) / diff(range(allCombs$b0)), | |
pch = 4, col = cols) | |
title(“MSE contour“) | |
# Add colorbar | |
z = matrix(1:niter, nrow = 1) | |
image(1, 1:niter, z, | |
col = cols, axes = FALSE, xlab = ““, ylab = ““) | |
title(“Iteration No.“) | |
axis(2, at = c(1, seq(5, niter, by=5))) |
By colouring the iterations from blue (early) to red (late) using a diverging palette, we can see how gradient descent arrived at parameter estimates that minimise the prediction error! In addition, the contour suggests that an optimal regression fit can be obtained using a small-intercept, large-coefficient combination or vice versa. Finally, note that the variation in the values of in the right-hand panel is concealed by the scale used in the MSE contour plot, intentionally set for ease of contour visualisation.
Logistic regression
Our second model, based on the mtcars
dataset, will have the following structure:
which is reminiscent of the generalised linear model employed in the past introduction to Bayesian models. In the code below, we will configure gradient descent such that in each of 100 iterations, a prediction is made and the three parameters , and are updated using the gradient expressions presented earlier, using the learning rate . We will also designate the three parameters as b0
, b1
and b2
and store their estimates under the lists intercepts
, slopes1
and slopes2
, respectively. Lastly, we will calculate and visualise the decision boundary from each iteration and examine how the parameter estimates change in the process.
CALCULATING THE DECISION BOUNDARY
The decision boundary can be determined from setting the logit expression to zero – equivalent to a target probability of 50% – then solving for either mpg or horsepower:
Solving for mpg here, we only need to pick at least two arbitrary values of horsepower, find their corresponding mpg values and connect the points with a line, in each iteration
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Sun Nov 14 20:13:12 2021 —————————— | |
# Logistic regression mtcars | |
library(RColorBrewer) | |
data(“mtcars“) | |
# Determine number of iterations | |
niter <- 100 | |
# Determine learning rate / step size | |
alpha <- 1e-5 | |
set.seed(1) | |
b0 <- rnorm(1) # intercept | |
b1 <- rnorm(1) # slope1 | |
b2 <- rnorm(1) # slope2 | |
# Set palette | |
cols <- colorRampPalette(rev(brewer.pal(n = 7, name = “RdBu“)))(niter) | |
# Plot | |
layout(matrix(c(1,1,2,2,3), nrow = 1, ncol = 5)) | |
plot(mpg ~ hp, data = mtcars, pch = 16, | |
xlab = “Horsepower“, ylab = “Miles per gallon (mpg)“, | |
col = ifelse(mtcars$vs, rgb(0,0,1,.5), rgb(1,0,0,.5))) | |
legend(“topright“, legend = c(“Straight“, “V-shaped“), | |
col = c(rgb(0,0,1,.5), rgb(1,0,0,.5)), pch = 16, | |
bty = “n“) | |
# Perform gradient descent | |
slopes1 <- rep(NA, niter) | |
slopes2 <- rep(NA, niter) | |
intercepts <- rep(NA, niter) | |
for(i in 1:niter){ | |
# prediction | |
logit <- b0 + b1 * mtcars$mpg + b2 * mtcars$hp | |
y <- 1 / (1 + exp(–logit)) | |
# b0 = b0 – dJ/da * alpha | |
b0 <- b0 – sum(y – mtcars$vs) * alpha | |
# b1 = b1 – dJ/db * alpha | |
b1 <- b1 – sum((y – mtcars$vs) * mtcars$mpg) * alpha | |
# b2 = b2 – dJ/db * alpha | |
b2 <- b2 – sum((y – mtcars$vs) * mtcars$hp) * alpha | |
# Solve for b0 + b1 * mtcars$mpg + b2 * mtcars$hp = 0 | |
# mtcars$mpg = -(b2 * mtcars$hp + b0) / b1 | |
db_points <- –(b2 * c(40, 400) + b0) / b1 | |
# Plot decision boundary | |
lines(c(40, 400), db_points, col = cols[i], lty = 2) | |
# Save estimates over all iterations | |
intercepts[i] <- b0 | |
slopes1[i] <- b1 | |
slopes2[i] <- b2 | |
} | |
title(“Decision Boundary“) | |
plot(x=1:niter, intercepts, type = “l“, lty=1, | |
ylim = c(–1, 1), xlab = “Iteration No.“, ylab = “Coefficient value“) | |
abline(h = 0, lty = 2, col=“grey“) | |
lines(x=1:niter, slopes1, lty=3) | |
lines(x=1:niter, slopes2, lty=4) | |
legend(“topright“, | |
legend = c(expression(beta[0]), | |
expression(beta[1]), | |
expression(beta[2])), | |
lty = c(1, 3, 4), | |
bty = “n“) | |
title(“Coefficient Estimation“) | |
# Add colorbar | |
z = matrix(1:niter, nrow = 1) | |
image(1, 1:niter, z, | |
col = cols, axes = FALSE, xlab = ““, ylab = ““) | |
title(“Iteration No.“) | |
axis(2, at = c(1, seq(20, niter, by=20))) |
Applying the same colouring scheme, we can see how gradient descent arrived at parameter estimates that minimise the prediction error! Note how the decision boundary neatly separates most members of the two engine shape classes. In addition, the right-hand panel shows how the slope parameters and change steadily, at different rates until they plateau, while the intercept is readily defined early on.
Conclusion
Well done, we have just learned and applied gradient descent! Time to recap all the steps taken:
- A general problem definition was set to assign predictors, target, parameters and learning rate to well defined mathematical entities
- The MSE and the binary cross-entropy cost functions were introduced
- The chain rule of calculus was presented and applied to arrive at the gradient expressions based on linear and logistic regression with MSE and binary cross-entropy cost functions, respectively
- For demonstration, two basic modelling problems were solved in R using custom-built linear and logistic regression, each based on the corresponding gradient expressions
Finally, here are some ideas of where to go from here:
- Customisation. One major advantage of knowing the nuts and bolts of gradient descent is the high degree of customisation it allows. You could, for example, use the above recipe to add parameter regularisation to a cost function and derive the updated gradient expressions
- Interpretation. Although the focus of our two exercises was computational, I urge you to make sense of the resulting parameter estimates. It seems obvious that older trees have wider circumferences, not so much that V-shaped engines provide less mpg. Can we create models that explore causality? How much does a unit change in each predictor impact the target prediction? To me, these sort of questions serve as an enjoyable sanity check
- Experimentation. Use the code above as a template to experiment with different learning rates and number of iterations. Recall that the learning rate determines the step size of each parameter update, while the number of iterations determines how long learning takes place. Do your experiments conform to expected results?
- Benchmarking. It could be interesting to compare our estimation results with those obtained from using
lm()
andglm()
or similar. Do parameter estimates agree between? Which algorithm is faster? How do they differ?
I hope this tutorial has rekindled your interest in machine learning, inspired novel applications and demystified the underlying mathematics. As usual, please enter a comment below – a question, a suggestion, praise or criticism. Until next time!
Citation
de Abreu e Lima, F (2023). poissonisfish: Gradient descent in R. From https://poissonisfish.com/2023/04/11/gradient-descent/
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.