Using Norms to Understand Linear Regression

[This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, (x1,x2,,xn), using a single number, s. In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, xi, and the summary, s, using a formula of the form,
i|xis|p,


where p was either 0, 1 or 2.

The Lp Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in Rn. Specifically, we were minimizing a measure of discrepancies that was almost identical to the Lp family of norms that can be used to measure the size of vectors. Understanding Lp norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an Lp norm, we simply take the sum we used before and rescale things by taking a pth root. This gives the formula for the Lp norm of any vector, v=(v1,v2,,vn), as,
|v|p=(i|vi|p)1p.


When p=2, this formula reduces to the familiar formula for the length of a vector:
|v|2=iv2i.

In the last post, the vector we cared about was the vector of elementwise discrepancies, v=(x1s,x2s,,xns). We wanted to minimize the overall size of this vector in order to make s a good summary of x1,,xn. Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the pth root at the end because one vector, v1, has a smaller norm than another vector, v2, only when the pth power of that norm smaller than the pth power of the other. What was essential wasn’t the scale of the norm, but rather the value of p that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the pth power of an Lp norm:
|v|pp=(i|vi|p).

The Regression Problem

Using Lp norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: (x1,x2,,xn) and (y1,y2,,yn). Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
yiβ0+β1xi.

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for β0 and β1, we have to compute a vector of discrepancies: v=(y1(β0+β1x1),,yn(β0+β1xn)). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the L0, L1 and L2 norms.

As we did with the single vector case, here we’ll define discrepancies as,
di=|yi(β0+β1xi)|p,


and the total error as,
Ep=i|yi(β0+β1xi)|p,

which is the just the pth power of the Lp norm.

Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of p. As before, let’s consider three settings:
E0=i|yi(β0+β1xi)|0


E1=i|yi(β0+β1xi)|1

E2=i|yi(β0+β1xi)|2

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the L0 norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing E1 and E2 define quite interesting and familiar forms of linear regression. We’ll start with E2 because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the L2 case, we select β0 and β1 to minimize,
E2=i(yi(β0+β1xi))2,


which is the summed squared error over all of the (xi,yi) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the L2 norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing E1 are also quite widely used: using the L1 norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than 1 are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called L2 regression), we predict the mean of yi given the value of xi. And when we use LAD regression (which would be better called L1 regression), we predict the median of yi given the value of xi. Just as I said in the previous post, the core theoretical tool that we need to understand is the Lp norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of β0 and β1 that minimize the discrepancy between β0+β1xi and yi, but also simultaneously try to satisfy a competing requirement that β1 not get too large. Note that we don’t try to control the size of β0 because it describes the overall scale of the data rather than the relationship between x and y.

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the Lp norm of the discrepancies and the Lq norm of the β’s. This means that we end up trying to minimize an expression of the form,
(i|yi(β0+β1xi)|p)+λ(|β1|q).

In most regularized regression models that I’ve seen in the wild, people tend to use p=2 and q=1 or q=2. When q=1, this model is called the LASSO. When q=2, this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.

To leave a comment for the author, please follow the link and comment on their blog: John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)