Partial Regression Plots in Julia, Python, and R

R on Data & The World

12 months ago

[This article was first published on R on Data & The World, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Partial regression plots – also called added variable plots, among other things – are a type of diagnostic plot for multivariate linear regression models. More specifically, they attempt to show the effect of adding a new variable to an existing model by controlling for the effect of the predictors already in use. They’re useful for spotting points with high influence or leverage, as well as seeing the partial correlation between the response and the new predictor.

For a model with predictors \(X_1,…,X_n\) and a response \(Y\), the partial regression plot for an additional predictor \(X_{n+1}\) can be constructed like so:

Compute the residuals from a model regressing \(Y\) against \(X_1,…,X_n\).
Compute the residuals from a model regressing \(X_{n+1}\) against \(X_1,…,X_n\).
Plot the residuals from the first model to the second model.

For this example, I will code up basic examples in Julia, Python, and R. For the data, I’ll use the sat data set from R’s faraway package, which I saved to a file beforehand. This dataset contains records for SAT scores in one year for each U.S. state along with four predictors:

Expenditures by the school system expend
Average pupil/teacher ratio ratio
Estimated average teacher salary salary
Percentage of eligible students taking the SAT that year takers

The base model will have SAT scores as a function of ratio, salary, and takers, so we’ll make partial regression plots for expend.

In R

Consider the SAT scores versus expend:

SAT versus expend

There could be a minor but potentially useful correlation there. However, the partial regression plot looks like this:

library(faraway)
library(ggplot2)
data(sat)
expend_resid <- resid(lm(data=sat, expend ~ ratio + salary + takers))
total_resid <- resid(lm(data=sat, total ~ ratio + salary + takers))
ggplot(data=NULL) + geom_point(aes(x=expend_resid, y=total_resid))

Partial regression plot in R.

The partial correlation appears to be negligble at best:

cor(expend_resid, total_resid)
## [1] 0.06295197

Presumably because of the very high correlation between expend and salary, which is already in the model:

cor(sat$expend, sat$salary)
## [1] 0.8698015

As for the code, it’s pretty straightforward, since the fact that we can get the residuals directly from the model object makes this very succinct. Technically nothing needs to be imported once the SAT data has been saved to a file, though I strongly prefer using ggplot2 to the base plotting system.

In Python

Python is a bit different in two ways. On the one hand, training data residuals can’t be extracted from the LinearRegression object from scikit-learn, so the code is a bit longer here:

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
sat = pd.read_csv("sat.csv", index_col=0)
expend_regression = LinearRegression().fit(sat[["ratio", "salary", "takers"]], sat["expend"])
total_regression = LinearRegression().fit(sat[["ratio", "salary", "takers"]], sat["total"])
expend_residuals = sat["expend"] - expend_regression.predict(sat[["ratio", "salary", "takers"]])
total_residuals = sat["total"] - total_regression.predict(sat[["ratio", "salary", "takers"]])
plt.scatter(expend_residuals, total_residuals)

On the other hand, the statsmodels package, which you’ll probably have installed a part of the typical Python data science stack, actually has a function to do that automatically (with default annotations and the regression line between the two sets of residuals, to boot):

import pandas as pd
import statsmodels.api as sm
sat = pd.read_csv("sat.csv", index_col=0)
sm.graphics.plot_partregress("total", "expend", ["ratio", "salary", "takers"], data=sat)

In Julia

Julia is fairly similar to Python in that its GLM package shares the inability to get residuals for the training data from the model object. There are some additional differences in Gadfly’s plotting function, as the default background is transparent.

using CSV
using DataFrames
using Gadfly
using GLM
sat = CSV.File("sat.csv") |> DataFrame
expend_regression = lm(@formula(expend ~ ratio + salary + takers), sat)
total_regression = lm(@formula(total ~ ratio + salary + takers), sat)
expend_residuals = sat["expend"] - predict(expend_regression, sat)
total_residuals = sat["total"] - predict(total_regression, sat)
residual_df = DataFrame(ExpendResiduals=expend_residuals, TotalResiduals=total_residuals)
plot(residual_df, x="ExpendResiduals", y="TotalResiduals", Geom.point, Theme(background_color="white"))

If you’re trying to recreate these plots, note that Gadfly only supports SVG as a file format by default, unlike R and matplotlib. You’ll need the Cairo and Fontconfig libraries if you want to output PNG files.

To leave a comment for the author, please follow the link and comment on their blog: R on Data & The World.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.