R Tutorial Series: Multiple Linear Regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In R, multiple linear regression is only a small step away from simple linear regression. In fact, the same lm() function can be used for this technique, but with the addition of a one or more predictors. This tutorial will explore how R can be used to perform multiple linear regression.
Tutorial Files
Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains information used to estimate undergraduate enrollment at the University of New Mexico (Office of Institutional Research, 1990). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.
Creating A Linear Model With Two Predictors
The lm() function
In R, the lm(), or “linear model,” function can be used to create a multiple regression model. The lm() function accepts a number of arguments (“Fitting Linear Models,” n.d.). The following list explains the two most commonly used parameters.
- formula: describes the model
- data: the variable that contains the dataset
Note that the formula argument follows a specific format. For multiple linear regression, this is “YVAR ~ XVAR1 + XVAR2 + … + XVARi” where YVAR is the dependent, or predicted, variable and XVAR1, XVAR2, etc. are the independent, or predictor, variables.
It is recommended that you save a newly created linear model into a variable. By doing so, the model can be used in subsequent calculations and analyses without having to retype the entire lm() function each time. The sample code below demonstrates how to create a linear model with two predictors and save it into a variable. In this particular case, we are using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL).
- > #create a linear model using lm(FORMULA, DATAVAR)
- > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD)
- > twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)
- > #display model
- > twoPredictorModel
The output of the preceding function is pictured below.
From this output, we can determine that the intercept is -8255.8, the coefficient for the unemployment rate is 698.2, and the coefficient for number of spring high school graduates is 0.9. Therefore, the complete regression equation is Fall Enrollment = -8255.8 + 698.2 * Unemployment Rate + 0.9 * Number of Spring High School Graduates. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 698.2 students for every one percent increase in the unemployment rate and 0.9 students for every one high school graduate. Suppose that our research question asks what the expected fall enrollment is, given this year’s unemployment rate of 9% and spring high school graduating class of 100,000 students. As follows, we can use the regression equation to calculate the answer to this question.
- > #what is the expected fall enrollment (ROLL) given this year’s unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000
- > -8255.8 + 698.2 * 9 + 0.9 * 100000
- [1] 88028
- > #the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is 88,028 students.
Creating A Linear Model With Three or More Predictors
When creating a model with more than two predictors, the lm() function can again be used. Simply, one can just continue to add variables to the FORMULA argument until all of them are accounted for. A three predictor model is demonstrated below. It seeks to predict the fall enrollment (ROLL) via the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC).
- > #create a linear model using lm(FORMULA, DATAVAR)
- > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC)
- > threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar)
- > #display model
- > threePredictorModel
The output of the preceding function is pictured below.
From this output, we can determine that the intercept is -9153.3, the coefficient for the unemployment rate is 450.1, the coefficient for number of spring high school graduates is 0.4, and the coefficient for per capita income is 4.3. Therefore, the complete regression equation is Fall Enrollment = -9153.3 + 450.1 * Unemployment Rate + 0.4 * Number of Spring High School Graduates + 4.3 * Per Capita Income. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 450.1 students for every one percent increase in the unemployment rate, 0.4 students for every one high school graduate, and 4.3 students for every one dollar of per capita income. Let’s revisit our research question, this time including a per capita income of $30,000.
- > #what is the expected fall enrollment (ROLL) given this year’s unemployment rate (UNEM) of 9%, spring high school graduating class (HGRAD) of 100,000, and a per capita income (INC) of $30,000
- > -9153.3 + 450.1 * 9 + 0.4 * 100000 + 4.3 * 30000
- [1] 163897.6
- > #the predicted fall enrollment, given a 9% unemployment rate, 100,000 student spring high school graduating class, and $30000 per capita income, is 163,898 students.
Summarizing The Models
A multiple linear regression model can be used to do much more than just calculate expected values. Here, the summary(OBJECT) function is a useful tool. It is capable of generating a wealth of important information about a linear model. The example below demonstrates the use of the summary function on the two models created during this tutorial.
- > #use summary(OBJECT) to display information about the linear model
- > summary(twoPredictorModel)
- > summary(threePredictorModel)
The output of the preceding functions is pictured below.
The summary(OBJECT) function has provided us with t-test, F-test, R-squared, residual, and significance values. All of this data can be used to answer important questions related to our models.
Alternative Modeling Options
Although lm() was used in this tutorial, note that there are alternative modeling functions available in R, such as glm() and rlm(). Depending on your unique circumstances, it may be beneficial or necessary to investigate alternatives to lm() before choosing how to conduct your regression analysis.
Complete Multiple Linear Regression Example
To see a complete example of how multiple linear regression can be conducted in R, please download the multiple linear regression example (.txt) file.
References
Fitting Linear Models. (n.d.). Retrieved November 22, 2009 from http://sekhon.berkeley.edu/library/stats/html/lm.html
Office of Institutional Research (1990). Enrollment Forecast [Data File]. Retrieved November 22, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/enrolldat.html
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.