Simple regression models in R

[This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Linear regression models are one the simplest and yet a very powerful models you can use in R to fit observed data and try to predict quantitative phenomena.

Say you know that a certain variable y is somewhat correlated with a certain variable x and you can reasonably get an idea of what y would be given x. A class example is the price of houses (y) and square meters (x). It is reasonable to assume that, within a certain range and given the same location, square meters are correlated with the price of the house. As a first rough approximation one could try out the hypothesis that price is directly proportional to square meters.

Now what if you would like to predict the possible price of a flat of 80 square meters? Well, an initial approach could be gathering the data of houses in the same location and with similar characteristics and then fit a model.

A linear model with one predicting variable might be the following:

clip_image002

Alpha and Beta can be found by looking for for the two values that minimise the error of the model relative to the observed data. Basically the procedure of finding the two coefficient is equivalent to finding the “closest” line to our data. Of course this will still be an estimate and it will probably not match any of the observed values.

In R fitting these models is quite easy, let’s code an example using the mtcars dataset and setting x = displacement, y = weight. It looks reasonable to assume that displacement and weight be correlated.

Rplot

So far our assumption seems reasonable.

#-------------------------------------------------------------------------------
# Gathering the data
#-------------------------------------------------------------------------------
# Loading the entire mtcars dataset
data(mtcars)
# Subsetting the dataset for our use
dat <- subset(mtcars,select=c(mpg,disp,hp,drat,wt))
dat
# Take a peek at the data plotting each feature against the other
plot(dat)
# Attach for rapid calls
attach(dat)
# Plot data to use
plot(disp,wt,type='p',xlab='Disp',ylab='Wt',main='Regression polynomials')
# Add a legend
legend("topleft",c("n=1","n=2","n=3"), col=c("red","green","blue"), lwd=3)

Let’s fit the model.

Overall the fitted line (in red) seems to be a good fit to our data and this is confirmed by the R squared value which is > 0.7. There is no minimum value for the R squared coefficient to be significant, however a value less than 0.3 usually indicates a bad fit and you should probably change the model you are using (R squared varies between 0 and 1). Take a close look at the p values too: a low value of the p value signals that the variable (in this case Disp) is statistically significant. Indeed the confidence intervals for the estimated coefficient of mpg do not include 0.

Rplot01

#-------------------------------------------------------------------------------
# Linear regression y = a + bx
#-------------------------------------------------------------------------------
# Fit the model
model1 <- lm(wt ~ disp)
# Parameters and information about the model
model1
coef(model1)
summary(model1)
# Plot the model
abline(model1,col='red',lwd=2)
##################################################################################
# OUTPUT
##################################################################################
# > summary(model1)
#
# Call:
# lm(formula = wt ~ disp)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.89044 -0.29775 -0.00684 0.33428 0.66525
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.5998146 0.1729964 9.248 2.74e-10 ***
# disp 0.0070103 0.0006629 10.576 1.22e-11 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4574 on 30 degrees of freedom
# Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
# F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
view raw linear_reg.R hosted with ❤ by GitHub

Sometimes the relation between two variables is all but linear, therefore a better approach could be the use of quadratic or cubic model as is used below.

The quadratic model seems pretty useless in this case since it is not more helpful than the original red line. The cubic model seems like a better fit to our data compared to the two previous models and perhaps could be the one that we were looking for.

In principle you could go on and fit higher polynomials to your data, however this might not be the optimal choice because you have to remember that you are looking at a fraction of the whole data available and perhaps some of it was distorted by measuring errors or other kind of errors and by trying to fit your model as close as you can to the data, you are implicitly setting it to be very sensitive to these errors (you are kind of overfitting it). Depending on what you are modelling you have to choose between this trade-off of good fit and a more sensitive model.

Rplot02

#-------------------------------------------------------------------------------
# Parabola y = a + bx + cx^2
#-------------------------------------------------------------------------------
model2 <- lm(wt ~ disp+I(disp^2))
summary(model2)
coef(model2)
# Predicted vs original
predicted <- fitted(model2)
original <- wt
# Plot model2
curve(predict(model2,data.frame(disp=x)),col='green',lwd=2,add=TRUE)
#-------------------------------------------------------------------------------
# Polynomial n=3 y = a +bx + cx^2 + dx^3
#-------------------------------------------------------------------------------
model3 <- lm(wt ~ disp+ I(disp^2) + I(disp^3))
summary(model3)
coef(model3)
# Predicted vs original
predicted <- fitted(model3)
original <- wt
# Plot model3
curve(predict(model3,data.frame(disp=x)),col='blue',lwd=2,add=TRUE)
view raw parabola_fit.R hosted with ❤ by GitHub

Now what if you have more information available? Well it is usually good to use it and put it into your model: R squared will surely increase however we will need to check the p values for statistical significance and the F stastistics.

If, for instance, you could use other variables such as Gross horsepower (hp), Rear axle ratio (drat) and  Miles/(US) gallon (mpg) you could fit a linear model that takes into account all these new variablesclip_image002[5]

Of course now the graph is much more difficult to visualise, however we can plot y hat against x1 for instance.

Rplot03

Furthermore, by taking a look at the summary of the model, we can understand which variables influence the most the outcome of the model and which don’t. This last model is probably better than the cubic model we fitted above because it uses more information. In R is quite easy and fun to play around with these models and combine them together.

#-------------------------------------------------------------------------------
# Another linear model using more information y = a*x1 + b*x2 + c*x3 + d*x4
#-------------------------------------------------------------------------------
# Plot data
plot(disp,wt,type='p',xlab='Disp',ylab='Wt',main='Linear regression')
# Add a legend
legend("topleft",c("Observ.","Predicted"), col=c("black","red"), lwd=3)
model4 <- lm(wt ~ disp+mpg+hp+drat)
summary(model4)
coef(model4)
predicted <- fitted(model4)
original <- wt
mat <- cbind(predicted,original)
points(disp,predicted, type="p", col="red")
view raw Linear_model.R hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)