Site icon R-bloggers

Simple Linear Regression in r

[This article was first published on finnstats », and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simple linear regression in r, we want to create models to investigate and forecast the relationship between variables, and the most basic relationship that we can think of is a straight line.

Visit finnstats.com for up-to-date and accurate lessons.

Let’s take a look at the first linear relationship that we are going to create.

Intraclass Correlation Coefficient in R-Quick Guide »

Simple Linear Regression in r

Let’s load Boston Housing data from mlbench package.

library(mlbench)
data("BostonHousing2")
head(BostonHousing2)
dim(BostonHousing2)

The data set contains 506  rows and 19 columns.

Now we can check the association between the average number of rooms in a house and the median house price from this data set.

Kruskal Wallis test in R-One-way ANOVA Alternative »

Scatter Plot

Now we can make use of ggplot for making a scatterplot.

library(ggplot2)
ggplot(BostonHousing2,mapping = aes(y=medv,x=rm)) +
  geom_point() +
  xlab("Average number of Rooms") +
  ylab("Median House Price")

The average house price and the number of rooms have a strong linear relationship.

Ok, let’s see another example of the relationship between the price of a diamond and the number of carats using a fancy hexbin plot.

Equality of Variances in R-Homogeneity test-Quick Guide »

Let’s see the dataset first,

head(diamonds)
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
library(hexbin)
ggplot(diamonds, mapping = aes(x = carat, y = price)) +
geom_hex(bins=50)

Now, let’s look at the plot it doesn’t appear to be linear to me, but we can make it while making small changes.

ggplot(diamonds, mapping = aes(x = log10(carat), y = log10(price))) +  geom_hex(bins=50)

We’ll look at using the log carat to forecast a diamond’s log price.

lm <- lm(log(price) ~ log(carat), data = diamonds)
summary(lm)
Call:
lm(formula = log(price) ~ log(carat), data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50833 -0.16951 -0.00591  0.16637  1.33793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.448661   0.001365  6190.9   <2e-16 ***
log(carat)  1.675817   0.001934   866.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2627 on 53938 degrees of freedom
Multiple R-squared:  0.933,	Adjusted R-squared:  0.933 
F-statistic: 7.51e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Assumptions

For the linear model we have the following assumptions:

  1. Linearity (A straight line between log price and log carat)
  2. Homoscedasticity (noise terms have the same variance)
  3. Normality (Noise terms  are normally distributed)
  4. Independence (The error terms are independent)

rbind in r-Combine Vectors, Matrix or Data Frames by Rows » f

1. Linearity

Plot for residual versus fitted values.

plot(lm, which = 1)

The red line aids look at any patterns that exist. It is essentially straight in this example, which indicates no trend in the residuals and assumption satisfied.

QQ-plots in R: Quantile-Quantile Plots-Quick Start Guide »

2. Homoscedasticity

Let’s look at the spread

plot(lm, which = 3)

In this scenario, we’d like to see an equitable distribution of points as we move from left to right – no obvious tendencies here.

3. Normality

Here we will make use of which=2

plot(lm, which = 2)

4. Independence

Hope you eagerly waiting for this assumption, It necessitates some understanding of the data’s origins, meaning, and collection methods. So no shortcuts.

Conclusion

All of the assumptions have been met, and we can now use the below formula to forecast the log(price).

log(price)=8.448661+1.675817 log(carat)

The post Simple Linear Regression in r appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: finnstats ».

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.