Robust Regressions: Dealing with Outliers in R

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Categories

  1. Regression Models

Tags

  1. Machine Learning
  2. Outlier
  3. R Programming
  4. Video Tutorials

It is often the case that a dataset contains significant outliers – or observations that are significantly out of range from the majority of other observations in our dataset. Let us see how we can use robust regressions to deal with this issue.

I described in another tutorial how we can run a linear regression in R. However, this does not account for the outliers in our data. So, how can we solve this?

Plots

A useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.

In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.

Let’s see how we can use a robust regression to mitigate for these outliers.

Firstly, let’s plot Cook’s distance and the QQ Plot:

Cook’s Distance

QQ Plot

We can see that a plot of Cook’s distance shows clear outliers, and the QQ plot demonstrates the same (with a significant number of our observations not lying on the regression line).

When we get a summary of our data, we see that the maximum value for usage sharply exceeds the mean or median:

summary(mydata)
      age            gender          webpages     
 Min.   :19.00   Min.   :0.0000   Min.   : 10.00  
 1st Qu.:27.00   1st Qu.:0.0000   1st Qu.: 20.00  
 Median :37.00   Median :1.0000   Median : 25.00  
 Mean   :37.94   Mean   :0.5124   Mean   : 29.64  
 3rd Qu.:48.75   3rd Qu.:1.0000   3rd Qu.: 32.00  
 Max.   :60.00   Max.   :1.0000   Max.   :950.00  
   videohours          income          usage       
 Min.   : 0.0000   Min.   :    0   Min.   :   500  
 1st Qu.: 0.4106   1st Qu.: 3503   1st Qu.:  3607  
 Median : 1.7417   Median : 6362   Median :  9155  
 Mean   : 4.0229   Mean   : 6179   Mean   : 11955  
 3rd Qu.: 4.7765   3rd Qu.: 8652   3rd Qu.: 19361  
 Max.   :45.0000   Max.   :12000   Max.   :108954

OLS Regression

Let’s now run a standard OLS regression and see what we come up with.

#OLS Regression
summary(ols <- lm(usage ~ income + videohours + webpages + gender + age, data = mydata))
Call:
lm(formula = usage ~ income + videohours + webpages + gender + 
    age, data = mydata)

Residuals:
   Min     1Q Median     3Q    Max 
-10925  -2559   -479   2266  46030 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.499e+03  4.893e+02  -5.107 3.96e-07 ***
income       7.874e-01  4.658e-02  16.903  < 2e-16 ***
videohours   1.172e+03  3.010e+01  38.939  < 2e-16 ***
webpages     5.414e+01  3.341e+00  16.208  < 2e-16 ***
gender      -1.227e+02  2.650e+02  -0.463    0.643    
age          8.781e+01  1.116e+01   7.871 9.50e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4113 on 960 degrees of freedom
Multiple R-squared:  0.8371,	Adjusted R-squared:  0.8363 
F-statistic: 986.8 on 5 and 960 DF,  p-value: < 2.2e-16

We see that along with the estimates, most of our observations are significant at the 5% level and the R-Squared is reasonably high at 0.8371.

However, we need to bear in mind that this regression is not accounting for the fact that significant outliers exist in our dataset.

Cook’s Distance

A method we can use to determine outliers in our dataset is Cook’s distance. As a rule of thumb, if Cook’s distance is greater than 1, or if the distance in absolute terms is significantly greater than others in the dataset, then this is a good indication that we are dealing with an outlier.

#Compute Cooks Distance
dist <- cooks.distance(ols)
dist<-data.frame(dist)
]s <- stdres(ols)
a <- cbind(mydata, dist, s)

#Sort in order of standardized residuals
sabs <- abs(s)
a <- cbind(mydata, dist, s, sabs)
asorted <- a[order(-sabs), ]
asorted[1:10, ]
    age gender webpages videohours income  usage        dist
966  35      0       80 45.0000000   6700 108954 2.058817511
227  32      1       27  0.0000000   7830  24433 0.010733761
93   37      1       30  0.0000000  10744  27213 0.018822466
419  25      1       20  0.4952778   1767  17843 0.010178820
568  36      1       37  1.6666667   8360  25973 0.007652459
707  42      1       40  4.2347222   8527  29626 0.006085165
283  39      0       46  0.0000000   6244  22488 0.007099847
581  40      0       24  0.0000000   9746  23903 0.010620785
513  29      1       23  1.1111111  11398  25182 0.015107423
915  30      0       42  0.0000000   9455  22821 0.010412391
            s      sabs
966 11.687771 11.687771
227  4.048576  4.048576
93   4.026393  4.026393
419  3.707761  3.707761
568  3.627891  3.627891
707  3.583459  3.583459
283  3.448079  3.448079
581  3.393124  3.393124
513  3.353123  3.353123
915  3.162762  3.162762

We are adding Cook’s distance and standardized residuals to our dataset. Observe that we have the highest Cook’s distance and the highest standaridized residual for the observation with the greatest internet usage.

Huber and Bisquare Weights

At this point, we can now adjust the weights assigned to each observation to adjust our regression results accordingly.

Let’s see how we can do this using Huber and Bisquare weights.

Huber Weights

#Huber Weights
summary(rr.huber <- rlm(usage ~ income + videohours + webpages + gender + age, data = mydata))
Call: rlm(formula = usage ~ income + videohours + webpages + gender + 
    age, data = mydata)
Residuals:
     Min       1Q   Median       3Q      Max 
-11605.7  -2207.7   -177.2   2430.9  49960.3 

Coefficients:
            Value      Std. Error t value   
(Intercept) -3131.2512   423.0516    -7.4016
income          0.9059     0.0403    22.4945
videohours   1075.0703    26.0219    41.3140
webpages       57.1909     2.8880    19.8028
gender       -173.4154   229.0994    -0.7569
age            88.6238     9.6455     9.1881

Residual standard error: 3449 on 960 degrees of freedom

huber <- data.frame(usage = mydata$usage, resid = rr.huber$resid, weight = rr.huber$w)
huber2 <- huber[order(rr.huber$w), ]
huber2[1:10, ]
     usage    resid     weight
966 108954 49960.32 0.09284397
227  24433 16264.20 0.28518764
93   27213 15789.65 0.29375795
419  17843 15655.03 0.29628629
707  29626 14643.43 0.31675392
568  25973 14605.87 0.31756653
283  22488 13875.58 0.33427984
581  23903 13287.62 0.34907331
513  25182 13080.99 0.35458486
105  19606 12478.59 0.37170941

Bisquare weighting

#Bisquare weighting
rr.bisquare <- rlm(usage ~ income + videohours + webpages + gender + age, data=mydata, psi = psi.bisquare)
summary(rr.bisquare)
Call: rlm(formula = usage ~ income + videohours + webpages + gender + 
    age, data = mydata, psi = psi.bisquare)
Residuals:
     Min       1Q   Median       3Q      Max 
-11473.6  -2177.8   -119.7   2491.9  50000.1 

Coefficients:
            Value      Std. Error t value   
(Intercept) -3204.9051   426.9458    -7.5066
income          0.8985     0.0406    22.1074
videohours   1077.1598    26.2615    41.0167
webpages       56.3637     2.9146    19.3384
gender       -156.7096   231.2082    -0.6778
age            90.2113     9.7343     9.2673

Residual standard error: 3434 on 960 degrees of freedom

bisqr <- data.frame(usage = mydata$usage, resid = rr.bisquare$resid, weight = rr.bisquare$w)
bisqr2 <- bisqr[order(rr.bisquare$w), ]
bisqr2[1:10, ]
     usage    resid       weight
227  24433 16350.56 0.0000000000
966 108954 50000.09 0.0000000000
93   27213 15892.10 0.0005829225
419  17843 15700.87 0.0022639112
707  29626 14720.97 0.0264753021
568  25973 14694.59 0.0274546293
283  22488 13971.53 0.0604133298
581  23903 13389.67 0.0944155930
513  25182 13192.86 0.1072333000
105  19606 12536.39 0.1543038994

In both of the above instances, observe that a much lower weight of 0.092 is assigned to observation 966 using Huber weights, and a weight of 0 is assigned to the same observation using Bisquare weighting.

In this regard, we are allowing the respective regressions to adjust the weights in a way that yields lesser importance to outliers in our model.

Conclusion

In this tutorial, you have learned how to:

  • Screen for outliers using Cook’s distance and QQ Plots
  • Why standard linear regressions do not necessarily adjust for outliers
  • How to use weighting techniques to adjust for such anomalies

If you have any questions on anything I have covered in this tutorial, please leave a comment and I will do my best to address your query. You can also find a video-based tutorial on this topic here.

Dataset

internetoutliers.csv

Related Post

  1. Multilevel Modelling in R: Analysing Vendor Data
  2. Logistic Regression with Python using Titanic data
  3. Failure Pressure Prediction Using Machine Learning
  4. Machine learning logistic regression for credit modelling in R
  5. Commercial data analytics: An economic view on the data science methods

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)