Linear regression models with robust parameter estimation

Ralph

12 years ago

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in R that provide robust statistical methods which are summarised on the CRAN Robust Task View.

As an example of using robust statistical estimation in a linear regression framework consider the CPUs data that was used in previous posts on linear regression and variable selection. For this data we could fit a model with six variables using least squares and also with a fast MM-estimator from the robustbase package.

First step is to make the functions and data available for analysis:

require(MASS)
require(robustbase)

The linear model using least squares is fitted as follows:

> cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)

The summary for this model:

> summary(cpu.mod1)
 
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: < 2.2e-16

The linear model using an MM-estimator is fitted as follows:

cpu.robmod1 = lmrob(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, control = lmrob.control(max.it = 100))

Note that we need to increase the default number of iterations (50) to allow the routine to converge to a solution. The summary for this model:

> summary(cpu.robmod1)
 
Call:
lmrob(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus, control = lmrob.control(max.it = 100))
 
Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-144.0045   -9.4554    0.7691   13.3757  759.6953 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.6634297  6.1697645  -0.594 0.553329    
syct         0.0063112  0.0043877   1.438 0.151868    
mmin         0.0098798  0.0034726   2.845 0.004897 ** 
mmax         0.0024463  0.0004525   5.407 1.80e-07 ***
cach         0.8702102  0.2551245   3.411 0.000782 ***
chmin        2.4078436  1.3319413   1.808 0.072130 .  
chmax        0.1016861  0.1494902   0.680 0.497145    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Robust residual standard error: 19.27 
Convergence in 65 IRWLS iterations
 
Robustness weights: 
 15 observations c(1,8,9,10,31,32,96,97,98,153,154,156,169,199,200) are outliers with |weight| = 0 ( < 0.00048); 
 14 weights are ~= 1. The remaining 180 ones are summarized as
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002472 0.831300 0.963700 0.849800 0.989500 0.998900 
Algorithmic parameters: 
tuning.chi         bb tuning.psi refine.tol    rel.tol 
 1.5476400  0.5000000  4.6850610  0.0000001  0.0000001 
 nResample     max.it     groups    n.group   best.r.s   k.fast.s      k.max  trace.lev compute.rd 
       500        100          5        400          2          1        200          0          0 
seed : int(0)

The two models differ in the variables that are considered important and the output from the lmrob function provides a summary of the weights that have been allocated to the data. A total of fifteen of the data points have been allocated very small weights by the fitting algorithm.

Related posts:

The update function for simplifying model selection.
Data analysis using simple linear regression models.
The dropterm function for simplifying models.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.