Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R + Categorical Predictors, Partial and Part Correlation]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to Introduction to R for Data Science Session 7: Multiple Regression + Dummy Coding, Partial and Part Correlations [Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.]
The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.
Data Science Serbia Course Pages [in Serbian]
Startit Course Pages [in Serbian]
Lecturers
- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Serbia
Summary of Session 7, 09. June 2016 :: Multiple Regression + Dummy Coding, Partial and Part Correlations.
Multiple Regression + Dummy Coding, Partial and Part Correlations. Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.
Intro to R for Data Science SlideShare :: Session 7
R script :: Session 7
######################################################## # Introduction to R for Data Science # SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # clear rm(list=ls()) #### read data library(datasets) library(broom) library(ggplot2) library(lattice) library(QuantPsyc) #### load data(iris) str(iris) #### simple linear regression: Sepal Length vs Petal Lenth # Predictor vs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") + geom_smooth(aes(colour = "black"), method='lm') + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none")

# What is wrong here? # let's see... reg <- lm(Petal.Length ~ Sepal.Length, data=iris) summary(reg) # Hm, everything seems fine to me... # And now for something completelly different (but in R)... #### Problems with linear regression in iris # Predictor vs Criterion {ggplot2} - group separation ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")

# Predictor vs Criterion {ggplot2} - separate regression lines ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour=Species)) + geom_smooth(method=lm) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")

### Ooops... ### overview and considerations plot(iris[,c(1,3,5)], main = "Inspect: Sepal vs. Petal Length \nfollowing the discovery of the Species...", cex.main = .75, cex = .6)

### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot data = iris, xlab = "Sepal Length", ylab = "Petal Length" )

### Petal.Length and Sepal.Lengt: EDA and distributions par(mfcol = c(2,2)) # boxplot Petal.Length boxplot(iris$Petal.Length, horizontal = TRUE, xlab="Petal Length") # histogram: Petal.Length hist(iris$Petal.Length, main="", xlab="Petal Length", prob=T) lines(density(iris$Petal.Length), lty="dashed", lwd=2.5, col="red") # boxplot Sepal.Length boxplot(iris$Sepal.Length, horizontal = TRUE, xlab="Sepal Length") # histogram: Sepal.Length hist(iris$Sepal.Length, main="", xlab="Sepal Length", prob=T) lines(density(iris$Sepal.Length), lty="dashed", lwd=2.5, col="blue")

# Petal Length and Sepal Length: Conditional Densities densityplot(~ Petal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Petal Length", ylab = "Density", main = "P(Petal Length|Species)", col.line = 'red' )

densityplot(~ Sepal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Sepal Length", ylab = "Density", main = "P(Sepal Length|Species)", col.line = 'blue' )

# Linear regression in subgroups species <- unique(iris$Species) w1 <- which(iris$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) tidy(reg) w2 <- which(iris$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,]) tidy(reg) w3 <- which(iris$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,]) tidy(reg) #### Dummy Coding: Species in the iris dataset is.factor(iris$Species) levels(iris$Species) reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) glance(reg) # Never forget what the regression coefficient for a dummy variable means: # It tells us about the effect of moving from the baseline towards the respective reference level! # Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg)) # NOTE: watch for the order of levels! levels(iris$Species) # Levels: setosa versicolor virginica iris$Species <- factor(iris$Species, levels = c("versicolor", "virginica", "setosa")) levels(iris$Species) # baseline is now: versicolor reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) # The regression coefficents (!): figure out what has happened! ### another way to do dummy coding rm(iris); data(iris) # ...just to fix the order of Species back to default levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) contrasts(iris$Species) # this probably what you remember from your stats class... iris$Species <- factor(iris$Species, levels = c ("virginica","versicolor","setosa")) levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) # baseline is now: virginica contrasts(iris$Species) # consider carefully what you need to do ### Petal.Length ~ Species (Dummy Coding) + Sepal.Length rm(iris); data(iris) # ...just to fix the order of Species back to default reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris$Species)==T, R does the dummy coding in lm() for you regSum <- summary(reg) regSum$r.squared regSum$coefficients # compare w. Simple Linear Regression reg <- lm(Petal.Length ~ Sepal.Length, data=iris) regSum <- summary(reg) regSum$r.squared regSum$coefficients ### Comparing nested models reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris) reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology: reg2 is a "full model" # this terminology will be used quite often in Logistic Regression # NOTE: Nested models # There is a set of coefficients for the nested model (reg1) such that it # can be expressed in terms of the full model (reg2); in our case it is simple # HOME: - figure it out. anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length # NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf # Influence Plot regFrame <- augment(reg2) ## Influence plot # influnce data infReg <- as.data.frame(influence.measures(reg)$infmat) # data.frame for ggplot2 plotFrame <- data.frame(residual = regFrame$.std.resid, leverage = regFrame$.hat, cookD = regFrame$.cooksd) ggplot(plotFrame, aes(y = residual, x = leverage)) + geom_point(size = plotFrame$cookD*100, shape = 1) + ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") + theme(plot.title = element_text(size=8, face="bold")) + ylab("Standardized Residual") + xlab("Leverage")

#### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression # (that's from your reading list, to remind you...) data(stackloss) str(stackloss) # Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html # Air Flow represents the rate of operation of the plant. # Water Temp is the temperature of cooling water circulated through coils in the absorption tower. # Acid Conc. is the concentration of the acid circulating. # stack.loss (the dependent variable) is 10 times the percentage of the ingoing ammonia to # the plant that escapes from the absorption column unabsorbed; # that is, an (inverse) measure of the over-all efficiency of the plant. stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss) # let's see: summary(stacklossModel) glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom} # predict new data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85) predict(stacklossModel, obs) # confidence intervals confint(stacklossModel, level=.95) # 95% CI confint(stacklossModel, level=.99) # 99% CI # 95% CI for Acid.Conc. only confint(stacklossModel, "Acid.Conc.", level=.95) # default regression plots in R plot(stacklossModel)

# multicolinearity library(car) # John Fox's car package VIF <- vif(stacklossModel) VIF sqrt(VIF) # Variance Inflation Factor (VIF) # The increase in the ***variance*** of an regression ceoff. due to colinearity # NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be # if there were no correlations with the other predictors in the model # NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE) Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant Tolerance # NOTE: you can inspect multicolinearity in the multiple regression mode # by conducting a Principal Component Analysis over the predictors; # when the time is right. #### R for partial and part (semi-partial) correlations library(ppcor) # a good one; there are many ways to do this in R #### partial correlation in R dataSet <- iris str(dataSet) dataSet$Species <- NULL irisPCor <- pcor(dataSet, method="pearson") irisPCor$estimate # partial correlations irisPCor$p.value # results of significance tests irisPCor$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned # partial correlation between x and y while controlling for z partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") partialCor$estimate partialCor$p.value partialCor$statistic # NOTE: # Formally, the partial correlation between X and Y given a set of n # controlling variables Z = {Z1, Z2, ..., Zn}, written ρXY·Z, is the # correlation between the residuals RX and RY resulting from the # linear regression of X with Z and of Y with Z, respectively. # The first-order partial correlation (i.e. when n=1) is the difference # between a correlation and the product of the removable correlations # divided by the product of the coefficients of alienation of the # removable correlations. [source: https://en.wikipedia.org/wiki/Partial_correlation] # NOTE: coefficient of alienation = 1 - R2 (R2 = "r-squared") # coefficient of alienation = the proportion of variance "unaccounted for" #### semi-partial correlation in R # NOTE: ... Semi-partial correlation is the correlation of two variables # with variation from a third or more other variables removed only # from the ***second variable*** # NOTE: The first variable <- rows, the second variable <-columns # cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015) # Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University # URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor$estimate irisSPCor$p.value irisSPCor$statistic # NOTE: ... Semi-partial correlation is the correlation of two variables # with variation from a third or more other variables removed only # from the ***second variable*** partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") # NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length # when the variance of dataSet$Petal.Length (2nd variable) due to dataSet$Sepal.Width # is removed! partCor$estimate partCor$p.value partCor$statistic # NOTE: In multiple regression, this is the semi-partial (or part) correlation # that you need to inspect: # assume a model with X1, X2, X3 as predictors, and Y as a criterion # You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3; # In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation # that you're looking for. # Give a thought to the following discussion on categorical predictors: # http://stats.stackexchange.com/questions/133203/partial-correlation-and-multiple-regression-controlling-for-categorical-variable # What's your take on this?
Readings :: Session 8: Logistic Regression [16. June, 2016, @Startit.rs, 19h CET]
- To be announced. In the meantime: a beatiful, concise theoretical background is available from Wikipedia.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.