Site icon R-bloggers

LR01: Correlation

[This article was first published on intubate <||>, XBRL, bioPN, sbioPN, and stats with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the first of a series of posts on the subject of linear regression, using R for computational demonstrations and examples. I hope you find it useful, but I am aware it may contains typos and conceptual errors (mostly when I try to think instead of just repeating what others thought…). Help on correcting/improving these notes is appreciated. This first post deals with the subject of correlation.



## install.packages("UsingR")
library(UsingR)

## Pearson's data
## Father and son data
data(father.son)
dim(father.son)
## [1] 1078    2
str(father.son)
## 'data.frame':	1078 obs. of  2 variables:
##  $ fheight: num  65 63.3 65 65.8 61.1 ...
##  $ sheight: num  59.8 63.2 63.3 62.8 64.3 ...
head(father.son)   ## First six pairs
##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
tail(father.son)   ## Last six pairs
##       fheight  sheight
## 1073 67.70657 59.81693
## 1074 66.99681 70.75232
## 1075 71.33181 68.26774
## 1076 71.78314 69.30589
## 1077 70.73837 69.30199
## 1078 70.30609 67.01500
x <- father.son$fheight
y <- father.son$sheight

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

abline(a = 0, b = 1, lty = 2)

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

abline(a = 0, b = 1, lty = 2)

## Families where father is 72 inches tall (to the nearest inch)
## plotted in a vertical strip.
abline(v=c(71.5, 72.5), lty = 3)
with(subset(father.son, fheight >= 71.5 & fheight < 72.5),
     points(fheight, sheight, col="red"))

(Pearson) correlation coefficient

where $(x_1, y_1)$, $(x_2, y_2)$, $\cdots$, $(x_n, y_n)$ are the $n$ sample pairs.

## Generating correlated Normal data

diffr1 <- function(n, rho, SDx = 1, SDy = 1) {
  meanx <- 3; meany <- 3
  
  x1 <- rnorm(n = n)
  x2 <- rnorm(n = n)
  x3 <- rho*x1 + sqrt(1-rho^2)*x2
  
  x <- meanx + SDx*x1
  y <- meany + SDy*x3
  
  r <- round(cor(x, y), 3)
  
  plot(x, y, xlim = c(0,6), ylim = c(0,6),
       xaxs = "i", yaxs = "i", main = paste("rho =", rho, "; r = ", r))
}

## Alternative, using multivariate normal.
library(mvtnorm)

diffr2 <- function(n, rho, SDx = 1, SDy = 1) {
  ## mean vector
  mu <- c(3, 3)
  ## variance-covariance matrix.
  sigma <- matrix(c(SDx^2      , rho*SDx*SDy,
                    rho*SDx*SDy, SDy^2), nrow = 2, byrow = TRUE)
  
  sim <- rmvnorm(n, mu, sigma)
  x <- sim[, 1]
  y <- sim[, 2]
  
  r <- round(cor(x, y), 3)
  
  plot(x, y, xlim = c(0,6), ylim = c(0,6),
       xaxs = "i", yaxs = "i", main = paste("rho =", rho, "; r = ", r))
}

## Let's use alternative 1
diffr <- diffr1
set.seed(123)
par(mai = c(.2, .2, .2, .2), mgp = c(1.5, 0.2, 0),
    tck = -.01, mfrow = c(1,3))
diffr(rho = 0.80, n = 50)
diffr(rho = 0, n = 50)
diffr(rho = -0.80, n = 50)

## What happens if r=1?
plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

Let’s get them using R:

(meanx <- mean(x))
## [1] 67.6871
(meany <- mean(y))
## [1] 68.68407
plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Point of averages
abline(v=meanx, col="green")
abline(h=meany, col="green")

Let’s calculate those products with R:

num_prod <- (x-meanx)*(y-meany)

## See the first results
num_prod[1:30]
##  [1] 23.4987260 24.2659096 14.5921950 11.3980443 28.8386686 20.7193070
##  [7] 10.6602839 13.8920687  6.6026866  3.3860013 29.8522714 15.8488740
## [13] 12.0660676 11.7337326 10.1656882  9.8610550  4.8145636  6.6206614
## [19]  1.1508456  2.9634055 -0.3923032 -5.8598975 10.8071150  8.9388582
## [25]  8.1936002  7.3768431  8.1223945  4.3498337  6.4173945  5.7854412

Most are positive, but some are negative.

col <- ifelse(num_prod >= 0, "blue", "red")
plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i", col = col,
     main = "Pearson's data",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

abline(v=meanx, col="green")
abline(h=meany, col="green")

(sum_pos_num_prod <- sum(num_prod[num_prod >= 0]))
## [1] 4975.409
(sum_neg_num_prod <- sum(num_prod[num_prod < 0]))
## [1] -803.8301

Then:

(total_sum_num_prod <- sum_pos_num_prod + sum_neg_num_prod)
## [1] 4171.579

will be positive.

(kind_of_sdx <- sqrt(sum((x - meanx)^2)))
## [1] 90.08021
(kind_of_sdy <- sqrt(sum((y - meany)^2)))
## [1] 92.37197

and we are set!

(r <- total_sum_num_prod / (kind_of_sdx * kind_of_sdy))
## [1] 0.5013383
cor(x, y)
## [1] 0.5013383

where $\text{su}$ stands for “standard units”.

is the sampling counterpart of the definition that we gave for $\rho$ (the population correlation coefficient). Remember? No? OK, Here it goes again:

In fact

leading to:

that resembles the population counterpart, and we can easily “verify” with R:

(r <- cov(x, y)/ (sd(x)*sd(y)))
## [1] 0.5013383

shows that $r$ is no other thing than the sum of the product of the variables in standard units, divided by $n – 1$.

Again, using R for “confirmation”:

su <- function(x)
  (x-mean(x))/sd(x)

sux <- su(x)
suy <- su(y)

n <- length(x)         ## Could also be length(y)

sum(sux * suy) / (n - 1)
## [1] 0.5013383

(It wouldn’t be a bad idea if you try to get to this version. You never know.)

Facts about the sample correlation coefficient:

  1. $\rho$ and $r$ differ (and both are pure numbers, without units)

    • positive $r$ $\Longleftrightarrow$ positive linear association (as with father and son’s height data)
    • negative $r$ $\Longleftrightarrow$ negative linear association
  2. [ |r| \leq 1 ]

(why? See Cauchy-Schwarz inequality if interested)

  1. $r(x, y) = r(y, x)$

  2. $r(x, y) = r(x, c + y)$

($c$ is any constant)

  1. $r(x, y) = r(x, c\cdot y)$

($c$ is a positive constant. A negative number will change the sign of r)

“Proofs” using R:

cor(x, y)
## [1] 0.5013383
## 4
cor(y, x)
## [1] 0.5013383
## 5
cor(x, 5 + y)
## [1] 0.5013383
## 6
cor(x, 3 * y)
## [1] 0.5013383
## 7 (5 and 6)
cor(x, 5 + 3 * y)
## [1] 0.5013383

Next

LR02: SD line, GoA, Regression

To leave a comment for the author, please follow the link and comment on their blog: intubate <||>, XBRL, bioPN, sbioPN, and stats with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.