Visualizing principal components with R and Sochi Olympic Athletes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Principal Components Analysis (PCA) is used as a dimensionality reduction method. Here we simply explain PCA step-by-step using data about Sochi Olympic Curlers.
It is hard to visualize a high dimensional space. When I took linear algebra, the book and teachers spoke about it as if were easy to visualize a hyperspace, but later when I took the Coursera course Neural Networks for Machine Learning, Geoffrey Hinton gave the wise advise, “To deal with a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it.” In other words, people cannot visualize a high dimensional space, so we use a simpler problem—two dimensions of Olympic athlete data—to explain PCA.
First, we have one dimensional data where the only dimension is the curler’s height.
Next, we add a second dimension: the curler’s weight. Notice there is a strong correlation between height and weight. Because of this redundancy, two dimensions are not necessary to represent most of the information.
By the way, if you look carefully at the first two images, notice the horizontal placement of the curlers is identical: adding the second axis moves the curlers only vertically.
After performing PCA, there are two principal components. Because we want to simplify two dimensions into one dimension, we ignore the second principal component and plot the data onto the first component as red squares. The black lines join each original point (green) to its projection (red) onto a one-dimensional line.
The blue line illustrates the first principal component. Its on this one-dimensional line that the two-dimensional space is projected.
Now we can show the same projections from the previous graph on its own one-dimensional strip chart, which most of the variation of a two-dimensional space in one dimension.
So in general PCA reduces the number of dimensions by projecting high dimensional data into a lower dimensional space. With higher dimensional data, it is often useful to keep more of the principal components. For graphing, two or three principal components are retained. For other purposes, the optimal number of components may be chosen using a scree plot or the minimum number of components that captures some percentage of the variation, say 90%.
Here is the R code.
# Read data from CSV # Download from http://www.danasilver.org/static/assets/sochi-2014-athletes/athletes.csv # See below for faster option. athletes <- read.csv('athletes.csv') # Subset data ath <- athletes[athletes$sport=='Curling',c('height','weight')] ath <- ath[complete.cases(ath),] # ALTERNATIVELY instead of downloading ath <- structure(list(height = c(1.73, 1.78, 1.7, 1.73, 1.71, 1.93, 1.7, 1.69, 1.84, 1.75, 1.83, 1.8, 1.8, 1.64), weight = c(66L, 84L, 74L, 66L, 73L, 80L, 58L, 60L, 88L, 85L, 80L, 71L, 85L, 69L )), .Names = c("height", "weight"), row.names = c(536L, 624L, 640L, 820L, 930L, 949L, 1191L, 1632L, 1818L, 2349L, 2583L, 2609L, 2641L, 2696L), class = "data.frame") # Plot 1 Dimension (just height) png('pca1-stripchart.png') stripchart(ath$height, col="green", pch=19, cex=2, xlab="Height (m)", main="Curlers at Sochi 2014 Winter Olympics") dev.off() # Plot 2 Dimensions x <- as.matrix(ath) plot2d <- function(col=3) { plot (x, asp = 0, col = col, pch = 19, cex = 2, xlab="Height (m)", ylab="Weight (kg)", main="Curlers at Sochi 2014 Winter Olympics") } png('pca2-scatterplot.png') plot2d() dev.off() # Perform PCA pcX <- prcomp(x, retx = TRUE, scale = FALSE, center=TRUE) # Transform points transformed <- pcX$x [,1] %*% t (pcX$rotation [1,]) transformed <- scale (transformed, center = -pcX$center, scale = FALSE) # Plot PCA projection plot_pca <- function() { plot2d() points (transformed, col = 2, pch = 15, cex = 2) segments (x [,1],x [,2], transformed [,1], transformed [,2]) } png('pca3-pca-projection.png') plot_pca() dev.off() # Draw first principal component over scatterplot png('pca4-first-component-on-scatterplot.png') plot_pca() lm.fit <- lm(transformed[,2] ~ transformed[,1]) abline(lm.fit, col="blue", cex=1.5) dev.off() # Plot first principal component by itself png('pca5-first-component-stripchart.png') stripchart(pcX$x[,1], col="red", cex=2, pch=15, xlab="First principal component") dev.off()
This was tested on R 3.0.2 (64-bit). Thank you to Dana Silver for the Sochi athlete data and to cbeleites for explaining how to plot PCA projections with line segments.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.