[This article was first published on everyday analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I work in consulting. If you’re a consultant at a certain type of company, agency, organization, consultancy, whatever, this can sometimes mean travelling a lot.
Many business travellers ‘in the know’ have heard the old joke that if you want to stay at any type of hotel anywhere in the world and get a great rate, all you have to do is say that you work for IBM.
The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).
But I digress.
Background
The point is that, as I said, I spent quite a bit of time travelling for work last year. Apparently the story with frequent fliers miles is that it’s best just to pick one airline and stick with it – and this also worked out well as most companies, including my employer, have preferred airlines and so you often don’t have much of a choice in the matter.
In my case this means flying Delta.
So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?
The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.
In my case this means flying Delta.
So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?
Aircraft data card from Delta.com |
Analysis
As this a data set of many variables (33) I thought this would be an interesting opportunity to practice using a dimensionality reduction method to make the information easier to visualize and analyze.
First let’s just look at the intermediary quantitative variables related to the aircraft physical characteristics: cruising speed, total accommodation, and other quantities like length and wingspan. These variables are about in the middle of the data frame, so we can visualize all of them at once using a scatterplot matrix, which is the default for R’s output if plot() is called on a dataframe.
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1) # scatterplot matrix of intermediary (size/non-categorical) variables plot(data[,16:22])
We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.
The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.
How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.
Principal Components Analysis
Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:
# Naively apply principal components analysis to raw data and plot pc <- princomp(data) plot(pc)Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.
# First component dominates greatly. What are the loadings? summary(pc) # 1 component has > 99% variance loadings(pc) # Can see all variance is in the range in milesImportance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance 0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion 0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club. -0.144 -0.110
Seat.Pitch..Club. -0.327 -0.248 0.189
Seat..Club.
Seat.Width..First.Class. 0.250 -0.160 -0.156 0.136
Seat.Pitch..First.Class. 0.515 -0.110 -0.386 0.112 -0.130 0.183
Seats..First.Class. 0.258 -0.124 -0.307 -0.109 0.160 0.149
Seat.Width..Business. -0.154 0.142 -0.108
Seat.Pitch..Business. -0.514 0.446 -0.298 0.154 -0.172 0.379
Seats..Business. -0.225 0.187
Seat.Width..Eco.Comfort. 0.285 -0.224
Seat.Pitch..Eco.Comfort. 0.159 0.544 -0.442
Seats..Eco.Comfort. 0.200 -0.160
Seat.Width..Economy. 0.125 0.110
Seat.Pitch..Economy. 0.227 0.190 -0.130
Seats..Economy. 0.597 -0.136 0.345 -0.165 0.168
Accommodation 0.697 -0.104 0.233
Cruising.Speed..mph. 0.463 0.809 0.289 -0.144 0.115
Range..miles. 0.999
Engines
Wingspan..ft. 0.215 0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft. -0.100 -0.187
Length..ft. 0.275 0.118 -0.318 0.467 0.582 -0.418
Wifi
Video
Power
Satellite
Flat.bed
Sleeper
Club
First.Class
Business
Eco.Comfort
Economy
This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):
# verify by plotting variance of columns mar <- par()$mar par(mar=mar+c(0,5,0,0)) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x') par(mar=mar)
We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.
# Scale data2 <- data.frame(scale(data)) # Verify variance is uniform plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables |
Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:
# Proceed with principal components pc <- princomp(data2) plot(pc) plot(pc, type='l') summary(pc) # 4 components is both 'elbow' and explains >85% variance
Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:
- Pick the number of components which explain 85% or greater of the variation
- Use the ‘elbow’ method of the scree plot (on right)
Here we are fortunate in that these two are the same, so we will retain the first four principal components. We put these into new data frame and plot.
# Get principal component vectors using prcomp instead of princomp pc <- prcomp(data2) # First for principal components comp <- data.frame(pc$x[,1:4]) # Plot plot(comp, pch=16, col=rgb(0,0,0,0.5))
So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.
Normally, I am a staunch opponent of 3D visualization, as I’ve spoken strongly about previously. The one exception to this rule is when the visualization is interactive, which allows the user to explore the space and not lose meaning due to three dimensions being collapsed into a 2D image. Plus, in this particular case, it’s a good excuse to use the very cool, very awesome rgl package.
Click on the images to view the interactive 3D versions (requires a modern browser). You can better see in the 3D projections that the data are confined mainly to the one plane one the left (components 1-3), with the exception of the outlier, and that there is also bunching in the other dimensions (components 1,3,4 on right).
library(rgl) # Multi 3D plot plot3d(comp$PC1, comp$PC2, comp$PC3) plot3d(comp$PC1, comp$PC3, comp$PC4)
So, now that we’ve simplified the complex data set into a lower dimensional space we can visualize and work with, how do we find patterns in the data, in our case, the aircraft which are most similar? We can use a simple unsupervised machine learning technique like clustering.
Cluster Analysis
Here because I’m not a data scientist extraordinaire, I’ll stick to the simplest technique and do a simple k-means – this is pretty straightforward to do in R.
First how do we determine the number of clusters? The simplest method is to look at the within groups sum of squares and pick the ‘elbow’ in the plot, similar to as with the scree plot we did for the PCA previously. Here I used the code from R in Action:
# Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
However, it should be noted that it is very important to set the nstart parameter and iter.max parameter (I’ve found 25 and 1000, respectively to be okay values to use), which the example in Quick-R fails to do, otherwise you can get very different results each time you run the algorithm, as below.
Clustering without the nstart parameter can lead to variable results for each run |
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot |
So here we can see that the “elbow” in the scree plot is at k=4, so we apply the k-means clustering function with k = 4 and plot.
# From scree plot elbow occurs at k = 4 # Apply k-means with k=4 k <- kmeans(comp, 4, nstart=25, iter.max=1000) library(RColorBrewer) library(scales) palette(alpha(brewer.pal(9,'Set1'), 0.5)) plot(comp, col=k$clust, pch=16)