Visualizing Principal Components

systematicinvestor

9 years ago

[This article was first published on Systematic Investor » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Principal Component Analysis (PCA) is a procedure that converts observations into linearly uncorrelated variables called principal components (Wikipedia). The PCA is a useful descriptive tool to examine your data. Today I will show how to find and visualize Principal Components.

Let’s look at the components of the Dow Jones Industrial Average index over 2012. First, I will download the historical prices and sector infromation for all components of the Dow Jones Industrial Average index.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Find Sectors for each company in DOW 30
	#****************************************************************** 
	tickers = spl('XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU')
	tickers.desc = spl('ConsumerCyclicals,ConsumerStaples,Energy,Financials,HealthCare,Industrials,Materials,Technology,Utilities')
	
	sector.map = c()
	for(i in 1:len(tickers)) {
		sector.map = rbind(sector.map, 
				cbind(sector.spdr.components(tickers[i]), tickers.desc[i])
			)
	}
	colnames(sector.map) = spl('ticker,sector')

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')	
	tickers = dow.jones.components()
	
	sectors = factor(sector.map[ match(tickers, sector.map[,'ticker']), 'sector'])
		names(sectors) = tickers
	
	data <- new.env()
	getSymbols(tickers, src = 'yahoo', from = '2000-01-01', env = data, auto.assign = T)
		for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)	
	
	bt.prep(data, align='keep.all', dates='2012')
	
	# re-order sectors, because bt.prep can change the order of tickers
	sectors = sectors[data$symbolnames]
	
	# save data for later examples
	save(data, tickers, sectors, file='bt.pca.test.Rdata')

Next, let’s run the Principal Component Analysis (PCA) on the companies returns during 2012 and plot percentage of variance explained for each principal component.

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = data$prices	
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret))
	
	loadings = p$loadings[]
	p.variance.explained = p$sdev^2 / sum(p$sdev^2)

	# plot percentage of variance explained for each principal component	
	barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')

The first principal component, usually it is market returns, explains around 45% of variance during 2012.

Next let’s plot all companies loadings on the first and second principal components and highlight points according to the sector they belong.

	#*****************************************************************
	# 2-D Plot
	#****************************************************************** 		
	x = loadings[,1]
	y = loadings[,2]
	z = loadings[,3]
	cols = as.double(sectors)
	
	# plot all companies loadings on the first and second principal components and highlight points according to the sector they belong
	plot(x, y, type='p', pch=20, col=cols, xlab='Comp.1', ylab='Comp.2')
	text(x, y, data$symbolnames, col=cols, cex=.8, pos=4)
	
	legend('topright', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

Please notice that the companies in the same sector tend to group together on the plot.

Next, let’s go one step further and create a 3D plot using first, second, and third principal components

	#*****************************************************************
	# 3-D Plot, for good examples of 3D plots
	# http://statmethods.wordpress.com/2012/01/30/getting-fancy-with-3-d-scatterplots/
	#****************************************************************** 				
	load.packages('scatterplot3d') 
	
	# plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong
	s3d = scatterplot3d(x, y, z, xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', color=cols, pch = 20)
		
	s3d.coords = s3d$xyz.convert(x, y, z)
	text(s3d.coords$x, s3d.coords$y, labels=data$symbolnames, col=cols, cex=.8, pos=4)
		
	legend('topleft', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

The 3D chart does add a bit of extra info, but I find the 2D chart easier to look at.

In the next post, I will demonstrate clustering based on the selected Principal components and after that I want to explore the interesting discussion presented in the using PCA for spread trading post.

Happy Holidays

To view the complete source code for this example, please have a look at the bt.pca.test() function in bt.test.r at github.

To leave a comment for the author, please follow the link and comment on their blog: Systematic Investor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.