Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a previous post, I introduced my new package with Yingkang Xie, freqparcoord. Here I’ll illustrate some of the other uses to which the package can be applied.
The freqparcoord package visualizes multivariate data by plotting the most frequent cases in the data, as defined by multivariate density estimation. The example in the previous post illustrated typical height, weight and age differences by position among baseball players.
But the package allows plotting the least frequent cases–perfect for outlier hunting. Let’s apply this to the baseball data, say finding 3 outliers for each position:
> library(freqparcoord) > data(mlb) > freqparcoord(mlb,-3,4:6,7)
Here columns 4:6 are height, weight and age, while column 7 is position; under default settings, the data will be faceted vertically by position. Here is what is displayed:
All variables are centered and scaled. To get a better idea as to what the extreme values are, we can set the keepidxs argument, and then show the original data corresponding to the displayed lines:
> p <- freqparcoord(mlb,-3,4:6,7,keepidxs=4) > p > p$xdisp[,c(1,4:7)] Name Height Weight Age PosCategory 237 Ivan_Rodriguez 69 218 35.25 Catcher 994 So_Taguchi 70 163 37.66 Outfielder 674 Julio_Franco 73 188 48.52 Infielder 964 Barry_Bonds 74 228 42.60 Outfielder 36 Toby_Hall 75 240 31.36 Catcher 35 A.J._Pierzynski 75 245 30.17 Catcher 891 Mike_Restovich 76 257 28.16 Outfielder 547 Tony_Clark 79 245 34.71 Infielder 155 C.C._Sabathia 79 290 26.61 Pitcher 275 Richie_Sexson 80 237 32.17 Infielder 559 Randy_Johnson 82 231 43.47 Pitcher 910 Jon_Rauch 83 260 28.42 Pitcher
Sexson was flagged due to his height and weight, while Franco emerged because he is 48 years old!
Our package can also be used for cluster hunting. Here we again look for the most frequent cases, but now locally most frequent, meaning that their estimated density values are local maxima.
Let’s try it on simulated data, with known clusters. We’ll generate from a mixture of 3 bivariate normals, with means at (0,0), (1,2) and (3,3). (The package includes a function rmixmvnorm() for such experiments.) Here are the results, both graphical and text:
> cv <- 0.5*diag(2)
> x<- rmixmvnorm(10000,2,3,list(c(0,0),c(1,2),c(3,3)),
+ list(cv,cv,cv))
> p <- freqparcoord(x,m=1,method="locmax",
+ keepidxs=1,k=50,klm=600)
[,1] [,2]
[1,] -0.3556997 0.2423760
[2,] -0.0993228 -0.2510209
[3,] 0.2507786 0.1883437
[4,] 1.1386850 2.2073467
[5,] 2.7581227 2.8935957
Of course, the results depend on the arguments, with default values being used here. The reader may wish to experiment with other values of klm. But the data clearly fall into 3 clusters, the correct number, centered near (0,0), (1,2) and (3,3), the correct locations. Note that we did NOT specify the number of clusters.
And there’s more! Watch this space for future posts.
The package is on CRAN but the latest version is here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.