[This article was first published on Life in Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here are the results of a “Curse of Dimensionality” homework assignment for Terran Lane’s Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to “get” how the maximum metric worked — it’s simpler than I was trying to make it, though I’m still not sure I can explain it properly.
In general, I’m a big fan of R’s lattice package, though I’m never quite happy with the legend/key settings.
The code is below. One thing that I’ve learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn’t immediately emitted to the console, and it’s annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I’m guessing here) a lot of time will be wasted with job setup and teardown.
dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4, rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){ ## examine distance between 2 points in dims dimensions for different metrics ## chosen at random according to rfun (rnorm by default) ## require(foreach) require(doMC) registerDoMC registerDoMC(cores=cores) ## use multicore/foreach on the outside (each sd) finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% { ## pre-allocate, don't append/rbind! ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods)) ii = 1 for (.dim in dims) { for (method in .methods) { ## reset .rep counter .rep = 1 while (.repFor plotting, I’ve used:
xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.', par.settings=list(background=list(col='darkgrey')), layout=c(3,1), scales=list(y=list(relation='free'), alternating=1), auto.key=list(space='bottom', columns=5, points=F, lines=T), xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)', ) xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.', par.settings=list(background=list(col='darkgrey')), layout=c(5,1), scales=list(y=list(relation='free'), alternating=1), auto.key=list(space='bottom', columns=3, points=F, lines=T), xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)', )
To leave a comment for the author, please follow the link and comment on their blog: Life in Code.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.