Use R! – Part 2
[This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is a follow-up of my first post about using R. For our yearly KU Leuven Geology PhD Seminar (08-09/05/2012), I quickly pasted this script together from several examples I had run into in the past, as well as some things that I have been doing myself in R. Be aware these analyses of the Meuse dataset are not always very useful! This was just to demonstrate several numerical/statistical techniques within R, in an attempt to add a few more R converts to the community. Below the example graphs, some explanation about what they represent, and the R code that I sent to all members of the statistics and numerical analysis discussion group.
The Meuse data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein. Below a map of the relative cadmium concentrations, and the scatterplot matrix for the different metal concentrations.
Linear mixed effects models are suited for linear dependence modelling, when your dataset has one or more categorical variables, and you expect both a general relationship, as well as slight differences between the different classes. What one could do with the Meuse dataset is the following: predict the copper concentrations in function of the lead concentration. The bold black line is the result of linear regression. By allowing an additional random effect for each flood frequency class, one obtains the other 3 lines, i.e. one for each class (this was done with lme4). This is probably not very useful for this dataset, but I have used this extension of linear regression several times by relating different material properties with for example the stratigraphy as categorical variable. This definitely is a useful technique within the geosciences.
Next is the non-linear modelling of the metal concentrations’ dependence on elevation, organic matter content and the distance from the river Meuse. While the concentrations show mutually linear dependence, this is not true for the other three variables, and we can for example use an artificial neural network (data transformations might of course also work, but I wanted to show how to use ANNs in R with AMORE).
The two figures below show the model fit (predicted vs target values) after training a single network, and the leave-one-out cross-validation results to estimate the true predictive error. Especially from the lead and zinc plots, it becomes apparent that the training error is smaller than the true predictive error. Adjusting with the ANN parameters would probably yield better results though, but this was outside of the scope of this demonstration.
Next is data clustering, an unsupervised machine learning technique. In the Meuse dataset, it would be nice to discover different classes based on the metal concentrations. The lead vs copper scatterplot nicely shows different classes due to the distance from the Meuse. Supposed we do not know these distances, are we able to detect these classes just from the concentration data?
Two algorithms are compared here: k-means and model-based clustering. For the first one the number of classes is specified. The plots are created for up to ten classes. The model-based clustering (done with MCLUST) uses expectation maximization methods to estimate the optimal number of classes. This example clearly demonstrates the disadvantages of k-means clustering, as all clusters are equal in size and have a more or less spherical shape. The use of model-based clustering overcomes these problems, and provides a decent classification, according to the distance to the river Meuse, as given in the plot above.
Finally, some geostatistics of course! Below are the examples for the ordinary kriging of the logarithmic zinc concentrations, as provided in the gstat demo.
And here is the script, if you want to try it yourself:
#################################################################### # PhD Seminar KU Leuven Geology Section: R examples, by B. Rogiers # #################################################################### # Download and install R from: http://cran.freestatistics.org/ # Open this script in RGui, or a more advanced R editor like RStudio: http://rstudio.org/ # Execute the code lines one by one install.packages(c('gstat','lme4','AMORE','mclust')) # install packages; only necessary the first time! library(gstat) # load geostatistics package, that contains our dataset data(meuse) # load the example dataset ?meuse # Help file for the Meuse river data set plot(meuse$x, meuse$y, cex=(meuse$cadmium)^(1/3), pch=16, xlab='X coordinate',ylab='Y coordinate',main='Cadmium concentration (ppm)') # plot sample locations and cadmium concentration plot(meuse[,3:6]) # concentration scatterplot matrix n <- nrow(meuse) # number of samples # Dependence modelling # Linear mixed models library(lme4) # load necessary package classes <- as.numeric(meuse$ffreq) # choose data classes: flood frequency 1,2,3 plot(meuse$lead, meuse$copper, col=classes, pch=classes, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Flood frequency classes') # plot lead and copper concentrations for each of the classes abline(lm(meuse$copper ~ meuse$lead)$coef, lwd=2) # draw a fitted linear regression line ?lmer # Help file for the lmer function lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes)) # fit a linear mixed effects model, with a random effect for each class coefs <- coef(lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes))) # save the coefficients of this model for(i in 1:3) abline(a=coefs$classes[i,1], b=coefs$classes[i,2],col=i, lty=i) # plot the 3 resulting lines # Neural networks library(AMORE) # load necessary package plot(meuse[,c(3:7,9,14)]) # scatterplot matrix of some numeric parameters input <- meuse[-c(42,43),c(7,9,14)] # use elevation, distance to the Meuse, and organic matter content (input data), to model target <- meuse[-c(42,43),c(3:6)] # the cadmium, copper, lead and zinc concentrations (target values); line 42 and 43 contain NA values input <- scale(input, center=T, scale=T) # standardize input data target <- scale(target, center=T, scale=T) # standardize target data ?newff # Help file for the newff function ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") # create a neural network ?train # Help file for the train function trainedANN <- train(ann, input, target, show.step=1, n.shows=1000) # train the neural network fit <- sim(trainedANN$net, input) # get the predictions for all data par(mfrow=c(2,2)) # plot 2 x 2 figures plot(fit[,1], target[,1], xlab='Model prediction', ylab='Target variable', main='Cadmium') # scatterplot of observations vs predictions plot(fit[,2], target[,2], xlab='Model prediction', ylab='Target variable', main='Copper') # scatterplot of observations vs predictions plot(fit[,3], target[,3], xlab='Model prediction', ylab='Target variable', main='Lead') # scatterplot of observations vs predictions plot(fit[,4], target[,4], xlab='Model prediction', ylab='Target variable', main='Zinc') # scatterplot of observations vs predictions par(mfrow=c(1,1)) # reset plot options # Uncertainty quantification # http://episun7.med.utah.edu/~alun/teach/stats/week08.pdf hist(meuse$cadmium) # histogram of the cadmium concentrations mean(meuse$cadmium) # mean of cadmium concentrations var(meuse$cadmium)/n # variance of the sample mean dataSample <- meuse$cadmium # use all data #n <- 15;dataSample <- sample(meuse$cadmium, size=n) # use smaller data sample # Bootstrapping # estimate variance of a statistic # good for smaller samples or awkward distributions # generate set of random samples with replacement to evaluate the statistic # non-parametric bootstrap: s <- 1000 # number of resampling cases t <- rep(0,s) # initialize the resulting vector of means for (i in 1:s) t[i] = mean(sample(dataSample,replace=TRUE)) # loop through all resampling cases, and calculate the mean mean(t) # bootstrapped estimate of mean var(t) # bootstrapped estimate of variance of the mean # semi-parametric bootstrap: a <- 0.1 # smoothing parameter for (i in 1:s) t[i] = mean( sample(dataSample,replace=TRUE) + a * rnorm(n)) # loop through all resampling cases, add random normal error to each sample, and calculate the mean mean(t) # bootstrapped estimate of mean var(t) # bootstrapped estimate of variance of the mean # Jack-knifing # estimate variance and bias of a statistic # more orderly version of the bootstrap # generate n samples by leaving one observation out at a time t = rep(0,n) # initialize the resulting vector of means for (i in 1:n) t[i] = mean(dataSample[-i]) # loop through all cases mean(t) # jack-knifed estimate of the mean (n-1)/n * sum( (t-mean(t))^2 ) # jack-knifed estimate of the variance of the mean n * mean(dataSample) - (n-1) * mean(t) # jack-knifed bias-corrected estimate # Cross-validation # evaluating predictive error for a model fitted to all data cv <- function(input=input, target=target, sample=i) # function to predict a sample through cross-validation, with the ANN we created above { cvSample <- input[i,];inputCV <- input[-i,];targetCV <- target[-i,] # remove the ith sample from the data ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") trainedANNCV <- train(ann, inputCV, targetCV, show.step=500, n.shows=1) return(sim(trainedANNCV$net, input)[i,]) # predict the left out sample } results <- matrix(nrow=nrow(input), ncol=4) # initialize the results matrix for(i in 1:nrow(input)) results[i,] <- cv(input, target, i) # loop through all samples, applying the above function par(mfrow=c(2,2)) # plot 2 x 2 figures for(i in 1:4) {plot(results[,i], target[,i], xlab='Model prediction',ylab='Target variable',main=c('Cadmium','Copper','Lead','Zinc')[i]); print(mean((target[,i]-results[,i])^2))} # plot observations vs predictions for each element par(mfrow=c(1,1)) # reset plot options # Clustering # Unsupervised machine learning technique to find classes in a dataset plot(meuse$lead, meuse$copper, cex=(meuse$dist.m)^(1/3)/7, pch=16, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Distance to Meuse river') # let us classify the lead and copper concentrations; two groups are clearly present: close to and far away from the Meuse # K-means clustering # http://en.wikipedia.org/wiki/K-means_clustering ?kmeans # Help file for the kmeans function par(ask=T) # adapt graphical parameters; now press enter in the console window to display plots for(i in 1:10) # loop from 1 to 10 k-means classes { classes <- kmeans(x=data.frame(lead=meuse$lead, copper=meuse$copper), centers=i, iter.max=100, nstart=100)$cluster # classify plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)', main=paste('K-means clustering, with',i,'classes')) # plot data with classes } par(ask=F) # reset graphical parameters # Model-based clustering # http://www.stat.washington.edu/raftery/Research/mbc.html library(mclust) # load necessary package ?Mclust # Help file for the Mclust function classes <- Mclust(data.frame(lead=meuse$lead, copper=meuse$copper))$classification # classify plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)',main='Model-based clustering') # plot data with classes # Geostatistics ?gstat # Help file for the gstat function coordinates(meuse) = ~ x+y # Set x and y variables as coordinates data(meuse.grid) # Load data grid for estimating Zinc concentrations gridded(meuse.grid) = ~ x+y # Set data as gridded with x and y as coordinates v <- vgm(0.581, "Sph", 900, nug = 0.0554) # Variogram model g <- gstat(id = "ln.zinc", form = log(zinc) ~ 1,data = meuse, nmax = 40, nmin = 20, maxdist = 1000, model = v) # Initialize gstat object plot(variogram(g), model=v) # Plot variogram model fit x <- predict(g, meuse.grid) # Predict using ordinary kriging spplot(x["ln.zinc.pred"], main = "log(zinc) ordinary kriging predictions", col.regions=terrain.colors(30)) # Plot ordinary kriging predictions x$ln.zinc.se = sqrt(x$ln.zinc.var) # convert kriging variance to kriging standard error for plotting spplot(x["ln.zinc.se"], main = "log(zinc) ordinary kriging prediction standard errors", col.regions=terrain.colors(30)) # Plot ordinary kriging standard error # need more examples? demo(cokriging) demo(examples) # Other useful info: # The R project website: https://www.r-project.org/ # List of packages: http://cran.freestatistics.org/web/packages/available_packages_by_name.html # Visualization package for the R statistical analysis platform, loosely based on "The Grammar of Graphics" from Leland Wilkinson: http://had.co.nz/ggplot2/ # R graph gallery: http://addictedtor.free.fr/graphiques/ # R & LaTeX: # knitr: http://yihui.name/knitr/ # Sweave: http://www.statistik.lmu.de/~leisch/Sweave/
To leave a comment for the author, please follow the link and comment on their blog: Bart Rogiers - Sreigor.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.