Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.
The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.
There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called Expectation – Maximization. This technique iteratively goes through the data while still preserving the covariance structure of the data.
library(e1071) raw < - replicate(10, rpois(50,100)) raw.orig <- raw rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) ) raw[rand.miss] <- NA raw <- data.frame(raw) var(na.omit(raw) ) var(raw.imputed) EMalg <- function(x, tol=.001){ missvals <- is.na(x) new.impute<-x old.impute <- x count.iter <- 1 reach.tol <- 0 sig <- as.matrix(var(na.exclude(x))) mean.vec <- as.matrix(apply(na.exclude(x),2,mean)) while(reach.tol != 1) { for(i in 1:nrow(x)) { pick.miss <-( c( missvals[i,]) ) if ( sum(pick.miss) != 0 ) { inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance # Run the EM new.impute[i,pick.miss] <- mean.vec[pick.miss] + sig[pick.miss,!pick.miss] %*% inv.S %*% (t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss]))) } } sig <- var((new.impute)) mean.vec <- as.matrix(apply(new.impute,2,mean)) if(count.iter > 1){ # we don't want this to run on the first iteration or else if fails for(l in 1:nrow(new.impute)){ for(m in 1:ncol(new.impute)){ if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) { reach.tol < - 0 } else { reach.tol <- 1 } } } } count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly old.impute <- new.impute } return(new.impute) } raw.imputed <- EMalg(raw, tol=.0001) plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data", sub="Missing Values in Red", xlab="X",ylab="Y") # overlay the imputed values on the plot plot.imputed <- raw.imputed[ row.names( subset(raw, is.na( raw[,2] ) | is.na( raw[,3]) ) ),] points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red') &nbsp;
Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis. This is one of many approaches for imputing missing data. Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.