Missing Data in R
[This article was first published on Quantitative Finance Collector, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Probably all of us have met the issue of handling missing data, from the basic portfolio correlation matrix estimation, to advanced multiple factor analysis, how to impute missing data remains a hot topic. Missing data are unavoidable, and more encompassing than the ubiquitous association of the term, irgoring missing data will generally lead to biased estimates. The following ways are often applied to handle the problem:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1, simple deletion strategies: including pairwise deletion and listwise deletion, the former may lead to inconsistent results, for example, non positive definite correlation or covariance matrices, while the cumulation of deleted cases may be enormous except in the case of very few missing values for the latter method;
2, so called “Working around” strategies, for example, the Full Information Maximum Likelihood (FIML) integrates out the missing data when fitting the desired model;
3, imputation strategies, these are the most widely used methods both in academia and industry, replacing missing value with an estimate of the actual value of that case. For instance, ‘hot-deck’ imputation consists of replacing the missing value by the observed value from another, similar case from the same dataset for which that variable was not missing; mean imputation consists of replacing the missing value by the mean of the variable in question; expectation Maximization (EM) arrives at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings); regression-mean imputation replaces the missing value by the conditional regression mean, and multiple imputation, rather than a single imputed value, multiple ones are derived from a prediction equation.
I came across an easy-to-use missing data imputation named Amelia II developed by professor Gary King from Harvard university, as its webpage introduces: Amelia II “multiply imputes” missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables….Unlike…other statistically rigorous imputation software, it virtually never crashes.
Amelia II was developed based on R language, so users have to install R before running it, installation of Amelia is staightforward: download and run the exe file, that’s it. For me, the beauty of Amelia II is its friendly interface, I don’t even need to run R software myself. Double clicking Amelia II shows the following
as you can see from the input and output menus, it supports csv files, simply importing a csv file with missing data returns a csv with imputed data, amazing, isn’t it?
Downloading the software and help documents at http://gking.harvard.edu/amelia/.
Tags – r
Read the full post at Missing Data in R.
To leave a comment for the author, please follow the link and comment on their blog: Quantitative Finance Collector.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.