[This article was first published on Quantitative Finance Collector, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Probably all of us have met the issue of handling missing data, from the basic portfolio correlation matrix estimation, to advanced multiple factor analysis, how to impute missing data remains a hot topic. Missing data are unavoidable, and more encompassing than the ubiquitous association of the term, irgoring missing data will generally lead to biased estimates. The following ways are often applied to handle the problem:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1, simple deletion strategies: including pairwise deletion and listwise deletion, the former may lead to inconsistent results, for example, non positive definite correlation or covariance matrices, while the cumulation of deleted cases may be enormous except in the case of very few missing values for the latter method;
2, so called “Working around” strategies, for example, the Full Information Maximum Likelihood (FIML) integrates out the missing data when fitting the desired model;
3, imputation strategies, these are the most widely used methods both in academia and industry, replacing missing value with an estimate of the actual value of that case. For instance, ‘hot-deck’ imputation consists of replacing the missing value by the observed value from another, similar case from the same dataset for which that variable was not missing; mean imputation consists of replacing the missing value by the mean of the variable in question; expectation Maximization (EM) arrives at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings); regression-mean imputation replaces the missing value by the conditional regression mean, and multiple imputation, rather than a single imputed value, multiple ones are derived from a prediction equation.
Amelia II was developed based on R language, so users have to install R before running it, installation of Amelia is staightforward: download and run the exe file, that’s it. For me, the beauty of Amelia II is its friendly interface, I don’t even need to run R software myself. Double clicking Amelia II shows the following
as you can see from the input and output menus, it supports csv files, simply importing a csv file with missing data returns a csv with imputed data, amazing, isn’t it?
Downloading the software and help documents at http://gking.harvard.edu/amelia/.
Tags – r
Read the full post at Missing Data in R.
To leave a comment for the author, please follow the link and comment on their blog: Quantitative Finance Collector.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.