Statistical matching, or when one single data source is not enough
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was recently asked how to go about matching several datasets where different samples of individuals were interviewed. This sounds like a big problem; say that you have dataset A and B, and that A contain one sample of individuals, and B another sample of individuals, then how could you possibly match the datasets? Matching datasets requires a common identifier, for instance, suppose that A contains socio-demographic information on a sample of individuals I, while B, contains information on wages and hours worked on the same sample of individuals I, then yes, it will be possible to match/merge/join both datasets.
But that was not what I was asked about; I was asked about a situation where the same population gets sampled twice, and each sample answers to a different survey. For example the first survey is about labour market information and survey B is about family structure. Would it be possible to combine the information from both datasets?
To me, this sounded a bit like missing data imputation problem, but where all the information
about the variables of interest was missing! I started digging a bit, and found that not only there
was already quite some literature on it, there is even a package for this, called {StatMatch}
with
a very detailed vignette.
The vignette is so detailed, that I will not write any code, I just wanted to share this package!
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.