Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“Fast, cheap, correct: Pick two.” Does this phrase apply to statistical matching algorithms? In the case of Optmatch, you can have all three. “Cheap” is easy: it is open source. You can download it for free. Today I’m going to explain how to make the matching process both faster and more substantively relevant using a technique we call “pre-stratification”: splitting your data into smaller matching problems.
The Problem
Ben Hansen and I often receive messages from Optmatch users of the form: “I have a very large matching problem, and Optmatch is taking a very, very long time to complete. Is there anything I can do?” For example, say that you have some data of the form:
> fake.data <- data.frame(z = rep(c(1,0), 3000), x = runif(6000))
Where z
is a indicator of whether the unit received “treatment” and x
is a covariate you wish to match on (or it could be a summary of covariates, such as a propensity score). If one were to invoke Optmatch on this data directly, it could take a long time. I don’t suggest you try it, but it might look something like this:
> library(optmatch) > my.matching <- pairmatch(mdist(z ~ x, data = fake.data))
In this code, mdist
prepares a treatment by control matrix where each entry is the Mahalanobis distance between each pair. pairmatch
finds the best set of treatment-control pairs, minimizing the average distance within pairs. We’ll see another example of mdist
below and more examples of both functions are contained in the online documentation (e.g. > ?mdist
).
Stratification
Of course, this example will be slow. You are telling optmatch to compare 3000 treated units with 3000 control units, which is a very, very large search space. We would recommend limiting the comparisons of treated and control units based on another covariate, stratifying the data into smaller subgroups prior to matching. Usually the best way is to use a categorical variable of substantive purpose.
For example, say you have a continuous covariate (x), a treatment indicator (z), and a factor indicating male or female (gender)
> fake.data <- data.frame(z = rep(c(1,0), 3000), x = runif(6000), gender = c(rep(0, 1500), rep(1, 1500)))
Perhaps previous studies indicate gender to be an important determinant to whether subjects self-select into treatment (this can be verified by checking the balance of male and female treatment and control subjects). Limiting matches to same gender pairs will likely improve the quality of your matches (as compared to ignoring gender) and will also speed up the matching process.
Using mdist
again, we can create a set of distances for male subjects (treated and control) and separately female subjects by indicating that gender
should be a stratifying variable and the distances can be fed to optmatch:
> distances <- mdist(z ~ x | gender, data = fake.data) > my.matches <- pairmatch(distances) > summary(my.matches) Structure of matched sets: 1:1 3000 Effective Sample Size: 3000 (equivalent number of matched pairs). sum(matched.distances)=4.33 (within 5.01 of optimum). Percentiles of matched distances: 0% 50% 95% 100% 5.83e-10 8.88e-04 4.75e-03 8.96e-03
Watching the R process as this ran, I saw it took about 200mb of RAM to compute the distances and the pair match, and it only took a few seconds. By comparison, the 3000 by 3000 matching task took all my available memory (forcing other apps to be pushed to swap) and did not complete in the 10 minutes I allowed it to run.
Propensity Scores
Clearly, stratification improves the execution time of matches. I’ve also found that stratified matches do very well compared to propensity score models that would also include the stratification variable. In other words, if were to build a propensity score model of treatment and include gender, I could so in this fashion:
> match.model <- glm(z ~ x + gender, data = fake.data, family = binomial) > my.matches <- pairmatch(mdist(match.model))
But like the original method, this would require a 3000 by 3000 entry matrix to search. Again, the faster way is to both include the stratifying variable from the propensity model and use it directly in the match:
> match.model <- glm(z ~ x + gender, data = fake.data, family = binomial) > my.matches <- pairmatch(mdist(match.model, structure.fmla = ~ gender))
Like the stratification above using Mahalanobis distance, the propensity score example completes much more quickly when gender
is used to stratify the data before the match.
Conclusion
While I’ve been arguing from a speed perspective so far, I also think that stratification improves the substantive quality of matches. Matches that stratify along variables with strong theoretical importance or that have been shown to be strong predictors of treatment selection in previous studies make good choices for stratifying variables as they improve the rational for the matching strategy. Even readers unfamiliar with the matching literature understand that stratifying limits comparisons to comparable units. Ultimately, convincing others that particular matching strategy allows for valid causal inference is a matter of rhetoric. Stratification can be another tool in creating believable matching scenarios.
The rhetorical aspect of matching can also be improved by quantitative analysis of the match quality, specifically balance testing. I’ve written on testing balance on this website before. The Optmatch and RItools documentation provides more examples of matching and balance testing strategies. When you have sped up the matching process using stratification, it is easy to compare balance on many different matching strategies to find the one that best fits your data.This again is an example of faster matching providing higher quality results. “Fast, cheap, correct” – Optmatch has them all.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.