Site icon R-bloggers

Designing and Analyzing Studies with Optmatch and RItools (Part 1)

[This article was first published on Mark M. Fredrickson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am currently writing a brief “how-to” for the APSA Section on Experimental Research newsletter on using Optmatch and RItools. The complete paper (a work in progress) can be found on my github page. I have the basics of the paper sketched in, but I would love to get feedback from the online R community, so I will be releasing the paper in several installments. Part 1: Creating fake data and blocking.

Data

Let us begin by creating some data in the style of the potential outcomes framework. Let U be all meaningful covariates related to the outcomes Y_c and Y_t. We observe X, but do not observe W, partitions of U. The covariates are a mix of discrete and continuous random variables.

n <- 100
x1 <- rbinom(n, 10, 0.25)
x2 <- rbinom(n, 1, 0.6)
x3 <- rnorm(n, 50, 10)
x4 <- rnorm(n, 0, 1)
x5 <- runif(n, 0, 100)
w1 <- rnorm(n, 0, 1)
w2 <- rbinom(n, 1, 0.1)
w3 <- runif(n, 0, 1)
X <- data.frame(x1, x2, x3, x4, x5)
W <- data.frame(w1, w2, w3)

The outcome Y is a continuous measure that is a function of the covariates and the treatment indicator. We first create Y_c from the covariates, and Y_t is simply Y_c + tau, where tau is the treatment effect.

tau <- 10
yc <- 0.25 * x1 + 4 * x2 + exp(x4) + x5 + 10 * w1 * w2 - w3 * 
    x3
yt <- yc + tau

Blocking and Randomization

To implement blocking, we use the matching procedures in the Optmatch package for R. Optmatch implements a procedure known as “optimal full matching” that minimizes the average distance between matched sets. Optmatch was designed with observational studies in mind, where the researcher has discovered “treatment” and “control” groups. Optmatch will then find matches between similar treated and control units. This strategy is known as “bipartite matching.” For more on matching (and using Optmatch in an observational study) see Rosenbaum (2010).

In our situation, we do not have an existing randomization vector for our data, but we still wish to create similar subsets of our data. Therefore we need to create the two partitions of the data that Optmatch will use. The most straightforward way to create the splitting vector is to do so randomly.

s <- vector("logical", n)
s[sample.int(n, n/2)] <- T

To create the blocks, we use the pairmatch function. pairmatch will create matches with one observation from each random set. Optmatch allows tuning the number of observations allowed from each random set. See the documentation for fullmatch for more

We need to specify a distance matrix between observations, and we can use the convenience function mdist to create a distance matrix based on the Malhanobis distance between observations.

blocks.all <- pairmatch(mdist(s ~ x1 + x2 + x3 + x4 + x5, 
  data = cbind(s, X)))

For reasons of convenience or theoretical importance, we may wish to privilege certain variables and force the matching within levels of those variables. For example, if units are clustered within a geographic unit — cities within a state — we can limit matches to within the state. This is also a useful technique when matching large numbers of subjects (see my website for more details on speeding up the matching process). To limit matches within blocks, we specify a factor indicating unit membership. In our case, let us just match within the binary variable x2. Prior to doing so, we will create a new split that places 50% of each level in the partitions.

count.x2.1 <- sum(x2)
X.ordered <- X[order(x2), ]
s.x2.0 <- sample.int((n - count.x2.1), (n - count.x2.1)/2)
s.x2.1 <- sample.int(count.x2.1, count.x2.1/2)
s.x2 <- vector("logical", n)
s.x2[c(s.x2.0, s.x2.1 + (n - count.x2.1))] <- T
blocks.x2 <- pairmatch(mdist(s ~ x1 + x3 + x4 + x5 | x2, data = cbind(s = s.x2, 
    X.ordered)))

For simplicity, we will continue with the single stratum blocking, but splitting up matching problems into smaller blocks is a very useful technique to have at your disposal. Once we have blocks, we can then randomize within the blocks. As we used a pair-matching strategy, we will randomize to two treatment levels, call them “treatment” and “control.” Since each observation is matched to one other we have n/2 = 50 blocks. For each block, we can flip a coin and assign either the first or second unit to the treatment condition.

tmp <- rbinom(n/2, 1, p = 0.5)
z <- vector("logical", n)
for (i in 1:(n/2)) {
    if (tmp[i] == 1) {
        z[i * 2 - 1] <- T
    }
    else {
        z[i * 2] <- T
    }
}

As our last manipulation to the data, create a variable that is the observed outcome Y_c if z = 0 and Y_t if z = 1. For illustration purposes later in the document, I also create a randomization and outcome that ignores the blocking structure.

all.data <- cbind(X, z, b = blocks.all)
all.data$y <- ifelse(z, yt, yc)
tmp <- vector("logical", n)
tmp[sample.int(n, n/2)] <- T
all.data$z.unblocked <- tmp
all.data$y.unblocked <- ifelse(tmp, yt, yc)

Preview

In the next post we will look at testing balance (similar distribution of covariates in the treatment and control groups) and analyzing blocked experiments from a randomization inference perspective.

To leave a comment for the author, please follow the link and comment on their blog: Mark M. Fredrickson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.