Question and Answer: Generating Binary and Discrete Response Data
[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have firm level data and the model is discrete choice with the main explanatory variable also a binary choice: First question is how can I calibrate the data generation model?
Answer:
This is a fundamental question for any kind of econometric model. How you calibrate your data implies the inherent structure of your data which in term implies what method you should use to attempt to recover your parameters. Now some data generating processes exist out there which do not yet have econometric solutions to. Yet there are many that do.
In general you can calibrate your data by i. modifying the parameters, ii. the distribution of explanatory variables, or iii. the distribution of the errors.
In a binary response case the most common models are probit/logit in which case in order to simulate data you would generate your underlying model and overlay the appropriate CDF over it which gives you probabilities of a success. Finally you would make a random draw based on those probabilities for each outcome being simulated.
I have numerous example code demonstrating this:
Stata: (Reverse Engineering a Probit) (Probit vs Logit)
R:
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <- as.data.frame(cbind(Y=rbinom(Nobs,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = “probit”), data = SData))
Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.
Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data http://works.bepress.com/joseph_hilbe/3/)
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome
SData <- as.data.frame(cbind(Y=Y, X)) # Combine Datarequire("nnet")
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <- as.data.frame(cbind(Y=rbinom(Nobs,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = “probit”), data = SData))
Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.
Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data http://works.bepress.com/joseph_hilbe/3/)
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome
SData <- as.data.frame(cbind(Y=Y, X)) # Combine Datarequire("nnet")
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))
To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.