[This article was first published on   R snippets, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            Last week I have posted about using simulation metamodeling to verify results of analytical solution of the model. After posting it I realized that the solution presented there can be improved by using knowledge of simulation model structure.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The story is about calculation of the variance of the random variable given by formula:
q(p-2+q/N)
where p and N are model parameters and q is a random variable that has binomial distribution with N Bernoulli trials and probability of success 2-p. As I have written the properly specified formula for the metamodel has the following form:
v ~ (p + I(p ^ 2) + I(p ^ 3) + I(p ^ 4)) * (N + I(1 /N))
where v is an estimate of the variance. Such a model was discussed in my last post (it is described in more detail there so I omit the repetition in this post).
However one can easily notice that for p=1 and p=2 the variance of the random variable is equal to 0. This is because in such case q has no variance (it is equal to 1 and 0 respectively).
In estimation of the metamodel one can take this into account in two ways:
- the weight of cases where p equals 1 or 2 might be increased (so we use Weighted Least Squares)
- a restriction on values of prediction in points where p equals 1 or 2 can be made to ensure that it is exactly 0 (so we use Constrained Least Squares)
The code given below estimates: original model and two versions of corrected models:
library(plyr)< o:p>
library(limSolve)< o:p>
sim <- function(param) {< o:p>
  q <- rbinom(1000, param$N, 2 – param$p)< o:p>
  c(v = var((param$p – 2 + q / param$N) * q))< o:p>
}< o:p>
set.seed(1)< o:p>
data.set <- expand.grid(p = seq(1, 2, len = 101), N = 1:10)< o:p>
data.set <- ddply(data.set, .(p,N), sim,.progress = “text”)< o:p>
lm.coef <- coef(lm(v ~ (p + I(p ^ 2) + I(p ^ 3) + I(p ^ 4))< o:p>
              * (N + I(1 / N)), data = data.set))< o:p>
weights <- ifelse(data.set$v==0, 100000, 1)< o:p>
wlm.coef <- coef(lm(v ~ (p + I(p ^ 2) + I(p ^ 3) + I(p ^ 4))< o:p>
                 * (N + I(1 / N)), data = data.set,< o:p>
                 weights = weights))< o:p>
A <- with(data.set, cbind(1, p, p^2, p^3, p^4, N, 1 / N,< o:p>
  p * N, p / N, p ^ 2 * N, p ^ 2 / N,< o:p>
  p ^ 3 * N, p ^ 3 / N, p ^ 4 * N, p ^ 4 / N))< o:p>
colnames(A) <- names(lm.coef)< o:p>
B <- data.set$v< o:p>
E <- A[B == 0,]< o:p>
F <- B[B == 0]< o:p>
lsei.coef <- lsei(A, B, E, F)$X< o:p>
Now we can compare the coefficients from these models to their analytical values given in the last post:
true.coef <- c(32, –88, 88, –38, 6, –8, –26, 20,< o:p>
               75, –18, –79, 7, 36, –1, –6)< o:p>
coef.dev <- cbind(lm.coef, wlm.coef, lsei.coef) – true.coef< o:p>
print(coef.dev, digits = 2)< o:p>
dotchart(coef.dev[,1], pch = 19)< o:p>
points(coef.dev[,2], 1:nrow(coef.dev), col = “red”, pch = 19)< o:p>
abline(v = 0, col = “gray”, lty = 2)< o:p>
The code first tabulates the deviation of estimates from true values:
              lm.coef wlm.coef lsei.coef< o:p>
(Intercept)   -0.6056    0.060     0.060< o:p>
p              1.5061   -0.235    -0.235< o:p>
I(p^2)        -1.3576    0.323     0.323< o:p>
I(p^3)         0.5243   -0.185    -0.185< o:p>
I(p^4)        -0.0731    0.038     0.038< o:p>
N              0.0088   -0.060    -0.060< o:p>
I(1/N)         1.1510    0.354     0.354< o:p>
p:N           -0.0065    0.173     0.173< o:p>
p:I(1/N)      -3.0646   -0.965    -0.965< o:p>
I(p^2):N      -0.0102   -0.182    -0.182< o:p>
I(p^2):I(1/N)  2.9942    0.953     0.953< o:p>
I(p^3):N       0.0115    0.084     0.084< o:p>
I(p^3):I(1/N) -1.2730   -0.404    -0.404< o:p>
I(p^4):N      -0.0029   -0.014    -0.014< o:p>
I(p^4):I(1/N)  0.1991    0.062     0.062< o:p>
and next plots the visualization of the differences (on the plot black points represent OLS estimates and red – WLS):
It can be seen that:
- the precision of estimation is improved by using additional information on problem structure;
- the estimates from WLS and CLS are identical to the third decimal place.
To leave a comment for the author, please follow the link and comment on their blog:  R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
