Post 1: A Bayesian 2PL IRT model
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, we define the Two-Parameter Logistic (2PL) IRT model, derive the complete conditionals that will form the basis of the sampler, and discuss our choice of prior specification.
2PL IRT Model specification
![newcommand{E}[1]{mathbb{E}left[#1right]}
newcommand{Egiven}[2]{mathbb{E}left[ left. #1 right| #2right]}
newcommand{Var}[1]{text{Var}left[#1right]}
newcommand{Vargiven}[2]{text{Var}left[ left. #1 right| #2right]}](https://i2.wp.com/mcmcinirt.stat.cmu.edu/wp-content/plugins/latex/cache/tex_6b112b09094981ba321a560bd852a5cc.gif?w=578)









This model becomes a Bayesian model when we place prior distributions on the item and person parameters. In the chapter and this supplement, we use the following prior specification
begin{aligned}
% U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) \
% ln frac{pi_{pi}}{1+pi_{pi}} & = a_i(theta_p – b_i) \
theta_p & stackrel{iid}{sim} text{Normal}(0,sigma_theta^2) \
sigma_theta^2 & sim text{Inverse-Gamma}(alpha_theta,beta_theta) \
a_i & stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) \
b_i & stackrel{iid}{sim} text{Normal}(0,sigma_b^2)
end{aligned}
where ,
,
,
and
are constants that we choose below in the prior specification section.
Derivation of the complete conditionals
The idea of a Gibbs sampler is that we can sample from the joint-posterior by iteratively sampling from conditional posteriors. In this section, we derive the conditional posteriors, which are commonly referred to as complete conditionals because they are conditional on all other parameters and data in the model.
We begin with the likelihood function. Since each is an iid Bernoulli, the likelihood function is:






By applying Bayes rule and using the conditional independence relationships in the model, we find the joint posterior is proportional to: begin{equation} f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta ) cdot f(sigma^2_theta ) cdot f(boldsymbol a) cdot f(boldsymbol b) label{eq:post} end{equation}
The complete conditionals conditionals are then essentially read off of Equation ref{eq:post} by simply dropping any expression which does not contain the parameter of interest. To illustrate this technique, we derive the complete conditional for formally:
begin{aligned}
f(boldsymboltheta | sigma^2_theta, boldsymbol a, boldsymbol b, boldsymbol{U} ) & propto
frac{ f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) }
{f(sigma^2_theta, boldsymbol a, boldsymbol b) } \
& propto frac{ f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot
f(boldsymboltheta | sigma^2_theta ) cdot
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b) }{
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b)
} \
& propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta )
end{aligned}
which has the same result as the heuristic.
Repeating this process we find the other complete conditionals:
begin{aligned}
f(theta_p|text{rest}) & propto
prod_{i=1}^I
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(theta_p| 0,sigma_theta^2) ~, \
f(sigma^2_theta|text{rest}) & propto
left[ prod_{p=1}^P f_text{Normal}(theta_p| 0,sigma_theta^2) right]
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta,
beta_theta right.
right)
\
& propto
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p right.
right) \
f(a_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Log-Normal}(a_i| mu_a,sigma_a^2) ~, \
f(b_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(b_i| 0,sigma_b^2) ,
end{aligned}
where stands in for the conditioning variables,
is defined in Equation ref{eq:pipi}, and
represents the density of the random variable named
, which has a
distribution.
Details of prior specification
Person ability
We model the ability of person as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
theta_i stackrel{iid}{sim} text{Normal}left(0, sigma^2_thetaright) quad .
end{equation*}
Although this normal distribution fulfills the mathematical role of a prior for , its interpretation in IRT is as the population distribution of person ability. Therefore, we are interested in estimating the value of
from the data.
One approach to estimating is to build a new layer of the model hierarchy so that our sampler will also give us output for
which we can then use to draw inferences. To do this, we give
a prior. A computationally convenient prior is the conjugate prior for the variance of a normal distribution, the inverse-gamma distribution:
begin{equation*}
sigma^2_theta sim text{Inverse-gamma}left(alpha_theta, beta_theta right) quad
end{equation*}
where the
and
parameters are the inverse-gamma distribution’s shape and scale parameters respectively. Since the prior is conjugate, the posterior is also inverse-gamma: begin{equation*}
sigma^2_theta left| boldsymboltheta right. sim
text{Inverse-gamma}left(
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p
right)
end{equation*}
where
is the total number of persons.
To set values for and
, we consider the effect of those values on the posterior of
. Since the posterior is a well known distribution, we can consult the Wikipedia page to find the formula for its mean. The posterior mean is:
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} %& = frac{beta}{alpha-1} \
& = frac{beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p}{ alpha_theta + frac{P}{2} – 1}
end{aligned}
If and
are small compared to
, then the posterior mean will be driven almost entirely by the data (via the
estimates):
begin{aligned}
sum_{p=1}^P theta^2_p & approx sigma^2_theta P &
& text{and} &
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ frac{sigma^2_theta P}{2} }{ frac{P}{2}} = sigma^2_theta
end{aligned}
If and
are much larger than
, then the posterior mean will be approximately
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ beta_theta }{ alpha_theta – 1} = E{sigma^2_theta}
end{aligned}
which is the prior mean.
If and
are of a similar size as
, then the posterior mean will be “shrunk” towards the prior mean. This can be useful for stabilizing estimation or incorporating prior information about the value of
from past studies. For example, if a past study had
subjects and estimated
as the value of
, then setting
begin{aligned}
alpha_theta & = frac{n}{2} &
& text{and} &
beta_theta & = s cdot frac{n}{2}
end{aligned}
will incorporate the prior information on
into our study in a principled way.
For our purposes, we choose . This leads to a “flat” prior for
so that the posterior mean will be approximately unbiased.
Item difficulty
We model the difficulty of item as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
b_i stackrel{iid}{sim} Nleft(0, sigma^2_bright) quad .
end{equation*}
Since items are often regarded as fixed, we are not usually interested in the variance of the item parameters. Therefore we choose a large value of
so that the prior is flat over typical values.
For our purposes, we choose , which is flat because it is nearly two orders of magnitude larger than a typical item difficulty parameter.
Item discrimination
We model the discrimination of item as being an iid draw from a log-normal distribution with an unknown mean and unknown variance
begin{equation*}
a_i stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) quad.
end{equation*}
Due to the multiplicative of nature of
in Equation ref{eq:pipi}, we chose a log-normal distribution for its prior. This is because the log-normal distribution is the limiting distribution for random variables which are multiplied, in the same way that the normal distribution is the limiting distribution for random variables which are added together.
The analogue of a mean zero normal distribution is a mode one log-normal. The mode of a log normal is given by
begin{equation*}
text{mode}left[ a_i right] = e^{mu_a-sigma^2_a} quad,
end{equation*}
which is equal to 1 when . The variance of a mode one log normal is given by
begin{aligned}
Var{a_i} & = left( e^{sigma^2_a}-1 right) e^{2mu_a + sigma^2_a} \
& = left( e^{mu_a}-1 right) e^{3mu_a} quad
end{aligned}
which is zero when
and can be arbitrarily large as
increases.
Due to the multiplicative and highly skewed nature of the log-normal distribution, it is somewhat difficult to gain an intuition for which values of lead to an informative prior, and which values lead to a flat prior. Our approach is to approximately match the spread of the prior on the item difficulty parameter
, by matching the upper quantile of the log-normal to a truncated version of the prior for
.
We can find the appropriate values of

require(truncnorm) ## If the above command fails, you must install the ## truncnorm library. To do so, un-comment the install.packages ## line below, run it, and follow the directions it gives. ## ## install.packages('truncnorm') ## We use the quantile function to determine the critical ## value where 95% of the mass of a normal distribution ## truncated at zero (a=0), with a mean of 0 (mean=0) and ## standard deviation of 10 (sd=10) lies. qtruncnorm(p=.95, mean=0, sd=10, a=0) # [1] 19.59964 ## We used a guess and check method to find the ## corresponding quantile for a mode 1 log-normal distribution xx = 1.185 ;qlnorm(.95, xx, sqrt(xx)) # [1] 19.6004
As a visual check, below we compare the CDF of a log-normal with , with the CDF of a truncated version of the prior for
(a truncated normal):
Note from the zoomed in CDF that the log-normal puts much less mass on values near zero. This is expected because for a given constant , we want to have values less than
to be rare in the same way that values greater than
are rare. Further note that the 95% quantile of both distributions are equal.
The code to produce the graph is given below. Note that it requires you to have run the previous R code from this page.
## Set a 2 by 1 grid for the graphs par(mfrow=c(1,2)) ## Adjust the margins of the plot par(mar=c(4,4,4,2)) ## Plot the zoomed in CDF of the truncated normal curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=1, main="Zoomed in", lty='dotted', lwd=2, xlab='quantile', ylab='probability') ## Add the CDF of the log-normal to the graph curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=1, add=TRUE, lwd=2, col='blue') ## Draw an arrow approximately where the two CDFs cross arrows( x0=0.325, y0=.038, x1=0.45, y1=.038, length=1/8) ## Draw the legend legend( "topleft", c('truncated b_i prior', 'log-normal'), col=c('black','blue'), lwd=2, lty=c('dotted', 'solid')) ## Draw the zoomed out truncated normal CDF in the second panel ## N.B. It draws in the second panel because add=TRUE is missing. curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=60, main="Zoomed out", lwd=2, lty='dotted', xlab='quantile', ylab='probability') ## Add the the CDF of the log-normal to the graph curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=60, add=TRUE, lwd=2, col='blue') ## Draw an arrow approximately where the two CDFs cross arrows( x0=10, y0=.95, x1=16.6, y1=.95, length=1/8) ## Reset to a 1 by 1 grid for any future graphs par(mfrow=c(1,1))
For our purposes, we choose choose for the prior of the item discrimination parameter because those values approximately match it to the prior for the item difficulty parameter.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.