Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post focuses on one of the more curious models in contemporary statistics, a specification for proportions that is either called fractional logit or quasi-Binomial. While in most cases in statistics, the evaluation of a model necessarily involves trade-offs in which a model is best applied to certain cases but not others, in this case fractional logit is simply the wrong model. The reason is that fractional logit is missing a normalizing constant, which in essence means that it doesn’t divide by the total to make a true probability. While this distinction may seem esoteric, I use examples in this blog post to indicate why this is a quite serious problem.
Importantly, recent work has resolved this issue with fractional logit and provides alternatives. It turns out based on recent machine learning research that the model can be fixed, resulting in a true statistical distribution, albeit a much more complicated one. Furthermore, there are excellent other alternatives like (ordered) beta regression that are more flexible and straightforward to estimate. For this reason, there is no reason to use fractional logit anymore, hence the title of this post.
I know this will be controversial for some, but I do not see any point in beating around the bush. This model just doesn’t have a clear purpose anymore given known biases and alternatives that are both useful and unbiased.
If you want the code for this post, you can see the full Rmarkdown file here.
The Problem: What a Probability Is
We can first start with the definition of the model, which comes from a 1996 paper by Papke and Wooldridge. Given a linear model \(X\beta\), a link function \(G(\cdot)\) (logit), and an outcome \(y_i\) that can take on any value between 0 and 1, we have a likelihood as follows:
\[ L(y_i) = G(x_i\beta)^{y_i}(1-G(x_i\beta))^{(1 – y_i)} \] To get the intuition behind this model, the definition is in fact the same for the Bernoulli distribution. The only difference is that instead of \(y_i\) being a binary variable with values of 0 or 1, \(y_i\) could be anything from 0 to 1, including 0.2, 0.4, etc. Inserting these continuous values into a discrete distribution is the neat trick that Papke and Wooldridge came up with and labeled fractional for a fraction of 0 and 1. They also showed that it was possible to find a maximum to the likelihood function and obtain estimates for the regression coefficients \(\beta\) on the logit scale.
This model represented a substantial step forward from what was available at the time, mainly OLS. There were plenty of GLMs but not for a fractional or proportional response. No one was using beta regression as that model would not be proposed until 2004. For these reasons, fractional logit became very popular and has almost 5,000 citations according to Google Scholar.
However, this might lead to some head-scratching–sticking continuous data in a discrete distribution is, at a minimum, kind of odd. It turns out, after much additional research, that it is indeed not a great idea. The reason is that the distribution above is not a true probability distribution–it does not integrate to 1.
You may have some vague idea about integration and probability density functions (PDFs) from stats classes, but if you do you likely aren’t entirely sure what it’s necessary. Thinking about integrals isn’t easy for anybody, so instead I’ll use a discrete example.
Imagine that we are considering the probability of college basketball teams winning the March Madness tournament. Right now, it seems the top 3 teams are Gonzaga (#1), Arizona (#2) and Kansas (#3). Let’s examine a simple binary probability as to which of these three of these teams will make it into the final four \(Pr(FF=1)\).
The question is, how do we want to calculate that probability? We could calculate it as being proportional to the number of games they won in the regular season. Given the teams’ conference records, suppose we have the following ranking for most likely to make it to the final four:
- Gonzaga: 10 (W) – 2 (L)
- Arizona: 11 (W) – 4 (L)
- Kansas: 9 (W) – 4 (L)
Gonzaga won the most games compared to losing games. But we might be concerned about the different number of games in different conferences. Gonzaga only played 12 conference games, Arizona played 15, and Kansas played 13. The question is here: which denominator do we use to weight the number of games? Without such a denominator, we don’t have probabilities that sum to 1.
We could consider wins as a share of the total number of games played in their conference season. In that case, we have the following ranking:
- Arizona: 0.3666667
- Gonzaga: 0.3333333
- Kansas: 0.3
In this case, Arizona is first, Gonzaga is second and Kansas is third. To only consider the relative proportion of wins versus losses, we first divide wins by losses and then by the sum of those ratios. That would give us the following ranking:
- Gonzaga: 0.5
- Arizona: 0.275
- Kansas: 0.225
The work of figuring out the denominator is what is called normalizing to get probabilities that sum to 1. If we don’t have such a denominator, we lack a crucial reference point by which to calculate probabilities. Essentially, what is it exactly that we are comparing the teams’ performance to? The fractional logit issue is the continuous version of the same thing: we have data from 0 to 1 and a function that converts that to a likelihood, but we don’t have a denominator that allows us to make it a probability. As a result, fractional logit is not a statistical distribution and lacks a PDF/CDF (and any way to simulate data from it).
OK So What?
The reader at this point might think, OK, so I’ve broken a statistical taboo. What’s going to happen to me, the statistical gods will strike me with lightning?1 Well, weird things start happening when you lack a normalizing constant as fractional logit does. In my paper on ordered beta regression, I simulated data and compared fractional logit with beta regression, OLS, and other alternatives, and its performance varied wildly. For that reason I did not recommend it as an alternative, but that was just one simulation (if a very big one).
However, there is more thorough research on this topic by people working in the machine learning literature. Gabriel Loaiza-Ganem and John Cunningham looked at variational auto-encoders, which are models for pixels that make up images. Apparently people using these models had been employing fractional logit, and they were likewise concerned about the lack of normalization. They went as far as identifying what the normalization constant is, which turns out to be the following for a given value of the linear predictor \(G(X\beta)\), which I denote as \(\mu\) (see their eqn (7)):
\[ C(\mu) = \begin{cases} \frac{2 \text{tanh}^{-1}(1 – 2\mu)}{1 – 2\mu} & \text{if } \mu \ne 0.5 \\ 2 & \text{otherwise} \end{cases} \]
This thing, to be honest, is kind of ugly, and has a discontinuity when the proportion is equal to 0.5 or 50%. The authors figured out how to sample from this distribution by using a Taylor-series approximation around the discontinuity, which is pretty cool, and they decided to name it the continuous Bernoulli distribution. I am cutting and pasting their Figure 1 to show what the distribution looks like:
Essentially, the distribution allows for mildly sloping lines across 0 to 1 that can be either upward or downward. The value of the normalizing constant \(C(\mu)\) in the leftward panel is much bigger towards the boundaries of the distribution, which intuitively makes sense. As the distribution moves towards extreme values, the denominator has to change to take into account the non-linearity in the outcome.
Furthermore, they show in the paper that the lack of normalization has a clear impact on performance. They use an image-learning task and examine how continuous Bernoulli without normalization (i.e., fractional logit) compares to continuous Bernoulli with normalization. Again, I’ll copy and paste their key result here:
In this figure, CB represents continuous Bernoulli and B is the fractional logit. As can be seen, the images are much sharper with continuous Bernoulli (normalized) than fractional logit. The authors point out that this is likely due to the normalizing constant becoming so large towards the extremes of the distribution: the un-normalized distribution has a hard time knowing where the boundaries are. Or, as in our basketball analogy, understanding what the relative comparison point is. It still gets fairly close to the average value but it can’t take into account the full distribution.
So there you have it. There is no reason to use fractional logit anymore because someone fixed it. On the other hand, you could use my ordered beta regression model, which did not exist at the time of this paper but does essentially the same thing (and doesn’t have issues with discontinuities that require Taylor series expansion). Still, the authors’ accomplishment is remarkable in creating a statistical distribution ex nihilo.
Example
This distribution is not available currently in R, though it can be implemented fairly straightforwardly in Stan. It is also available in tensorflow
in Python, but as I’m not primarily a Python user, I’ll stick with R. I produce code below that can fit this model with the R package brms
as a custom family, and in the future I plan to add support for it to ordbetareg
. I still think ordered beta regression makes more sense as a default, especially with the issues with the normalizing constant in the continuous Bernoulli, but it is great to have this model as another robust alternative for bounded continuous variables.
To demonstrate how to fit continuous Bernoulli, I first generate data using the rordbeta
function in ordbetareg
that will create proportion data from 0 to 1 inclusive. I’ll add a covariate X
to predict the mean of the distribution on the logit scale (which continuous Bernoulli also uses) with a coefficient of 2.5:
library(ordbetareg) N <- 500 X <- runif(n=N) Y <- rordbeta(n=N, mu = plogis(-2 + 2.5*X),cutpoints=c(-3,3)) hist(Y)
The outcome has a distinct U-shape and 70 discrete responses (0 or 1).
The code below defines the custom family for brms
to work (truly amazing how easy this is to do):
c_bernoulli <- custom_family("c_bernoulli", dpars="mu", links="logit", lb=0,ub=1, type="real") # define log density function # some code from spinkey https://discourse.mc-stan.org/t/continuous-bernoulli/26886 stan_funs <- " //normalization constant real c_norm(real mu) { if(mu==0.5) { return log(2); } else { real const = (log(2 - 2*mu) - log(2*mu))/(2 * (1 - 2*mu)); return(log(const)); } } // log PDF for continuous Bernoulli real c_bernoulli_lpdf(real y, real mu) { // unnormalized density real lp = y * log(mu) + (1 - y) * log1m(mu); // normalized density lp += c_norm(mu); return lp; }" stanvars <- stanvar(scode = stan_funs, block = "functions") # posterior predictions posterior_predict_c_bernoulli <- function(i, prep, ...) { # need inverse CDF function for continuous bernoulli inv_cdf <- function(u=NULL, mu=NULL) { if(mu==0.5) { out <- u } else { out <- (log(u * (2 * mu - 1) + 1 - mu) - log(1 - mu))/(log(mu) - (log(1-mu))) } return(out) } mu <- brms::get_dpar(prep, "mu", i = i) u <- runif(n=length(mu)) inv_cdf(u,mu) } posterior_epred_c_bernoulli <- function(prep) { # expected value mu <- brms::get_dpar(prep, "mu") if(mu==0.5) { return(0.5) } else { (mu / (2 * mu - 1)) + (1 / (2*atanh(1 - 2*mu))) } } fit_c_bern <- brm( Y ~ X, data = tibble(Y=Y, X=X), family = c_bernoulli, stanvars = stanvars, backend="cmdstanr", refresh=0 ) ## Running MCMC with 4 sequential chains... ## ## Chain 1 finished in 0.9 seconds. ## Chain 2 finished in 0.9 seconds. ## Chain 3 finished in 0.9 seconds. ## Chain 4 finished in 0.8 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 0.9 seconds. ## Total execution time: 4.0 seconds. library(marginaleffects) avg_slopes(fit_c_bern) ## ## Term Estimate 2.5 % 97.5 % ## X 0.5085 0.4429 0.5669 ## ## Prediction type: response ## Columns: type, term, estimate, conf.low, conf.high
The avg_slopes
function from the marginaleffects
package gives the marginal effect of the parameter back-transformed to 0/1. The continuous Bernoulli estimates a very big marginal effect for X
(it is not the same as the true coefficient because that was on the logit scale, not the true scale). We can plot the posterior predictive distribution:
pp_check(fit_c_bern)
It’s a bit off, but pretty close, and we probably shouldn’t expect it to be perfect given that the data was generated from the ordered beta distribution, not the continuous Bernoulli.
We can then compare those results to the ordered beta regression in ordbetareg
:
fit_ordbeta <- ordbetareg( Y ~ X, data = tibble(Y=Y, X=X), backend="cmdstanr", refresh=0 ) ## Running MCMC with 4 sequential chains... ## ## Chain 1 finished in 2.5 seconds. ## Chain 2 finished in 2.4 seconds. ## Chain 3 finished in 2.5 seconds. ## Chain 4 finished in 2.5 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 2.5 seconds. ## Total execution time: 10.2 seconds. avg_slopes(fit_ordbeta) ## ## Term Estimate 2.5 % 97.5 % ## X 0.4887 0.4136 0.556 ## ## Prediction type: response ## Columns: type, term, estimate, conf.low, conf.high pp_check(fit_ordbeta)
The marginal effect of X is fairly close to the continuous Bernoulli’s, and the posterior distribution is much closer to the true distribution. But again, we generated the data from the ordered beta distribution so it’s not surprising that the continuous Bernoulli estimate is different. It might take some time to figure out exactly where and when the distributions will differ. For this relatively simple example, the marginal effects of both distributions are fairly close. We might expect beta regression to differ more widely from the continuous Bernoulli when there are more extreme values in the beta distribution (high dispersion).
In any case, though, there are clearly alternatives to fractional logit, and it is safe to let the model go. If the fractional logit specification is preferred, than the continuous Bernoulli (as implemented in the code above) is a correct distribution. I still think ordered beta (or the beta variants in general) are preferable as they don’t have a strange discontinuity in the middle of the function and their properties are well-known.
With some positive probability, yes.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.