Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Valid \(P\)-values Are Uniform Under the Null Model
- Posterior Predictive \(P\)-values
- Table 1: Some \(P\)-values, \(S\)-values, Maximum Likelihood Ratios, and Likelihood Ratio Statistics
- Frequentist Interval Estimate Percentiles
- Refer to Long-Run Coverage
- Brown et al. (2017) Reanalysis
- Table 2: \(P\)-values, \(S\)-values, and Likelihoods for Targeted Hypotheses About Hazard Ratios for Brown et al.
- Plot the point estimate and 95% compatibility interval
- \(P\)-value and \(S\)-value Functions
- Figure 3: \(P\)-value (Compatibility) Function
- Cumulative Confidence (Compatibility Distribution)
- Figure 4: \(S\)-value (Suprisal) Function
- Likelihood (Support) Functions
- Figure S1: Relative Likelihood Function
- Figure S2: Deviance Function \(2ln(MLR)\)
- Statistical Package Citations
- Environment
- References
The following post provides some of the code that was used to construct the figures and tables from Rafi & Greenland, 2020.1 An enhanced PDF version of the paper can be found here. For further discussion of the computations, see the appendix of the main paper, along with our technical supplement.2
Disclaimer: I am responsible for all the code and mistakes below, and none of them can be attributed to my coauthors or my fellow package developers.
In order to recreate the functions, I would recommend installing the
latest version of concurve
from CRAN, as it has
patched some issues with graphing when the outcome is binary. Use the
script below to get the latest version and load the R
package. A
number of other R
packages are also used in this post, which are
listed below.
install.packages("concurve") library("concurve")
Valid \(P\)-values Are Uniform Under the Null Model
Here we show that valid \(P\)-values have specific properties, when the
null model is true. We first generate two variables (\(Y\), \(X\)) that come
from the same normal distribution with a \(\mu\) of 0 and \(\sigma\) of 1,
each with a total of 1000
observations. We assume that there is no
relationship between these two variables. We run a simple t-test between \(Y\) and \(X\) and iterate this 100000
times and compute 100000
\(P\)-values to see the overall distribution of the \(P\)-values, which we
then plot using a histogram./
RNGkind(kind = "L'Ecuyer-CMRG") set.seed <- 1031 n.sim <- 100000 t.sim <- numeric(n.sim) n.samp <- 1000 for (i in 1:n.sim) { X <- rnorm(n.samp, mean = 0, sd = 1) Y <- rnorm(n.samp, mean = 0, sd = 1) df <- data.frame(X, Y) t <- t.test(X, Y, data = df) t.sim[i] <- t[[3]] } ggplot(NULL, aes(x = t.sim)) + geom_histogram(bins = 30, col = "black", fill = "#99c7c7", alpha = 0.25) + labs(title = "Distribution of P-values Under the Null", x = "P-value") + scale_x_continuous(breaks = seq(0, 1, 0.10)) + theme_bw()
This can also be shown using the
TeachingDemos
R
package, which has a function dedicated to showing this phenomenon.
library("TeachingDemos") RNGkind(kind = "L'Ecuyer-CMRG") set.seed <- 1031 obs_p <- Pvalue.norm.sim(n = 1000, mu = 0, mu0 = 0, sigma = 1, sigma0 = 1, test= "t", alternative = "two.sided", alpha = 0.05, B = 100000)
ggplot(NULL, aes(x = obs_p)) + geom_histogram(bins = 30, col = "black", fill = "#99c7c7", alpha = 0.25) + labs(title = "Distribution of P-values Under the Null", x = "P-value") + scale_x_continuous(breaks = seq(0, 1, 0.10)) + theme_bw()
As you can see, when the null model is true, the distribution of \(P\)-values is uniform. Valid \(P\)-values are uniform under the null hypothesis and their corresponding \(S\)-values are exponentially distributed. We run the same simulation as before, but then convert the obtained \(P\)-values into \(S\)-values, to see how they are distributed.
RNGkind(kind = "L'Ecuyer-CMRG") set.seed <- 1031 n.sim <- 100000 t.sim <- numeric(n.sim) n.samp <- 1000 for (i in 1:n.sim) { X <- rnorm(n.samp, mean = 0, sd = 1) Y <- rnorm(n.samp, mean = 0, sd = 1) df <- data.frame(X, Y) t <- t.test(X, Y, data = df) t.sim[i] <- t[[3]] } ggplot(NULL, aes(x = -log2(t.sim))) + geom_histogram(bins = 30, col = "black", fill = "#d46c5b", alpha = 0.5) + labs(title = "Distribution of S-values Under the Null", x = "S-value (Bits of Information)") + theme_bw()
Posterior Predictive \(P\)-values
Despite the name, posterior predictive \(P\)-values are not considered
valid frequentist \(P\)-values because they do not meet the uniformity
criterion, instead they are pulled towards the parameter value 0.5
.
For further discussion, see our technical
supplement along with Greenland
(2019) for comprehensive
theoretical discussions.2, 3
A quick excerpt from our main paper and the technical supplement explains why this is not the case,
As discussed in the Supplement, in Bayesian settings one may see certain \(P\)-values that are not valid frequentist \(P\)-values, the primary example being the posterior predictive \(P\)-value;45; unfortunately, the negative logs of such invalid \(P\)-values do not measure surprisal at the statistic given the model, and so are not valid \(S\)-values.
The decision rule “reject H if \(p\leq\) \(\alpha\)” will reject H 100\(\alpha\)% of the time under sampling from a model M obeying H (i.e., the Type-1 error rate of the test will be \(\alpha\)) provided the random variable \(P\) corresponding to \(p\) is valid (uniform under the model M used to compute it), but not necessarily otherwise6. This is one reason why frequentist writers reject invalid \(P\)-values (such as posterior predictive \(P\)-values, which highly concentrate around 0.50) and devote considerable technical coverage to uniform \(P\)-values;6;5.4 A valid \(P\)-value (“U-value”) translates into an exponentially distributed \(S\)-value with a mean of 1 nat or \(\log_{2}(e)=1.443\) bits where \(e\) is the base of the natural logs.
Uniformity is also central to the “refutational information” interpretation of the \(S\)-value used here, for it is necessary to ensure that the \(P\)-value \(p\) from which \(s\) is derived is in fact the percentile of the observed value of the test statistic in the distribution of the statistic under M, thus making small \(p\) surprising under M and making \(s\) the corresponding degree of surprise. Because posterior predictive \(P\)-values do not translate into sampling percentiles of the statistic under the hypothesis (in fact, they are pulled toward 0.5 from the correct percentiles);5,4 the resulting negative log does not measure surprisal at the statistic given M, and so is not a valid \(S\)-value in our terms.
And indeed, we can show this phenomenon below with simulations. Here we
fit a simple Bayesian regression model with a weakly informative prior
normal(0, 10)
using rstanarm
, where
both the predictor and response variable come from the same distribution
and have the same location and scale parameters (\(\mu = 0\),
\(\sigma = 1\). We then calculate the observed test statistics, along with
their distributions, and convert them into posterior predictive
\(P\)-values and plot them using
Bayesplot
functions. Then, we iterate
this 1000
times to examine the distribution of posterior predictive
\(P\)-values and compare them to standard \(P\)-values that are known to be
uniform.
But first, we’ll generate the distribution of the test statistic from one model.
library("bayesplot") library("rstan") library("rstanarm") rstan_options(auto_write = TRUE) options(mc.cores = 4) teal <- c("#99c7c7") color_scheme_set("teal") RNGkind(kind = "L'Ecuyer-CMRG") n.samp <- 1000 X <- rnorm(n.samp, mean = 0, sd = 1) Y <- rnorm(n.samp, mean = 0, sd = 1) df <- data.frame(X, Y) mod1 <- stan_glm(Y ~ X, data = df, chains = 1, cores = 4, refresh = 0, prior = normal(0, 10)) yrep <- posterior_predict(mod1) h <- ppc_stat(Y, yrep, stat = "median", freq = TRUE, binwidth = 0.01) + labs(title = "Distribution of Posterior Test Statistic") + theme_bw() + yaxis_text(on = TRUE) values_all <- h[["plot_env"]][["T_yrep"]] prob_to_find <- h[["plot_env"]][["T_y"]] quantInv <- function(distr, value) ecdf(distr)(value) posterior_p_value <- quantInv(values_all, prob_to_find) l <- ppc_stat_2d(Y, yrep) + theme_bw() + labs(title = "Distribution of Posterior Test Statistic") plot(h)
plot(l)
print(prob_to_find) [1] 0.03618
The above is the posterior predictive test statistic for one model, along with it’s distribution. Now let’s iterate this process, convert them into posterior predictive \(P\)-values, and generate their distribution.
rstan_options(auto_write = TRUE) options(mc.cores = 4) RNGkind(kind = "L'Ecuyer-CMRG") set.seed <- 1031 n.sim <- 1000 ppp <- numeric(n.sim) n.samp <- 1000 quantInv <- function(distr, value) ecdf(distr)(value) for (i in 1:length(ppp)) { X <- rnorm(n.samp, mean = 0, sd = 1) Y <- rnorm(n.samp, mean = 0, sd = 1) df <- data.frame(X, Y) mod1 <- stan_glm(Y ~ X, data = df, chains = 1, cores = 4, iter = 1000, refresh = 0, prior = normal(0, 10)) yrep <- posterior_predict(mod1) h <- ppc_stat(Y, yrep, stat = "median") values_all <- h[["plot_env"]][["T_yrep"]] prob_to_find <- h[["plot_env"]][["T_y"]] posterior_p_value <- quantInv(values_all, prob_to_find) ppp[i] <- posterior_p_value } Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#bulk-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: The largest R-hat is 1.05, indicating chains have not mixed. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#r-hat Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#bulk-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#bulk-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess ggplot(NULL, aes(x = ppp)) + geom_histogram(bins = 30, col = "black", fill = "#99c7c7", alpha = 0.25) + labs(title = "Distribution of Posterior Predictive P-values", subtitle = "From Simulation of 1000 Simple Linear Regressions", x = "Posterior Predictive P-value") + scale_x_continuous(breaks = seq(0, 1, 0.10)) + theme_bw()
We can also show this in Stata 16.1
, using their new
bayesstats ppvalues
command which generates posterior predictive
\(P\)-values.
clear set obs 10000 /** Generate variables from a normal distribution like before */ generate x = rnormal(0, 1) generate y = rnormal(0, 1) /** We set up our model here */ bayesmh y x, likelihood(normal({var})) prior({var}, normal(0, 10)) /// prior({y:}, normal(0, 10)) rseed(1031) saving(coutput_pred, replace) mcmcsize(1000) /** We use the bayespredict command to make predictions from the model */ bayespredict (mean:@mean({_resid})) (var:@variance({_resid})), /// rseed(1031) saving(coutput_pred, replace) /** Then we calculate the posterior predictive P-values */ bayesstats ppvalues {mean} using coutput_pred number of observations (_N) was 0, now 10,000 Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ normal(xb_y,{var}) Priors: {y:x _cons} ~ normal(0,10) (1) {var} ~ normal(0,10) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_y. Bayesian normal regression MCMC iterations = 3,500 Random-walk Metropolis-Hastings sampling Burn-in = 2,500 MCMC sample size = 1,000 Number of obs = 10,000 Acceptance rate = .1878 Efficiency: min = .05592 avg = .08647 Log marginal-likelihood = -14154.084 max = .1143 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- y | x | .0044782 .0103859 .001389 .0033005 -.012539 .0282905 _cons | .0183664 .0100446 .00094 .017438 -.0008961 .0360122 -------------+---------------------------------------------------------------- var | .9880526 .0134149 .00142 .988618 .9624949 1.016522 ------------------------------------------------------------------------------ file coutput_pred.dta saved Computing predictions ... file coutput_pred.dta saved file coutput_pred.ster saved Posterior predictive summary MCMC sample size = 1,000 ----------------------------------------------------------- T | Mean Std. Dev. E(T_obs) P(T>=T_obs) -------------+--------------------------------------------- mean | -.0005311 .0093848 .0003538 .47 ----------------------------------------------------------- Note: P(T>=T_obs) close to 0 or 1 indicates lack of fit.
As we can see from the R
plot above, the test statistics are are
generally concentrated around 0.5
and the distribution (as shown above
via the R
code) hardly resembles the uniform distribution of \(P\)-value
under the null model. Further, the base-2 log transformation of
posterior predictive \(P\)-values is not exponentially distributed, making
them invalid \(S\)-values in our terms.
ggplot(NULL, aes(x = (-log2(ppp)))) + geom_histogram(bins = 30, col = "black", fill = "#d46c5b", alpha = 0.25) + labs(title = "Distribution of -log2(Posterior Predictive P-values)", subtitle = "S-values Produced From Posterior Predictive P-values", x = "-log2(Posterior Predictive P-value)") + theme_bw()
Table 1: Some \(P\)-values, \(S\)-values, Maximum Likelihood Ratios, and Likelihood Ratio Statistics
Here, we look at some common \(P\)-values, and how they correspond to other information statistics and likelihood measures.
pvalue <- c(0.99, 0.90, 0.50, 0.25, 0.10, 0.05, 0.025, 0.01, 0.005, 0.0001, paste("5 sigma (~ 2.9 in 10 million)"), paste("1 in 100 million (GWAS)"), paste("6 sigma (~ 1 in a billion)")) svalue <- round(c(-log2(c(0.99, 0.90, 0.50, 0.25, 0.10, 0.05, 0.025, 0.01, 0.005, 0.0001)), 21.7, 26.6, 29.9), 2) pvalue[1:10] <- formatC(pvalue[1:10], format = "e", digits = 2) mlr <- round(c(1.00, 1.01, 1.26, 1.94, 3.87, 6.83, 12.3, 27.6, 51.4, 1935, (5.2 * (10^5)), (1.4 * (10^7)), (1.3 * (10^8))), 2) # likelihood ratio statistic lr <- round(c(0.00016, 0.016, 0.45, 1.32, 2.71, 3.84, 5.02, 6.63, 7.88, 15.1, 26.3, 32.8, 37.4), 2) table1 <- data.frame(pvalue, svalue, mlr, lr) colnames(table1) <- c("P-value (compatibility)", "S-value (bits)", "Maximum Likelihood Ratio", "Deviance Statistic 2ln(MLR)") kbl(table1, format = "html", padding = 45, longtable = TRUE, color = "#777") %>% row_spec(row = 0, color = "#777", bold = T) %>% column_spec(column = 1:4, color = "#777", bold = F) %>% kable_classic(full_width = TRUE, html_ = "Cambria", lightable_options = "basic", _size = 17, fixed_thead = list(enabled = F)) %>% footnote(general_title = "Abbreviations: ", title_format = "underline", general = c("Table 1 $P$-values and binary $S$-values, with corresponding maximum-likelihood ratios (MLR) and deviance (likelihood-ratio) statistics for a simple test hypothesis H under background assumptions A"))
P-value (compatibility) | S-value (bits) | Maximum Likelihood Ratio | Deviance Statistic 2ln(MLR) |
---|---|---|---|
0.99 | 0.01 | 1.000e+00 | 0.00 |
0.9 | 0.15 | 1.010e+00 | 0.02 |
0.5 | 1.00 | 1.260e+00 | 0.45 |
0.25 | 2.00 | 1.940e+00 | 1.32 |
0.1 | 3.32 | 3.870e+00 | 2.71 |
0.05 | 4.32 | 6.830e+00 | 3.84 |
0.025 | 5.32 | 1.230e+01 | 5.02 |
0.01 | 6.64 | 2.760e+01 | 6.63 |
0.005 | 7.64 | 5.140e+01 | 7.88 |
1e-04 | 13.29 | 1.935e+03 | 15.10 |
5 sigma (~ 2.9 in 10 million) | 21.70 | 5.200e+05 | 26.30 |
1 in 100 million (GWAS) | 26.60 | 1.400e+07 | 32.80 |
6 sigma (~ 1 in a billion) | 29.90 | 1.300e+08 | 37.40 |
Abbreviations: | |||
Table 1 \(P\)-values and binary \(S\)-values, with corresponding maximum-likelihood ratios (MLR) and deviance (likelihood-ratio) statistics for a simple test hypothesis H under background assumptions A |
For further discussion of 5 and 6 sigma cutoffs, see.7
Frequentist Interval Estimate Percentiles
Refer to Long-Run Coverage
Here we simulate a study where one group with 100 participants has an
\(\mu\) of 100
with a \(\sigma\) of 20
and the second group has the same
number of participants but an average of 80
and a standard deviation
of 20.
We compare them using a Welch’s \(t\)-test and generate 95%
compatibility intervals several times, specifically 100
times, and
then plot them. Since we know the mean difference is 20
, we wish to
see how often the interval estimates cover this true parameter value.
#' NOTE: Confidence Interval Simulation Function #' #' @return #' @export #' #' @examples sim <- function() { fake <- data.frame((x <- rnorm(100, 100, 20)), (y <- rnorm(100, 80, 20))) intervals <- t.test(x = x, y = y, data = fake, conf.level = .95)$conf.int[] } RNGkind(kind = "L'Ecuyer-CMRG") set.seed(1031) z <- replicate(100, sim(), simplify = FALSE) df <- data.frame(do.call(rbind, z)) df$studynumber <- (1:length(z)) intrvl.limit <- c("lower.limit", "upper.limit", "studynumber") colnames(df) <- intrvl.limit df$point <- ((df$lower.limit + df$upper.limit) / 2) df$covered <- (df$lower.limit <= 20 & 20 <= df$upper.limit) df$coverageprob <- ((as.numeric(table(df$covered)[2]) / nrow(df) * 100)) ggplot(data = df, aes(x = studynumber, y = point, ymin = lower.limit, ymax = upper.limit)) + geom_pointrange(mapping = aes(color = covered), fill = "#99c7c7", size = .60, alpha = 0.35) + geom_hline(yintercept = 20, lty = 1, color = "red", alpha = 0.30) + coord_flip() + labs(title = "Simulated 95% Frequentist Interval Estimates", subtitle = "True Mean Difference is 20", x = "Study Number", y = "Estimate (Point + Interval)", caption = "Red Intervals Do Not Cover Population Parameter") + theme_bw() + # use a white background theme(legend.position = "none") + annotate(geom = "text", x = 102, y = 30, label = "Overall Coverage (%) =", size = 3.5, color = "#99c7c7", alpha = 0.950) + annotate(geom = "text", x = 102, y = 35, label = df$coverageprob, size = 3.5, color = "#99c7c7", alpha = 0.950) + theme(text = element_text(size = 13.5)) + theme(plot.title = element_text(size = 16), plot.subtitle = element_text(size = 12), plot.caption = element_text(size = 10))
This shows that the intervals tend to vary simply as a result of randomness, and that the proper interpretation of the attached percentile is about long run coverage of the true parameter. However, this does not preclude interpretation of a single interval estimate from a study.8 As we write in the Replace unrealistic “confidence” claims with compatibility measures section of our paper.1
The fact that “confidence” refers to the procedure behavior, not the reported interval, seems to be lost on most researchers. Remarking on this subtlety, when Jerzy Neyman discussed his confidence concept in 1934 at a meeting of the Royal Statistical Society, Arthur Bowley replied, “I am not at all sure that the ‘confidence’ is not a confidence trick.”.9 And indeed, 40 years later, Cox and Hinkley warned, “interval estimates cannot be taken as probability statements about parameters, and foremost is the interpretation ‘such and such parameter values are consistent with the data.’”.8 Unfortunately, the word “consistency” is used for several other concepts in statistics, while in logic it refers to an absolute condition (of noncontradiction); thus, its use in place of “confidence” would risk further confusion.
To address the problems above, we exploit the fact that a 95% CI summarizes the results of varying the test hypothesis H over a range of parameter values, displaying all values for which \(p\) > 0.0510 and hence \(s\) < 4.32 bits;3.11 Thus the CI contains a range of parameter values that are more compatible with the data than are values outside the interval, under the background assumptions;3.12 Unconditionally (and thus even if the background assumptions are uncertain), the interval shows the values of the parameter which, when combined with the background assumptions, produce a test model that is “highly compatible” with the data in the sense of having less than 4.32 bits of information against it. We thus refer to CI as compatibility intervals rather than confidence intervals;13;3;11 their abbreviation remains “CI.” We reject calling these intervals “uncertainty intervals,” because they do not capture uncertainty about background assumptions.13
Indeed, this also has to do with our paper on deemphasizing the model assumptions that are often behind common statistical outputs, and why it is necessary to treat them as uncertain, rather than given.14 For example, a 95% interval estimate is assumed to be free of bias in its construction, however, it’s coverage claims are no longer so, once there are systematic errors in play. Many of these common, classical statistical procedures are designed to deal with random variation, and less so with bias (a relatively new field, with shiny new methods).
Brown et al. (2017) Reanalysis
Taken from the Brown et al. data.15, 16
Here we take the reported statistics from the Brown et al. studies in order to run statistical tests of different alternative test hypotheses and use those results to construct various functions.
We calculate the standard errors from the point estimate and the confidence (compatibility) limits.
se <- log(2.59 / 0.997) / 3.92 logUL <- log(2.59) logLL <- log(0.997) logpoint <- log(1.61) logpoint + (1.96 * se) [1] 0.9536 logpoint - (1.96 * se) [1] -0.001097
Table 2: \(P\)-values, \(S\)-values, and Likelihoods for Targeted Hypotheses About Hazard Ratios for Brown et al.
testhypothesis <- c("Halving of hazard", "No effect (null)", "Point estimate", "Doubling of hazard", "Tripling of hazard", "Quintupling of hazard") hazardratios <- c(0.5, 1, 1.61, 2, 3, 5) pvals <- c(1.6e-06, 0.05, 1.00, 0.37, 0.01, 3.2e-06) svals <- round(-log2(pvals), 2) mlr2 <- c((1.0 * 10^5), 6.77, 1.00, 1.49, 26.2, (5.0 * 10^4)) lr <- round(c( exp((((log(0.5 / 1.61)) / se)^2) / (2)), exp((((log(1 / 1.61)) / se)^2) / (2)), exp((((log(1.61 / 1.61)) / se)^2) / (2)), exp((((log(2 / 1.61)) / se)^2) / (2)), exp((((log(3 / 1.61)) / se)^2) / (2)), exp((((log(5 / 1.61)) / se)^2) / (2))), 3) LR <- formatC(lr, format = "e", digits = 2) table2 <- data.frame(testhypothesis, hazardratios, pvals, svals, mlr2, LR) colnames(table2) <- c("Test Hypothesis", "HR", "P-values", "S-values", "MLR", "LR") kbl(table2, format = "html", padding = 45, longtable = TRUE, color = "#777") %>% row_spec(row = 0, color = "#777", bold = T) %>% column_spec(column = 1:6, color = "#777", bold = F) %>% kable_classic(full_width = TRUE, html_ = "Cambria", lightable_options = "basic", _size = 15, fixed_thead = list(enabled = F))
Test Hypothesis | HR | P-values | S-values | MLR | LR |
---|---|---|---|---|---|
Halving of hazard | 0.50 | 0.00 | 19.25 | 1.00e+05 | 1.02e+05 |
No effect (null) | 1.00 | 0.05 | 4.32 | 6.77e+00 | 6.77e+00 |
Point estimate | 1.61 | 1.00 | 0.00 | 1.00e+00 | 1.00e+00 |
Doubling of hazard | 2.00 | 0.37 | 1.43 | 1.49e+00 | 1.49e+00 |
Tripling of hazard | 3.00 | 0.01 | 6.64 | 2.62e+01 | 2.62e+01 |
Quintupling of hazard | 5.00 | 0.00 | 18.25 | 5.00e+04 | 5.03e+04 |
Plot the point estimate and 95% compatibility interval
label <- c("Brown et al. (2017)\n JAMA", "Brown et al. (2017)\n J Clin Psychiatry") point <- c(1.61, 1.7) lower <- c(0.997, 1.1) upper <- c(2.59, 2.6) df <- data.frame(label, point, lower, upper)
Here we plot the 95% compatibility interval estimate reported from the high-dimensional propensity score analysis.
ggplot(data = df, mapping = aes(x = label, y = point, ymin = lower, ymax = upper, group = label)) + geom_pointrange(color = "#007C7C", size = 1.5, alpha = 0.4) + geom_hline(yintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.5) + coord_flip() + scale_y_log10(breaks = scales::pretty_breaks(n = 10), limits = c(0.80, 4)) + labs(title = "Association Between Serotonergic Antidepressant Exposure \nDuring Pregnancy and Child Autism Spectrum Disorder", subtitle = "Reported 95% Compatibility Intervals From Primary Results", x = "Study", y = "Hazard Ratio (JAMA) + Adjusted Pooled Odds Ratio (J Clin Psychiatry)") + annotate(geom = "text", x = 1.5, y = 1, size = 5, label = "Null parameter value of 1", color = "#d46c5b", alpha = 0.75) + annotate(geom = "text", x = 2.2, y = 1.8, size = 5, label = "Point Estimate = 1.61, Lower Limit = 0.997, Upper Limit = 2.59", color = "#000000", alpha = 0.75) + annotate(geom = "text", x = 1.2, y = 1.8, size = 5, label = "Point Estimate = 1.7, Lower Limit = 1.1, Upper Limit = 2.6", color = "#000000", alpha = 0.75) + theme_light() + theme(axis.text.y = element_text(angle=90, hjust=1, size = 13)) + theme(axis.text.x = element_text(size = 13)) + theme(title = element_text(size = 13))
Once again, the authors reported that15
In the 2837 pregnancies (7.9%) exposed to antidepressants, 2.0% (95% CI, 1.6%-2.6%) of children were diagnosed with autism spectrum disorder. Risk of autism spectrum disorder was significantly higher with serotonergic antidepressant exposure (4.51 exposed vs 2.03 unexposed per 1000 person-years; between-group difference, 2.48 95% CI, 2.33-2.62 per 1000 person-years) in crude (HR, 2.16 95% CI, 1.64-2.86) and multivariable-adjusted analyses (HR, 1.59 95% CI, 1.17-2.17) (Table 2). After inverse probability of treatment weighting based on the HDPS, the association was not significant (HR, 1.61 95% CI, 0.997-2.59) (Table 2).
and concluded with
In children born to mothers receiving public drug coverage in Ontario, Canada, in utero serotonergic antidepressant exposure compared with no exposure was not associated with autism spectrum disorder in the child. Although a causal relationship cannot be ruled out, the previously observed association may be explained by other factors.
We will show using the graphical and tabular summaries below, why a more nuanced summary such as,
“After HDPS adjustment for confounding, a 61% hazard elevation remained; however, under the same model, every hypothesis from no elevation up to a 160% hazard increase had \(p\) > 0.05; Thus, while quite imprecise, these results are consistent with previous observations of a positive association between serotonergic antidepressant prescriptions and subsequent ASD. Because the association may be partially or wholly due to uncontrolled biases, further evidence will be needed for evaluating what, if any, proportion of it can be attributed to causal effects of prenatal serotonergic antidepressant use on ASD incidence.”
would have been far more appropriate, as we discuss in our paper.1
\(P\)-value and \(S\)-value Functions
In order to use this information to construct a \(P\)-value function, we
can use the concurve
functions.
library("concurve")
We enter the reported point estimates and compatibility limits and
produce all possible intervals + \(P\)-values + \(S\)-values. This is
calculated assuming normal approximations with the
curve_rev()
function.
curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59, measure = "ratio")
It is stored in the object curve1
.
We can generate a table of the relevant statistics from this reanalysis,
similar to the table from above, but this time using the
concurve
function,
curve_table()
,
but this time it will give us interval estimates of various percentiles
(\(25\)%, \(50\)%, \(75\)%, \(95\)%, etc.). These are in turn used to construct
the entire confidence (compatibility) distribution or \(P\)-value
function.
kbl(curve_table(curve1[[1]]), format = "html", padding = 45, longtable = TRUE, color = "#777", row.names = F, col.names = colnames(curve_table(curve1[[1]]))) %>% row_spec(row = 0, color = "#777", bold = T) %>% column_spec(column = 1:7, color = "#777", bold = F) %>% kable_classic(full_width = TRUE, html_ = "Cambria", lightable_options = "basic", _size = 17, fixed_thead = list(enabled = F)) %>% footnote(general_title = c("Table of Statistics for Various Interval Estimate Percentiles"), general = " ")
Lower Limit | Upper Limit | Interval Width | Interval Level (%) | CDF | P-value | S-value (bits) |
---|---|---|---|---|---|---|
1.490 | 1.740 | 0.250 | 25.0 | 0.625 | 0.750 | 0.415 |
1.366 | 1.897 | 0.531 | 50.0 | 0.750 | 0.500 | 1.000 |
1.217 | 2.131 | 0.914 | 75.0 | 0.875 | 0.250 | 2.000 |
1.178 | 2.200 | 1.021 | 80.0 | 0.900 | 0.200 | 2.322 |
1.134 | 2.286 | 1.152 | 85.0 | 0.925 | 0.150 | 2.737 |
1.079 | 2.403 | 1.325 | 90.0 | 0.950 | 0.100 | 3.322 |
0.999 | 2.595 | 1.596 | 95.0 | 0.975 | 0.050 | 4.322 |
0.933 | 2.779 | 1.846 | 97.5 | 0.988 | 0.025 | 5.322 |
0.860 | 3.015 | 2.155 | 99.0 | 0.995 | 0.010 | 6.644 |
Table of Statistics for Various Interval Estimate Percentiles | ||||||
Figure 3: \(P\)-value (Compatibility) Function
Plot the \(P\)-value (Compatibility) function of the Brown et al. data
Now we plot the \(P\)-value function from the data stored in curve1
using the
ggcurve()
function from concurve
.
ggcurve(curve1[[1]], type = "c", measure = "ratio", nullvalue = c(1), levels = c(0.5, 0.75, 0.95)) + labs(title = expression(paste(italic(P), "-value (Compatibility) Function")), subtitle = expression(paste(italic(P), "-values for a range of hazard ratios (HR)")), x = "Hazard Ratio (HR)", y = expression(paste(italic(P),"-value"))) + geom_vline(xintercept = 1.61, lty = 1, color = "gray", alpha = 0.2) + geom_vline(xintercept = 2.59, lty = 1, color = "gray", alpha = 0.2) + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11))
It is practically the same as the published version from Rafi & Greenland, 2020.1
curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59, measure = "ratio") curve2 <- curve_rev(point = 1.7, LL = 1.1, UL = 2.6, measure = "ratio")
Cumulative Confidence (Compatibility Distribution)
Although we do not cover this figure in our paper, it is easy to
construct using
concurve's
ggcurve()
function by specifying type
as “cdf
” to see the median estimate
within the confidence distribution. The horizontal line that meets at
the curve, is approximately the same as the maximum at the \(P\)-value
function/confidence (compatibility) curve.
ggcurve(curve1[[2]], type = "cdf", measure = "ratio", nullvalue = c(1)) + labs(title = "P-values for a range of hazard ratios (HR)", subtitle = "Association Between Serotonergic Antidepressant Exposure \nDuring Pregnancy and Child Autism Spectrum Disorder", x = "Hazard Ratio (HR)", y = "Cumulative Compatibility") + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11))
We will also calculate the likelihoods so that we can generate the corresponding likelihood functions.
lik2 <- curve_rev(point = 1.7, LL = 1.1, UL = 2.6, type = "l", measure = "ratio") lik1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59, type = "l", measure = "ratio")
We can also see how consistent these results are with previous studies
conducted by the same research group, given the overlap of the
functions, which can be compared using the
plot_compare()
function. Let’s compare the relative likelihood functions from both
studies from this research group to see how consistent the results are.
plot_compare(data1 = lik1[[1]], data2 = lik2[[1]], type = "l1", measure = "ratio", nullvalue = TRUE, title = "Brown et al. 2017. J Clin Psychiatry. vs. \nBrown et al. 2017. JAMA.", subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6 \nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59", xaxis = "Hazard Ratio / Odds Ratio") + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11))
and the \(P\)-value functions.
plot_compare(data1 = curve1[[1]], data2 = curve2[[1]], type = "c", measure = "ratio", nullvalue = TRUE, title = "Brown et al. 2017. J Clin Psychiatry. vs. \nBrown et al. 2017. JAMA.", subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6 \nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59", xaxis = "Hazard Ratio / Odds Ratio") + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11))
Figure 4: \(S\)-value (Suprisal) Function
Plot the \(S\)-value (Surprisal) function of the Brown et
al. data with
ggcurve()
ggcurve(data = curve1[[1]], type = "s", measure = "ratio", nullvalue = c(1), title = "S-value (Surprisal) Function", subtitle = "S-Values (surprisals) for a range of hazard ratios (HR)", xaxis = "Hazard Ratio", yaxis1 = "S-value (bits of information)") + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11))
which is quite close to the figure from our paper.
Likelihood (Support) Functions
Here we provide the code to show how we constructed the likelihood functions from our paper, which are the supplementary figures found here (S1) and here (S2).
Calculate and Plot Likelihood (Support) Intervals.
Here we use the reported estimates to construct the \(Z\)-scores and standard errors, which are in turn used to compute the likelihoods.
hrvalues <- seq(from = 0.65, to = 3.98, by = 0.01) se <- log(2.59 / 0.997) / 3.92 zscore <- sapply(hrvalues, function(i) (log(i / 1.61) / se))
Figure S1: Relative Likelihood Function
We then standardize all the likelihoods by their maximum at the likelihood function,17 to produce relative likelihoods, likelihood intervals, and their corresponding relative likelihood function.18
support <- exp((-zscore^2) / 2) likfunction <- data.frame(hrvalues, zscore, support) ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) + geom_line() + geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) + geom_hline(yintercept = (1/6.83), lty = 1, color = "#333333", alpha = 0.05) + geom_ribbon(aes(x = hrvalues, ymin = min(support), ymax = support), fill = "#d46c5b", alpha = 0.10) + labs(title = "Relative Likelihood Function", subtitle = "Relative likelihoods for a range of hazard ratios", x = "Hazard Ratio (HR)", y = "Relative Likelihood \n1/MLR") + annotate(geom = "text", x = 1.65, y = 0.2, label = "1/6.83 LI = 0.997, 2.59", size = 4, color = "#000000") + theme_bw() + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11)) + scale_x_log10(breaks = scales::pretty_breaks(n = 10)) + scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)), breaks = seq(0, 1, 0.10))
The \(\frac{1}{6.83}\) likelihood interval corresponds to the 95% compatibility interval, as shown by the figure from our paper.
Below we use the calculated \(Z\)-scores to construct the log-likelihood function which is the upward-concave parabola \(\frac{Z^{2}}{2}\) = \(-ln(MLR)\)
support <- (zscore^2) / 2 likfunction <- data.frame(hrvalues, zscore, support) ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) + geom_line() + geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) + geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)), fill = "#d46c5b", alpha = 0.10) + labs(title = "Log-Likelihood Function", subtitle = "Log-likelihoods for a range of hazard ratios", x = "Hazard Ratio (HR)", y = "ln(MLR)") + theme_bw() + theme(axis.title.x = element_text(size = 13), axis.title.y = element_text(size = 13)) + scale_x_log10(breaks = scales::pretty_breaks(n = 10)) + scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)), breaks = seq(0, 7, 1)) + theme(text = element_text(size = 15)) + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 12))
Figure S2: Deviance Function \(2ln(MLR)\)
Known as the deviance function, it is twice the log-likelihood, which maps to the \(S\)-value function.
support <- (zscore^2) likfunction <- data.frame(hrvalues, zscore, support) ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) + geom_line() + geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) + geom_hline(yintercept = 3.84, lty = 1, color = "#333333", alpha = 0.05) + geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)), fill = "#d46c5b", alpha = 0.10) + annotate(geom = "text", x = 1.65, y = 4.4, label = "1/6.83 LI = 0.997, 2.59", size = 4, color = "#000000") + theme_bw() + theme(plot.title = element_text(size = 12), plot.subtitle = element_text(size = 11), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), text = element_text(size = 11)) + scale_x_log10(breaks = scales::pretty_breaks(n = 10)) + scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)), breaks = seq(0, 14, 2)) + labs(title = "Deviance Function", subtitle = "Deviance statistics for a range of hazard ratios", x = "Hazard Ratio (HR)", y = " Deviance Statistic \n2ln(MLR)")
And the version from our paper can be found here.
It is important to note that although we have calculated these
likelihood functions manually here, they can also be generated easily
using the
curve_lik()
function which takes inputs from the
ProfileLikelihood
R
package. To see a further discussion, please see the following
article,
which gives several examples for a wide variety of models.
Further, we urge some caution. Although we endorse the construction and presentation of likelihood functions along with \(P\)-value and \(S\)-value functions, along with the tabulations, the use of pure likelihood methods has been highly controversial among some statisticians, with some going as far as to say that likelihood is blind although not all statisticians believe this, and have responded to such criticisms in kind, see discussants.19 Thus, we support providing both \(P\)-value/\(S\)-value and likelihood-based functions for a complete picture.
Indeed, we can see how they all easily map to one another when plotted side by side, with the following script.
library("cowplot") plot_grid(confcurve, surprisalcurve, relsupportfunction, deviancefunction, ncol = 2, nrow = 2)
A clearer plot comparing the four functions is seen below (click on the image to view in full).
Additional adjustments that were made to the figures from Rafi & Greenland, 20201 were done using Adobe Illustrator and Photoshop.
All errors are ours, and we welcome critical feedback and reporting of errors. To report possible errors in our analyses, please post it as a comment below or as a bug here. We will compile them onto a public errata, if any errors or flaws are reported to us.
Statistical Package Citations
Please remember to cite the R
packages if you use any of the R
scripts from above. The citation for our1 can be found below in the
References section or it can be downloaded
here
for a reference manager.
citation("concurve") citation("TeachingDemos") citation("ProfileLikelihood") citation("rstan") citation("rstanarm") citation("bayesplot") citation("knitr") citation("kableExtra") citation("ggplot2") citation("cowplot") citation("Statamarkdown")
Environment
The analyses were run on:
R version 4.0.5 (2021-03-31) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Statamarkdown_0.5.2 kableExtra_1.3.4 cowplot_1.1.1 ggplot2_3.3.3 concurve_2.7.7 rmarkdown_2.7 loaded via a namespace (and not attached): [1] readxl_1.3.1 uuid_0.1-4 backports_1.2.1 systems_1.0.1 plyr_1.8.6 [6] igraph_1.2.6 splines_4.0.5 crosstalk_1.1.1 rstantools_2.1.1 inline_0.3.17 [11] digest_0.6.27 htmltools_0.5.1.1 rsconnect_0.8.17 fansi_0.4.2 magrittr_2.0.1 [16] openxlsx_4.2.3 credentials_1.3.0 RcppParallel_5.1.2 matrixStats_0.58.0 officer_0.3.18 [21] xts_0.12.1 svglite_2.0.0 askpass_1.1 prettyunits_1.1.1 colorspace_2.0-0 [26] rvest_1.0.0 haven_2.4.0 xfun_0.22 dplyr_1.0.5 callr_3.7.0 [31] crayon_1.4.1 bcaboot_0.2-1 jsonlite_1.7.2 lme4_1.1-26 survival_3.2-10 [36] zoo_1.8-9 glue_1.4.2 survminer_0.4.9 gtable_0.3.0 webshot_0.5.2 [41] V8_3.4.0 car_3.0-10 pkgbuild_1.2.0 rstan_2.21.1 abind_1.4-5 [46] scales_1.1.1 DBI_1.1.1 rstatix_0.7.0 miniUI_0.1.1.1 Rcpp_1.0.6 [51] viridisLite_0.4.0 xtable_1.8-4 foreign_0.8-81 km.ci_0.5-2 stats4_4.0.5 [56] StanHeaders_2.21.0-7 DT_0.18 htmlwidgets_1.5.3 httr_1.4.2 threejs_0.3.3 [61] ellipsis_0.3.1 farver_2.1.0 pkgconfig_2.0.3 loo_2.4.1 sass_0.3.1 [66] utf8_1.2.1 labeling_0.4.2 reshape2_1.4.4 tidyselect_1.1.0 rlang_0.4.10 [71] later_1.1.0.1 munsell_0.5.0 pbmcapply_1.5.0 cellranger_1.1.0 tools_4.0.5 [76] cli_2.4.0 generics_0.1.0 broom_0.7.6 ggridges_0.5.3 evaluate_0.14 [81] stringr_1.4.0 fastmap_1.1.0 yaml_2.2.1 sys_3.4 processx_3.5.1 [86] knitr_1.32 zip_2.1.1 survMisc_0.5.5 purrr_0.3.4 nlme_3.1-152 [91] mime_0.10 rstanarm_2.21.1 xml2_1.3.2 shinythemes_1.2.0 compiler_4.0.5 [96] bayesplot_1.8.0 rstudioapi_0.13 curl_4.3 ggsignif_0.6.1 statmod_1.4.35 [101] tibble_3.1.1 bslib_0.2.4 stringi_1.5.3 highr_0.9 ps_1.6.0 [106] blogdown_1.3 forcats_0.5.1 gdtools_0.2.3 lattice_0.20-41 Matrix_1.3-2 [111] nloptr_1.2.2.2 markdown_1.1 shinyjs_2.0.0 KMsurv_0.1-5 vctrs_0.3.7 [116] pillar_1.6.0 lifecycle_1.0.0 jquerylib_0.1.3 data.table_1.14.0 ProfileLikelihood_1.1 [121] flextable_0.6.5 httpuv_1.5.5 R6_2.5.0 bookdown_0.22 promises_1.2.0.1 [126] gridExtra_2.3 rio_0.5.26 codetools_0.2-18 MASS_7.3-53.1 boot_1.3-27 [131] colourpicker_1.1.0 gtools_3.8.2 assertthat_0.2.1 openssl_1.4.3 withr_2.4.2 [136] metafor_2.4-0 shinystan_2.5.0 hms_1.0.0 grid_4.0.5 minqa_1.2.4 [141] tidyr_1.1.3 carData_3.0-4 ggpubr_0.4.0 shiny_1.6.0 base64enc_0.1-3 [146] dygraphs_1.1.1.6
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.