AIC in the well-specified linear model: theory and simulation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Theory
AIC
Consider the AIC for the usual linear model Y=Xβ+ε:
AIC=12ln(2πeˆσ2)+p+1N(#eq:AIC)
where p is the dimension of the covariate vector X and ˆσ2 is the ML estimate of the Y|X conditional variance. The expectation of @ref(eq:AIC) under model assumptions can be found by using the fact that, for a χ2 random variable with ν degrees of freedom1:
E(lnχ2)=ln2+ψ(ν2)(#eq:ElogX2)
E(AIC)=ln[2πeV(Y|X)]2+12ln(N−p2)−121N−p+p+1N,(#eq:EAIC)
Now consider two such models, with different covariate vectors X1 and X2, of dimension p1 and p2 respectively, both assumed to be well specified. Denote, as before:
AICi=12ln(2πeˆσ2i)+pi+1N(#eq:AICi)
E(AIC1–AIC2)=12ln(V(Y|X1)V(Y|X2))+p1−p22N+O(N−2)(#eq:DeltaEAIC).
Assuming, without loss of generality, that p1≤p2, we have:
E(AIC1–AIC2)<0⟺N<p2−p1ln(V(Y|X1)V(Y|X2)).(#eq:AICCondition)
N≲V(Y|X2)V(Y|X1)−V(Y|X2)(p2−p1).(#eq:AICConditionApprox)
Cross-entropy
In parallel to AIC, we can consider the exact “information criterion” provided by the model in-sample cross-entropy under the true data generating process. For a single linear model, the in-sample cross-entropy is:
CEin=12ln(2πeˆσ2)+12σ2−ˆσ2+1N(β−ˆβ)TXTX(β−ˆβ)ˆσ2.(#eq:InSampleCrossEntropy)
- The numerator and denominator are conditionally independent χ2 variables with p and N−p degrees of freedom respectively. This can be seen by rewriting these as ϵTHϵ, and ϵT(1−H)ϵ, respectively, where H=X(XTX)−1XT as usual.
- For a χ2 random variable with ν degrees of freedom we have E(1χ2)=1ν–2.
Using these results, we can show that:
E(CEin|X)=E(AIC|X)+O(N−2)(#eq:AICvsCE)
Before rushing to the (wrong) conclusion that AIC1–AIC2 will correspondingly estimate a difference of expected cross-entropies, let us notice that the relevant in-sample cross-entropy to be considered for model evaluation is Eq. @ref(eq:InSampleCrossEntropy) with X corresponding to the full covariate vector: this is the target we should try to estimate (at least to the extent that our goal is predicting Y given X). For this reason, strictly speaking, Eq. @ref(eq:AICvsCE) is exact only if our model is well specified as a model of Y|X. Otherwise, in order to estimate consistently E(CEin|X), we should use Takeuchi’s Information Criterion (TIC) rather than AIC.
A bit more pragmatically, in the real world we could assume the remainder of @ref(eq:AICvsCE) to be O(N−1) (rather than O(N−2)), but generally small with respect the leading order AIC correction (p+1N). This will be the case if the models being compared are approximately well specified.
Simulation
Setup
We take the data generating process to be:
Y=mX+q+ε,(#eq:DGPSim)
X∼N(0,1),ε∼N(0,1),ε⊥X.(#eq:DGPSim2)
m <- 0.1 q <- 0 rxy <- function(n) { tibble( x = rnorm(n, sd = 1), y = m * x + q + rnorm(n, sd = 1) ) }
We compare the model with vs. without slope term (m=0 vs. m≠0), which we will denote by
suffixes 1 and 1⊕X, respectively. The functions
below compute AIC and in-sample cross-entropy from the corresponding
lm
objects. We also define a “Naive Information Criterion”
NIC≡log(ˆσ).
nic <- function(fit) { p <- length(coef(fit)) n <- nobs(fit) sigma_hat <- sigma(fit) * sqrt((n - p) / n) log(sigma_hat) } aic <- function(fit) { p <- length(coef(fit)) n <- nobs(fit) sigma_hat <- sigma(fit) * sqrt((n - p) / n) log(sigma_hat) + (p + 1) / n + 0.5 *(1 + log(2*pi)) } ce <- function(fit, data) { p <- length(coef(fit)) n <- nobs(fit) sigma_hat <- sigma(fit) * sqrt((n - p) / n) y_hat <- fitted(fit) mu <- data$x * m + q res <- 0 res <- res + 0.5 / (sigma_hat^2) res <- res + log(sigma_hat) res <- res + mean(0.5 * (y_hat - mu)^2 / (sigma_hat^2)) res <- res + 0.5 * log(2 * pi) return(res) }
From our results above, we expect:
E(AIC1⊕X−AIC1)<0⟺N≥1ln(1+m2)(1+O(m2))(#eq:DeltaEAICSim)
E((CEin)i)=E(AICi)+O(N−2,m2N−1),(#eq:AICvsCESim)
I will use tidyverse for plotting results.
library(dplyr) library(ggplot2)
In order to make results reproducible let’s:
set.seed(840)
Results
We simulate fitting models 1 and 1⊕X at different sample sizes from the data generating process described above.
fits <- tidyr::expand_grid( n = 10 ^ seq(from = 1, to = 3, by = 0.5), b = 1:1e3 ) |> mutate(data = lapply(n, rxy)) |> group_by(n, b, data) |> tidyr::expand(model = c(y ~ 1, y ~ x)) |> ungroup() |> mutate( fit = lapply(row_number(), \(i) lm(model[[i]], data = data[[i]])), ce = sapply(row_number(), \(i) ce(fit[[i]], data[[i]])), aic = sapply(fit, aic), nic = sapply(fit, nic), model = format(model) ) |> select(-c(fit, data))
The plots below show the dependence from sample size of E(ΔAIC) and E(ΔCEin), as well as AIC selection frequencies. Notice that for N=1m2, even though E(ΔAIC)=0, the selection frequency of the “complex” model 1⊕X is still below 50 %. This is because the distribution of ΔAIC is asymmetric, as seen in the second plot, and E(ΔAIC)<median(ΔAIC).
fits |> mutate( is_baseline = model == "y ~ 1", delta_ce = ce - ce[is_baseline], delta_aic = aic - aic[is_baseline], delta_nic = nic - nic[is_baseline], .by = c(n, b), ) |> filter(!is_baseline) |> summarise( `E( ΔCE )` = mean(delta_ce), `E( ΔAIC )` = mean(delta_aic), `E( ΔNIC )` = mean(delta_nic), .by = n ) |> tidyr::pivot_longer( -n, names_to = "metric", values_to = "value" ) |> ggplot(aes(x = n, y = value, color = metric)) + geom_point() + geom_line() + geom_hline(yintercept = 0, linetype = "dashed") + geom_vline(aes(xintercept = 1 / m^2), linetype = "dotted") + scale_x_log10("Sample Size") + coord_cartesian(ylim = c(-0.025, 0.025)) + ylab(expression(IC)) + theme(legend.position = "bottom", legend.title = element_blank()) + ggtitle("AIC vs. in-sample cross-entropy", "Expected values") + NULL
fits |> filter(aic == min(aic), .by = c(n, b)) |> summarise(count = n(), .by = c(n, model)) |> ggplot(aes(fill = model, x = n, y = count)) + geom_col() + scale_x_log10("Sample Size") + ylab("Count") + theme(legend.position = "bottom") + ggtitle("AIC model selection frequencies")
fits |> filter(n %in% c(10, 100, 1000)) |> mutate(delta_aic = aic - aic[model == "y ~ 1"], .by = c(n, b)) |> filter(model != "y ~ 1") |> mutate(expec = -0.5 * log(1 + m^2) + 0.5 / n) |> ggplot(aes(x = delta_aic, color = as.factor(n))) + geom_density() + coord_cartesian(xlim = c(-0.1, NA)) + labs(x = "ΔAIC", y = "Density", color = "Sample Size") + ggtitle("ΔAIC probability density")
Finally, here is something I have no idea where it comes from. The plot below shows the scatterplot of in-sample cross-entropy differences vs. the AIC differences. It is well known that AIC only estimates the expectation of these differences, averaged over potential training samples. One may ask whether AIC has anything to say about the actual cross-entropy difference for the estimated models, conditional on the realized training sample.
Assuming I have made no errors here, the tilted-U shape of this scatterplot is a clear negative answer. What’s especially interesting is that, apparently, these differences have a negative correlation. I fail to see where do the negative correlation and the U-shape come from.
fits |> filter(n == 100) |> mutate( is_baseline = model == "y ~ 1", delta_ce = ce - ce[is_baseline], delta_aic = aic - aic[is_baseline], .by = c(n, b), ) |> filter(!is_baseline) |> ggplot(aes(x = delta_aic, y = delta_ce)) + geom_point(size = 1, alpha = 0.2) + lims(x = c(-0.02, 0.01), y = c(-0.01, 0.03)) + labs(x = "ΔAIC", y = "ΔCE") + ggtitle("AIC vs. in-sample cross-entropy", "Point values for N = 100") + NULL
See e.g. 1503.06266↩︎
The same equation actually gives the expectation of AIC conditional to the in-sample covariate vector X. Since this conditioning differs for the two different models involving X1 and X2, in our comparison of expected values we must interpret this as unconditional expectations, in general.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.