Model Selection Criteria: Derivation and Comparison of AIC and BIC

Master the rigorous derivation and institutional intuition behind AIC and BIC. Learn how to navigate the bias-variance trade-off in General Linear Models.

Visualizing...

Our institutional research engineers are currently mapping the formal proof for Model Selection Criteria: Derivation and Comparison of AIC and BIC.

Apply for Institutional Early Access →

The Formal Theorem

Let M \mathcal{M} be a statistical model with k k parameters and likelihood L(θy) L(\theta | y) . Given a sample size n n , the information criteria are defined as:
AIC=2ln(L^)+2kBIC=2ln(L^)+kln(n) \begin{aligned} \text{AIC} &= -2\ln(\hat{L}) + 2k \\ \text{BIC} &= -2\ln(\hat{L}) + k \ln(n) \end{aligned}
where L^ \hat{L} denotes the maximized likelihood of the model given the data y y .

Analytical Intuition.

In the grand theater of statistical modeling, we face the eternal struggle between complexity and clarity. Imagine a mapmaker trying to chart a coastline; include every pebble and the map becomes an unreadable mess, but simplify too much and the shape is lost. AIC (Akaike Information Criterion) acts as an optimistic arbiter, minimizing the Kullback-Leibler divergence between our model and the 'true' unknown process, effectively targeting predictive accuracy. It rewards fit while penalizing extra parameters with a constant weight of 2 2 . BIC (Bayesian Information Criterion), however, is the pragmatic realist. By scaling its penalty by ln(n) \ln(n) , it acknowledges that as our evidence pool grows, our tolerance for 'noise-chasing' parameters must shrink. It seeks the 'true' model hidden within the data, asymptotically choosing the most parsimonious structure that explains the underlying mechanism. Together, they form the twin lenses through which we view the trade-off between the bias of simplicity and the variance of complexity, ensuring our models describe the signal rather than the chaotic echoes of the noise.
CAUTION

Institutional Warning.

Students often confuse the 'truth' assumption: AIC does not assume the true model is in the set, prioritizing predictive performance. BIC assumes a true model exists, prioritizing consistent identification of that model in the limit as n n \to \infty .

Academic Inquiries.

01

Why does BIC penalize complexity more harshly than AIC?

Because the penalty term kln(n) k \ln(n) grows with sample size n n , whereas AIC's 2k 2k remains constant. Once ln(n)>2 \ln(n) > 2 , or n>e27.39 n > e^2 \approx 7.39 , BIC's penalty exceeds AIC's.

02

Can I compare models with different transformations of the dependent variable using AIC?

No. AIC values depend on the Jacobian of the transformation. You must use the same likelihood function definition for all models being compared.

03

What happens if I use AIC or BIC when my model is misspecified?

AIC remains a valid tool for comparing predictive performance under misspecification. BIC's theoretical justification as an approximation of the posterior probability of the 'true' model becomes ambiguous if the true model is absent.

Standardized References.

  • Definitive Institutional SourceBurnham, K. P., & Anderson, D. R., Model Selection and Multimodel Inference.

Institutional Citation

Reference this proof in your academic research or publications.

NICEFA Visual Mathematics. (2026). Model Selection Criteria: Derivation and Comparison of AIC and BIC: Visual Proof & Intuition. Retrieved from https://www.nicefa.org/library/general-linear-models-/model-selection-criteria--derivation-and-comparison-of-aic-and-bic

Dominate the Logic.

"Abstract theory is just a movement we haven't seen yet."