Model Selection Criteria: Derivation and Comparison of AIC and BIC

Q: Why does BIC penalize complexity more harshly than AIC?

Because the penalty term $ k \ln(n) $ grows with sample size $ n $, whereas AIC's $ 2k $ remains constant. Once $ \ln(n) > 2 $, or $ n > e^2 \approx 7.39 $, BIC's penalty exceeds AIC's.

Analytical Intuition.

In the grand theater of statistical modeling, we face the eternal struggle between complexity and clarity. Imagine a mapmaker trying to chart a coastline; include every pebble and the map becomes an unreadable mess, but simplify too much and the shape is lost. AIC (Akaike Information Criterion) acts as an optimistic arbiter, minimizing the Kullback-Leibler divergence between our model and the 'true' unknown process, effectively targeting predictive accuracy. It rewards fit while penalizing extra parameters with a constant weight of

2

. BIC (Bayesian Information Criterion), however, is the pragmatic realist. By scaling its penalty by

\ln(n)

, it acknowledges that as our evidence pool grows, our tolerance for 'noise-chasing' parameters must shrink. It seeks the 'true' model hidden within the data, asymptotically choosing the most parsimonious structure that explains the underlying mechanism. Together, they form the twin lenses through which we view the trade-off between the bias of simplicity and the variance of complexity, ensuring our models describe the signal rather than the chaotic echoes of the noise.

Academic Inquiries.

Why does BIC penalize complexity more harshly than AIC?

Because the penalty term $k \ln(n)$ grows with sample size $n$ , whereas AIC's $2k$ remains constant. Once $\ln(n) > 2$ , or $n > e^2 \approx 7.39$ , BIC's penalty exceeds AIC's.

Can I compare models with different transformations of the dependent variable using AIC?

No. AIC values depend on the Jacobian of the transformation. You must use the same likelihood function definition for all models being compared.

What happens if I use AIC or BIC when my model is misspecified?

AIC remains a valid tool for comparing predictive performance under misspecification. BIC's theoretical justification as an approximation of the posterior probability of the 'true' model becomes ambiguous if the true model is absent.

NICEFA Visual Mathematics. (2026). Model Selection Criteria: Derivation and Comparison of AIC and BIC: Visual Proof & Intuition. Retrieved from https://www.nicefa.org/library/general-linear-models-/model-selection-criteria--derivation-and-comparison-of-aic-and-bic

Visualizing...

The Formal Theorem

Analytical Intuition.

Institutional Warning.

Academic Inquiries.

Why does BIC penalize complexity more harshly than AIC?

Can I compare models with different transformations of the dependent variable using AIC?

What happens if I use AIC or BIC when my model is misspecified?

Standardized References.

The Matrix Formulation of the General Linear Model: Y = Xβ + ϵ and its Fundamental Assumptions

Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y

Proof of Unbiasedness of the OLS Estimator: E(β̂) = β

Derivation of the Variance-Covariance Matrix of the OLS Estimator: Var(β̂) = σ²(X'X)⁻¹

Institutional Citation

Dominate the Logic.

Visualizing...

The Formal Theorem

Analytical Intuition.

Institutional Warning.

Academic Inquiries.

Why does BIC penalize complexity more harshly than AIC?

Can I compare models with different transformations of the dependent variable using AIC?

What happens if I use AIC or BIC when my model is misspecified?

Standardized References.

Related Proofs Cluster.

The Matrix Formulation of the General Linear Model: Y = Xβ + ϵ and its Fundamental Assumptions

Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y

Proof of Unbiasedness of the OLS Estimator: E(β̂) = β

Derivation of the Variance-Covariance Matrix of the OLS Estimator: Var(β̂) = σ²(X'X)⁻¹

Institutional Citation

Dominate the Logic.