Derivation of the F-statistic for Overall Model Significance and its Distribution

Derive the F-statistic for overall model significance in General Linear Models, understanding its distribution, geometric intuition, and practical implications for BSc students.

Visualizing...

Our institutional research engineers are currently mapping the formal proof for Derivation of the F-statistic for Overall Model Significance and its Distribution.

Apply for Institutional Early Access →

The Formal Theorem

Given the linear regression model y=Xβ+ϵ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} , where y \mathbf{y} is an n×1 n \times 1 vector of observations, X \mathbf{X} is an n×p n \times p design matrix (including a column of ones for the intercept, so p1 p-1 predictors), β \boldsymbol{\beta} is a p×1 p \times 1 vector of coefficients, and ϵ \boldsymbol{\epsilon} is an n×1 n \times 1 vector of error terms such that ϵN(0,σ2I) \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}) . The null hypothesis for overall model significance is H0:β1==βp1=0 H_0: \beta_1 = \dots = \beta_{p-1} = 0 (all predictor coefficients are zero, excluding the intercept) against the alternative H1:at least one βj0 H_1: \text{at least one } \beta_j \neq 0 for j{1,,p1} j \in \{1, \dots, p-1\} . The F-statistic for testing this hypothesis is given by:
F=(SSR/(p1))(SSE/(np))where SSR=yT(H1nJ)yis the Regression Sum of Squares,and SSE=yT(IH)yis the Residual Sum of Squares. \begin{aligned} F &= \frac{(SSR / (p-1))}{(SSE / (n-p))} \\ \text{where } SSR &= \mathbf{y}^T \left( \mathbf{H} - \frac{1}{n}\mathbf{J} \right) \mathbf{y} \\ \text{is the Regression Sum of Squares,} \text{and } SSE &= \mathbf{y}^T (\mathbf{I} - \mathbf{H}) \mathbf{y} \\ \text{is the Residual Sum of Squares.} \end{aligned}
Here, H=X(XTX)1XT \mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T is the hat matrix, J \mathbf{J} is an n×n n \times n matrix of all ones, p1 p-1 is the degrees of freedom for the regression (number of predictors), and np n-p is the degrees of freedom for the residuals. Under the null hypothesis H0 H_0 , the F-statistic follows an F-distribution with p1 p-1 and np n-p degrees of freedom:
FF(p1,np) F \sim F(p-1, n-p)

Analytical Intuition.

Imagine we're cinematographers trying to predict the box office success of a new film. Our 'model' is a combination of factors: star power, budget, genre, etc. The F-statistic is like a discerning critic, asking: "Does this ensemble of factors truly explain the film's success better than just guessing the average success of all films ever?" It quantifies the 'signal-to-noise ratio.' We compare the variance our model *explains* (the SSRSSR from our stellar cast and budget) to the variance it *fails to explain* (the SSESSE due to unforeseen audience whims or bad marketing). If our model's explained variance is significantly larger than the unexplained noise, the F-statistic soars, suggesting our film's success isn't just random. This cinematic battle of variances determines if our predictive factors genuinely contribute, indicating overall model significance beyond mere chance. A high FF value suggests our story has predictive power.
CAUTION

Institutional Warning.

Students often confuse the F-test for overall significance with individual t-tests for coefficients, or misinterpret a non-significant F-value as meaning all predictors are useless, rather than the collective set offering no significant improvement.

Academic Inquiries.

01

Why are SSRSSR and SSESSE divided by σ2 \sigma^2 to get chi-squared distributions?

Quadratic forms zTAz \mathbf{z}^T \mathbf{A} \mathbf{z} where zN(0,I) \mathbf{z} \sim N(\mathbf{0}, \mathbf{I}) and A \mathbf{A} is an idempotent matrix with rank r r follow a χ2(r) \chi^2(r) distribution. Here, ϵ/σN(0,I) \boldsymbol{\epsilon} / \sigma \sim N(\mathbf{0}, \mathbf{I}) . Since SSE=ϵT(IH)ϵ SSE = \boldsymbol{\epsilon}^T (\mathbf{I} - \mathbf{H}) \boldsymbol{\epsilon} and SSR=ϵT(H1nJ)ϵ SSR = \boldsymbol{\epsilon}^T (\mathbf{H} - \frac{1}{n}\mathbf{J}) \boldsymbol{\epsilon} under H0 H_0 , dividing by σ2 \sigma^2 transforms them into the required chi-squared form.

02

How does Cochran's Theorem apply here?

Cochran's Theorem states that if zN(0,I) \mathbf{z} \sim N(\mathbf{0}, \mathbf{I}) and i=1kAi=I \sum_{i=1}^k \mathbf{A}_i = \mathbf{I} where each Ai \mathbf{A}_i is a symmetric idempotent matrix of rank ri r_i , then zTAizχ2(ri) \mathbf{z}^T \mathbf{A}_i \mathbf{z} \sim \chi^2(r_i) and these chi-squared variables are independent. In our context, yXβ=ϵ \mathbf{y} - \mathbf{X}\boldsymbol{\beta} = \boldsymbol{\epsilon} . Under H0 H_0 , yβ01+ϵ \mathbf{y} \approx \beta_0 \mathbf{1} + \boldsymbol{\epsilon} . We use the orthogonal projection matrices to decompose SST SST into SSR+SSE SSR + SSE , where the matrices PSSR \mathbf{P}_{SSR} and PSSE \mathbf{P}_{SSE} satisfy the conditions for Cochran's theorem to prove the independence and chi-squared distribution.

03

What is the role of the centering matrix P1=I11T/n \mathbf{P}_{\mathbf{1}} = \mathbf{I} - \mathbf{1}\mathbf{1}^T/n or 1nJ \frac{1}{n}\mathbf{J} in SSR SSR ?

The centering matrix accounts for the intercept. SST=yTP1y SST = \mathbf{y}^T \mathbf{P}_{\mathbf{1}} \mathbf{y} measures total variability around the mean yˉ \bar{y} . SSR=yT(H1nJ)y SSR = \mathbf{y}^T (\mathbf{H} - \frac{1}{n}\mathbf{J}) \mathbf{y} measures the variability explained by the predictors *beyond* what's explained by just the mean. If the model only had an intercept, H \mathbf{H} would simplify to 1nJ \frac{1}{n}\mathbf{J} , making SSR=0 SSR = 0 . This ensures that the degrees of freedom for SSR SSR correctly reflects the number of *additional* parameters introduced by the predictors (i.e., p1 p-1 ).

04

What happens to the F-statistic if the model does not include an intercept?

If the model does not include an intercept, the sums of squares are calculated differently. SST SST would typically be yTy \mathbf{y}^T \mathbf{y} (total uncorrected sum of squares). SSR SSR would be yTHy \mathbf{y}^T \mathbf{H} \mathbf{y} and SSE=yT(IH)y SSE = \mathbf{y}^T (\mathbf{I} - \mathbf{H}) \mathbf{y} . The degrees of freedom would also change: dfregression=p df_{regression} = p (number of predictors) and dfresidual=np df_{residual} = n-p . The F-test would then test H0:β1==βp=0 H_0: \beta_1 = \dots = \beta_p = 0 (all coefficients are zero).

Standardized References.

  • Definitive Institutional SourceNeter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. Applied Linear Statistical Models. 5th ed. McGraw-Hill, 2005.

Institutional Citation

Reference this proof in your academic research or publications.

NICEFA Visual Mathematics. (2026). Derivation of the F-statistic for Overall Model Significance and its Distribution: Visual Proof & Intuition. Retrieved from https://www.nicefa.org/library/general-linear-models-/derivation-of-the-f-statistic-for-overall-model-significance-and-its-distribution

Dominate the Logic.

"Abstract theory is just a movement we haven't seen yet."