Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y

Q: What happens if X is not full column rank?

If $ \text{rank}(X) < p $, the matrix $ X'X $ is singular (non-invertible). The system has infinitely many solutions, and the model is said to suffer from multicollinearity.

Q: Is the OLS estimator always the best choice?

Only under the Gauss-Markov assumptions (linearity, no perfect multicollinearity, zero conditional mean of errors, and homoscedasticity). If these are violated, other estimators like Ridge or Lasso may be preferred.

Q: Does the derivation require the assumption that $ \epsilon $ is normally distributed?

No. OLS derivation only requires minimizing the sum of squares; it makes no assumption about the distribution of $ \epsilon $ for the point estimate. Normality is only required for hypothesis testing and confidence intervals.

Master the OLS estimator derivation: $ \hat{\beta} = (X'X)^{-1}X'Y $. Explore the geometric orthogonality, matrix calculus, and Gauss-Markov foundations.

Visualizing...

Our institutional research engineers are currently mapping the formal proof for Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y.

Apply for Institutional Early Access →

The Formal Theorem

Consider the linear model

Y = X\beta + \epsilon

, where

Y \in \mathbb{R}^{n \times 1}

is the response vector,

X \in \mathbb{R}^{n \times p}

is the design matrix of full column rank, and

\epsilon \sim (0, \sigma^2 I_n)

. The OLS estimator

\hat{\beta}

minimizes the residual sum of squares function

S(\beta) = (Y - X\beta)'(Y - X\beta)

. The unique solution is given by:

\begin{aligned} \hat{\beta} = (X'X)^{-1}X'Y \end{aligned}

Analytical Intuition.

Imagine the data as a cloud of points in high-dimensional space. Our model

X\beta

represents a flat 'hyperplane' of possible predictions. Because our observed data

Y

rarely falls perfectly on this plane due to noise, we seek the specific projection of

Y

onto the subspace spanned by the columns of

X

. Think of

X\beta

as the 'shadow' cast by

Y

onto the coordinate system defined by our predictors. The OLS estimator

\hat{\beta}

is the precise coordinate map that locates this shadow. By minimizing the squared distance between the truth

Y

and our projection

X\hat{\beta}

, we are mathematically ensuring that the error vector

e = Y - X\hat{\beta}

is perfectly orthogonal to the subspace of

X

. This orthogonality condition,

X'(Y - X\hat{\beta}) = 0

, acts as a geometric 'locking' mechanism, forcing the solution to the unique point of minimum distance, effectively silencing the noise to reveal the underlying signal.

CAUTION

Institutional Warning.

Students often confuse the OLS estimator $\hat{\beta}$ with the true population parameter $\beta$ . Furthermore, many struggle to distinguish between the 'normal equations' $X'X\hat{\beta} = X'Y$ and the final closed-form estimator, failing to realize the former is the definition of the orthogonality condition.

Academic Inquiries.

What happens if X is not full column rank?

If $\text{rank}(X) < p$ , the matrix $X'X$ is singular (non-invertible). The system has infinitely many solutions, and the model is said to suffer from multicollinearity.

Is the OLS estimator always the best choice?

Only under the Gauss-Markov assumptions (linearity, no perfect multicollinearity, zero conditional mean of errors, and homoscedasticity). If these are violated, other estimators like Ridge or Lasso may be preferred.

Does the derivation require the assumption that $\epsilon$ is normally distributed?

No. OLS derivation only requires minimizing the sum of squares; it makes no assumption about the distribution of $\epsilon$ for the point estimate. Normality is only required for hypothesis testing and confidence intervals.

Standardized References.

Definitive Institutional SourceSeber, G.A.F. and Lee, A.J., Linear Regression Analysis.

Advanced

The Matrix Formulation of the General Linear Model: Y = Xβ + ϵ and its Fundamental Assumptions

Master the matrix formulation of the General Linear Model, $ Y = X\beta + \epsilon $, and its fundamental assumptions. Rigorous yet intuitive content for BSc Math/Stats students.

Foundational

Proof of Unbiasedness of the OLS Estimator: E(β̂) = β

Master the rigorous proof of OLS estimator unbiasedness, $ E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} $. Understand critical assumptions, geometric intuition, and common pitfalls for robust linear modeling.

Foundational

Derivation of the Variance-Covariance Matrix of the OLS Estimator: Var(β̂) = σ²(X'X)⁻¹

A rigorous derivation of the Variance-Covariance matrix for the OLS estimator, exploring the geometric impact of data configuration on statistical precision.

Foundational

The Gauss-Markov Theorem: Proof that OLS is the Best Linear Unbiased Estimator (BLUE)

Master the Gauss-Markov Theorem: Understand why OLS is the Best Linear Unbiased Estimator (BLUE) under key assumptions for robust statistical inference.

Institutional Citation

Reference this proof in your academic research or publications.

NICEFA Visual Mathematics. (2026). Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y: Visual Proof & Intuition. Retrieved from https://www.nicefa.org/library/general-linear-models-/derivation-of-the-ordinary-least-squares--ols--estimator--------x-x---x-y

Dominate the Logic.

"Abstract theory is just a movement we haven't seen yet."

Subscribe for Full Proofs Early Access

Visualizing...

The Formal Theorem

Analytical Intuition.

Institutional Warning.

Academic Inquiries.

What happens if X is not full column rank?

Is the OLS estimator always the best choice?

Does the derivation require the assumption that ϵ \epsilon ϵ is normally distributed?

Standardized References.

Related Proofs Cluster.

The Matrix Formulation of the General Linear Model: Y = Xβ + ϵ and its Fundamental Assumptions

Proof of Unbiasedness of the OLS Estimator: E(β̂) = β

Derivation of the Variance-Covariance Matrix of the OLS Estimator: Var(β̂) = σ²(X'X)⁻¹

The Gauss-Markov Theorem: Proof that OLS is the Best Linear Unbiased Estimator (BLUE)

Institutional Citation

Dominate the Logic.

Does the derivation require the assumption that $\epsilon$ is normally distributed?