Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y

Master the OLS estimator derivation: \( \hat{\beta} = (X'X)^{-1}X'Y \). Explore the geometric orthogonality, matrix calculus, and Gauss-Markov foundations.

Visualizing...

Our institutional research engineers are currently mapping the formal proof for Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y.

Apply for Institutional Early Access →

The Formal Theorem

Consider the linear model Y=Xβ+ϵ Y = X\beta + \epsilon , where YRn×1 Y \in \mathbb{R}^{n \times 1} is the response vector, XRn×p X \in \mathbb{R}^{n \times p} is the design matrix of full column rank, and ϵ(0,σ2In) \epsilon \sim (0, \sigma^2 I_n) . The OLS estimator β^ \hat{\beta} minimizes the residual sum of squares function S(β)=(YXβ)(YXβ) S(\beta) = (Y - X\beta)'(Y - X\beta) . The unique solution is given by:
β^=(XX)1XY \begin{aligned} \hat{\beta} = (X'X)^{-1}X'Y \end{aligned}

Analytical Intuition.

Imagine the data as a cloud of points in high-dimensional space. Our model Xβ X\beta represents a flat 'hyperplane' of possible predictions. Because our observed data Y Y rarely falls perfectly on this plane due to noise, we seek the specific projection of Y Y onto the subspace spanned by the columns of X X . Think of Xβ X\beta as the 'shadow' cast by Y Y onto the coordinate system defined by our predictors. The OLS estimator β^ \hat{\beta} is the precise coordinate map that locates this shadow. By minimizing the squared distance between the truth Y Y and our projection Xβ^ X\hat{\beta} , we are mathematically ensuring that the error vector e=YXβ^ e = Y - X\hat{\beta} is perfectly orthogonal to the subspace of X X . This orthogonality condition, X(YXβ^)=0 X'(Y - X\hat{\beta}) = 0 , acts as a geometric 'locking' mechanism, forcing the solution to the unique point of minimum distance, effectively silencing the noise to reveal the underlying signal.
CAUTION

Institutional Warning.

Students often confuse the OLS estimator β^ \hat{\beta} with the true population parameter β \beta . Furthermore, many struggle to distinguish between the 'normal equations' XXβ^=XY X'X\hat{\beta} = X'Y and the final closed-form estimator, failing to realize the former is the definition of the orthogonality condition.

Academic Inquiries.

01

What happens if X is not full column rank?

If rank(X)<p \text{rank}(X) < p , the matrix XX X'X is singular (non-invertible). The system has infinitely many solutions, and the model is said to suffer from multicollinearity.

02

Is the OLS estimator always the best choice?

Only under the Gauss-Markov assumptions (linearity, no perfect multicollinearity, zero conditional mean of errors, and homoscedasticity). If these are violated, other estimators like Ridge or Lasso may be preferred.

03

Does the derivation require the assumption that ϵ \epsilon is normally distributed?

No. OLS derivation only requires minimizing the sum of squares; it makes no assumption about the distribution of ϵ \epsilon for the point estimate. Normality is only required for hypothesis testing and confidence intervals.

Standardized References.

  • Definitive Institutional SourceSeber, G.A.F. and Lee, A.J., Linear Regression Analysis.

Institutional Citation

Reference this proof in your academic research or publications.

NICEFA Visual Mathematics. (2026). Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y: Visual Proof & Intuition. Retrieved from https://www.nicefa.org/library/general-linear-models-/derivation-of-the-ordinary-least-squares--ols--estimator--------x-x---x-y

Dominate the Logic.

"Abstract theory is just a movement we haven't seen yet."