Q: Does a high $ R^2 $ imply that a model is a good predictor or that the independent variables cause the dependent variable?

No, not necessarily. A high $ R^2 $ primarily indicates that a large proportion of the variance in the dependent variable $ Y $ is explained by the independent variables $ X $ within the sample data, suggesting a good *fit*. However, it does not imply causation; correlation is not causation. A high $ R^2 $ also doesn't guarantee predictive accuracy on new data (the model could be overfit) or that the model's underlying assumptions are met. Other diagnostic checks, such as residual analysis, out-of-sample validation, and theoretical justification, are essential for assessing a model's overall quality and reliability.

Question 1

Why does the cross-product term $ 2\sum (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) $ vanish in the decomposition?

Accepted Answer

This vanishing is a direct consequence of the Ordinary Least Squares (OLS) estimation procedure. When an intercept term is included in the model, OLS ensures that the sum of the residuals $ (Y_i - \hat{Y}_i) $ is zero, i.e., $ \sum (Y_i - \hat{Y}_i) = 0 $. Additionally, the OLS residuals are orthogonal to the predicted values $ \hat{Y}_i $, meaning $ \sum (Y_i - \hat{Y}_i) \hat{Y}_i = 0 $. Combining these, the cross-product term simplifies to $ \sum (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) = \sum (Y_i - \hat{Y}_i)\hat{Y}_i - \bar{Y}\sum (Y_i - \hat{Y}_i) = 0 - \bar{Y}(0) = 0 $.

Question 2

Is $ R^2 $ always between 0 and 1?

Accepted Answer

For standard OLS regression models that include an intercept, $ R^2 $ is always between 0 and 1. This is because $ \text{SSR} $, $ \text{SSE} $, and $ \text{SST} $ are sums of squares and are thus non-negative. Moreover, $ \text{SSR} $ cannot exceed $ \text{SST} $ because it represents the explained portion of the total variation. However, if a model omits the intercept term, or if non-OLS estimation methods are used, the property that the cross-product term vanishes might not hold, and $ R^2 $ could theoretically be negative (or greater than 1, depending on the definition used), though this indicates a very poor model fit.

Question 3

What is the relationship between $ R^2 $ and the Pearson correlation coefficient $ r $?

Accepted Answer

For a simple linear regression model (with only one independent variable), the coefficient of determination $ R^2 $ is equal to the square of the Pearson product-moment correlation coefficient $ r $ between the independent variable $ X $ and the dependent variable $ Y $. That is, $ R^2 = (\text{corr}(X, Y))^2 $. In multiple linear regression, $ R^2 $ is defined as the square of the multiple correlation coefficient, which is the Pearson correlation between the observed values $ Y_i $ and the predicted values $ \hat{Y}_i $, i.e., $ R^2 = (\text{corr}(Y, \hat{Y}))^2 $.

Question 4

Does a high $ R^2 $ imply that a model is a good predictor or that the independent variables cause the dependent variable?

Accepted Answer

No, not necessarily. A high $ R^2 $ primarily indicates that a large proportion of the variance in the dependent variable $ Y $ is explained by the independent variables $ X $ within the sample data, suggesting a good *fit*. However, it does not imply causation; correlation is not causation. A high $ R^2 $ also doesn't guarantee predictive accuracy on new data (the model could be overfit) or that the model's underlying assumptions are met. Other diagnostic checks, such as residual analysis, out-of-sample validation, and theoretical justification, are essential for assessing a model's overall quality and reliability.

Decomposition of Total Sum of Squares: SST = SSR + SSE and its Implications for R²

Visualizing...

The Formal Theorem

Analytical Intuition.

Institutional Warning.

Academic Inquiries.

Why does the cross-product term $2\sum (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y})$ vanish in the decomposition?

Is $R^2$ always between 0 and 1?

What is the relationship between $R^2$ and the Pearson correlation coefficient $r$ ?

Does a high $R^2$ imply that a model is a good predictor or that the independent variables cause the dependent variable?

Standardized References.

The Matrix Formulation of the General Linear Model: Y = Xβ + ϵ and its Fundamental Assumptions

Derivation of the Ordinary Least Squares (OLS) Estimator: β̂ = (X'X)⁻¹X'Y

Proof of Unbiasedness of the OLS Estimator: E(β̂) = β

Derivation of the Variance-Covariance Matrix of the OLS Estimator: Var(β̂) = σ²(X'X)⁻¹

Institutional Citation

Dominate the Logic.