Review of Linear Regression
Published:
an overview of linear regression
Formulation
We can have two perspectives of linear regression: the scalar form and the matrix form. For clear presentation, we use the following notations:
- \(n\) is the number of samples
- \(d\) is the number of features
-
The Scalar Form: The linear regression model is given by
\(\begin{aligned} \forall i=1, 2, \dots, n, \quad y_i &= \boldsymbol{\beta}^\top \boldsymbol{x}_i + \epsilon_i \\ &= \beta_0 1+ \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{d-1} x_{i(d-1)} + \epsilon_i \end{aligned}\)
- \(y_i \in \mathbb{R}\) is the independent variable
- \(\boldsymbol{x_i}=(1,x_{i1},\dots,x_{i(d-1)})^\top \in \mathbb{R}^{d}\) is the feature vector
- \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_{d-1})^\top \in \mathbb{R}^d\) is the parameter vector, where \(\beta_0 \in \mathbb{R}\) is the intercept, \(\beta_1, \beta_2, …, \beta_{d-1} \in \mathbb{R}\) are the coefficients of the features
- \(\epsilon_i \in \mathbb{R}\) is the error term
And we have the following optimization problem to solve the parameter vector \(\boldsymbol{\beta}\):
-
Goal (Minimize SSE):
\[\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \dots - \beta_{d-1} x_{i(d-1)})^2\] -
Solution (Set \(\frac{\partial SSE}{\partial \beta_j}\) to 0):
\[\forall j = 1, 2,\dots,d-1 , \quad \beta_j = \frac{\sum_{i=1}^n (y_i - \bar{y})(x_{ij} - \bar{x}_j)}{\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}\] \[\beta_0 = \bar{y} - \sum_{j=1}^{d-1} \beta_j \bar{x}_j\]
-
The Matrix Form: The linear regression model is given by
\[\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\]\(\quad \quad \quad \quad \quad \quad \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} =\begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1(d-1)} \\ 1 & x_{21} & x_{22} & \dots & x_{2(d-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{n(d-1)} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{(d-1)} \end{bmatrix} +\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\)
- \(\boldsymbol{y} \in \mathbb{R}^n\) is the response vector
- \(\boldsymbol{X} \in \mathbb{R}^{n \times d}\) is the feature matrix
- \(\boldsymbol{\beta} \in \mathbb{R}^d\) is the parameter vector
- \(\boldsymbol{\epsilon} \in \mathbb{R}^n\) is the error vector
And we have the following optimization problem to solve the parameter vector \(\boldsymbol{\beta}\):
-
Goal (Minimize SSE):
\[\min_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta}\|_2^2\] \[\Leftrightarrow \min_{\boldsymbol{\beta}} (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta})^\top (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta})\] -
Solution (Set \(\frac{\partial SSE}{\partial \boldsymbol{\beta}}\) to 0):
\[\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}\], where \(\hat{\boldsymbol{\beta}}\) is the OLS estimator of the parameter vector \(\boldsymbol{\beta}\).
- Note: The solution is only valid when \(\boldsymbol{X}^\top \boldsymbol{X}\) is invertible, which requires that the columns of \(\boldsymbol{X}\) are linearly independent. This is an assumption of linear regression. (see assumption 3 below)
Assumptions
The following assumptions of linear regression are quoted from “Introductory Econometrics: A Modern Approach, Sixth Edition” by Jeffrey M. Wooldridge, with slight modifications.
-
Linear in Parameters: In the population model, the dependent variable \(y\) is related to the independent variables \(\boldsymbol{x}\) through a linear relationship as
\[\begin{aligned} y &= \boldsymbol{\beta}^\top \boldsymbol{x} + \epsilon \\ &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_{d-1} x_{d-1} + \epsilon \end{aligned}\], where \(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, …, \beta_{d-1})^\top\) is the unknown parameter vector (constant) and \(\epsilon\) is an unobserved random error.
- Random Sampling: We have a random sample of size \(n\), \({(\boldsymbol{x}_i, y_i), i=1, 2, …, n}\) following the linear relationship.
-
No Perfect Collinearity: In the sample, none of the independent variables is constant, and there are no exact linear relationships among the independent variables. In other words, the columns of \(\boldsymbol{X}\) are linearly independent, i.e., there does not exist a nonzero vector \(\boldsymbol\alpha \in \mathbb{R}^d\) such that \[\boldsymbol{X} \boldsymbol\alpha = \boldsymbol{0}\] , where \(\boldsymbol{0} \in \mathbb{R}^n\) is the zero vector.
- Zero Conditional Mean: The error \(\epsilon\) has an expected value of zero given any values of the explanatory variable. In other words, \[\mathbb{E}[\epsilon | \boldsymbol{x}] = 0\]
- Homoscedasticity: The error \(\epsilon\) has the same variance given any value of the explanatory variables. In other words, \[\mathrm{Var}(\epsilon | \boldsymbol{x}) = \sigma^2\]
- Normality: The error \(\epsilon\) is independent of the explanatory variables and follows a normal distribution with mean 0 and variance \(\sigma^2\). In other words, \[\epsilon \sim \mathcal{N}(0, \sigma^2)\]
Notes
-
Unbiasedness: If the assumptions 1-4 hold, then the OLS estimator \(\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}\) is unbiased .
Proof: Notice that
\[\begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} \\ &= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top (\boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}) \\ &= \boldsymbol{\beta} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} \\ \end{aligned}\]Then we compute the conditional expectation
\[\begin{aligned} \mathbb{E}[\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}] &= \mathbb{E}[\boldsymbol{\beta} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} \mid \boldsymbol{X}] \\ &= \mathbb{E}[\boldsymbol{\beta} \mid \boldsymbol{X}] + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \mathbb{E}[\boldsymbol{\epsilon} \mid \boldsymbol{X}] \\ &= \boldsymbol{\beta} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{0} \\ &= \boldsymbol{\beta} \end{aligned}\] - BLUE (Gauss-Markov Theorem): If the assumptions 1-5 hold, then the OLS estimator \(\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}\) is the best linear unbiased estimator (BLUE), where the BLUE is defined as follows:
- The term linear means that the estimator \(\boldsymbol{\tilde{\beta}}\) is a linear function of the observed response vector \(\boldsymbol{y}\), \[\boldsymbol{\tilde{\beta}} = \boldsymbol{A} \boldsymbol{y}\] , where \(\boldsymbol{A} \in \mathbb{R}^{d \times n}\) is a matrix of constants. And for the OLS estimator, we have \(\boldsymbol{A} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top\).
- The term unbiased means that the expected value of the estimator is equal to the true value of the parameter being estimated, \[\mathbb{E}[\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}] = \boldsymbol{\beta}\]
- The term best means that the OLS estimator has the smallest covariance among all linear unbiased estimators, \[\mathrm{Cov}(\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}) \succeq \mathrm{Cov}(\boldsymbol{\tilde{\beta}} \mid \boldsymbol{X})\] , where \(\succeq\) means that the matrix on the left is greater than or equal to the matrix on the right in the sense of positive semidefiniteness.
Proof:
-
Step 1. Construct Linear Estimater: This step is tricky. Suppose that \(\boldsymbol{\tilde{\beta}}\) is any linear estimator of \(\boldsymbol{\beta}\) in the form
\[\begin{aligned} \boldsymbol{\tilde{\beta}} &= \boldsymbol{C} \boldsymbol{y} \\ &= \left[(\boldsymbol{X} ^ \top \boldsymbol{X})^{-1} \boldsymbol{X} ^ \top + \boldsymbol{D}\right] \boldsymbol{y} \\ &= \hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{y} \end{aligned}\], where \(\boldsymbol{D} \in \mathbb{R}^{d \times n}\) is a matrix of constants, and \(\boldsymbol{C} = ((\boldsymbol{X} ^ \top \boldsymbol{X})^{-1} \boldsymbol{X} ^ \top + \boldsymbol{D})\in \mathbb{R}^{d \times n}\).
-
Step 2. Satisfy Unbiasedness: Then we compute the conditional expectation of \(\boldsymbol{\tilde{\beta}}\) given \(\boldsymbol{X}\) as follows:
\[\begin{aligned} \mathbb{E}[\boldsymbol{\tilde{\beta}} \mid \boldsymbol{X}] &= \mathbb{E}[\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{y} \mid \boldsymbol{X}] \\ &= \boldsymbol{\beta} + \boldsymbol{D} \mathbb{E}[\boldsymbol{y} \mid \boldsymbol{X}] \\ &= \boldsymbol{\beta} + \boldsymbol{D} \mathbb{E}[\boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \mid \boldsymbol{X}] \\ &= \boldsymbol{\beta} + \boldsymbol{D} \boldsymbol{X} \boldsymbol{\beta} \\ \end{aligned}\]In order for \(\boldsymbol{\tilde{\beta}}\) to be unbiased, we must have \[\boldsymbol{D} \boldsymbol{X} = \boldsymbol{0}\] , where \(\boldsymbol{0} \in \mathbb{R}^{d \times d}\) is the zero matrix.
-
Step 3. Compute Covariance: Next, we compute the covariance of \(\boldsymbol{\tilde{\beta}}\) given \(\boldsymbol{X}\) as follows:
\[\begin{aligned} \mathrm{Cov}(\boldsymbol{\tilde{\beta}} \mid \boldsymbol{X}) &= \mathrm{Cov}(\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{y} \mid \boldsymbol{X}) \\ &= \mathrm{Cov}(\hat{\boldsymbol{\beta}} + \boldsymbol{D} (\boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}) \mid \boldsymbol{X}) \\ &= \mathrm{Cov}(\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{\epsilon} \mid \boldsymbol{X}) \\ &= \underbrace{\mathbb{E}[(\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{\epsilon})(\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{\epsilon})^\top \mid \boldsymbol{X}]}_{u} - \underbrace{\mathbb{E}[\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{\epsilon} \mid \boldsymbol{X}] \mathbb{E}[\hat{\boldsymbol{\beta}} + \boldsymbol{D} \boldsymbol{\epsilon} \mid \boldsymbol{X}]^\top}_{v} \end{aligned}\]And we compute the two terms \(u\) and \(v\) as follows:
\[\begin{aligned} u &= \mathbb{E}[(\boldsymbol{\beta} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} + \boldsymbol{D} \boldsymbol{\epsilon})\cdot (\boldsymbol{\beta} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} + \boldsymbol{D} \boldsymbol{\epsilon})^\top \mid \boldsymbol{X}] \\ &= \mathbb{E}\left[(\boldsymbol{\beta} + [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}]\boldsymbol{\epsilon}) \cdot (\boldsymbol{\beta} + [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}]\boldsymbol{\epsilon})^\top \mid \boldsymbol{X}\right] \\ &= \boldsymbol{\beta} \boldsymbol{\beta}^\top + [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}] \mathbb{E}[\boldsymbol{\epsilon} \boldsymbol{\epsilon}^\top \mid \boldsymbol{X}] [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}]^\top \\ &= \boldsymbol{\beta} \boldsymbol{\beta}^\top + \sigma^2 [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}] [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top + \boldsymbol{D}]^\top \\ &= \boldsymbol{\beta} \boldsymbol{\beta}^\top + \sigma^2 [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} (\boldsymbol{X}^\top \boldsymbol{X})(\boldsymbol{X}^\top \boldsymbol{X})^{-1} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1}(\underbrace{\boldsymbol{D}\boldsymbol{X}}_{\boldsymbol{0}})^\top + \underbrace{\boldsymbol{D} \boldsymbol{X}}_{\boldsymbol{0}} (\boldsymbol{X}^\top \boldsymbol{X})^{-1} + \boldsymbol{D} \boldsymbol{D}^\top] \\ &= \boldsymbol{\beta} \boldsymbol{\beta}^\top + \sigma^2 [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} + \boldsymbol{D} \boldsymbol{D}^\top] \\ \\ v &= (\boldsymbol{\beta} + \boldsymbol{D} \mathbb{E}[\boldsymbol{\epsilon} \mid \boldsymbol{X}]) (\boldsymbol{\beta} + \boldsymbol{D} \mathbb{E}[\boldsymbol{\epsilon} \mid \boldsymbol{X}])^\top \\ &= \boldsymbol{\beta} \boldsymbol{\beta}^\top \end{aligned}\]Then we have \[\mathrm{Cov}(\boldsymbol{\tilde{\beta}} \mid \boldsymbol{X}) = \sigma^2 [(\boldsymbol{X}^\top \boldsymbol{X})^{-1} + \boldsymbol{D} \boldsymbol{D}^\top]\] At the same time, we know that \[\mathrm{Cov}(\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}) = \sigma^2 (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\] Finally we have \[\mathrm{Cov}(\tilde{\boldsymbol{\beta}} \mid \boldsymbol{X}) = \mathrm{Cov}(\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}) + \sigma^2 \boldsymbol{D} \boldsymbol{D}^\top \succeq \mathrm{Cov}(\hat{\boldsymbol{\beta}} \mid \boldsymbol{X})\]
That is, the OLS estimator \(\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}\) is the best linear unbiased estimator (BLUE).
- Hypothesis Testing: If the assumptions 1-6 hold, then we can use
-
Null Hypothesis:
\[H_0: \beta_j = 0\], where \(\beta_j\) is the \(j\)-th element of the parameter vector \(\boldsymbol{\beta}\).
-
z-score (for \(\sigma^2\) known): The z-score is given by
\[\begin{aligned} z_j &= \frac{\hat{\beta}_j - \beta_j}{\sqrt{\mathrm{Var}(\hat{\beta}_j)}} \\ &= \frac{\hat{\beta}_j - \beta_j}{\sqrt{\sigma^2 (\boldsymbol{X}^\top \boldsymbol{X})^{-1}_{jj}}}\\ &\sim \mathcal{N}(0, 1) \end{aligned}\], where \(\mathcal{N}(0, 1)\) is the standard normal distribution.
-
t-statistic (for \(\sigma^2\) unknown): First, we need to estimate the variance of the error term \(\sigma^2\) as follows:
\[\begin{aligned} \hat{\sigma}^2 &= \frac{1}{n-d} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \\ &= \frac{1}{n-d} (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})^\top (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}}) \\ \end{aligned}\]Then the t-statistic is given by
\[\begin{aligned} t_j &= \frac{\hat{\beta}_j - \beta_j}{\sqrt{\hat{\sigma}^2 (\boldsymbol{X}^\top \boldsymbol{X})^{-1}_{jj}}} \\ &\sim t_{n-d} \end{aligned}\], where \(t_{n-d}\) is the t-distribution with \((n-d)\) degrees of freedom.
-