Multiple Linear Regression (continued)

Last updated
Save as PDF

Page ID: 233

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $

$ \newcommand{\dsum}{\displaystyle\sum\limits} $

$ \newcommand{\dint}{\displaystyle\int\limits} $

$ \newcommand{\dlim}{\displaystyle\lim\limits} $

$ \newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$

( \newcommand{\kernel}{\mathrm{null}\,}\) $ \newcommand{\range}{\mathrm{range}\,}$

$ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$

$ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$

$ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$ \newcommand{\Span}{\mathrm{span}}$

$ \newcommand{\id}{\mathrm{id}}$

$ \newcommand{\Span}{\mathrm{span}}$

$ \newcommand{\kernel}{\mathrm{null}\,}$

$ \newcommand{\range}{\mathrm{range}\,}$

$ \newcommand{\RealPart}{\mathrm{Re}}$

$ \newcommand{\ImaginaryPart}{\mathrm{Im}}$

$ \newcommand{\Argument}{\mathrm{Arg}}$

$ \newcommand{\norm}[1]{\| #1 \|}$

$ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\AA}{\unicode[.8,0]{x212B}}$

$ \newcommand{\vectorA}[1]{\vec{#1}} % arrow$

$ \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$

$ \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$ \newcommand{\vectorC}[1]{\textbf{#1}} $

$ \newcommand{\vectorD}[1]{\overrightarrow{#1}} $

$ \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} $

$ \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} $

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$\newcommand{\longvect}{\overrightarrow}$

$ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $

$\newcommand{\avec}{\mathbf a}$ $\newcommand{\bvec}{\mathbf b}$ $\newcommand{\cvec}{\mathbf c}$ $\newcommand{\dvec}{\mathbf d}$ $\newcommand{\dtil}{\widetilde{\mathbf d}}$ $\newcommand{\evec}{\mathbf e}$ $\newcommand{\fvec}{\mathbf f}$ $\newcommand{\nvec}{\mathbf n}$ $\newcommand{\pvec}{\mathbf p}$ $\newcommand{\qvec}{\mathbf q}$ $\newcommand{\svec}{\mathbf s}$ $\newcommand{\tvec}{\mathbf t}$ $\newcommand{\uvec}{\mathbf u}$ $\newcommand{\vvec}{\mathbf v}$ $\newcommand{\wvec}{\mathbf w}$ $\newcommand{\xvec}{\mathbf x}$ $\newcommand{\yvec}{\mathbf y}$ $\newcommand{\zvec}{\mathbf z}$ $\newcommand{\rvec}{\mathbf r}$ $\newcommand{\mvec}{\mathbf m}$ $\newcommand{\zerovec}{\mathbf 0}$ $\newcommand{\onevec}{\mathbf 1}$ $\newcommand{\real}{\mathbb R}$ $\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$ $\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$ $\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$ $\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$ $\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$ $\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$ $\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$ $\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$ $\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$ $\newcommand{\laspan}[1]{\text{Span}\{#1\}}$ $\newcommand{\bcal}{\cal B}$ $\newcommand{\ccal}{\cal C}$ $\newcommand{\scal}{\cal S}$ $\newcommand{\wcal}{\cal W}$ $\newcommand{\ecal}{\cal E}$ $\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$ $\newcommand{\gray}[1]{\color{gray}{#1}}$ $\newcommand{\lgray}[1]{\color{lightgray}{#1}}$ $\newcommand{\rank}{\operatorname{rank}}$ $\newcommand{\row}{\text{Row}}$ $\newcommand{\col}{\text{Col}}$ $\renewcommand{\row}{\text{Row}}$ $\newcommand{\nul}{\text{Nul}}$ $\newcommand{\var}{\text{Var}}$ $\newcommand{\corr}{\text{corr}}$ $\newcommand{\len}[1]{\left|#1\right|}$ $\newcommand{\bbar}{\overline{\bvec}}$ $\newcommand{\bhat}{\widehat{\bvec}}$ $\newcommand{\bperp}{\bvec^\perp}$ $\newcommand{\xhat}{\widehat{\xvec}}$ $\newcommand{\vhat}{\widehat{\vvec}}$ $\newcommand{\uhat}{\widehat{\uvec}}$ $\newcommand{\what}{\widehat{\wvec}}$ $\newcommand{\Sighat}{\widehat{\Sigma}}$ $\newcommand{\lt}{<}$ $\newcommand{\gt}{>}$ $\newcommand{\amp}{&}$ $\definecolor{fillinmathshade}{gray}{0.9}$

Fitted values and residuals
ANOVA
Inference on Multiple Linear Regression
Contributors:

A response variable $Y$ is linearly related to $p-1$ different explanatory variables $X^{(1)},\ldots,X^{(p-1)}$. The regression model is given by

\[ Y_i = \beta_0 + \beta_1 X_i^{(1)} + \cdots + \beta_{p-1} X_i^{(p-1)} + \varepsilon_i, \qquad i=1,\ldots,n, \tag{1}\]

where $\varepsilon_i$ have mean zero, variance $\sigma^2$ and are independent with a normal distribution (working assumption). The equation (1) can be expressed in matrix notations as

\[ Y = \mathbf{X} \beta + \varepsilon, \qquad \mbox{where} \qquad Y = \begin{bmatrix} Y_1 \\Y_2 \\ \cdot\\Y_n\end{bmatrix}, \qquad \varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \cdot\\ \varepsilon_n \end{bmatrix}.\]

Fitted values and residuals

The fitted value for the $i$-th observation is $ \widehat{Y}_i = b_0 + b_1 X_i^{(1)} + . . . + b_{p-1} X_i^{(p-1)}$, and the residual is $ e_i = Y_i - \widehat{Y}_i. $ Using matrix notations, the vector of fitted values, $ \widehat{Y} $, can be expressed as

$$ \widehat{Y} = X b = X \widehat{\beta} = X ( X^T X)^{-1} X^T Y $$

The $n \times n$ matrix $ X ( X^T X)^{-1} X^T Y $ is called the hat matrix and is denoted by H. Thus $ \widehat{Y}$ = HY. The vector of residuals, to be denoted by $\mathbf{e}$ (with $i$-th coordinate $e_i$, for $i=1,\ldots,n$) can therefore be expressed as

$ e = Y - \widehat{Y} $ = Y - HY = ($I_n $ - H) Y = $( I_n - X (X^T X)^{-1} X^T ) Y.$

Hat matrix: check that the matrix H has the property that HH = H and ($I_n$ - H)(($I_n$ - H) = ($I_n$ - H). A square matrix A having the property that AA = A is called an indempotent matrix. So both H and $I_n$ - H are indempotent matrices. The important implication of the equation $$ \widehat Y = \mathbf{H} Y $$ is that the matrix $ \mathbf{H}$ the response vector $ \mathbf{Y} $ as a linear combination of the columns of the matrix $ \mathbf{X}$ to obtain the vector of fitted values, $\widehat{Y}$. Similarly, the matrix $I_n - \mathbf{H}$ applied to $\mathbf{Y}$ gives the residual vector $\mathbf{e}$.

Properties of Residuals: Many of the properties of residual can be deduced by studying the properties of the matrix $\mathbf{H}$. Some of them are listed below. $$ \sum_i e_i = 0 and \sum_i X_i^{(j)}e_i = 0 , for j=1,\ldots,p-1 $$. These are results of the following: $ \mathbf{X}^T\mathbf{e} \mathbf{X}^T(I_n - \mathbf{H})Y = \mathbf{X}^TY - \mathbf{X}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y = \mathbf{X}^T Y - \mathbf{X}^T Y = 0. $ Also note that $ \widehat{Y} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T Y$, and hence $$ \sum_i \widehat Y_i e_i = \widehat Y^T \mathbf{e} = Y^T \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{e} = 0. $$

ANOVA

The matrix viewpoint gives a coherent way of representing the different components of the analysis of variance of the response in regression. As before, we need to deal with the objects

$$ SSTO = \sum_i (Y_i - \overline{Y})^2, \qquad SSE = \sum_i(Y_i - \widehat Y_i)^2 = \sum_i e_i^2, \qquad \mbox{and}~~SSR = SSTO - SSE. $$

The degrees of freedom of $SSR$ is $\mathbf{p} - 1$. The degrees of freedom of $SSTO$ is $ \mathbf{n} - 1$ and d.f.$(SSE)$ = d.f.$(SSTO)$ - d.f.$(SSR)$ = $ \mathbf{n} - 1 - (\mathbf{p}-1) = \mathbf{n}-\mathbf{p} $. Moreover,

$$ \overline{Y} = \frac{1}{n} \sum_i Y_i = (\frac{1}{n}) Y^T \mathbf{1} $$

$$ SSTO = \sum_i Y_i^2 - \frac{1}{n}(\sum_i Y_i)^2 = Y^T Y - (\frac{1}{n}) Y^T \mathbf{J} Y $$

$$ SSE = \mathbf{e}^T \mathbf{e} = \mathbf{Y}^T(I-\mathbf{H}) (I-\mathbf{H})\mathbf{Y} = \mathbf{Y}^T (I-\mathbf{H}) \mathbf{Y}\\ SSE = Y^T Y - \widehat \beta^T \mathbf{X}^T Y where \mathbf{J} = \mathbf{1}\mathbf{1}^T $$.

We can use the ANOVA decomposition to test $H_0 : \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0 $ (no regression effect), against $ H_1$ : not all $ \beta_j $ are equal to zero. The test statistic is $$ F^* = \frac{\frac{SSR}{\mbox{d.f.}(SSR)}}{\frac{SSE}{\mbox{d.f.}(SSE)}} = \frac{SSR/(p-1)}{SSE/(n-p)}. $$ Under $H_0$ and assumption of normal errors, $F^*$ has $F_{p-1, n-p}$ distribution. So, reject $H_0$ in favor of $H_1$, at level $\alpha$ if $F^* > F(1-\alpha;p-1,n-p)$.

Inference on Multiple Linear Regression

We can ask the same questions regarding estimation of various parameters as we did in the case of regression with one predictor variable.

Mean and standard error of estimates: We already checked that (with $ \mathbf{b} \equiv \widehat \beta$) $E(\mathbf{b}) = \beta$ and Var$(\mathbf{b}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1}$. And hence the estimated variance-covariance matrix of $\mathbf{b}$ is $\widehat{\mbox{Var}}(\mathbf{b}) = MSE(\mathbf{X}^T \mathbf{X})^{-1}$. Denote by $s(b_j)$ the standard error of $b_j = \widehat \beta_j$. Then $s^2(b_j)$ is the $(j+1)$-th diagonal entry of the $p \times p$ matrix $\widehat{\mbox{Var}}(\mathbf{b})$.
Note that $$ \mbox{Var}(\mathbf{b}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} ~~\mbox{so that}~~ \widehat{\mbox{Var}}(\mathbf{b}) = \mbox{MSE} ~ (\mathbf{X}^T \mathbf{X})^{-1}.$$

ANOVA : Under $H_0 : \beta_1=\beta_2=\cdots=\beta_{p-1} =0$, the F-ratio $F^* = MSR/MSE$ has an $F_{p-1,n-p}$ distribution. So, reject $H_0$ in favor of $H_1 $: at least one $j \in\{1,\ldots,p-1\}, \beta \neq $ 0 , at level $\alpha$ if $F^* > F(1-\alpha;p-1,n-p)$.

Hypothesis tests for individual parameters : Under $H_0 : \beta_j = \beta_j^0$, for a given $j \in \{1,\ldots,p-1\}$, $$ t^* = \frac{b_j-\beta_j^0}{s(b_j)} \sim t_{n-p}. $$ So, if $H_1 : \beta_j \neq \beta_j^0$, then reject $H_0$ in favor of $H_1$ at level $\alpha$ if $ |t^*| > t(1-\alpha/2;n-p)$.

Confidence intervals for individual parameters : Based on the result above, 100(1-$\alpha$) % two-sided confidence interval for $\beta_j$ is given by $$ b_j \pm t(1-\alpha/2;n-p)s(b_j).$$

Estimation of mean response : Since $$ E(Y|X_h) = \beta^T X_h, where X_h = \begin{bmatrix}1 \\X_h^{(1)}\\ \cdot \\ \cdot \\X_h^{(p-1)} \end{bmatrix}, $$ an unbiased point estimate of $E(Y|X_h)$ is $\widehat Y_h = \mathbf{b}^T X_h = b_0 + b_1X_h^{(1)} + \cdots + b_{p-1}X_h^{(p-1)}$. Using the Working-Hotelling procedure, an $100(1-\alpha)$ % confidence region for the entire regression surface (that is, confidence region for $E(Y|X_h)$ for all possible values of $X_h$), is given by $$ \widehat Y_h \pm \sqrt{p F(1-\alpha;p,n-p)} \hspace{.05in} s (\widehat Y_h), $$ where $s(\widehat Y_h)$ is the estimated standard error of $\widehat Y_h$ and is given by $$ s^2(\widehat Y_h) = (MSE) \cdot X_h^T (\mathbf{X}^T \mathbf{X})^{-1}X_h. $$ The last formula can be deduced from the fact that $$ \mbox{Var}(\widehat Y_h) = \mbox{Var}(X_h^T \mathbf{b}) = X_h^T \mbox{Var}(\mathbf{b}) X_h = \sigma^2 X_h^T (\mathbf{X}^T\mathbf{X})^{-1} X_h. $$ Also, using the fact that $(\widehat Y_h - X_h^T \beta)/s(\widehat Y_h) \sim t_{n-p}$, a pointwise, $100(1-\alpha)$ % two-sided confidence interval for $E(Y|X_h) = X_h^T \beta$ is given by $$ \widehat Y_h \pm t(1-\alpha/2;n-p) s(\widehat Y_h). $$ Extensions to the case where we want to simultaneously estimate $E(Y|X_h)$ for $g$ different values of $X_h$ can be achieved using either the Bonferroni procedure, or the Working-Hotelling procedure.

Simultaneous prediction of new observations : Analogous to the one variable regression case, we consider the simultaneous prediction of new observations $Y_{h(new)} = \beta^T X_h + \varepsilon_{h(new)}$ for $g$ different values of $X_h$. Use $s(Y_{h(new)} - \widehat Y_{h(new)})$ to denote the estimated standard deviation of prediction error when $X=X_h$. We have $$ s^2(Y_{h(new)} - \widehat Y_{h(new)}) = (MSE) (1+X_h^T (\mathbf{X}^T\mathbf{X})^{-1}X_h). $$ Bonferroni procedure yields simultaneous 100(1-$\alpha$) % prediction intervals of the form $$ \widehat Y_h \pm t(1-\alpha/2g;n-p) s(Y_{h(new)} - \widehat Y_{h(new)}). $$ Scheff'{e}'s procedure gives the following simultaneous confidence intervals $$ \widehat Y_h \pm \sqrt{gF(1-\alpha;g,n-p)} \hspace{.06in} s(Y_{h(new)} - \widehat Y_{h(new)}). $$

Coefficient of multiple determination : The quantity $ R^2 = 1 - \frac{SSE}{SSTO} = \frac{SSR}{SSTO}$ is a measure of association between the response $Y$ and the predictors $X^{(1)},\ldots,X^{(p-1)}$. This has the interpretation that $R^2$ is the proportion of variability in the response explained by the predictors. Another interpretation is that $R^2$ is the maximum squared correlation between $Y$ and any linear function of $X^{(1)},\ldots,X^{(p-1)}$.

Adjusted $R^2$ : If one increases number of predictor variables in the regression model, then $R^2$ increases. To take into account the number of predictors, the measure called adjusted multiple $R$-squared, or, $$ R_a^2 = 1-\frac{MSE}{MSTO} = 1 - \frac{SSE/(n-p)}{SSTO/(n-1)} = 1- \left(\frac{n-1}{n-p}\right) \frac{SSE}{SSTO}, $$ is used. Notice that $R_a^2 < R^2$, and when the number of observationsis not too large, $R_a^2$ can be substantially smaller than $R^2$. Even though $R_a^2$ does not have as nice an interpretation as $R^2$, in multiple linear regression, this considered to be a better measure of association.

Contributors:

Valerie Regalia
Debashis Paul

Search

Text Color

Text Size

Margin Size

Font Type