Inference in Simple Linear Regression

Last updated
Save as PDF

Page ID: 224

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Inference in Simple Linear Regression

Fact : Under normal regression model $(b_0,b_1)$ and $SSE$ are independently distributed and
$\frac{b_0 - \beta_0}{s(b_0)} \sim t_{n-2}$, $\qquad \frac{b_1 - \beta_1}{s(b_1)} \sim t_{n-2}$, $\qquad SSE \sim \sigma^2 \chi_{n-2}^2$.
Confidence interval for $\beta_0$ and $\beta_1$ : $100(1-\alpha)\%$ (two-sided) confidence interval for $\beta_i$:
$(b_i - t(1-\alpha/2;n-2) s(b_i)$, $b_i + t(1-\alpha/2;n-2) s(b_i))$

for $i=0,1$, where $t(1-\alpha/2;n-2)$ is the $1-\alpha/2$ upper cut-off point (or $(1-\alpha/2)$ quantile) of $t_{n-2}$ distribution; i.e., $P(t_{n-2} > t(1-\alpha/2;n-2)) = \alpha/2$.

Hypothesis tests for $\beta_0$ and $\beta_1$ : $H_0 : \beta_i = \beta_{i0}$ ($i=0$ or $1$).
Test statistic : $T_i = \frac{b_i - \beta_{i0}}{s(b_i)}$.

Alternative: $H_1 : \beta_i > \beta_{i0}$. Reject $H_0$ at level $\alpha$ if $\frac{b_i - \beta_{i0}}{s(b_i)} > t(1-\alpha;n-2)$. Or if, P-value = $P(t_{n-2} > T_i^{observed}) < \alpha$.
Alternative: $H_1 : \beta_i < \beta_{i0}$. Reject $H_0$ at level $\alpha$ if $\frac{b_i - \beta_{i0}}{s(b_i)} < t(\alpha;n-2)$. Or if, P-value = $P(t_{n-2} < T_i^{observed}) < \alpha$.
Alternative: $H_1 : \beta_i \neq \beta_{i0}$. Reject $H_0$ at level $\alpha$ if $|\frac{b_i - \beta_{i0}}{s(b_i)}| > t(1-\alpha/2;n-2)$. Or if, P-value = $P(|t_{n-2}| > |T_i^{observed}|) < \alpha$.

Inference for mean response at $X = X_h$

Point estimate: $\widehat Y_h = b_0 + b_1 X_h$.

Fact: $E(\widehat Y_h) = \beta_0 + \beta_1 X_h = E(Y_h)$, $Var(\widehat Y_h) = \sigma^2(\widehat Y_h) = \sigma^2\left[\frac{1}{n} + \frac{(X_h - \overline{X})^2}{\sum_i (X_i - \overline{X})^2}\right]$. Estimated variance is $s^2(\widehat Y_h) = MSE \left[\frac{1}{n} + \frac{(X_h - \overline{X})^2}{\sum_i (X_i - \overline{X})^2}\right]$.

Distribution: $\frac{\widehat Y_h - E(Y_h)}{s(\widehat Y_h)} \sim t_{n-2}$.

Confidence interval: $100(1-\alpha)$% confidence interval for $E(Y_h)$ is $(\widehat Y_h - t(1-\alpha/2;n-2) s(\widehat Y_h),\widehat Y_h + t(1-\alpha/2;n-2) s(\widehat Y_h))$.

Prediction of a new observation $Y_{h(new)}$ at $X = X_h$

Prediction : $\widehat Y_{h(new)} = \widehat Y_h = b_0 + b_1 X_h$.

Error in prediction : $Y_{h(new)} - \widehat Y_{h(new)} = Y_{h(new)} - \widehat Y_h$.

Fact : $\sigma^2(Y_{h(new)} - \widehat Y_h) = \sigma^2(Y_{h(new)}) + \sigma^2(\widehat Y_h) = \sigma^2 + \sigma^2(\widehat Y_h) = \sigma^2\left[1+\frac{1}{n}+ \frac{(X_h - \overline{X})^2}{\sum_i (X_i - \overline{X})^2}\right]$.

Estimate of $\sigma^2(Y_{h(new)} - \widehat Y_h)$ is $s^2(Y_{h(new)} - \widehat Y_h) = MSE \left[1+\frac{1}{n}+ \frac{(X_h - \overline{X})^2}{\sum_i (X_i - \overline{X})^2}\right]$.

Distribution : $\frac{Y_{h(new)} - \widehat Y_h}{s(Y_{h(new)} -\widehat Y_h)} \sim t_{n-2}$.

Prediction interval : $100(1-\alpha)$% prediction interval for $Y_{h(new)}$ is $(\widehat Y_h - t(1-\alpha/2;n-2) s(Y_{h(new)}-\widehat Y_h),\widehat Y_h + t(1-\alpha/2;n-2) s(Y_{h(new)}-\widehat Y_h))$.

Confidence band for the regression line : At $X=X_h$ the $100(1-\alpha)$% confidence band for the regression line is given by $\widehat Y_h \pm w_\alpha s(\widehat Y_h), \qquad \mbox{where} \sim w_\alpha = \sqrt{2F(1-\alpha; 2, n-2)}$.

Here $F(1-\alpha;2,n-2)$ is the $1-\alpha$ upper cut-off point (or, $(1-\alpha)$ quantile) for the $F_{2,n-2}$ distribution ($F$ distribution with d.f. $(2,n-2)$).

Example $\PageIndex{1}$: Simple linear regression

We consider a data set on housing price. Here$Y=$ selling price of houses (in $1000), and $X=$ size of house (100 square feet). The summary statistics are given below:

$n = 19$, $\overline{X} = 15.719$, $\overline{Y} = 75.211$

$\sum_i(X_i - \overline{X})^2 = 40.805$, $\sum_i (Y_i - \overline{Y})^2 = 556.078$, $\sum_i (X_i - \overline{X})(Y_i - \overline{Y}) = 120.001$.

Estimates of $\beta_1$ and $\beta_0$ :

\[b_1 = \frac{\sum_i (X_i - \overline{X})(Y_i - \overline{Y})}{\sum_i(X_i - \overline{X})^2} = \frac{120.001}{40.805} = 2.941\]

and

\[b_0 = \overline{Y} - b_1 \overline{X} = 75.211 - (2.941)(15.719) = 28.981.\]

Fit and Prediction: The fitted regression line : $Y = 28.981 + 2.941 X$. When $X = 18.5 = X_h$, the predicted value, that is an estimate of the mean selling price (in $1000) when size of the house is 1850 sq. ft., is $\widehat Y_h = 28.981 + (2.941) (18.5) = 83.39$.
MSE: The degrees of freedom (df) $= n-2 = 17$. $SSE = \sum_i(Y_i - \overline{Y})^2 - b_1^2\sum_i(X_i - \overline{X})^2 = 203.17$. So, $MSE = \frac{SSE}{n-2} = \frac{203.17}{17} = 11.95$.
Standard Error Estimates: $s^2(b_0) = MSE \left[\frac{1}{n} + \frac{\overline{X}^2}{\sum_i(X_i - \overline{X})^2} \right] = 73.00$, $\qquad s(b_0) = \sqrt{s^2(b_0)} = 8.544$.
$s^2(b_1) = \frac{MSE}{\sum_i(X_i - \overline{X})^2} = 0.2929$, $\qquad s(b_1) = \sqrt{s^2(b_1)} = 0.5412$.
Confidence Intervals: We assume that the errors are normal to ﬁnd conﬁdence intervals for the parameters $\beta_0$ and $\beta_1$. We use the fact that $\frac{b_0 - \beta_0}{s(b_0)} \sim t_{n-2}$ and $\frac{b_1 - \beta_1}{s(b_1)} \sim t_{n-2}$ where $t_{n-2}$ denotes the $t$-distribution with $n-2$ degrees of freedom. Since $t(0.975;17) = 2.1098$, it follows that 95% two-sided confidence interval for $\beta_1$ is $2.941 \pm (2.1098)(0.5412) = (1.80, 4.08)$. Since $t(0.95;17) = 1.740$, the 90% two-sided confidence interval for $\beta_0$ is $28.981\pm (1.740)(8.544) = (14.12,43.84)$.

Contributors

Agnes Oshiro

(Source: Spring 2012 STA108 Handout 4)