# Simple Linear Regression (with one predictor)

## Model

$$X$$ and $$Y$$ are the predictor and response variables, respectively. Fit the model,
$Y_i = \beta_0+\beta_1X_i+\epsilon_i, x = 1,2,...,n$
where $$\epsilon_1 ,..., \epsilon_n$$ are uncorrelated, $$E(\epsilon_1)=0, VAR(\epsilon_1)=\sigma^2$$.

## Interpretation

Look at the scatter plot of $$Y$$ (vertical axis) versus $$X$$ (horizontal axis). Consider narrow vertical strips around the different values of $$X$$:
1. Means (measure of center) of the points falling in the vertical strips lie (approximately) on a straight line with slope $$\beta_1$$ and intercept $$\beta_0$$.
2. Standard deviations (measure of spread) of the points falling in each vertical strip are (roughly) the same.

## Estimation of $$\beta_0$$ and $$\beta_1$$

We employ the method of least squares to estimate $$\beta_0$$ and $$\beta_1$$. This means, we minimize the sum of squared errors : $$Q(\beta_0,\beta_1) = \sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2$$. This involves differentiating $$Q(\beta_0,\beta_1)$$ with respect to the parameters $$\beta_0$$ and $$\beta_1$$ and setting the derivatives to zero. This gives us the normal equations: $nb_0 + b_1\sum_{i=1}^nX_i = \sum_{i=1}^nY_i$ $b_0\sum_{i=1}^nX_i+b_1\sum_{i=1}^nX_i^2 = \sum_{i=1}^nX_iY_i$ Solving these equations, we have: $b_1=\frac{\sum_{i=1}^nX_iY_i-n\overline{XY}}{\sum_{i=1}^nX_i^2-n\overline{X}^2} = \frac{\sum_{i=1}^n(X_i-\overline{X})(Y_i-\overline{y})}{\sum_{i=1}^n(X_i-\overline{X})^2}, b_0 = \overline{Y}-b_1\overline{X}$ $$b_0$$ and $$b_1$$ are the estimates of $$\beta_0$$ and $$\beta_1$$, respectively, and are sometimes denoted as $$\widehat\beta_0$$ and $$\widehat\beta_1$$.

## Prediction

The fitted regression line is given by the equation: $\widehat{Y} = b_0 + b_1X$ and is used to predict the value of $$Y$$ given a value of $$X$$.

## Residuals

These are the quantities $$e_i = Y_i - \widehat{Y}_i = Y_i - (b_0 + b_1X_i)$$, where $$\widehat{Y}_i = b_0 + b_1X_i$$. Note that $$\epsilon_i = Y_i - \beta_0 - \beta_1X_i$$. This means that $$e_i$$'s estimate $$\epsilon_i$$'s. Some properties of the regression line and residuals are :
1. $$\sum_{i}e_i = 0$$.
2. $$\sum_{i}e_i^2 \leq \sum_{i}(Y_i - u_0 - u_1X_i)^2$$ for any $$(u_0, u_1)$$ (with equality when $$(u_0, u_1)$$ = $$(b_0, b_1)$$).
3. $$\sum_{i}Y_i = \sum_{i}\widehat{Y}_i$$.
4. $$\sum_{i}X_ie_i = 0$$.
5. $$\sum_{i}\widehat{Y}_ie_i = 0$$.
6. Regression line passes through the point $$(\overline{X},\overline{Y})$$
7. The slope $$b_1$$ of the regression line can be expressed as $$b_1 = r_{XY}\frac{sy}{sx}$$, where $$r_{XY}$$ is the correlation coefficient between $$X$$ and $$Y$$ and $$s_X$$ and $$s_Y$$ are the standard deviations of $$X$$ and $$Y$$.
Error sum of squares, deonted $$SSE$$, is given by $SSE = \sum_{i=1}^ne_i^2 = \sum_{i=1}^n(Y_i - \overline{Y})^2 - b_1^2\sum_{i=1}^n(X_i-\overline{X})^2.$

## Estimation of $$\sigma^2$$

It can be shown that $$E(SSE) = (n-2)\sigma^2.$$ Therefore, $$\sigma^2$$ is estimated by the mean squared error, i.e., $$MSE = \frac{SSE}{n-2}.$$ Note also that this justifies the statement that the degree of freedom of the errors is $$n-2$$ which is sample size $$(n)$$ minus the number of regression coefficients ($$\beta_0$$ and $$\beta_1$$) being estimated.

## Contributors

• Debashis Paul (UCD)
• Scott Brunstein (UCD)