# 12.3: Linear Regression

- Page ID
- 10836

## Linear Regression

Suppose that a pair \(\{X, Y\}\) of random variables has a joint distribution. A value \(X(\omega)\) is observed. It is desired to estimate the corresponding value \(Y(\omega)\). Obvious there is no rule for determining \(Y(\omega)\) unless \(Y\) is a function of \(X\). The best that can be hoped for is some estimate based on an average of the errors, or on the average of some function of the errors.

Suppose \(X(\omega)\) is observed, and by some rule an estimate \(\widehat{Y} (\omega)\) is returned. The error of the estimate is \(Y(\omega) - \widehat{Y} (\omega)\). The most common measure of error is the mean of the square of the error

\(E[(Y - \widehat{Y})^2]\)

The choice of the mean square has two important properties: it treats positive and negative errors alike, and it weights large errors more heavily than smaller ones. In general, we seek a rule (function) \(r\) such that the estimate \(\widehat{Y} (\omega)\) is \(r(X(\omega))\). That is, we seek a function \(r\) such that

\(E[(Y - r(X))^2]\) is a minimum.

The problem of determining such a function is known as the *regression problem*. In the unit on Regression, we show that this problem is solved by the conditional expectation of \(Y\), given \(X\). At this point, we seek an important partial solution.

**The regression line of \(Y\) on \(X\)**

We seek the best straight line function for minimizing the mean squared error. That is, we seek a function \(r\) of the form \(u = r(t0 = at + b\). The problem is to determine the coefficients \(a, b\) such that

\(E[(Y - aX - b)^2]\) is a minimum

We write the error in a special form, then square and take the expectation.

\(\text{Error} = Y - aX - b = (Y - \mu_Y) - a(X - \mu_X) + \mu_Y - a\mu_X - b = (Y - \mu_Y) - a(X - \mu_X) - \beta\)

\(\text{Error squared} = (Y - \mu_Y)^2 + a^2 (X - \mu_X)^2 + \beta^2 - 2\beta (Y - \mu_Y) + 2 \alpha \beta (X - \mu_X) - 2a(Y - \mu_Y) (X - \mu_X)\)

\(E[(Y - aX - b)^2] = \sigma_Y^2 + a^2 \sigma_X^2 + \beta^2 - 2a \text{Cov} [X, Y]\)

Standard procedures for determining a minimum (with respect to *a*) show that this occurs for

\(a = \dfrac{\text{Cov} [X,Y]}{\text{Var}[X]}\) \(b = \mu_Y - a \mu_X\)

Thus the optimum line, called the *regression line of* \(Y\) *on* \(X\), is

\(u = \dfrac{\text{Cov} [X,Y]}{\text{Var}[X]} (t - \mu_X) + \mu_Y = \rho \dfrac{\sigma_Y}{\sigma_X} (t - \mu_X) + \mu_Y = \alpha(t)\)

The second form is commonly used to define the regression line. For certain theoretical purposes, this is the preferred form. But for *calculation*, the first form is usually the more convenient. Only the covariance (which requres both means) and the variance of \(X\) are needed. There is no need to determine \(\text{Var} [Y]\) or \(\rho\).

Example \(\PageIndex{1}\) The simple air of Example 3 from "Variance"

jdemo1 jcalc Enter JOINT PROBABILITIES (as on the plane) P Enter row matrix of VALUES of X X Enter row matrix of VALUES of Y Y Use array operations on matrices X, Y, PX, PY, t, u, and P EX = total(t.*P) EX = 0.6420 EY = total(u.*P) EY = 0.0783 VX = total(t.^2.*P) - EX^2 VX = 3.3016 CV = total(t.*u.*P) - EX*EY CV = -0.1633 a = CV/VX a = -0.0495 b = EY - a*EX b = 0.1100 % The regression line is u = -0.0495t + 0.11

Example \(\PageIndex{2}\) The pair in Example 6 from "Variance"

Suppose the pair \(\{X, Y\}\) has joint density \(f_{XY}(t, u) = 3u\) on the triangular region bounded by \(u = 0\), \(u = 1 + t\), \(u = 1- t\). Determine the regression line of \(Y\) on \(X\).

**Analytic Solution**

By symmetry, \(E[X] = E[XY] = 0\), so \(\text{Cov} [X, Y] = 0\). The regression curve is

\(u = E[Y] = 3\int_0^1 u^2 \int_{u - 1}^{1 - u} \ dt du = 6 \int_{0}^{1} u^2 (1 - u)\ du = 1/2\)

Note that the pair is uncorrelated, but by the rectangle test is not independent. With zero values of \(E[X]\) and \(E[XY]\), the approximation procedure is not very satisfactory unless a very large number of approximation points are employed.

Example \(\PageIndex{3}\) Distribution of Example 5 from "Random Vectors and MATLAB" and Example 12 from "Function of Random Vectors"

The pair \(\{X, Y\}\) has joint density \(f_{XY} (t, u) = \dfrac{6}{37} (t + 2u)\) on the region \(0 \le t \le 2\), \(0 \le u \le \text{max} \{1, t\}\) (see Figure 12.3.1). Determine the regression line of \(Y\) on \(X\). If the value \(X(\omega) = 1.7\) is observed, what is the best mean-square linear estimate of \(Y(\omega)\)?

**Figure 12.3.1.** Regression line for Example 12.3.3

**Analytic Solution**

\(E[X] = \dfrac{6}{37} \int_{0}^{1} \int_{0}^{1} (t^2 + 2tu)\ dudt + dfrac{6}{37} \int_{1}^{2} \int_{0}^{t} (t^2 + 2tu)\ dudt = 50/37\)

The other quantities involve integrals over the same regions with appropriate integrands, as follows:

Quantity | Integrand | Value |

\(E[X^2]\) | \(t^3 + 2t^2 u\) | 779/370 |

\(E[Y]\) | \(tu + 2u^2\) | 127/148 |

\(E[XY]\) | \(t^2u + 2tu^2\) | 232/185 |

Then

\(\text{Var} [X] = \dfrac{779}{370} - (\dfrac{50}{37})^2 = \dfrac{3823}{13690}\) \(text{Cov}[X, Y] =\dfrac{232}{185} - \dfrac{50}{37} \cdot \dfrac{127}{148} = \dfrac{1293}{13690}\)

and

\(a = \text{Cov}[X, Y]/\text{Var}[X] = \dfrac{1293}{3823} \approx 0.3382\), \(b = E[Y] - aE[X] = \dfrac{6133}{15292} \approx 0.4011\)

The regression line is \(u = at + b\). If \(X(\omega) = 1.7\), the best linear estimate (in the mean square sense) is \(\widehat{Y} (\omega) = 1.7a + b = 0.9760\) (see Figure 12.3.1 for an approximate plot).

APPROXIMATION

tuappr Enter matrix [a b] of X-range endpoints [0 2] Enter matrix [c d] of Y-range endpoints [0 2] Enter number of X approximation points 400 Enter number of Y approximation points 400 Enter expression for joint density (6/37)*(t+2*u).*(u<=max(t,1)) Use array operations on X, Y, PX, PY, t, u, and P EX = total(t.*P) EX = 1.3517 % Theoretical = 1.3514 EY = total(u.*P) EY = 0.8594 % Theoretical = 0.8581 VX = total(t.^2.*P) - EX^2 VX = 0.2790 % Theoretical = 0.2793 CV = total(t.*u.*P) - EX*EY CV = 0.0947 % Theoretical = 0.0944 a = CV/VX a = 0.3394 % Theoretical = 0.3382 b = EY - a*EX b = 0.4006 % Theoretical = 0.4011 y = 1.7*a + b y = 0.9776 % Theoretical = 0.9760

**An interpretation of \(\rho^2\)**

The analysis above shows the minimum mean squared error is given by

\(E[(Y - \widehat{Y})^2] = E[(Y - \rho \dfrac{\sigma_Y}{\sigma_X} (X - \mu_X) - \mu_Y)^2] = \sigma_Y^2 E[(Y^* - \rho X^*)^2]\)

\(= \sigma_Y^2 E[(Y^*)^2 - 2\rho X^* Y^* + \rho^2(X^*)^2] = \sigma_Y^2 (1 - 2\rho^2 + \rho^2) = \sigma_Y^2 (1 - \rho^2)\)

If \(\rho = 0\), then \(E[(Y - \widehat{Y})^2] = \sigma_Y^2\), the mean squared error in the case of zero linear correlation. Then, \(\rho^2\) is interpreted as the *fraction of uncertainty removed by the linear rule and X*. This interpretation should not be pushed too far, but is a common interpretation, often found in the discussion of observations or experimental results.

**More general linear regression**

Consider a jointly distributed class. \(\{Y, X_1, X_2, \cdot\cdot\cdot, X_n\}\). We wish to deterimine a function \(U\) of the form

\(U = \sum_{i = 0}^{n} a_i X_i\), with \(X_0 = 1\), such that \(E[(Y - U)^2]\) is a minimum

If \(U\) satisfies this minimum condition, then \(E[(Y - U)V] = 0\), or, equivalently

\(E[YV] = E[UV]\) for all \(V\) of the form \(V = \sum_{i = 0}^{n} c_i X_i\)

To see this, set \(W = Y - U\) and let \(d^2 = E[W^2]\). Now, for any \(\alpha\)

\(d^2 \le E[(W + \alpha V)^2] = d^2 + 2\alpha E[WV] + \alpha^2 E[V^2]\)

If we select the special

\(\alpha = -\dfrac{E[WV]}{E[V^2]}\) then \(0 \le -\dfrac{2E[WV]^2}{E[V^2]} + \dfrac{E[WV]^2}{E[V^2]^2} E[V^2]\)

This implies \(E[WV]^2 \le 0\), which can only be satisfied by \(E[WV] =0\), so that

\(E[YV] = E[UV]\)

On the other hand, if \(E[(Y - U)V] = 0\) for all \(V\) of the form above, then \(E[(Y- U)^2]\) is a minimum. Consider

\(E[(Y - V)^2] = E[(Y - U + U - V)^2] = E[(Y - U)^2] + E[(U - V)^2] + 2E[(Y - U) (U - V)]\)

See \(U - V\) is of the same form as \(V\), the last term is zero. The first term is fixed. The second term is nonnegative, with zero value iff \(U - V = 0\) a.s. Hence, \(E[(Y - V)^2]\) is a minimum when \(V = U\).

If we take \(V\) to be 1, \(X_1, X_2, \cdot\cdot\cdot, X_n\), successively, we obtain \(n + 1\) linear equations in the \(n + 1\) unknowns \(a_0, a_1, \cdot\cdot\cdot, a_n\), as follows.

\(E[Y] = a_0 + a_1 E[X_1] + \cdot\cdot\cdot + a_n E[X_n]\)

\(E[YX_1] = a_0 E[X_i] + a_1 E[X_1X_i] + \cdot\cdot\cdot + a_n E[X_n X_i]\) for \(1 \le i \le n\)

For each \(i = 1, 2, \cdot\cdot\cdot, n\), we take (2) - \(E[X_i] \cdot (1)\) and use the calculating expressions for variance and covariance to get

\(\text{Cov} [Y, X_i] = a_1 \text{Cov} [X_1, X_i] + a_2 \text{Cov} [X_2, X_i] + \cdot\cdot\cdot + a_n \text{Cov} [X_n, X_i]\)

These \(n\) equations plus equation (1) may be solved alagebraically for the \(a_i\).

In the important special case that the \(X_i\) are uncorrelated (i.e. \(\text{Cov}[X_i, X_j] = 0\) for \(i \ne j\)), we have

\(a_i = \dfrac{\text{Cov}[Y, X_i]}{\text{Var} [X_i]}\) \(1 \le i \le n\)

and

\(a_0 = E[Y] - a_1 E[X_1] - a_2 E[X_2] - \cdot\cdot\cdot - a_n E[X_n]\)

In particular, this condition holds if the class \(\{X_i : 1 \le i \le n\}\) is iid as in the case of a simple random sample (see the section on "Simple Random Samples and Statistics").

Examination shows that for \(n = 1\), with \(X_1 = X\), \(a_0 = b\), and \(a_1 = a\), the result agrees with that obtained in the treatment of the regression line, above.

Example \(\PageIndex{4}\) Linear regression with two variables.

Suppose \(E[Y] = 3\), \(E[X_1] = 2\), \(E[X_2] = 3\), \(\text{Var}[X_1] = 3\), \(\text{Var}[X_2] = 8\), \(\text{Cov}[Y, X_1] = 5\), \(\text{Cov} [Y, X_2] = 7\), and \(\text{Cov} [X_1, X_2] = 1\). Then the three equations are

\(a_0 + 2a_2 + 3a_3 = 3\)

\(0 + 3a_1 + 1 a_2 = 5\)

\(0 + 1a_1 + 8a_2 = 7\)

**Solution of these simultaneous linear equations with MATLAB gives the results**

\(a_0 = - 1.9565\), \(a_1 = 1.4348\), and \(a_2 = 0.6957\).