Skip to main content
Statistics LibreTexts

14.4: Linear Regression

  • Page ID
    52060
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Linear regression gives us the best equation of a line through the scatter plot data in terms of least squares. Let’s begin with the equation of a line:

    \[ y = a + bx \]

    where \(a\) is the intercept and \(b\) is the slope.

    linepicpng-300x148.png

    The data, the collection of \((x,y)\) points, rarely lie on a perfect straight line in a scatter plot. So we write

    \[ y^{\prime} = a + b x \]

    as the equation of the best fit line. The quantity \(y^{\prime}\) is the predicted value of \(y\) (predicted from the value of \(x\)) and \(y\) is the measured value of \(y\). Now consider :

    linescatpng-300x167.png

    The difference between the measured and predicted value at data point \(i\), \(d_{i} = y_{i} - y^{\prime}_{i}\), is the deviation. The quantity

    \[ d^{2}_{i} = (y_{i} - y^{\prime}_{i})^{2} = (y_{i} - (a + b x_{i}))^{2} \]

    is the squared deviation. The sum of the squared deviations is

    \[ E = \sum_{i=1}^{n} d_{i}^{2} = \sum_{i=1}^{n} (y_{i} - (a + b x_{i}))^{2} \]

    The least squares solution for \(a\) and \(b\) is the solution that minimizes \(E\), the sum of squares, over all possible selections of \(a\) and \(b\). Minimization problems are easily handled with differential calculus by solving the differential equations:

    \[ \frac{\partial E}{\partial a}=0 \;\;\;\;\; \mbox{and} \;\;\;\;\; \frac{\partial E}{\partial b}=0 \]

    The solution to those two differential equations is

    \[ a = \frac{(\sum y_{i})(\sum x_{i}^{2}) - (\sum x_{i})(\sum x_{i} y_{i})}{n(\sum x_{i}^{2}) - (\sum x_{i})^{2}} \]

    and

    \[ b = \frac{n(\sum x_{i} y_{i}) - (\sum x_{i})(\sum y_{i})}{n(\sum x_{i}^{2}) - (\sum x_{i})^{2}} \]

    Example 14.3 : Continue with the data from Example 14.1 and find the best fit line. The data again are:

    Subject \(x\) \(y\) \(xy\) \(x^{2}\) \(y^{2}\)
    A 6 82 492 36 6724
    B 2 86 172 4 7396
    C 15 43 645 225 1849
    D 9 74 666 81 5476
    E 12 58 696 144 3364
    F 5 90 450 25 8100
    G 8 78 624 64 6084
    \(n=7\) \(\sum x=57\) \(\sum y=511\) \(\sum xy=3745\) \(\sum x^{2}=579\) \(\sum y^{2}=38993\)

    Using the sums of the columns, compute:

    \[\begin{eqnarray*} a & = & \frac{(\sum y_{i})(\sum x_{i}^{2}) - (\sum x_{i})(\sum x_{i} y_{i})}{n(\sum x_{i}^{2}) - (\sum x_{i})^{2}} \\ & = & \frac{(511)(579) - (57)(3745)}{(7)(579) - (57)^{2}} \\ & = & 102.493 \end{eqnarray*}\]

    and

    \[\begin{eqnarray*} b & = & \frac{n(\sum x_{i} y_{i}) - (\sum x_{i})(\sum y_{i})}{n(\sum x_{i}^{2}) - (\sum x_{i})^{2}} \\ & = & \frac{(7)(3745) - (57)(511)}{(7)(579) - (57)^{2}} \\ & = & -3.622 \end{eqnarray*}\]

    So

    \[\begin{eqnarray*} y^{\prime} & = & a + bx \\ y^{\prime} & = & 102.493 - 3.622 x \end{eqnarray*}\]

    egline-300x172.jpg

    14.5.1: Relationship between correlation and slope

    The relationship is

    \[ r = \frac{b s_{x}}{s_{y}} \]

    where

    \[\begin{eqnarray*} s_{x} & = & \sqrt{\frac{\sum_{i=1}^{n}(x_{i} - \overline{x})^{2}}{n-1}} \\ s_{y} & = & \sqrt{\frac{\sum_{i=1}^{n}(y_{i} - \overline{y})^{2}}{n-1}} \end{eqnarray*}\]

    are the standard deviations of the \(x\) and \(y\) datasets considered separately.


    This page titled 14.4: Linear Regression is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Gordon E. Sarty via source content that was edited to the style and standards of the LibreTexts platform.