Skip to main content
Statistics LibreTexts

10.5: Linear Regression

  • Page ID
    46191
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
    Learning Objectives
    • Explain the purpose of linear regression in modeling the relationship between two variables.
    • Identify the components of a linear regression equation: the y-intercept (a) and the slope (b).
    • Describe how the y-intercept and slope define the position and steepness of the regression line.
    • Use a linear regression model to make predictions based on the trend between variables.

    A line of regression is a straight line that best represents the relationship between two correlated variables in a dataset. It is used in regression analysis to predict a dependent variable's value based on the independent variable's value. The line is determined using the least squares method, which minimizes the sum of the squared differences between the observed data points and the predicted values from the line. The line of regression is the line that best fits those points. In other words, the line minimizes the distance between itself and the data points. Below is a graph of the line of regression with the corresponding equation for the line.

    Graph of line of Regression and scatterplot.
    Figure \(\PageIndex{1}\): Graph of Line of Regression and Scatterplot
    Definition: Equation for the Line of Regression

    \(y'=a+bx\)

    Where,

    • \(y'\) is the estimated value of \(y\) at \(x\). Note that it does not represent the derivative of y as in a calculus class.
    • The intercept of the line is \(a = \dfrac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}\)
    • The slope of the line is \(b = \dfrac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}\)

    Examples

    Example \(\PageIndex{1}\)

    Professor Martinez is conducting a study to understand the relationship between the number of hours students study per week and their performance on the midterm exam in Math 400, an advanced calculus course at the university. She collects data from 8 randomly selected students in her class. The exam is out of 100 points, and time is measured in hours per week. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 8 hours per week. Round the results to three decimal places.

    Data Values
    x: Hours Studied Per Week y: Midterm Exam Score (out of 100 points)
    10 51
    10 53
    12 64
    13 68
    14 71
    15 79
    16 84
    20 92

    Table \(\PageIndex{1}\) Hours Studied Per Week and Midterm Scores out of 100 Points.

    Solution

    In this example, the formulas will be used.

    Step 1) Add three columns to the table for \(xy\), \(x^2\), and \(y^2\). Fill in the columns by performing the required computations.

    Data Values
    \(x\) \(y\) \(xy\) \(x^2\) \(y^2\)
    10 51 510 100 2601
    10 53 530 100 2809
    12 64 768 144 4096
    13 68 884 169 4624
    14 71 994 196 5041
    15 79 1185 225 6241
    16 84 1344 256 7056
    20 92 1840 400 8464

    Table \(\PageIndex{2}\) Key Columns Needed to Compute the Regression Coefficients a and b

    Step 2) Find the sum of each column.

    • \(n\) = 8
    • \(\sum x\) = 110
    • \(\sum y\) = 562
    • \(\sum x^2\) = 1590
    • \(\sum y^2\) = 40932
    • \(\sum xy\) = 8055

    Step 3) Plug the information into the formulas for the regression coefficients and compute them using the order of operations.

    \(a = \dfrac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}\)\(= \dfrac{(562)(1590)-(110)(8055)}{8(1590)-(110)^2}\)\(=\dfrac{893,580 - 886,050}{12,720-12,100} =\dfrac{7,530}{620} = 12.145\)

    \(b = \dfrac{8(8055)-(110)(562)}{8(1590)-(110)^2}\)\(=\dfrac{64,440 - 61,820}{12,720-12,100}=\dfrac{2,620}{620}=4.226\)

    Step 4) Write out the equation for the line of regression using the computed regression coefficients.

    \(y' = 12.145+4.226x\)

    Step 5) Use the line of regression to compute \(y'\) (estimated Midterm Exam score) when x = 8 (hours per week). Plug x = 8 into the equation and compute the estimated value.

    \(y' = 12.145+4.226x = 12.145+4.226(8) = 12.145 + 33.808 = 45.953\)

    Thus, if a student studies 8 hours per week, the estimated Midterm Exam score is around 46 points.

    It is more efficient to work out a linear regression problem using technology. The example above will be reworked using the TI-84+ calculator in the next example.

    Example \(\PageIndex{2}\)

    Professor Martinez is conducting a study to understand the relationship between the number of hours students study per week and their performance on the midterm exam in Math 400, an advanced calculus course at the university. She collects data from 8 randomly selected students in her class. The exam is out of 100 points and time is measured in hours per week. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 8 hours per week. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    x: Hours Studied Per Week y: Midterm Exam Score (out of 100 points)
    10 51
    10 53
    12 64
    13 68
    14 71
    15 79
    16 84
    20 92

    Table \(\PageIndex{3}\) Hours Studied Per Week and Midterm Scores out of 100 Points.

    Solution

    Step 1) Press the [STAT] button, make sure that [Edit and 1:EDIT] are selected, then press [ENTER].

    Edit function to enter data in TI-84+.
    Figure \(\PageIndex{2}\): Edit Function to Enter Data in TI-84+

    Step 2) Enter the x-values in List 1 [\(L_1\)] and the y-values in List 2 [\(L_2\)].

    Lists (L1 and L2) columns where X and Y values are entered.
    Figure \(\PageIndex{3}\): Lists (L1 and L2) Columns Where X and Y Values are Entered

    Step 3) Press the [STAT] button again, use the right arrow to select [CALC], use the down arrow to select [8: LinReg(a+bx)], and then press [ENTER].

    Linear regression function to compute regression coefficients.
    Figure \(\PageIndex{4}\): Linear Regression Function to Compute Regression Coefficients

    Step 4) Make sure that X-list has \(L_1\) and the Y-list has \(L_2\). Use the down arrow to select [Calculate] and press [ENTER].

    Check screen to ensure proper lists are selected.
    Figure \(\PageIndex{5}\): Check Screen to Ensure Proper Lists are Selected

    Step 5) On the output page, \(a\) and \(b\) will be on the first two lines. After rounding to three places, they are \(a = 12.145\) and \(b = 4.226\).

    Output of regression coefficients (a = 12.145 and b = 4.226).
    Figure \(\PageIndex{6}\): Output of Regression Coefficients (a = 12.145 and b = 4.226)

    Step 6) Use the line of regression to compute \(y'\) (estimated Midterm Exam score) when x = 8 (hours per week). Plug x = 8 into the equation and compute the estimated value.

    \(y' = 12.145+4.226x = 12.145+4.226(8) = 12.145 + 33.808 = 45.953\)

    Thus, if a student studies 8 hours per week, the estimated Midterm Exam score is around 46 points.

    Example \(\PageIndex{3}\)

    A health researcher at the Health Department ​​​​​​at a large university is conducting a study to explore the relationship between physical activity and health outcomes among college students aged 18–25 years old. The researcher is specifically interested in determining whether there is a correlation between the number of hours students work out per week and the number of days they spend being ill in a year. The researcher collected data provided in the table below. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 9 hours per week. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    X: Hours Worked Out per Week Y: Days Spent Ill in a Year
    0 14
    2 10
    4 8
    5 6
    7 5
    10 3
    12 2

    Table \(\PageIndex{4}\) Hours Worked Out Per Week and Days Ill Per Year.

    Solution

    In this example, the TI-84+ calculator will be used to compute the regression coefficients. Follow the steps in Example 2 to compute the regression coefficients. The output is provided in the image below.

    Output of regression coefficients (a = 12.251 and b = –0.964).
    Figure \(\PageIndex{7}\): Output of Regression Coefficients (a = 12.251 and b = –0.964)

    Therefore, the line of regression is \(y' = 12.251 - 0.944x\).

    The estimated value is \(y' = 12.251 - 0.944(9) = 12.251 -8.496 = 3.755\).

    Thus, according to the linear model, if a person works out 9 hours per week, they will be ill for around 4 days during the year.

    Example \(\PageIndex{4}\)

    A researcher is exploring if there is any correlation between the amount of money students spend on lunch and their GPA in a college setting. Hypothetically, we are testing if students who spend more money on lunch tend to have higher or lower GPAs.

    The researcher collected 10 pairs of data representing the amount of money students spend on lunch and their corresponding GPA. The researcher collected data provided in the table below. It was shown that the data values are not correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for $13.00 on lunch. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    Amount Spent on Lunch ($) GPA
    $ 10.00 1.95
    $ 7.50 3.20
    $ 4.00 3.60
    $ 8.45 2.80
    $ 6.95 3.40
    $ 9.00 2.70
    $ 8.90 2.56
    $ 12.50 3.30
    $ 19.80 3.00
    $ 6.90 3.49

    Table \(\PageIndex{5}\) Amount Spent on Lunch in Dollars and Grade Point Average (GPA).

    Solution

    Since there is no correlation, the line of regression is not computed as it is not valid.

    Attributions

    "10.1: Regression" by Kathryn Kozak is licensed under CC BY-SA 4.0


    This page titled 10.5: Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Toros Berberyan, Tracy Nguyen, and Alfie Swan via source content that was edited to the style and standards of the LibreTexts platform.