Skip to main content
Statistics LibreTexts

10.5: Linear Regression

  • Page ID
    58310
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
    Learning Objectives
    • Explain the purpose of linear regression in modeling the relationship between two variables.
    • Identify the components of a linear regression equation: the y-intercept (a) and the slope (b).
    • Describe how the y-intercept and slope define the position and steepness of the regression line.
    • Use a linear regression model to make predictions based on the trend between variables.

    A line of regression is a straight line that best represents the relationship between two correlated variables in a dataset. It is used in regression analysis to predict a dependent variable's value based on the independent variable's value. The line is determined using the least squares method, which minimizes the sum of the squared differences between the observed data points and the predicted values from the line. The line of regression is the line that best fits those points. In other words, the line minimizes the distance between itself and the data points. Below is a graph of the line of regression with the corresponding equation for the line.

    Graph of line of Regression and scatterplot.
    Figure \(\PageIndex{1}\): Graph of Line of Regression and Scatterplot
    Definition: Equation for the Line of Regression

    \(y'=a+bx\)

    Where,

    • \(y'\) is the estimated value of \(y\) at \(x\). Note that it does not represent the derivative of y as in a calculus class.
    • The intercept of the line is \(a = \dfrac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}\)
    • The slope of the line is \(b = \dfrac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}\)

    Examples

    Example \(\PageIndex{1}\)

    Professor Martinez is conducting a study to understand the relationship between the number of hours students study per week and their performance on the midterm exam in Math 400, an advanced calculus course at the university. She collects data from 8 randomly selected students in her class. The exam is out of 100 points, and time is measured in hours per week. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 8 hours per week. Round the results to three decimal places.

    Data Values
    x: Hours Studied Per Week y: Midterm Exam Score (out of 100 points)
    10 51
    10 53
    12 64
    13 68
    14 71
    15 79
    16 84
    20 92

    Table \(\PageIndex{1}\) Hours Studied Per Week and Midterm Scores out of 100 Points.

    Solution

    In this example, the formulas will be used.

    Step 1) Add three columns to the table for \(xy\), \(x^2\), and \(y^2\). Fill in the columns by performing the required computations.

    Data Values
    \(x\) \(y\) \(xy\) \(x^2\) \(y^2\)
    10 51 510 100 2601
    10 53 530 100 2809
    12 64 768 144 4096
    13 68 884 169 4624
    14 71 994 196 5041
    15 79 1185 225 6241
    16 84 1344 256 7056
    20 92 1840 400 8464

    Table \(\PageIndex{2}\) Key Columns Needed to Compute the Regression Coefficients a and b

    Step 2) Find the sum of each column.

    • \(n\) = 8
    • \(\sum x\) = 110
    • \(\sum y\) = 562
    • \(\sum x^2\) = 1590
    • \(\sum y^2\) = 40932
    • \(\sum xy\) = 8055

    Step 3) Plug the information into the formulas for the regression coefficients and compute them using the order of operations.

    \(a = \dfrac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}\)\(= \dfrac{(562)(1590)-(110)(8055)}{8(1590)-(110)^2}\)\(=\dfrac{893,580 - 886,050}{12,720-12,100} =\dfrac{7,530}{620} = 12.145\)

    \(b = \dfrac{8(8055)-(110)(562)}{8(1590)-(110)^2}\)\(=\dfrac{64,440 - 61,820}{12,720-12,100}=\dfrac{2,620}{620}=4.226\)

    Step 4) Write out the equation for the line of regression using the computed regression coefficients.

    \(y' = 12.145+4.226x\)

    Step 5) Use the line of regression to compute \(y'\) (estimated Midterm Exam score) when x = 8 (hours per week). Plug x = 8 into the equation and compute the estimated value.

    \(y' = 12.145+4.226x = 12.145+4.226(8) = 12.145 + 33.808 = 45.953\)

    Thus, if a student studies 8 hours per week, the estimated Midterm Exam score is around 46 points.

    It is more efficient to work out a linear regression problem using technology. The example above will be reworked using the TI-84+ calculator in the next example.

    Example \(\PageIndex{2}\)

    Professor Martinez is conducting a study to understand the relationship between the number of hours students study per week and their performance on the midterm exam in Math 400, an advanced calculus course at the university. She collects data from 8 randomly selected students in her class. The exam is out of 100 points and time is measured in hours per week. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 8 hours per week. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    x: Hours Studied Per Week y: Midterm Exam Score (out of 100 points)
    10 51
    10 53
    12 64
    13 68
    14 71
    15 79
    16 84
    20 92

    Table \(\PageIndex{3}\) Hours Studied Per Week and Midterm Scores out of 100 Points.

    Solution

    Step 1) Press the [STAT] button, make sure that [Edit and 1:EDIT] are selected, then press [ENTER].

    Edit function to enter data in TI-84+.
    Figure \(\PageIndex{2}\): Edit Function to Enter Data in TI-84+

    Step 2) Enter the x-values in List 1 [\(L_1\)] and the y-values in List 2 [\(L_2\)].

    Lists (L1 and L2) columns where X and Y values are entered.
    Figure \(\PageIndex{3}\): Lists (L1 and L2) Columns Where X and Y Values are Entered

    Step 3) Press the [STAT] button again, use the right arrow to select [CALC], use the down arrow to select [8: LinReg(a+bx)], and then press [ENTER].

    Linear regression function to compute regression coefficients.
    Figure \(\PageIndex{4}\): Linear Regression Function to Compute Regression Coefficients

    Step 4) Make sure that X-list has \(L_1\) and the Y-list has \(L_2\). Use the down arrow to select [Calculate] and press [ENTER].

    Check screen to ensure proper lists are selected.
    Figure \(\PageIndex{5}\): Check Screen to Ensure Proper Lists are Selected

    Step 5) On the output page, \(a\) and \(b\) will be on the first two lines. After rounding to three places, they are \(a = 12.145\) and \(b = 4.226\).

    Output of regression coefficients (a = 12.145 and b = 4.226).
    Figure \(\PageIndex{6}\): Output of Regression Coefficients (a = 12.145 and b = 4.226)

    Step 6) Use the line of regression to compute \(y'\) (estimated Midterm Exam score) when x = 8 (hours per week). Plug x = 8 into the equation and compute the estimated value.

    \(y' = 12.145+4.226x = 12.145+4.226(8) = 12.145 + 33.808 = 45.953\)

    Thus, if a student studies 8 hours per week, the estimated Midterm Exam score is around 46 points.

    Example \(\PageIndex{3}\)

    A health researcher at the Health Department ​​​​​​at a large university is conducting a study to explore the relationship between physical activity and health outcomes among college students aged 18–25 years old. The researcher is specifically interested in determining whether there is a correlation between the number of hours students work out per week and the number of days they spend being ill in a year. The researcher collected data provided in the table below. It was shown that the data values are positively correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for 9 hours per week. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    X: Hours Worked Out per Week Y: Days Spent Ill in a Year
    0 14
    2 10
    4 8
    5 6
    7 5
    10 3
    12 2

    Table \(\PageIndex{4}\) Hours Worked Out Per Week and Days Ill Per Year.

    Solution

    In this example, the TI-84+ calculator will be used to compute the regression coefficients. Follow the steps in Example 2 to compute the regression coefficients. The output is provided in the image below.

    Output of regression coefficients (a = 12.251 and b = –0.964).
    Figure \(\PageIndex{7}\): Output of Regression Coefficients (a = 12.251 and b = –0.964)

    Therefore, the line of regression is \(y' = 12.251 - 0.944x\).

    The estimated value is \(y' = 12.251 - 0.944(9) = 12.251 -8.496 = 3.755\).

    Thus, according to the linear model, if a person works out 9 hours per week, they will be ill for around 4 days during the year.

    Example \(\PageIndex{4}\)

    A researcher is exploring if there is any correlation between the amount of money students spend on lunch and their GPA in a college setting. Hypothetically, we are testing if students who spend more money on lunch tend to have higher or lower GPAs.

    The researcher collected 10 pairs of data representing the amount of money students spend on lunch and their corresponding GPA. The researcher collected data provided in the table below. It was shown that the data values are not correlated in 10.3. Find the line of regression and estimate the exam score if a student studies for $13.00 on lunch. Use the TI-84 + calculator. Round the results to three decimal places.

    Data Values
    Amount Spent on Lunch ($) GPA
    $ 10.00 1.95
    $ 7.50 3.20
    $ 4.00 3.60
    $ 8.45 2.80
    $ 6.95 3.40
    $ 9.00 2.70
    $ 8.90 2.56
    $ 12.50 3.30
    $ 19.80 3.00
    $ 6.90 3.49

    Table \(\PageIndex{5}\) Amount Spent on Lunch in Dollars and Grade Point Average (GPA).

    Solution

    Since there is no correlation, the line of regression is not computed as it is not valid.

    Attributions

    "10.1: Regression" by Kathryn Kozak is licensed under CC BY-SA 4.0

    Exercises

    1. A café owner wants to determine if there is a significant correlation between the daily temperature and the number of iced coffee drinks sold. The owner records the daily temperature and the number of iced coffee drinks sold for five randomly selected days listed below. Test for correlation with \( \alpha = 0.05 \) using r and Pearson's Correlation Matrix (PMC). Please click on the PMC table to access the table in the book. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 50 degrees.
    Bivariate Data
    Temperature (⁰F) # of Iced Coffees Sold
    72 35
    78 42
    85 53
    88 56
    91 60

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A café owner wants to determine if there is a significant correlation between the daily temperature and the number of iced coffee drinks sold. The owner records the daily temperature and the number of iced coffee drinks sold for five randomly selected days listed below. Test for correlation with \( \alpha = 0.05 \). Use the traditional method. Click on this link for the t-distribution table to locate the critical values. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 50 degrees.
    Bivariate Data
    Temperature (⁰F) # of Iced Coffees Sold
    72 35
    78 42
    85 53
    88 56
    91 60

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A café owner wants to determine if there is a significant correlation between the daily temperature and the number of iced coffee drinks sold. The owner records the daily temperature and the number of iced coffee drinks sold for five randomly selected days listed below. Test for correlation with \( \alpha = 0.05 \). Use the p-value method. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 50 degrees.
    Bivariate Data
    Temperature (⁰F) # of Iced Coffees Sold
    72 35
    78 42
    85 53
    88 56
    91 60

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher wants to investigate whether there is a significant linear relationship between gas prices and average household income in different cities. The data below shows average gas prices and corresponding household income (in thousands of dollars) for the seven cities listed below. Test for correlation with \( \alpha = 0.01 \) using r and Pearson's Correlation Matrix (PMC). Please click on the PMC table to access the table in the book. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 2.75.
    Bivariate Data
    Gas Price ($) Household Income (in $1,000s)
    3.10 45
    3.25 52
    3.40 60
    3.55 66
    3.70 72
    3.85 78
    4.00 85

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher wants to investigate whether there is a significant linear relationship between gas prices and average household income in different cities. The data below shows average gas prices and corresponding household income (in thousands of dollars) for the seven cities listed below. Test for correlation with \( \alpha = 0.01 \). Use the traditional method. Click on this link for the t-distribution table to locate the critical values. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 2.75.
    Bivariate Data
    Gas Price ($) Household Income (in $1,000s)
    3.10 45
    3.25 52
    3.40 60
    3.55 66
    3.70 72
    3.85 78
    4.00 85

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher wants to investigate whether there is a significant linear relationship between gas prices and average household income in different cities. The data below shows average gas prices and corresponding household income (in thousands of dollars) for the seven cities listed below. Test for correlation with \( \alpha = 0.01 \). Use the p-value method. If there is enough evidence of a linear relationship, determine the line of regression and make a prediction when x = 2.75.
    Bivariate Data
    Gas Price ($) Household Income (in $1,000s)
    3.10 45
    3.25 52
    3.40 60
    3.55 66
    3.70 72
    3.85 78
    4.00 85

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher believes that students who study more hours per week might experience lower levels of stress. To test this, she surveys 6 college students and records how many hours they study per week and their self-reported stress level on a scale of 1 to 10 (10 = highest stress). Test for correlation with \( \alpha = 0.05 \) using r and Pearson's Correlation Matrix (PMC). Please click on the PMC table to access the table in the book.
    Bivariate Data
    Study Hours Stress Level
    4 10
    6 8
    8 9
    10 10
    12 4
    14 3

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher believes that students who study more hours per week might experience lower levels of stress. To test this, she surveys 6 college students and records how many hours they study per week and their self-reported stress level on a scale of 1 to 10 (10 = highest stress). Test for correlation with \( \alpha = 0.05 \). Use the traditional method. Click on this link for the t-distribution table to locate the critical values.
    Bivariate Data
    Study Hours Stress Level
    4 10
    6 8
    8 9
    10 10
    12 4
    14 3

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    1. A researcher believes that students who study more hours per week might experience lower levels of stress. To test this, she surveys 6 college students and records how many hours they study per week and their self-reported stress level on a scale of 1 to 10 (10 = highest stress). Test for correlation with \( \alpha = 0.05 \). Use the p-value method.
    Bivariate Data
    Study Hours Stress Level
    4 10
    6 8
    8 9
    10 10
    12 4
    14 3

    Scan the QR code or click on it to open the MyOpenMath version of the above question with step-by-step guidance.
    MyOpenMath is a free online learning platform designed to support math instruction through automated homework, quizzes, and assessments. You must register for MyOpenMath and sign in to view the question.

    QR code linking to the MyOpenMath version of the question above with step-by-step guided problem-solving.

    Answers

    If you are an instructor and want the solutions to all the exercise questions for each section, please email Toros Berberyan.


    This page titled 10.5: Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Toros Berberyan, Tracy Nguyen, and Alfie Swan via source content that was edited to the style and standards of the LibreTexts platform.