Skip to main content
Statistics LibreTexts

3: Intro to Linear Regression

  • Page ID
    57709
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Strešlau, the Capital or Ruritania

    Regression is a set of methods that seek to learn the specific relationship between one or more influenced (dependent, response) variables and one or more influencing (independent, predictor) variables. There are many existing regression methods, each focusing on different ways of determining how best to quantify that relationship.

    As is tradition, this chapter starts with our first definition of "best fit" and derives many results from that definition. This chapter is entirely mathematical in that probability distributions are not considered (until Chapter 5).

    ✦•················• 😺 •··················•✦

    Scatter plot of the sample data
    Figure \(\PageIndex{1}\): Sample data and a line of best fit for that data. Note that the slope of the line is negative. This indicates that increasing values of \(x\) tend to correspond to lower values of \(y\). Regression detects for such trends.

     

    Let \(x\) and \(y\) be numeric variables. The linear relationship between \(x\) and \(y\) can be summarized by a line that "best" fits the observed data. That is, we can summarize the relationship between \(x\) and \(y\) using a linear equation:

    \begin{equation}
    y = \beta_0 + \beta_1 x + \varepsilon \label{eq:ch2a-lobf}
    \end{equation}

    Here, parameter \(\beta_1\) represents the slope and parameter \(\beta_0\) represents the y-intercept (the value of \(y\) on the line when \(x=0\)). The slope is usually the only thing in the equation that is interesting; it is the effect of \(x\) on \(y\). The \(\varepsilon\) represents the vertical distance between the observation and the population line of best fit. It contains all of the things that affect \(y\) that are not included in \(x\).

    We said that the line given in equation \ref{eq:ch2a-lobf} "best" fits the observed data. What we mean by "best" determines where we go from here. In thinking about "best," it may help to see some sample data and the "line of best fit" for it (Figure \(\PageIndex{1}\), above).

    A good statistician will ask:

    What makes this line the "best"?

    A good statistician will answer:

    It depends. 

     

    Note that there are at least three definitions of "best" that we can use:

    1. Maximize the likelihood that the data were generated
    2. Minimize the sum of the absolute value of the residuals
    3. Minimize the sum of the square of the residuals

    All three definitions are entirely legitimate — as are many other definitions. However, each leads to different estimation methods and estimators. 

     

    Note

    While different models will usually give different estimates, the substantive conclusions will rarely differ significantly in a well-formed model.

     

    Note that the result will be a line represented by

    \begin{equation}
    \hat{y} = b_0 + b_1 x \label{eq:lm2a-model}
    \end{equation}

    Using Latin characters indicates that these are based on your particular sample; they are sample estimates. Contrast this with using Greek characters to indicate population parameters. The "hat" on the \(y\) indicates that this is an estimate. All together, this is our model equation. It is the equation of the line of best fit based on the data you collected.

    The first definition leads to "maximum likelihood estimation," which will be covered in Chapter 12. It is an excellent technique that can be generalized to many more settings than can ordinary least squares. Its greatest strength is that it makes use of the researcher's greater understanding of the data-generating process (Chapter 14 to Chapter 18). Its greatest weakness is the mathematics involved.

    The second definition leads to a type of robust regression frequently termed "median regression." This method is helpful for times when there are outliers in the data that you cannot (or should not) remove. The drawback to this method is that estimating the two parameters (\(\beta_0\) and \(\beta_1\)) does not provide a closed-form solution. In other words, it requires a repetitive sequence of steps and can only approximate those estimates. Furthermore, the approximation process is computationally intensive. Because of this, median regression was little used until recently. Because of this, the statistical theory behind it is not as well explored as other types. We will see this in Chapter 11: Quantile Regression.

    The most popular definition of "best," and the one that starts our journey, is the final definition. It leads to an estimation method called "ordinary least squares" (OLS). It is rather straight-forward to minimize a sum of squared values using differential calculus. One strength is that an equation results from this process — a closed-form solution with no need for iteration. This means that the process returns mathematically exact values. The drawback is that it is limited in the types of processes that can be modeled.

    We start exploring ordinary least squares immediately.

     

      

    Learning Objectives

    By the end of this chapter, you should be able to:

    1. Conceptual Foundations & Scalar Representation
      • Define a simple linear regression model in both words and its fundamental scalar equation form: \(Y_i = \beta_0 + \beta_1\ x_i + \varepsilon_i\).

      • Interpret the meaning of each component in the regression equation—the outcome (\(Y\)), predictor (\(x\)), intercept (\(\beta_0\)), slope (\(\beta_1\)), and error term (\(\varepsilon\)) — within the context of a real-world example.

      • Translate a research question about a relationship between two variables into a testable regression framework.

    2. Estimation & First Results
      • Explain the core principle of Ordinary Least Squares (OLS): finding the line that minimizes the sum of the squared vertical distances (errors) between observed data points and the regression line.

      • Calculate the OLS estimates for the slope (\(b_1\)) and intercept (\(b_0\)) using their standard formulas, given a small dataset.

      • Construct the estimated regression equation (\(\hat{Y} = b_0 + b_1\, x\)) and use it to make a prediction for a new value of \(x\).

    3. Model Assumptions & Their Meaning
      • List and describe the four key assumptions of the Classical Linear Model for the simple regression case: Linearity, Independence, Homoscedasticity, and Normality of errors.

      • Explain why these assumptions are necessary for OLS estimators to have desirable properties (like being BLUE: Best Linear Unbiased Estimators).

      • Recognize potential violations of these assumptions (e.g., patterns in a residual plot) and articulate their implications for the validity of the model's results.

    4. Model Fit & Interpretation
      • Define the Proportional Reduction in Error (PRE) and explain how it quantifies the improvement of the regression model over a simple baseline model (like predicting using the mean of \(Y\)).

      • Calculate and interpret the R-squared statistic as a specific PRE measure, explaining what it reveals about the strength of the linear relationship.

      • Distinguish between the goodness-of-fit of a model (e.g., R-squared) and the statistical significance of its individual parameters.

    5. Synthesis & Communication
      • Perform a complete, beginner-level regression analysis: from stating the model, to estimating coefficients, checking assumptions, and interpreting the model's fit.

      • Communicate the results of a simple linear regression clearly in writing, summarizing the estimated relationship, its practical significance, and the model's limitations.

    · · ─ ·✶· ─ · ·

    Chapter Sections

     

     


    This page titled 3: Intro to Linear Regression is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?