Skip to main content
Statistics LibreTexts

7: Introduction to Linear Regression

Linear regression is a very powerful statistical technique. Many people have some familiarity with regression just from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

Figure 7.1 shows two variables whose relationship can be modeled perfectly with a straight line. The equation for the line is

\[y = 5 + 57.49x\]

Imagine what a perfect linear relationship would mean: you would know the exact value of y just by knowing the value of x. This is unrealistic in almost any natural process. For example, if we took family income x, this value would provide some useful information about how much nancial support y a college may offer er a prospective student. However, there would still be variability in nancial support, even when comparing students whose families have similar nancial backgrounds.

Linear regression assumes that the relationship between two variables, x and y, can be modeled by a straight line:

\[y = \beta _0 + \beta _1x \tag{7.1}\]

where \(\beta _0\) and \(\beta _1\) represent two model parameters ( \(\beta\) is the Greek letter beta). These parameters are estimated using data, and we write their point estimates as b0 and b1.  When we use x to predict y, we usually call x the explanatory or predictor variable, and we call y the response.

It is rare for all of the data to fall on a straight line, as seen in the three scatterplots in Figure 7.2. In each case, the data fall around a straight line, even if none of the observations fall exactly on the line. The rst plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between x and y. The second plot shows an upward trend that, while evident, is not as strong as the rst. The last plot shows a very weak downward trend in the data, so slight we can hardly notice it. In each of these examples, we will have some uncertainty regarding our estimates of the model parameters, \(\beta _0\) and \(\beta _1\). For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?

Figure 7.1: Requests from twelve separate buyers were simultaneously placed with a trading company to purchase Target Corporation stock (ticker TGT, April 26th, 2012), and the total cost of the shares were reported. Because the cost is computed using a linear formula, the linear t is perfect.

Figure 7.2: Three data sets where a linear model may be useful even though the data do not all fall exactly on the line.

As we move forward in this chapter, we will learn different criteria for line- tting, and we will also learn about the uncertainty associated with estimates of model parameters.

We will also see examples in this chapter where tting a straight line to the data, even if there is a clear relationship between the variables, is not helpful. One such case is shown in Figure 7.3 where there is a very strong relationship between the variables even though the trend is not linear. We will discuss nonlinear trends in this chapter and the next, but the details of tting nonlinear models are saved for a later course.

Contributors

  • David M Diez (Google/YouTube)
  • Christopher D Barr (Harvard School of Public Health)
  • Mine Çetinkaya-Rundel (Duke University)