Skip to main content
Statistics LibreTexts

6.2: Functional Fit

  • Page ID
    57732
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    A second requirement of ordinary least squares (OLS) is that the expected value of the residuals is constant (and zero). From the gut, this means that the residuals randomly bounce above and below our estimates (regression curve). If the residuals are above (or below) our estimates more frequently than expected, then the curve should be moved up (or down) to provide a "better" fit.

    We used this requirement in many places in prior chapters. This requirement is entirely equivalent to the assumption that the underlying model (expected/predicted values) consistently fits the data, that there is no systematic error.

    ✦•················• ✦ •··················•✦

    Be aware, however, that the mathematics behind OLS will force the average residual to be zero. This means two things. First, the OLS model is "self-correcting," in that the "line of best fit" will provide the best linear fit to the data. Second, the OLS model ensures that it is impossible to detect a systematic error in the measurements

    Graphical Check

    Graph the residuals against each of the independent variables. Look for non-linear patterns in the plot (parabolas, cubics, etc.). If such exists, your model is misspecified. A fix is to transform the independent variable to eliminate that pattern. This is one place where the graphical "test" is superior to the numeric test. If you can identify the pattern, you have the fix. For instance, if the residuals have the pattern in Figure \(\PageIndex{1}\), then the solution may be to use \(x^2\) in place of (or in addition to) \(x\). To see this, run the following code to obtain Figure \(\PageIndex{1}\).

    set.seed(370)
    
    x = seq(0,3,length=20)
    n = length(x)
    e = rnorm(n)
    y = 4 + 2*x^2 + e
    
    mod = lm(y ~ x)
    E = residuals(mod)
    
    plot(x, E)                ## residuals plot
    

    Note the strong quadratic shape to this residuals plot (Figure \(\PageIndex{1}\)). This strongly suggests that the model is misspecified.

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): A residuals plot for a misspecified model. Note that the residuals show a definite quadratic form to them. Fixing this issue may be as simple as including \(x^2\) as an additional independent variable.
    Note

    As an aside, note that there are only three "runs" in the residuals. From the discussion of the runs test, a run is a sequence of values on one side of the prediction line. According to the above figure, the first run consists of the first 5 values; the second, the next 12 values; and the third, the last 3 values.

    Since the number of runs is based on the Binomial distribution, we can calculate the probability of observing this number of runs under the null hypothesis. Thus, we can calculate a p-value for the hypothesis that the model is properly specified.

    This example clearly shows that the model is misspecified. There is still more information contained in the residuals. It would be wrong to ignore that information.

    Since the residuals plot has a prominent quadratic shape, a solution could be to include \(x^2\) in the model:

    x2 = x^2
    mod = lm(y ~ x + x2)
    E2 = residuals(mod)
    
    plot(x, E2)                ## residuals plot
    

    With this change, we re-examine the residuals plot — one residuals plot for each independent variable. Since we have two (one?), we need to examine two (one?) residuals plots (Figure \(\PageIndex{2}\)). Note the transformation was successful. Neither plot shows anything other than what appears to be random bouncing across the line \(y=0\).

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{2}\): Residuals plots for the properly specified model. There is one residuals plot per independent variable. Note that the residuals in neither plot suggest anything other than a lack of pattern.
    Caution

    There is a habit to feel sad that some requirement is not met by the model/data, such as above. However, do not feel sad. Feel happy, because you have learned something new about the relationships in the data! We know more!

    Celebrate! 🥳

    A Numeric Test

    That last sentence leads us to a numeric test. Compare the plots of Figure \(\PageIndex{1}\) and \(\PageIndex{2}\). The residual is colored blue if it is positive; pink, otherwise. In Figure \(\PageIndex{1}\), there are long unbroken streaks (runs) of blue and pink. In Figure \(\PageIndex{1}\), on the other hand, the length of those runs is much reduced and their number is increased.

    The test suggested by the above is not-surprisingly called the runs test. It is implemented in the lawstat package as the function runs.test. It takes just one piece of information: the residuals in the order of the independent variable. This test is also implemented in the randtests and snpar packages, as well as the KnoxStats package, which is what I use in this book.

    The null hypothesis of the runs test is that the succeeding values are independent and have mean \(0\).

    Here, I demonstrate the runs.test function in the KnoxStats package. Note that this function currently requires the lawstat package to be installed.

    library(KnoxStats)
    set.seed(370)
    
    x = runif(100)
    e = rnorm(100)
    
    runs.test(e, order=x)      ## The runs test
    

    In this version of the runs.test function, the first slot goes to the residuals, and the second slot goes to the independent variable.

    The output of this code is

            Runs Test - Two sided
    
    data:  e, as ordered by x
    Standardized Runs Statistic = 1.6081, p-value = 0.1078
    

    As usual, check the p-value. Since the p-value of \(0.1078\) is greater than the \(\alpha\) level of \(0.05\), we fail to reject the null hypothesis that the expected value of the residuals is constant and zero. Thus, since the p-value is greater than \(\alpha\), the model passes this test.

    Exploration of the Effects of Non-Constant Expected Value

    To see the effect of a non-constant expected value, let us revisit one of the proofs from Section 3.3: Our First Assumptions.

    What is the expected value of \(b_1\)?

    Proof.

    \begin{align}
    E[b_1] &= E\left[ \frac{\sum_{i=1}^n (x_i-\bar{x})(Y_i-\overline{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\right] \\[1em]
    &= E\left[ \frac{\sum_{i=1}^n (x_i-\bar{x})Y_i}{\sum_{i=1}^n (x_i - \bar{x})^2} \right] \\[1em]
    &= \frac{\sum_{i=1}^n (x_i-\bar{x})E\left[Y_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[1em]
    &= \frac{\sum_{i=1}^n (x_i-\bar{x})(\beta_0 + \beta_1 x_i + E\left[\varepsilon_i\right])}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[1em]
    &= \beta_0\frac{\sum_{i=1}^n (x_i-\bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2} + \beta_1 \frac{\sum_{i=1}^n (x_i-\bar{x})x_i}{\sum_{i=1}^n (x_i - \bar{x})^2} + \frac{\sum_{i=1}^n (x_i-\bar{x})E\left[eps_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[1em]
    &= \beta_0\frac{0}{\sum_{i=1}^n (x_i - \bar{x})^2} + \beta_1 \frac{\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2} + \frac{\sum_{i=1}^n (x_i-\bar{x}) E\left[\varepsilon_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[1em]
    &= \beta_1 + \frac{\sum_{i=1}^n (x_i-\bar{x}) E\left[\varepsilon_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2}
    \end{align}

    Now, this last line is \(\beta_1\) if \(\varepsilon\) is independent of \(x\). So, if the expected value is constant zero, the OLS estimator of \(\beta_1\) is clearly unbiased. I leave it as an exercise to show that if it is constant, but non-zero, then the OLS estimator of \(\beta_1\) remains unbiased.

    If, however, the assumption of a constant expected residual value is violated, the OLS estimate of \(\beta_1\) is biased. This is not a good thing. It means that your predictions are wrong... even "on average."

    But, what about the OLS estimator of \(\beta_0\), the y-intercept? What effect does a non-zero expected residual value have on it? To see, let us revisit the proof of the unbiasedness of \(b_0\).

    \begin{align}
    E\left[ b_0\right] &= E\left[ \overline{Y} - \bar{x} b_1\right] \\[1em]
    &= E\left[ \overline{Y} \right] - \bar{x}\, E\left[ b_1\right] \\[1em]
    &= \left(\beta_0 + \bar{x}\beta_1 + E\left[ \varepsilon_i\right] \right) - \bar{x} \left(\beta_1 + \frac{\sum_{i=1}^n (x_i-\bar{x}) E\left[ \varepsilon_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2}\right) \\[1em]
    E\left[ b_0\right] &= \beta_0 + E\left[ \varepsilon_i\right] - \bar{x} \frac{\sum_{i=1}^n (x_i-\bar{x}) E\left[ \varepsilon_i\right]}{\sum_{i=1}^n (x_i - \bar{x})^2}
    \end{align}

    Thus, there are two places that the non-constant expected value ("poor model fit") affects the OLS estimator of \(\beta_0\). The result above shows:

    • If \(E[\varepsilon_i]=0\), the expected value of the errors is zero, then \(b_0\) is unbiased for \(\beta_0\).
    • However, if \(E[\varepsilon_i]\) is constant, but non-zero, then \(E[b_0] = \beta_0 + E[\varepsilon_i]\).
    • And, if \(E[\varepsilon_i]\) is a function of \(x\), then \(E[b_0]\) is \(\beta_0 + E[\varepsilon_i]\) plus some function of \(x\).

    I leave it as an exercise for you to show that \(E[b_0] = \beta_0 + E[\varepsilon_i]\) if \(\bar{x}=0\), regardless of whether the residuals are correlated with the independent variable. This is yet another reason some disciplines center their data before analyzing it.

    Caution

    The actual assumption is that the expected value of the residual is constantly zero. Again, because of the mathematics of OLS, the mean residual is guaranteed to be zero. So, there is no way to test if the expected value of the residuals is constantly zero, only that it is constant

    Of all assumptions/requirements, this is the most important to meet. If your residuals depend on the value of \(x\), then both the \(b_0\) and \(b_1\) estimators are biased. If the expected value of the residuals is not zero, then just the \(b_0\) estimator is biased.

    It is even worse. Because OLS mathematically forces \(\bar{e}=0\), one cannot test if the expected value of the residuals really is zero. One must rely on the assumption that the data were collected without systematic error. That is, the statistician must trust the scientist to measure things correctly.

    Question

    When would the residuals be a function of the residuals?

    This is an excellent question that you need to grapple with. It fundamentally means that you are missing an important variable from your model. Perhaps that variable is not an independent variable. Perhaps it is a confounding variable. Perhaps it is another dependent variable, which requires multivariate regression (beyond the scope of this book). It definitely means your model is weak and should be rejected as incomplete (if possible).


    This page titled 6.2: Functional Fit is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?