Skip to main content
Statistics LibreTexts

3.1: Scalar Representation

  • Page ID
    57710
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    This section and the next show how our definition of "best" mathematically leads to specific results. That leading can be done by representing the regression problem in scalar or in matrix form. At one level, there is no difference in the two representations. At another level, one representation may make proofs — and understandings — easier and much more manifest. And so, let us begin with the scalar representation of the regression problem. From experience, it seems to make more sense than starting with the matrix representation (Chapter 4: Matrices and Linear Regression).

    ✦•················• ✦ •··················•✦

    Scatter plot showing residuals
    Figure \(\PageIndex{1}\): Sample data and a line of best fit for that data. Also marked are the residuals, the difference between what was observed (dots) and what is predicted by the model (line). This particular line of best fit minimizes the sum of squared residuals.

    Finding the OLS Line

    Ordinary least squares estimation defines "best" as "having the lowest sum of squared errors." These errors (or residuals) are the vertical distance between the observed point and the corresponding point on the line (see Figure \(\PageIndex{1}\)). With this explanation of what we mean by "error" in this context, we can use our third definition of "best" to obtain the OLS estimators for \(\beta_0\) and \(\beta_1\).

    Remember that to optimize (maximize or minimize) a function using calculus, one takes its derivative(s) with respect to the parameter(s) of interest, sets the resulting equations equal to 0, then solves the system of equations.

    Note

    Also, one should perform the second derivative test to determine the type of optimization point found: minimum, maximum, and saddle point (neither).

    And so, the first step is to form the objective function that we want to minimize. Since we seek to minimize the sum of squared errors, that \(Q\) function is is the sum of squared errors:
    \begin{align}
    Q &= \sum_{i=1}^n \varepsilon_i^2 \\[1em]
    &= \sum_{i=1}^n \left( y_i - \hat{y}_i\right)^2 \\[1em]
    &= \sum_{i=1}^n \big(\, y_i - \left(\beta_0 + \beta_1 x_i\right)\, \big)^2 \\[1em]
    &= \sum_{i=1}^n \big( y_i - \beta_0 - \beta_1 x_i \big)^2
    \end{align}

    OLS Estimator of \(\beta_0\)

    Now that we have the objective function, \(Q\), we take its derivative with respect to each parameter, set it equal to 0, and solve for that parameter.

    Let us start with \(\beta_0\):

    \begin{align}
    \frac{\partial}{\partial \beta_0}Q &= \sum_{i=1}^n -2\left( y_i - \beta_0 - \beta_1 x_i \right) \\[1em]
    0 &\stackrel{\text{set}}{=} \sum_{i=1}^n -2\left( y_i - b_0 - b_1 x_i \right) \\[1em]
    &= \sum_{i=1}^n y_i - \sum_{i=1}^nb_0 - b_1 \sum_{i=1}^nx_i \\[1em]
    &= n\bar{y} - nb_0 - nb_1 \bar{x}
    \end{align}

    This immediately leads to

    \[b_0 = \bar{y} - b_1\bar{x} \]

    This is called the OLS estimator of \(\beta_0\). Note that this formula needs \(\bar{x}\) and \(\bar{y}\). These are both easily calculated from the data. This formula also need \(b_1\), which is the OLS estimate of \(\beta_1\). Thus, we will need to determine the value of \(b_1\) to use this formula.

    OLS Estimator of \(\beta_1\)

    And so, to obtain a formula for \(b_1\), we take the derivative of \(Q\) with respect to the second parameter, \(\beta_1\):

    \begin{align}
    \frac{\partial}{\partial \beta_1}Q &= \sum_{i=1}^n -2x_i \left( y_i - \beta_0 - \beta_1 x_i \right) \\[1em]
    0 &\stackrel{\text{set}}{=} \sum_{i=1}^n -2x_i \left( y_i - b_0 - b_1 x_i \right) \\[1em]
    &= \sum_{i=1}^n x_i y_i - b_0\sum_{i=1}^n x_i - b_1 \sum_{i=1}^n x_i^2 \\[1em]
    &= \sum_{i=1}^n x_i y_i - n b_0 \bar{x} - b_1 \sum_{i=1}^n x_i^2 \\
    \text{Substituting our estimator \(b_0\), we have} \hfill & \nonumber \\
    0 &= \sum_{i=1}^n x_i y_i - \left(\bar{y} - b_1\bar{x}\right) n\bar{x} - b_1 \sum_{i=1}^n x_i^2 \\[1em]
    &= \sum_{i=1}^n x_i y_i - n\bar{x}\,\bar{y} + b_1n\bar{x}^2 - b_1 \sum_{i=1}^n x_i^2 \\[1em]
    b_1 \left( \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right) &= \sum_{i=1}^n x_i y_i - n\bar{x}\,\bar{y} \\
    \text{Finally, we have}\hfill& \nonumber \\
    b_1 &= \frac{\displaystyle\sum_{i=1}^n x_i y_i - n\bar{x}\, \bar{y}}{\displaystyle\sum_{i=1}^n x_i^2 - n\bar{x}^2} \label{eq:lm2a-penultimateOLSstep}
    \end{align}

    The Two Estimators

    Thus, the two OLS estimators of \(\beta_0\) and \(\beta_1\) are

    \begin{equation} \left\{
    \begin{array}{ll}
    b_0 &= \bar{y} - b_1\bar{x} \\[1em]
    b_1 &= \displaystyle \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\, \bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2} \\
    \end{array} \right. \label{eq:olsEstimators}
    \end{equation}

    Note that this mathematical process had but one requirement:

    \begin{equation}
    \sum_{i=1}^n x_i^2 - n\bar{x}^2 \ne 0
    \end{equation}

    If that requirement is not met by the data, then the divisor of \(b_1\) is zero in equation \(\ref{eq:lm2a-penultimateOLSstep}\), which leads to dividing by zero, Armageddon, and a really bad hair day. However, note that

    \begin{equation}
    \sum_{i=1}^n\ x_i^2 - n\bar{x}^2 = (n-1)\ s^2_x
    \end{equation}

    As such, this requirement is met when the variance of \(x\) is non-zero. In other words, we require that the independent variable varies.

    Note

    For a mathematician, this is an important observation. For a statistician, it gives insight into how to "break" OLS: Measure all of the observations using the same value of the independent variable. In other words, to a statistician, the steps have meaning beyond the mathematics.

    Also note that some sources will give the formula for \(b_1\) as:

    \begin{equation}
    b_1 = \frac{\displaystyle\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2}\label{eq:ch2-sxx}
    \end{equation}

    I leave it as an exercise to show these:

    Exercise \(\PageIndex{1}\)

    \(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\) is equivalent to \(\sum_{i=1}^n x_i y_i - n\bar{x}\, \bar{y}\)

    Exercise \(\PageIndex{2}\)

    \( \sum_{i=1}^n (x_i - \bar{x})^2\) is equivalent to \(\sum_{i=1}^n x_i^2 - n\bar{x}^2 \)

    Definition: Sum of Squared Deviations, Sxx

    We will come across the denominator of equation \(\ref{eq:ch2-sxx}\) in many settings. Thus, to save ink, we will symbolize it as \(S_{xx}\) and define it as:

    \begin{equation}
    S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2
    \end{equation}

    Thus, our OLS line of best fit is the line defined by the set of points \((x, \hat{y})\), where

    \begin{equation}
    \hat{y} = b_0 + b_1 x \label{eq:lm2-olsModel}
    \end{equation}

    Note that \(\hat{y}\) is the expected value of \(Y\) (dependent variable), given that value of \(x\) (independent variable). In other words,
    \begin{equation}
    \hat{y}_i = E[Y | x_i]
    \end{equation}

    It is the conditional mean of \(Y\) given \(x_i\); the expected value of \(Y\), given this value of \(x_i\); the mean of \(Y\) when the independent variable has value \(x_i\).

    Note

    There is a difference between an "expected" and a "predicted" value. The expected value is the mean: If you were to rerun the universe a gazillion times, collected the same amount of data, and estimated \(\hat{y}\) each time, the expected value is the average of all of those \(\hat{y}\)s. The predicted value is the value of an additional observation measured at the same value of \(x\).

    That the two are both on the line is happy happenstance — but happenstance nonetheless. The main difference, as you will discover, is in the intervals that surround that value. Intervals on predictions (prediction intervals) are wider than intervals on the mean/expected value (confidence intervals). This is because we are more uncertain about a future value than we are about an average.

    And this is as far as we can go without making additional assumptions. As such, it marks a great place for a toy example.

    A Toy Example

    Example \(\PageIndex{1}\)

    Let us measure two variables on four subjects. Those two variables are \(x\) and \(y\). For the first subject, the value of \(x\) is \(-2\) and the value of \(y\) is \(3\). For the second subject the \(x\) and \(y\) values are \(0\) and \(0\). For the third subject, the values are \(0\) and \(2\). For the fourth subject, they are \(2\) and \(-1\).

    Given this information, let us calculate the ordinary least squares estimators of \(\beta_0\) and \(\beta_1\).

    Solution:
    First, the formulas for \(b_0\) and \(b_1\) require we calculate \(\bar{x}\) and \(\bar{y}\). To do this, we need the data. Here they are

    Sample data
    x y
    -2 3
    0 0
    0 2
    2 -1

    The means are \(0\) and \(1\), respectively. And, with that, we can use the formula for \(b_1\) (Equation \(\ref{eq:olsEstimators}\)b):

    \begin{align}
    b_1 &= \displaystyle \frac{\displaystyle \sum_{i=1}^n x_i y_i - n\bar{x}\, \bar{y}}{\displaystyle \sum_{i=1}^n x_i^2 - n\bar{x}^2} \\[1em]
    &= \displaystyle \frac{\Big( -2(3) + 0(0) + 0(2) + 2(-1) \Big) - 4(0)1}{ \Big( (-2)^2 + (0)^2 + (0)^2 + (2)^2 \Big) - 4(0)^2} \\[1em]
    &= \displaystyle \frac{-8 - 0}{ \phantom{-}8 - 0 } \\[1em]
    &= -1
    \end{align}

    For the OLS estimator of the intercept, \(\beta_0\), we have (Equation \(\ref{eq:olsEstimators}\)a):
    \begin{align}
    b_0 &= \bar{y} - b_1 \bar{x} \\[1em]
    &= 1 - (-1) 0 \\[1em]
    &= 1
    \end{align}

    Thus, the OLS line of best fit is the line defined by the set of points \((x, \hat{y})\), where
    \begin{equation}
    \hat{y} = 1 - 1x
    \end{equation}

    Figure \(\PageIndex{2}\) below shows the points and the OLS line of best fit.

    Scatter plot of the data
    Figure \(\PageIndex{2}\): Graphic of the data and the OLS line of best fit for the toy data of Example 3.1.1.

    So, what does the equation mean? It means that the expected value of \(Y\) when \(x=0\) is 1, the y-intercept. It also means that for every one increase in the value of \(x\), the expected value of \(Y\) increases by -1 (decreases by 1), which is the value of the slope.

    To go beyond this rote interpretation, we need to know what the numbers represent. That information was lacking in this toy example.

    Note

    From a scientific standpoint, it is dangerous to interpret the y-intercept when \(x=0\) is outside the observed range of the data (x-values). Models are best when you are trying to understand the relationship within the observed ranges of the independent variable(s). This is interpolation — "inter" from "within."

    Trying to use the model to understand relationships outside the observed values of the independent variables is called "extrapolation," where "extra" means "outside." Extrapolation is dangerous: all curves look linear at a small enough scale (remember Newton's Method from Calculus). Thus, fitting the data with a line may be a good approximation in one scale, it may not make sense at a wider range, where the non-linearity of the relationship may become more pronounced.

    Another Toy Example: Handedness

    Example \(\PageIndex{2}\)

    Let us measure two variables on four Ruritanian subjects. Those two variables are handedness and time to cut a sheet of paper. Sounds familiar? The difference here is that the independent variable is dichotomous.

    Given the information in the table, let us calculate — and interpret — the ordinary least squares estimators of \(\beta_0\) and \(\beta_1\).

    Solution:
    Here are the data:

    Handedness data
    Handedness Time
    Left 3
    Right 1
    Right 2
    Left 2

    The first thing to do is change our independent variable into a numeric variable. When the variable is dichotomous (has only two possible values), this is easy. Set one value to 0 and the other to 1. So, without loss of generality, let us follow the alphabet and replace Left with \(0\) and Right with \(1\). With this transformation, the values for handedness are now {0, 1, 1, 0} and we can use the same procedure as we used in Example \(\PageIndex{1}\).

    First, the formulas for \(b_0\) and \(b_1\) require we calculate \(\bar{x}\) and \(\bar{y}\). They are \(0.5\) and \(2\), respectively. With that, we can use the formula for \(b_1\) (Equation \(\ref{eq:olsEstimators}\)b):

    \begin{align}
    b_1 &= \displaystyle \frac{\displaystyle\sum_{i=1}^n x_i y_i - n\bar{x}\, \bar{y}}{\displaystyle\sum_{i=1}^n x_i^2 - n\bar{x}^2} \\[1em]
    &= \displaystyle \frac{\Big( 0(3) + 1(1) + 1(2) + 0(2) \Big) - 4(0.5)2}{ \Big( (0)^2 + (1)^2 + (1)^2 + (0)^2 \Big) - 4(0.5)^2} \\[1em]
    &= \displaystyle \frac{3 - 4}{ 2 - 1 } \\[1em]
    &= -1
    \end{align}

    For the OLS estimator of the intercept, \(\beta_0\), we have (Equation \ref {eq:olsEstimators}a):

    \begin{align}
    b_0 &= \bar{y} - b_1 \bar{x} \\[1em]
    &= 2 - (-1) 0.5 \\[1em]
    &= 2.5
    \end{align}

    Thus, the OLS line of best fit is the line defined by the set of points \((x, \hat{y})\), where

    \begin{equation}
    \hat{y} = 2.5 - 1\ x
    \end{equation}

    Now that we have some context, what does this equation mean?

    Remember that an \(x\)-value of \(0\) indicates we are discussing left-handed people. Thus, the expected value of \(Y\) for the lefties is \(1.5 + (1)0 = 1.5\). The expected value of \(Y\) for right-handed people is \(1.5 + (1)1 = 2.5\).

    Thus, the y-intercept is the predicted value for base level (lefties). The "slope" is the "effect of handedness" (moving from left- to right-handed) on that y-intercept.

    ───── ⋆⋅☆⋅⋆ ─────

    You have seen an analysis of this type in your past introductory statistics course. This is just the two-sample t-procedure under the guise of linear models.

    Note

    Since we can compare the means of two group in the regression realm (Example \(\PageIndex{2}\), above), can we compare the means of more than two groups? In other words, can we extend linear models to ANOVA? The answer is Yes! In fact, ANOVA is built on a base of linear models, as we will see in the future (4.4: Multicollinearity and Categorical Independent Variables).


    Last Thought

    Please do not forget that all statistical procedures have requirements that have to be met. So far, the only requirement is that there is variation in the independent variable. To draw stronger conclusions, perhaps calculate confidence intervals and test hypotheses, we will need to make stronger requirements. We will do this in the future. For now, let's just require that there is variation in the independent variable.


    This page titled 3.1: Scalar Representation is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?