12.1.1: Scatterplots
- Page ID
- 34783
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)A scatterplot shows the relationship between two quantitative variables measured on the same individuals.
- The predictor variable is labeled on the horizontal or \(x\)-axis.
- The response variable is labeled on the vertical or \(y\)-axis.
How to Interpret a Scatterplot:
- Look for the overall pattern and for deviations from that pattern.
- Look for outliers, individual values that fall outside the overall pattern of the relationship.
- A positive linear relation results when larger values of one variable are associated with larger values of the other.
- A negative linear relation results when larger values of one variable are associated with smaller values of the other.
- A scatterplot has no association if no obvious linear pattern is present.
Use technology to make a scatterplot for the following sample data set:
Correlation Coefficient
The sample correlation coefficient measures the direction and strength of the linear relationship between two quantitative variables. There are several different types of correlations. We will be using the Pearson Product Moment Correlation Coefficient (PPMCC). The PPMCC is named after biostatistician Karl Pearson. We will just use the lower-case \(r\) for short when we want to find the correlation coefficient, and the Greek letter \(\rho\), pronounced “rho,” (rhymes with sew) when referring to the population correlation coefficient.

Interpreting the Correlation:
- A positive \(r\) indicates a positive association (positive linear slope).
- A negative \(r\) indicates a negative association (negative linear slope).
- \(r\) is always between \(-1\) and \(1\), inclusive.
- If \(r\) is close to \(1\) or \(-1\), there is a strong linear relationship between \(x\) and \(y\).
- If \(r\) is close to \(0\), there is a weak linear relationship between \(x\) and \(y\). There may be a non-linear relation or there may be no relation at all.
- Like the mean, \(r\) is strongly affected by outliers. Figure 12-1 gives examples of correlations with their corresponding scatterplots.
.png?revision=1)
When you have a correlation that is very close to \(-1\) or \(1\), then the points on the scatter plot will line up in an almost perfect line. The closer \(r\) gets to \(0\), the more scattered your points become.
Take a moment and see if you can guess the approximate value of \(r\) for the scatter plots below.
Solution
Scatterplot A: \(r = 0.98\), Scatterplot B: \(r = 0.85\), Scatterplot C: \(r = -0.85\).
When \(r\) is equal to \(-1\) or \(1\) all the dots in the scatterplot line up in a straight line. As the points disperse, \(r\) gets closer to zero. The correlation tells the direction of a linear relationship only. It does not tell you what the slope of the line is, nor does it recognize nonlinear relationships. For instance, in Figure 12-2, there are three scatterplots overlaid on the same set of axes. All three data sets would have \(r = 1\) even though they all have different slopes.
.png?revision=1)
For the next example in Figure 12-3, \(r = 0\) would indicate no linear relationship; however, there is clearly a non-linear pattern with the data.
.png?revision=1)
Figure 12-4 shows a correlation \(r = 0.874\), which is pretty close to one, indicating a strong linear relationship. However, there is an outlier, called a leverage point, which is inflating the value of the slope. If you remove the outlier then \(r = 0\), and there is no up or down trend to the data.
.png?revision=1)
Calculating Correlation
To calculate the correlation coefficient by hand we would use the following formula.
\[r = \frac{\sum \left( \left(x_{i} - \bar{x}\right) \left(y_{i} - \bar{y}\right) \right)}{\sqrt{ \left( \left(\sum \left(x_{i} - \bar{x}\right)^{2}\right) \left(\sum \left(y_{i} - \bar{y}\right)^{2}\right) \right)} } = \frac{SS_{xy}}{\sqrt{ \left(SS_{xx} \cdot SS_{yy}\right) }}\]
Instead of doing all of these sums by hand we can use the output from summary statistics. Recall that the formula for a variance of a sample is \(s_{x}^{2} = \frac{\sum \left(x_{i} - \bar{x}\right)^{2}}{n-1}\). If we were to multiply both sides by the degrees of freedom, we would get \(\sum \left(x_{i} - \bar{x}\right)^{2} = (n-1) s_{x}^{2}\).
We use these sums of squares \(\sum \left(x_{i} - \bar{x}\right)^{2}\) frequently, so for shorthand we will use the notation \(SS_{xx} = \sum \left(x_{i} - \bar{x}\right)^{2}\). The same would hold true for the \(y\) variable; just changing the letter, the variance of \(y\) would be \(s_{y}^{2} = \frac{\sum \left(y_{i} - \bar{y}\right)^{2}}{n-1}\), therefore \(SS_{yy} = (n-1) s_{y}^{2}\).
The numerator of the correlation formula is taking in the horizontal distance of each data point from the mean of the \(x\) values, times the vertical distance of each point from the mean of the \(y\) values. This is time-consuming to find so we will use an algebraically equivalent formula \(\sum \left(\left(x_{i} - \bar{x}\right) \left(y_{i} - \bar{y}\right) \right) = \sum (xy) - n \cdot \bar{x} \bar{y}\), and for short we will use the notation \(SS_{xy} = \sum (xy) - n \cdot \bar{x} \bar{y}\).
To start each problem, use descriptive statistics to find the sum of squares.
\(SS_{xx} = (n-1) s_{x}^{2}\) | \(SS_{yy} = (n-1) s_{y}^{2}\) | \(SS_{xy} = sum (xy) - n \cdot \bar{x} \bar{y}\) |
Use the following data to calculate the correlation coefficient.
When is a correlation statistically significant? The next subsection shows how to run a hypothesis test for correlations.