# 2.3: Correlation

- Page ID
- 7794

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

As before (in §§4 and 5), when we moved from describing histograms with words (like *symmetric*) to describing them with numbers (like the *mean*), we now will build a numeric measure of the strength and direction of a linear association in a scatterplot.

[def:corrcoeff] Given bivariate quantitative data \(\{(x_1,y_1), \dots , (x_n,y_n)\}\) the **[Pearson] correlation coefficient** of this dataset is \[r=\frac{1}{n-1}\sum \frac{(x_i-\overline{x})}{s_x}\frac{(y_i-\overline{y})}{s_y}\] where \(s_x\) and \(s_y\) are the standard deviations of the \(x\) and \(y\), respectively, datasets by themselves.

We collect some basic information about the correlation coefficient in the following

[fact:corrcoefff] For any bivariate quantitative dataset \(\{(x_1,y_1), \dots ,(x_n,y_n)\}\) with correlation coefficient \(r\), we have

- \(-1\le r\le 1\) is always true;
- if \(|r|\) is near \(1\) – meaning that \(r\) is near \(\pm 1\) – then the linear association between \(x\) and \(y\) is
*strong* - if \(r\) is near \(0\) – meaning that \(r\) is positive or negative, but near \(0\) – then the linear association between \(x\) and \(y\) is
*weak* - if \(r>0\) then the linear association between \(x\) and \(y\) is positive, while if \(r<0\) then the linear association between \(x\) and \(y\) is negative
- \(r\) is the same no matter what units are used for the variables \(x\) and \(y\) – meaning that if we change the units in either variable, \(r\) will not change
- \(r\) is the same no matter which variable is begin used as the explanatory and which as the response variable – meaning that if we switch the roles of the \(x\) and the \(y\) in our dataset, \(r\) will not change.

It is also nice to have some examples of correlation coefficients, such as

Many electronic tools which compute the correlation coefficient \(r\) of a dataset also report its square, \(r^2\). There reason is explained in the following

[fact:rsquared] If \(r\) is the correlation coefficient between two variables \(x\) and \(y\) in some quantitative dataset, then its square \(r^2\) it the fraction (often described as a percentage) of the variation of \(y\) which is associated with variation in \(x\).

[eg:rsquared] If the square of the correlation coefficient between the independent variable *how many hours a week a student studies statistics* and the dependent variable *how many points the student gets on the statistics final exam* is \(.64\), then 64% of the variation in scores for that class is cause by variation in how much the students study. The remaining 36% of the variation in scores is due to other random factors like whether a student was coming down with a cold on the day of the final, or happened to sleep poorly the night before the final because of neighbors having a party, or some other issues different just from studying time.