3.3: Measures of Association between Two Variables
- Page ID
- 56088
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)- Measuring the Association of Two Variables
- Correlation Analysis
- Constructing a Scatter Plot
- Correlation is Not the Same as Causation
3.1. Introducing Inferential Analysis
Inferential analysis is simply how we try to infer what a population relationship might be while only using data from a sample. It is a method that is used to draw conclusions, that is, to infer or conclude trends about a larger population based on the samples analyzed.
In future chapters, we will learn that we could go about this in basically two ways:
- Estimating Parameters: Taking a statistic from a sample of data (like the sample mean) and using it to describe the population (population mean). The sample is used to estimate a value that describes the entire population, in addition to a confidence interval. Then, the estimate is created.
- Hypothesis Tests: Data is used to determine if it is strong enough to support or reject an assumption.
Inferential analysis allows us to study the relationship between variables within a sample, allowing for conclusions and generalizations that accurately represent the population. There are many types of inferential analysis tests in the field of statistics. While we will learn these techniques in the future, in this section we will just discuss two commonly calculated measures of association. Once those measures are introduced, we will conclude the section noting that correlation is not causation.
3.2 Covariance
Covariance is a statistical measure used to describe how two quantitative variables move together. If the values of both variables tend to be above (or below) their means at the same time, the covariance is positive. If one variable tends to be above its mean when the other is below, the covariance is negative. Covariance is a foundational concept in economics and statistics, especially in topics such as portfolio theory, regression analysis, and market risk.
For example, an economist might study the relationship between disposable income and consumer spending. If these two variables tend to increase and decrease together, the covariance between them will be positive. If one increases while the other decreases, the covariance will be negative.
It’s important to note that the magnitude of covariance depends on the units of measurement of the variables. If income is measured in dollars and spending is in euros, the covariance will be in dollar-euro units. This makes it difficult to interpret the strength of association without standardizing (e.g., via correlation).
Population Covariance:
\[
\text{Cov}(X, Y) = \frac{\sum (x_i - \mu_X)(y_i - \mu_Y)}{N}
\]
Sample Covariance:
\[
s_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n - 1}
\]
Where:
- \( x_i \) and \( y_i \) are the data values
- \( \mu_X \) and \( \mu_Y \) are the population means of \( X \) and \( Y \)
- \( \bar{x} \) and \( \bar{y} \) are the sample means
- \( N \) is the population size
- \( n \) is the sample size
Example 1
An economics student wants to measure the relationship between weekly hours worked and weekly income (in dollars) for a sample of three workers. The data are as follows:
| Person | Hours Worked (x) | Weekly Income (y) |
|---|---|---|
| A | 20 | 250 |
| B | 30 | 400 |
| C | 40 | 550 |
a) Compute the sample covariance between hours worked and income.
Solution
Step 1: Find the sample means
\[
\bar{x} = \frac{20 + 30 + 40}{3} = 30, \quad \bar{y} = \frac{250 + 400 + 550}{3} = 400
\]
Step 2: Apply the sample covariance formula
\[
s_{xy} = \frac{(20 - 30)(250 - 400) + (30 - 30)(400 - 400) + (40 - 30)(550 - 400)}{3 - 1}
\]
\[
= \frac{(-10)(-150) + (0)(0) + (10)(150)}{2} = \frac{1500 + 0 + 1500}{2} = \frac{3000}{2} = 1500
\]
Answer The sample covariance is 1500 (hours·dollars). The positive value indicates that hours worked and income tend to move together.
Example 2
A policy analyst is studying whether there is a relationship between unemployment rate (%) and monthly consumer spending (in $1000s) across four U.S. states.
| State | Unemployment Rate (x) | Spending (y) |
|---|---|---|
| W | 4.2 | 3.5 |
| X | 5.0 | 3.0 |
| Y | 5.8 | 2.6 |
| Z | 6.5 | 2.2 |
a) Compute the sample covariance.
b) Based on the sign of the covariance, describe the relationship.
Solution
Step 1: Compute the sample means
\[
\bar{x} = \frac{4.2 + 5.0 + 5.8 + 6.5}{4} = 5.375, \quad
\bar{y} = \frac{3.5 + 3.0 + 2.6 + 2.2}{4} = 2.825
\]
Step 2: Use the sample covariance formula
\[
s_{xy} = \frac{
(4.2 - 5.375)(3.5 - 2.825) +
(5.0 - 5.375)(3.0 - 2.825) +
(5.8 - 5.375)(2.6 - 2.825) +
(6.5 - 5.375)(2.2 - 2.825)
}{4 - 1}
\]
\[
= \frac{
(-1.175)(0.675) +
(-0.375)(0.175) +
(0.425)(-0.225) +
(1.125)(-0.625)
}{3}
\]
\[
= \frac{-0.7931 - 0.0656 - 0.0956 - 0.7031}{3} = \frac{-1.6574}{3} \approx -0.5525
\]
Answer
a) The sample covariance is approximately −0.5525 (%·$1000s).
b) The negative sign suggests that as unemployment increases, consumer spending tends to decrease — a relationship consistent with economic theory.
3.3.1 Correlation Coefficient
The correlation coefficient is a standardized measure that describes the strength and direction of a linear relationship between two quantitative variables. It is based on covariance, but unlike covariance, the correlation coefficient is unitless and always falls between −1 and 1. This makes it easier to compare relationships across variables that may have different units of measurement.
Economists frequently use correlation to examine the association between income and spending, education and wages, or inflation and unemployment. While correlation does not imply causation, it is a useful way to assess how two variables move together.
- A positive correlation means that as one variable increases, the other tends to increase.
- A negative correlation means that as one variable increases, the other tends to decrease.
- A correlation close to 0 suggests little to no linear relationship.
- A correlation close to −1 or 1 indicates a strong linear relationship.
Figure 4 provides some common interpretations for correlation coefficient values.
| Size of Correlation | Interpretation |
| .90 to 1.00 (-.90 to -1.00) | Very high positive (negative) correlation |
| .70 to .90 (-.70 to -.90) | High positive (negative) correlation |
| .50 to .70 (-.50 to -.70) | Moderate positive (negative) correlation |
| .30 to .50 (-.30 to -.50) | Low positive (negative) correlation |
| .00 to .30 (.00 to -.30) | Little, if any, correlation |
Correlation Coefficient Equations:
Population Correlation Coefficient:
\[
\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}
\]
Sample Correlation Coefficient:
\[
r = \frac{s_{xy}}{s_x s_y}
\]
Where:
- \( \text{Cov}(X, Y) \) is the population covariance
- \( \sigma_X \) and \( \sigma_Y \) are the population standard deviations of \( X \) and \( Y \)
- \( s_{xy} \) is the sample covariance
- \( s_x \) and \( s_y \) are the sample standard deviations of \( x \) and \( y \)
Example 1
A labor economist wants to evaluate the relationship between education level (in years) and hourly wage (in dollars) for three workers. The data are:
| Person | Years of Education (x) | Wage (y) |
|---|---|---|
| A | 12 | 15 |
| B | 14 | 18 |
| C | 16 | 22 |
a) Compute the sample correlation coefficient between education and wage.
Solution
Step 1: Compute sample means
\[
\bar{x} = \frac{12 + 14 + 16}{3} = 14, \quad \bar{y} = \frac{15 + 18 + 22}{3} = 18.33
\]
Step 2: Compute sample standard deviations
\[
s_x = \sqrt{\frac{(12 - 14)^2 + (14 - 14)^2 + (16 - 14)^2}{2}} = \sqrt{\frac{4 + 0 + 4}{2}} = \sqrt{4} = 2
\]
\[
s_y = \sqrt{\frac{(15 - 18.33)^2 + (18 - 18.33)^2 + (22 - 18.33)^2}{2}} = \sqrt{\frac{11.11 + 0.11 + 13.44}{2}} = \sqrt{12.33} \approx 3.51
\]
Step 3: Compute sample covariance
\[
s_{xy} = \frac{(12 - 14)(15 - 18.33) + (14 - 14)(18 - 18.33) + (16 - 14)(22 - 18.33)}{2}
= \frac{(-2)(-3.33) + (0)(-0.33) + (2)(3.67)}{2}
= \frac{6.66 + 0 + 7.34}{2} = \frac{14}{2} = 7
\]
Step 4: Compute correlation
\[
r = \frac{7}{2 \times 3.51} = \frac{7}{7.02} \approx 0.997
\]
Answer The correlation coefficient is approximately 0.997, indicating a very strong positive linear relationship between education and wage.
Example 2
A public health economist examines the relationship between cigarette tax rates and adult smoking rates across four states. The data are:
| State | Tax per Pack ($) (x) | Smoking Rate (%) (y) |
|---|---|---|
| A | 1.00 | 22 |
| B | 1.50 | 18 |
| C | 2.00 | 15 |
| D | 2.50 | 12 |
a) Compute the sample correlation between tax rate and smoking rate.
Solution
Step 1: Compute Sample Means
\[
\bar{x} = \frac{1.00 + 1.50 + 2.00 + 2.50}{4} = 1.75, \quad
\bar{y} = \frac{22 + 18 + 15 + 12}{4} = 16.75
\]
Step 2: Compute standard deviations
\[
s_x = \sqrt{\frac{(1.00 - 1.75)^2 + (1.50 - 1.75)^2 + (2.00 - 1.75)^2 + (2.50 - 1.75)^2}{3}}
= \sqrt{\frac{0.5625 + 0.0625 + 0.0625 + 0.5625}{3}}
= \sqrt{0.4167} \approx 0.645
\]
\[
s_y = \sqrt{\frac{(22 - 16.75)^2 + (18 - 16.75)^2 + (15 - 16.75)^2 + (12 - 16.75)^2}{3}}
= \sqrt{\frac{27.5625 + 1.5625 + 3.0625 + 22.5625}{3}}
= \sqrt{18.25} \approx 4.27
\]
Step 3: Compute sample covariance
\[
s_{xy} = \frac{
(1.00 - 1.75)(22 - 16.75) +
(1.50 - 1.75)(18 - 16.75) +
(2.00 - 1.75)(15 - 16.75) +
(2.50 - 1.75)(12 - 16.75)
}{3}
\]
\[
= \frac{
(-0.75)(5.25) +
(-0.25)(1.25) +
(0.25)(-1.75) +
(0.75)(-4.75)
}{3}
= \frac{-3.9375 - 0.3125 - 0.4375 - 3.5625}{3}
= \frac{-8.25}{3} = -2.75
\]
Step 4: Compute correlation
\[
r = \frac{-2.75}{0.645 \times 4.27} = \frac{-2.75}{2.7542} \approx -0.998
\]
Answer The correlation is approximately −0.998, suggesting a very strong negative linear relationship: higher cigarette taxes are strongly associated with lower smoking rates.
3.3.4 Trend Lines and Scatter Plots
We previously learned about one of the most useful graphs for displaying the relationship between two quantitative variables; the scatter plot. If we add a trendling to the scatter plot, it can be used to determine whether a linear (straight line) correlation exists between two variables. Each individual data appears as the point in the plot fixed by the values of both variables for that individual. The scatter plot below in Figure 2 shows several types of correlation.
A positive correlation exists when two variables operate in unison so that when one variable rises or falls, the other does the same. A negative correlation exists when two variables move in opposition to one another so that when one variable rises, the other falls. As Figure 2 shows below, "weak" correlation can occur if the points are not close to the trend line, but can also occur is the slope of the trend line is more shallow. A more shallow trend line will relate to a correlation coefficient closer to 0 and a more steep trend line will relate to a correlation coefficient closer to 1 or -1.
Curvilinear Relationships: As shown in Figure 3, it is important to note that not all relationships between x and y can be a straight line. There are many curvilinear relationships indicating that one variable increases as the other variable increases until the relationship reverses itself so that one variable finally decreases while the other continues to increase (Levin & Fox 2006).

How to Examine a Scatter Plot
As suggested by Moore, Notz & Fligner (2013), in any graph of data, look for the overall pattern and for striking deviations. The overall pattern of a scatter plot can be described by the direction, form, and strength of the relationship. An significant deviation from the trend line may signal the observation is an outlier, i.e. an individual value that falls outside the overall pattern of the relationship.
You interpret a scatterplot by looking for trends in the data as you go from left to right: If the data show an uphill pattern as you move from left to right, this indicates a positive relationship between x and y. As the x-values increase (move right), the y-values tend to increase (move up).


