Skip to main content
Statistics LibreTexts

13.1: Associations Among Variables

  • Page ID
    55388
    • Chanler Hilley, Kennesaw State University
    • University of Missouri System

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Thus far, all of our analyses have focused on comparing the value of a continuous variable across different groups via mean differences. We will now turn away from means and look instead at how to assess the relationship between two continuous variables in the form of correlations. As we will see, the logic behind correlations is the same as it was behind group means, but we will now have the ability to assess an entirely new data structure.

    Variability and Covariance

    A common theme throughout statistics is the notion that individuals will differ on different characteristics and traits, which we call variance. In inferential statistics and hypothesis testing, our goal is to find systematic reasons for differences and rule out random chance as the cause. By doing this, we are using information on a different variable—which so far has been group membership, like in ANOVA—to explain this variance. In correlations, we will instead use a continuous variable to account for the variance.

    Because we have two continuous variables, we will have two characteristics or scores on which people will vary. What we want to know is whether people vary on the scores together. That is, as one score changes, does the other score also change in a predictable or consistent way? This notion of variables differing together is called covariance (the prefix co- meaning “together”).

    Let’s look at our formula for variance on a single variable:

    \[
    \Large
    s^2=\frac{\sum{(X-M)^2}}{N-1}
    \nonumber
    \]

    We use X to represent a person’s score on the variable at hand, and M to represent the mean of that variable. The numerator of this formula is the sum of squares, which we have seen several times for various uses. Recall that squaring a value is just multiplying that value by itself. Thus, we can write the same equation as:

    \[
    \Large
    s^2=\frac{\sum{((X-M)(X-M))}}{N-1}
    \nonumber
    \]

    This is the same formula and works the same way as before, where we multiply the deviation score by itself (we square it) and then sum across squared deviations.

    Now, let’s look at the formula for covariance. In this formula, we will still use X to represent the score on one variable, and we will now use Y to represent the score on the second variable. We will still use bars to represent averages of the scores. The formula for covariance (covXY with the subscript XY to indicate covariance across the X and Y variables) is:

    \[
    \Large
    \text{cov}_{XY}=\frac{\sum{((X-M_X)(Y-M_Y))}}{N-1}
    \nonumber
    \]

    As we can see, this is the exact same structure as the previous formula. Now, instead of multiplying the deviation score by itself on one variable, we take the deviation scores from a single person on each variable and multiply them together. We do this for each person (exactly the same as we did for variance) and then sum them to get our numerator. The numerator in this is called the sum of products.

    \[
    \Large
    SP=\sum{((X-M_X)(Y-M_Y))}
    \nonumber
    \]

    We will calculate the sum of products using the same table we used to calculate the sum of squares. In fact, the table for sum of products is simply a sum of squares table for X, plus a sum of squares table for Y, with a final column of products. The table would include all of the following columns, with values computed for each individual in the data set:

    • \(X\)
    • \(X-M_X\)
    • \((X-M_X)^2\)
    • \(Y\)
    • \(Y-M_Y\)
    • \((Y-M_Y)^2\)
    • \((X-M_X)(X-M_X)\)

    In the next section, you will see that this table works the same way it did before (remember that the column headers tell you exactly what to do in that column). We list our raw data for the X and Y variables in the X and Y columns, respectively, then add them up so we can calculate the mean of each variable. We then take those means and subtract them from the appropriate raw score to get our deviation scores for each person on each variable, and the columns of deviation scores will both add up to zero. We will square our deviation scores for each variable to get the sums of squares for X and Y so that we can compute the variance and standard deviation of each. (We will use the standard deviation in our equation below.) Finally, we take the deviation score from each variable and multiply them together to get our product score. Summing this column will give us our sum of products. It is very important that you multiply the raw deviation scores from each variable, not the squared deviation scores.

    Our sum of products will go into the numerator of our formula for covariance, and then we only have to divide by N − 1 to get our covariance. Unlike the sum of squares, both our sum of products and our covariance can be positive, negative, or zero, and they will always match (e.g., if our sum of products is positive, our covariance will always be positive). A positive sum of products and covariance indicates that the two variables are related and move in the same direction. That is, as one variable goes up, the other will also go up, and vice versa. A negative sum of products and covariance means that the variables are related but move in opposite directions when they change, which is called an inverse relationship. In an inverse relationship, as one variable goes up, the other variable goes down. If the sum of products and covariance are zero, then that means the variables are not related. As one variable goes up or down, the other variable does not change in a consistent or predictable way.

    The previous paragraph brings us to an important definition about relationships between variables. What we are looking for in a relationship is a consistent or predictable pattern. That is, the variables change together, either in the same direction or opposite directions, in the same way each time. It doesn’t matter if this relationship is positive or negative, only that it is not zero. If there is no consistency in how the variables change within a person, then the relationship is zero and does not exist. We will revisit this notion of direction vs. zero relationship later on.

    Visualizing Relationships

    Chapter 2 covered many different forms of data visualization, and visualizing data remains an important first step in understanding and describing our data before we move into inferential statistics. Nowhere is this more important than in correlation. Correlations are visualized by a scatter plot, where our X variable values are plotted on the x-axis, the Y variable values are plotted on the y-axis, and each point or marker in the plot represents a single person’s score on X and Y. Figure \(\PageIndex{1}\) shows a scatter plot for hypothetical scores on job satisfaction (X) and worker well-being (Y). We can see from the axes that each of these variables is measured on a 10-point scale, with 10 being the highest on both variables (high satisfaction and good well-being) and 1 being the lowest (dissatisfaction and poor well-being). When we look at this plot, we can see that the variables do seem to be related. The higher scores on job satisfaction tend to also be the higher scores on well-being, and the same is true of the lower scores.

    Scatter plot showing a positive relationship between job satisfaction (x-axis) and well-being (y-axis); as job satisfaction increases, well-being tends to increase.
    Figure \(\PageIndex{1}\): Plotting job satisfaction and well-being scores. (“Scatter Plot Job Satisfaction and Well-Being” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    Figure \(\PageIndex{1}\) demonstrates a positive relationship. As scores on X increase, scores on Y also tend to increase. Although this is not a perfect relationship (if it were, the points would form a single straight line), it is nonetheless very clearly positive. This is one of the key benefits to scatter plots: they make it very easy to see the direction of the relationship. As another example, Figure \(\PageIndex{2}\) shows a negative relationship between job satisfaction (X) and burnout (Y). As we can see from this plot, higher scores on job satisfaction tend to correspond with lower scores on burnout, which is how stressed, unenergetic, and unhappy someone is at their job. As with Figure \(\PageIndex{1}\), this is not a perfect relationship, but it is still a clear one. As these figures show, points in a positive relationship move from the bottom left of the plot to the top right, and points in a negative relationship move from the top left to the bottom right.

    Scatterplot showing a negative relationship between Job Satisfaction (x-axis) and Burnout (y-axis); as job satisfaction increases, burnout decreases.
    Figure \(\PageIndex{2}\): Plotting job satisfaction and burnout scores. (“Scatter Plot Job Satisfaction and Burnout” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    Scatter plots can also indicate that there is no relationship between the two variables. In these scatter plots (for example, Figure \(\PageIndex{3}\), which plots job satisfaction and job performance), there is no interpretable shape or line in the scatter plot. The points appear randomly throughout the plot. If we tried to draw a straight line through these points, it would basically be flat. The low scores on job satisfaction have roughly the same scores on job performance as do the high scores on job satisfaction. Scores in the middle or average range of job satisfaction have some scores on job performance that are about equal to the high and low levels and some scores on job performance that are a little higher, but the overall picture is one of inconsistency.

    Scatter plot showing no clear relationship between job satisfaction (x-axis) and job performance (y-axis); data points are scattered randomly.
    Figure \(\PageIndex{3}\): Plotting no relationship between job satisfaction and job performance. (“Scatter Plot Job Satisfaction and Job Performance” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    As we can see, scatter plots are very useful for giving us an approximate idea of whether there is a relationship between the two variables and, if there is, if that relationship is positive or negative. They are also the only way to determine one of the characteristics of correlations that are discussed next: form.

    Video: Introduction to Correlation

    Introduction to Correlation on YouTube.

    Three Characteristics

    When we talk about correlations, there are three traits that we need to know in order to truly understand the relationship (or lack of relationship) between X and Y: form, direction, and magnitude. We will discuss each of them in turn.

    Form

    The first characteristic of relationships between variables is their form. The form of a relationship is the shape it takes in a scatter plot, and a scatter plot is the only way it is possible to assess the form of a relationship. There are three forms we look for: linear, curvilinear, or no relationship. A linear relationship is what we saw in Figures \(\PageIndex{1}\), \(\PageIndex{2}\), and \(\PageIndex{3}\). If we drew a line through the middle points in any of the scatter plots, we would be best suited with a straight line. The term linear comes from the word line. A linear relationship is what we will always assume when we calculate correlations. All of the correlations presented here are only valid for linear relationships. Thus, it is important to plot our data to make sure we meet this assumption.

    The relationship between two variables can also be curvilinear. As the name suggests, a curvilinear relationship is one in which a line through the middle of the points in a scatter plot will be curved rather than straight. Two examples are presented in Figures \(\PageIndex{4}\) and \(\PageIndex{5}\).

    Scatter plot with points showing an upward trend. X-axis ranges from 1 to 10, y-axis ranges from 2 to 10. Points increase in value as x increases.
    Figure \(\PageIndex{4}\): Exponentially increasing curvilinear relationship. (“Curvilinear Relation Increasing” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
    Scatter plot showing a curved pattern with points rising to a peak at the center and then falling, forming an arch shape. The x-axis ranges from 2 to 16, and the y-axis from 1 to 7.
    Figure \(\PageIndex{5}\): Inverted-U curvilinear relationship. (“Curvilinear Relation Inverted U” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    Curvilinear relationships can take many shapes, and the two examples above are only a small sample of the possibilities. What they have in common is that they both have a very clear pattern but that pattern is not a straight line. If we try to draw a straight line through them, we would get a result similar to what is shown in Figure \(\PageIndex{6}\).

    Scatter plot with points forming an upward and downward arch, intersected horizontally by a straight line at y = 4. x-axis ranges from 2 to 14, y-axis from 1 to 7.
    Figure \(\PageIndex{6}\): Overlaying a straight line on a curvilinear relationship. (“Curvilinear Relation Inverted U with Straight Line” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    Although that line is the closest it can be to all points at the same time, it clearly does a very poor job of representing the relationship we see. Additionally, the line itself is flat, suggesting there is no relationship between the two variables even though the data show that there is one. This is important to keep in mind, because the math behind our calculations of correlation coefficients will only ever produce a straight line—we cannot create a curved line with the techniques discussed here.

    Finally, sometimes when we create a scatter plot, we end up with no interpretable relationship at all. An example of this is shown in Figure \(\PageIndex{7}\). The points in this plot show no consistency in relationship, and a line through the middle would once again be a straight, flat line.

    Scatter plot with many black dots clustered near the center, with values between 2 and 8 on both x and y axes. No clear pattern or trend is visible.
    Figure \(\PageIndex{7}\): No relationship. (“Scatter Plot No Relation” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

    Sometimes when we look at scatter plots, it is tempting to get biased by a few points that fall far away from the rest of the points and seem to imply that there may be some sort of relationship. These points are called outliers, and we will discuss them in more detail later in the chapter. These can be common, so it is important to formally test for a relationship between our variables, not just rely on visualization. This is the point of hypothesis testing with correlations, and we will go in-depth on it soon. First, however, we need to describe the other two characteristics of relationships: direction and magnitude.

    Direction

    The direction of the relationship between two variables tells us whether the variables change in the same way at the same time or in opposite ways at the same time. We saw this concept earlier when first discussing scatter plots, and we used the terms positive and negative. A positive relationship is one in which X and Y change in the same direction: as X goes up, Y goes up, and as X goes down, Y also goes down. A negative relationship is just the opposite: X and Y change together in opposite directions: as X goes up, Y goes down, and vice versa.

    As we will see soon, when we calculate a correlation coefficient, we are quantifying the relationship demonstrated in a scatter plot. That is, we are putting a number to it. That number will be either positive, negative, or zero, and we interpret the sign of the number as our direction. If the number is positive, it is a positive relationship, and if it is negative, it is a negative relationship. If it is zero, then there is no relationship. The direction of the relationship corresponds directly to the slope of the hypothetical line we draw through scatter plots when assessing the form of the relationship. If the line has a positive slope that moves from bottom left to top right, it is positive, and vice versa for negative. If the line it flat, that means it has no slope, and there is no relationship, which will in turn yield a zero for our correlation coefficient.

    Magnitude

    The number we calculate for our correlation coefficient, which we will describe in detail below, corresponds to the magnitude of the relationship between the two variables. The magnitude is how strong or how consistent the relationship between the variables is. Higher numbers mean greater magnitude, which means a stronger relationship.

    Our correlation coefficients will take on any value between −1.00 and 1.00, with 0.00 in the middle, which again represents no relationship. A correlation of −1.00 is a perfect negative relationship; as X goes up by some amount, Y goes down by the same amount, consistently. Likewise, a correlation of 1.00 indicates a perfect positive relationship; as X goes up by some amount, Y also goes up by the same amount. Finally, a correlation of 0.00, which indicates no relationship, means that as X goes up by some amount, Y may or may not change by any amount, and it does so inconsistently.

    The vast majority of correlations do not reach −1.00 or 1.00. Instead, they fall in between, and we use rough cut offs for how strong the relationship is based on this number. Importantly, the sign of the number (the direction of the relationship) has no bearing on how strong the relationship is. The only thing that matters is the magnitude, or the absolute value of the correlation coefficient. A correlation of −1 is just as strong as a correlation of 1. We generally use values of .10, .30, and .50 as indicating weak, moderate, and strong relationships, respectively.

    The strength of a relationship, just like the form and direction, can also be inferred from a scatter plot, though this is much more difficult to do. Some examples of weak and strong relationships are shown in Figures \(\PageIndex{8}\) and \(\PageIndex{9}\), respectively. Weak correlations still have an interpretable form and direction, but it is much harder to see. Strong correlations have a very clear pattern, and the points tend to form a line. The examples show two different directions, but remember that the direction does not matter for the strength, only the consistency of the relationship and the size of the number, which we will see next.

    A scatter plot with many small black dots clustered between x-values 2 to 7 and y-values 3 to 8, showing no clear pattern or trend.
    Figure \(\PageIndex{8}\): Weak positive correlation. (“Scatter Plot Weak Positive Correlation” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
    Scatter plot with many points showing a negative correlation; as the x-values increase, the y-values decrease. The points are clustered with some spread.
    Figure \(\PageIndex{9}\): Strong negative correlation. (“Scatter Plot Strong Negative Correlation” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
    Test Your Knowledge

    Question \(\PageIndex{1}\)

    Question \(\PageIndex{2}\)

    Question \(\PageIndex{3}\)

    Question \(\PageIndex{4}\)

    Question \(\PageIndex{5}\)


    This page titled 13.1: Associations Among Variables is shared under a not declared license and was authored, remixed, and/or curated by Chanler Hilley, Kennesaw State University via source content that was edited to the style and standards of the LibreTexts platform.