Skip to main content
Statistics LibreTexts

16: Correlation, Similarity, and Distance

  • Page ID
    45240
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Introduction

    We continue with our discussion and introduction of inferential statistics. Recall that as we analyze a data set, we generally want to begin by describing it (central tendency, measures of variability), and we also want to plot the data. To begin our introduction to correlation and regression, first we describe how to produce graphs to help show linear association or in some cases, cause and effect — the latter perhaps the primary reason for using regression.

    Graphical representation

    The previous statistical procedures we have examined have used one or more categorical or qualitative variables (Chapter 3). For example,

    1. Chi-Square Analyses: variables are all categorical, including the response variable (Chapter 9).
    2. T-tests: one categorical (Factor) variable and one (Dependent, Outcome, Response) variable that was continuous or interval scale (Chapter 8.5, 10).
    3. ANOVA Analyses: one or more variables are categorical (Factors, the independent variables) and one (Dependent, Outcome, Response) variable that was continuous or interval scale (Chapter 12, 14).

    The convention in graphing ANOVA (or Chi-Square) is to use the Factor or Independent variables as the X-axis and to have the dependent variable (Response) as the Y-axis. We called these bar charts (Chapter 4.1).

    A bar chart with error bars, showing a data set of number of matings vs. age (classified as old or young). Age populations are further divided by sex.
    Figure \(\PageIndex{1}\): Bar chart with error bars.

    Box plots (Chapter 4.3) are also useful, and perhaps the preferred choice to display this type of comparison (one involving groups) (Fig. \(\PageIndex{2}\)).

    Box plot of matings vs. age (old or young).
    Figure \(\PageIndex{2}\): Box plots.

    In correlation (and regression) analyses we will have two or more continuous or interval scale variables. To show relationships among continuous variables, a scatter plot, also called an X-Y plot, works well (Chapter 4.5).

    In correlation, no causation is implied, so either variable can be placed on the X-axis. The convention of graphing in regression is to place the independent variable as the X-axis and the dependent variable as the Y-axis (Fig. \(\PageIndex{3}\)). Another consideration: if one variable is considered fixed and the other random, then the fixed variable would be assigned to the horizontal axis.

    Scatter plot of number of matings vs. mass, with 4 groups shown on the same plot: female old, female young, male old, and male young.
    Figure \(\PageIndex{3}\): Scatterplot with groups.

    To produce a scatterplot (also called an X-Y plot) in Rcmdr, select Graph → Plot → and select the Y and X variables. Use a combination of Options, Frame, and Edit Attributes selections to modify the default graph.

    • 16.1: Product-moment correlation
      Correlations as methods of describing direction and magnitude of the linear association between two variables. Discussion of the Pearson product-moment correlation for use in describing association between continuous, ratio-scale data.
    • 16.2: Causation and partial correlation
      The difference between correlation and causation, and the danger of confounding variables creating spurious correlations between measured variables. Partial correlation as a method of determining whether a measured third variable is correlated with the two variables of interest. Note: source material is missing a dataset used for one worked example.
    • 16.3: Data aggregation and correlation
      How correlations among grouped or aggregated data may differ from the underlying individual correlations. Discussion of the ecological fallacy.
    • 16.4: Spearman and other correlations
      Types of correlation estimators besides the Pearson product moment correlation. Includes discussion of Spearman rank correlation, Kendall's tau, and other correlations.
    • 16.5: Instrument reliability and validity
      Discussion of types of instruments and instrument reliability. Types of reliability and reliability estimators.
    • 16.6: Similarity and distance
      Distance correlation as a measure of association strength between variables regardless of relationship linearity. Examples of distance measures as used in biology.
    • 16.7: References and suggested readings


    This page titled 16: Correlation, Similarity, and Distance is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Michael R Dohm via source content that was edited to the style and standards of the LibreTexts platform.