13.3: Identifying Associations
- Page ID
- 65903
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Associations can be identified using observational studies. There are two main types of observational studies: retrospective and prospective. We begin by defining retrospective study.
A retrospective study is an observational study in which researchers obtain information about subjects from data that has already been collected or from an event that has already occurred.
For this type of study, the analysis is performed on data that were already collected. The point of this type of study is to take information that already exists and see if there are associations between variables. Researchers have to be careful with this type of analysis, especially when there are many observations for each of the variables. Too many observations can result in observed associations for variables with very tenuous connections. For example, the website tylervigen.com reports that the total revenue generated by arcades is very strongly associated with the number of computer science doctorates awarded in the United States. Does this mean one causes the other to occur or is it more likely that both variables are increasing with time causing an apparent relationship? Using data that has already been collected ignores important protocols that are required to establish a causal relationship, and therefore it is usually impossible to tell whether two variables are really related, or whether they are just associated with one another.
These types of studies are quite common in research studies. They can give researchers information on issues that may exist and should be studied further, without a major investment of resources. Such an analysis can also help validate research concepts and ideas with the aim of developing more sophisticated research hypotheses. The next type of study considers observational studies that observe data collected in the future.
A prospective study is a type of observational study in which researchers gather information about subjects from data that has not yet been collected or from an event that is planned but has not happened yet.
Retrospective and prospective studies are both types of observational studies so that the researchers do not attempt to manipulate anything about the life of the individuals participating in the experiment other than the fact that they are now being observed or participating. The difference between a retrospective study and a prospective study has to do with whether the data exists when the researchers design the study and decides what hypotheses they are going to investigate. For example, researchers can observe a population to try to learn if there is a relationship between income and race. The researchers could collect data from a website that has income data that is categorized by race or ethnicity for a specified city. The researchers could then use this data to determine if race and income appear to be associated. This is an example of a retrospective study. On the other hand, researchers could decide to design a study that uses results from an upcoming survey on income and race and ethnicity. This would be an example of a prospective study.
Prospective studies are also often longitudinal, that is, individuals are tracked through time. Consider a situation where a researcher wants to determine if race and ethnicity are a factor in determining income after graduation at a particular university. The researcher surveys 100 graduating seniors to obtain baseline information. After graduation, the alums report their income each year afterwards. Such a study can allow the researcher to determine if there is a log-term association between race and ethnicity and income. Because the study was designed prior to obtaining any data, this is a prospective study.
A prospective study is a longitudinal study is the same individuals are repeatedly observed over time.
The strength of a linear association between two variables is measured by a numerical data summary called the correlation. The correlation is a measure that can be as small as \(-1\), indicating a perfect negative linear association, and can be as large as \(1\), indicating a perfect positive linear association. If the correlation is equal to 0, then no linear association exists. Values of the correlation near \(-1\) and \(1\) indicate strong associations and values of the correlation near \(0\) indicate weak associations. The correlation is often represented using the Greek letter rho (\(\rho\)).
The correlation between two variables is a numerical summary measure that varies between \(-1\) and \(1\) that indicates the strength of a linear association between the two variables. The linear trend is increasing when the correlation is positive and is decreasing when the correlation is negative. When the correlation is equal to 0 there is no linear trend.
The scatterplots shown in Figure \(\PageIndex{1}\)-\(\PageIndex{2}\) graphically depict the strength of the association between two variables for various values of the correlation. In each of these plots 100 observations from two variables were simulated with a linear association whose strength is indicted by the corresponding value of the correlation coefficient. In Figure \(\PageIndex{1}\) we observe what such a relationship may look like when \(\rho=0\). In this case the value of the correlation indicates that there is no linear relationship between the two variables. In examining the plot we can look for a trend, in this case a linear trend. The plot shown in Figure \(\PageIndex{1}\) shows essentially a cloud of points with no trend. For example, if someone tells us the value of the first variable, whose value is represented on the horizontal axis, then we would have essentially no information about what the corresponding value of the second variable, whose value is represented by the vertical axis, would be.
Figure \(\PageIndex{1}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.00\). (Public domain image created by Alan M. Polansky)
Now consider Figure \(\PageIndex{2}\) where we observe what such a relationship may look like when \(\rho=0.25\). In this plot we can observe a slight upward trend in the data cloud. That is, if the first variable is larger then the second variable also tends to be larger. However, The plot also indicates that there would be some degree of uncertainty of we attempted to predict the second variable based on the value of the first variable. For example, if the first variable is equal to \(-1\) then we would expect the second variable to roughly be between \(-3\) and \(1\). Correspondingly, if the first variable is equal to 2 then we would expect the second variable to be roughly between 0 and 3. Finally, note that the upward trend is linear, meaning that the cloud of points could be visualized as being loosely clustered around a line.
Figure \(\PageIndex{2}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.25\). (Public domain image created by Alan M. Polansky)
Considering the relationship shown in Figure \(\PageIndex{3}\) we observe what paired data may look like when \(\rho=0.40\). In this plot we can observe a slight upward trend in the data cloud similar to what we observe in Figure \(\PageIndex{2}\). While the strength of the linear relationship is stronger as indicated by the value of the correlation,
Figure \(\PageIndex{3}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.40\). (Public domain image created by Alan M. Polansky)
Figure \(\PageIndex{5}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.60\). (Public domain image created by Alan M. Polansky)
Figure \(\PageIndex{6}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.75\). (Public domain image created by Alan M. Polansky)
Figure \(\PageIndex{7}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.90\). (Public domain image created by Alan M. Polansky) 
Figure \(\PageIndex{8}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=0.99\). (Public domain image created by Alan M. Polansky)
Figure \(\PageIndex{9}\): Scatterplot of data from two variables with the linear strength association with correlation coefficient \(\rho=1.00\). (Public domain image created by Alan M. Polansky)


