# 6.1: Analysis of Correlation

To start with relationships, one need first to find a correlation, e.g., to measure the extent and sign of relation, and to prove if this is statistically reliable.

Note that correlation does not reflect the nature of relationship (Figure $$\PageIndex{1}$$). If we find a significant correlation between variables, this could mean that A depends on B, B depends on A, A and B depend on each other, or A and B depend on a third variable C but have no relation to each other. A famous example is the correlation between ice cream sales and home fires. It would be strange to suggest that eating ice cream causes people to start fires, or that experiencing fires causes people to buy ice cream. In fact, both of these parameters depend on air temperature$$^{[1]}$$.

Figure $$\PageIndex{1}$$ Correlation and causation (taken from XKCD, http://xkcd.com/552/).

Numbers alone could be misleading, so there is a simple rule: plot it first.

### Plot it first

The most striking example of relationships where numbers alone do to provide a reliable answer, is the Anscombe’s quartet, four sets of two variables which have almost identical means and standard deviations:

Code $$\PageIndex{1}$$ (Python):

classic.desc <- function(.x) {c(mean=mean(.x, na.rm=TRUE),var=var(.x, na.rm=TRUE))}
sapply(anscombe, classic.desc)


(Data anscombe is embedded into R. To compact input and output, several tricks were used. Please find them yourself.)

Linear model coefficients (see below) are also quite similar but if we plot these data, the picture (Figure $$\PageIndex{2}$$) is radically different from what is reflected in numbers:

Code $$\PageIndex{2}$$ (Python):

a.vars <- data.frame(i=c(1, 5), ii=c(2, 6), iii=c(3, 7), iv=c(4, 8))
oldpar <- par(mfrow=c(2, 2), mar=c(4, 4, 1, 1))
for (i in 1:4) { plot(anscombe[a.vars[, i]], pch=19, cex=1.2); abline(lm(anscombe[rev(a.vars[, i])]), lty=2) }


(For aesthetic purposes, we put all four plots on the same figure. Note the for operator which produces cycle repeating one sequence of commands four times. To know more, check ?"for".)

To the credit of nonparametric and/or robust numerical methods, they are not so easy to deceive:

Code $$\PageIndex{3}$$ (Python):

robust.desc <- function(.x) {c(median=median(.x, na.rm=TRUE),IQR=IQR(.x, na.rm=TRUE), mad=mad(.x, na.rm=TRUE))}
sapply(anscombe, robust.desc)


Figure $$\PageIndex{2}$$ Anscombe’s quartet, plotted together with lines from linear models.

This is correct to guess that boxplots should also show the difference. Please try to plot them yourself.

### Correlation

To measure the extent and sign of linear relationship, we need to calculate correlation coefficient. The absolute value of the correlation coefficient varies from 0 to 1. Zero means that the values of one variable are unconnected with the values of the other variable. A correlation coefficient of $$1$$ or $$-1$$ is an evidence of a linear relationship between two variables. A positive value of means the correlation is positive (the higher the value of one variable, the higher the value of the other), while negative values mean the correlation is negative (the higher the value of one, the lower of the other).

It is easy to calculate correlation coefficient in R:

Code $$\PageIndex{4}$$ (Python):

cor(5:15, 7:17)
cor(5:15, c(7:16, 23))


(By default, R calculates the parametric Pearson correlation coefficient $$r$$.)

In the simplest case, it is given two arguments (vectors of equal length). It can also be called with one argument if using a matrix or data frame. In this case, the function cor() calculates a correlation matrix, composed of correlation coefficients between all pairs of data columns.

Code $$\PageIndex{5}$$ (Python):

cor(trees)


As correlation is in fact the effect size of covariance, joint variation of two variables, to calculate it manually, one needs to know individual variances and variance of the difference between variables:

Code $$\PageIndex{6}$$ (Python):

with(trees, cor(Girth, Height))
(v1 <- var(trees$Girth)) (v2 <- var(trees$Height))
(v12 <- var(trees$Girth - trees$Height))
(pearson.r <- (v1 + v2 - v12)/(2*sqrt(v1)*sqrt(v2)))


Another way is to use cov() function which calculates covariance directly:

Code $$\PageIndex{7}$$ (Python):

with(trees, cov(Girth, Height)/(sd(Girth)*sd(Height)))


To interpret correlation coefficient values, we can use either symnum() or Topm() functions (see below), or Mag() together with apply():

Code $$\PageIndex{8}$$ (Python):

noquote(apply(cor(trees), 1:2, function(.x) Mag(.x, squared=FALSE)))  # asmisc.r


If the numbers of observations in the columns are unequal (some columns have missing data), the parameter use becomes important. Default is everything which returns NA whenever there are any missing values in a dataset. If the parameter use is set to complete.obs, observations with missing data are automatically excluded. Sometimes, missing data values are so dispersed that complete.obs will not leave much of it. In that last case, use pairwise.complete.obs which removes missing values pair by pair.

Pearson’s parametric correlation coefficients characteristically fail with the Anscombe’s data:

Code $$\PageIndex{9}$$ (Python):

diag(cor(anscombe[, 1:4], anscombe[, 5:8]))


To overcome the problem, one can use Spearman’s $$\rho$$ (“rho”, or rank correlation coefficient) which is most frequently used nonparametric correlation coefficient:

Code $$\PageIndex{10}$$ (Python):

with(trees, cor(Girth, Height, method="spearman"))
diag(cor(anscombe[, 1:4], anscombe[, 5:8], method="s"))


(Spearman’s correlation is definitely more robust!)

The third kind of correlation coefficient in R is nonparametric Kendall’s $$\tau$$ (“tau”):

Code $$\PageIndex{11}$$ (Python):

with(trees, cor(Girth, Height, method="k"))
diag(cor(anscombe[, 1:4], anscombe[, 5:8], method="k"))


It is often used to measure association between two ranked or binary variables, i.e. as an alternative to effect sizes of the association in contingency tables.

How to check if correlation is statistically significant? As a null hypothesis, we could accept that correlation coefficient is equal to zero (no correlation). If the null is rejected, then correlation is significant:

Code $$\PageIndex{12}$$ (Python):

with(trees, cor.test(Girth, Height))


The logic of cor.test() is the same as in tests before (Table 5.1.1, Figure 5.1.1). In terms of p-value:

The probability of obtaining the test statistic (correlation coefficient), given the initial assumption of zero correlation between the data is very low—about 0.3%. We would reject H$$_0$$ and therefore accept an alternative hypothesis that correlation between variables is present. Please note the confidence interval, it indicates here that the true value of the coefficient lies between 0.2 and 0.7. with 95% probability.

It is not always easy to read the big correlation table, like in the following example of longley macroeconomic data. Fortunately, there are several workarounds, for example, the symnum() function which replaces numbers with letters or symbols in accordance to their value:

Code $$\PageIndex{13}$$ (Python):

symnum(cor(longley))


The second way is to represent the correlation matrix with a plot. For example, we may use the heatmap: split everything from $$-1$$ to $$+1$$ into equal intervals, assign the color for each interval and show these colors (Figure $$\PageIndex{3}$$):

Code $$\PageIndex{14}$$ (Python):

cor.l <- cor(longley)
dimnames(cor.l) <- lapply(dimnames(cor.l), abbreviate)
palette.l <- rgb.palette(length(unique(abs(cor.l))))
library(lattice)
levelplot(abs(cor.l), col.regions=palette.l, xlab="", ylab="")


(We shortened here long names with the abbreviate() command.)

The other interesting way of representing correlations are correlation ellipses (from ellipse package). In that case, correlation coefficients are shown as variously compressed ellipses; when coefficient is close to $$-1$$ or $$+1$$, ellipse is more narrow (Figure $$\PageIndex{4}$$). The slope of ellipse represents the sign of correlation (negative or positive):

Code $$\PageIndex{15}$$ (Python):

library(ellipse)
colors <- cm.colors(7)
plotcorr(cor.l, type="lower", col=colors[5*cor.l + 2])


Several useful ways to visualize and analyze correlations present in the asmisc.r file supplied with this book:

Code $$\PageIndex{16}$$ (Python):

tox <- read.table("data/poisoning.txt", h=TRUE)
tox.cor <- cor(tox, method="k")


We calculated here Kendall’s correlation coefficient for the binary toxicity data to make the picture used on the title page. Pleiad() not only showed (Figure $$\PageIndex{5}$$) that illness is associated with tomato and Caesar salad, but also found two other correlation pleiads: coffee/rice and crab dip/crisps. (By the way, pleiads show one more application of R: analysis of networks.)

Figure $$\PageIndex{3}$$ Heatmap: graphical representation of the correlation matrix.

Function Cor() outputs correlation matrix together with asterisks for the significant correlation tests:

Code $$\PageIndex{17}$$ (Python):

tox <- read.table("data/poisoning.txt", h=TRUE)
Cor(tox, method="kendall", dec=2) # asmisc.r


Figure $$\PageIndex{4}$$ Correlation coefficients as ellipses.

Finally, function Topm() shows largest correlations by rows:

Code $$\PageIndex{18}$$ (Python):

tox <- read.table("data/poisoning.txt", h=TRUE)
tox.cor <- cor(tox, method="k")
Topm(tox.cor, level=0.4) # asmisc.r


Data file traits.txt contains results of the survey where most genetically apparent human phenotype characters were recorded from many individuals. Explanation of these characters are in trait_c.txt file. Please analyze this data with correlation methods.

Figure $$\PageIndex{5}$$ Correlation pleiads for the toxicity data.

### References

1. There are, however, advanced techniques with the goal to understand the difference between causation and correlation: for example, those implemented in bnlearn package.