25 Modeling continuous relationships in R
25.1 Computing covariance and correlation (Section 24.3)
Let’s first look at our toy example of covariance and correlation. For this example, we first start by generating a set of X values.
Then we create a related Y variable by adding some random noise to the X variable:
We compute the deviations and multiply them together to get the crossproduct:
And then we compute the covariance and correlation:
25.2 Hate crime example
Now we will look at the hate crime data from the
fivethirtyeight package. First we need to prepare the data by getting rid of NA values and creating abbreviations for the states. To do the latter, we use the
state.name variables that come with R along with the
match() function that will match the state names in the
hate_crimes variable to those in the list.
## ## Pearson's product-moment correlation ## ## data: hateCrimes$avg_hatecrimes_per_100k_fbi and hateCrimes$gini_index ## t = 3, df = 48, p-value = 0.001 ## alternative hypothesis: true correlation is greater than 0 ## 95 percent confidence interval: ## 0.21 1.00 ## sample estimates: ## cor ## 0.42
Remember that we can also compute the p-value using randomization. To to this, we shuffle the order of one of the variables, so that we break the link between the X and Y variables — effectively making the null hypothesis (that the correlation is less than or equal to zero) true. Here we will first create a function that takes in two variables, shuffles the order of one of them (without replacement) and then returns the correlation between that shuffled variable and the original copy of the second variable.
Now we take the distribution of observed correlations after shuffling and compare them to our observed correlation, in order to obtain the empirical probability of our observed data under the null hypothesis.
##  0.0066
This value is fairly close (though a bit larger) to the one obtained using
25.3 Robust correlations (24.3.2)
In the previous chapter we also saw that the hate crime data contained one substantial outlier, which appeared to drive the significant correlation. To compute the Spearman correlation, we first need to convert the data into their ranks, which we can do using the
We can then compute the Spearman correlation by applying the Pearson correlation to the rank variables"
##  0.057
We see that this is much smaller than the value obtained using the Pearson correlation on the original data. We can assess its statistical signficance using randomization:
##  0.0014
Here we see that the p-value is substantially larger and far from significance.