7.2: Violent Crime, Wealth, Region
- Page ID
- 57739
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)That was fun!
Let's now try this with two independent variables. We will model the violent crime rate in 2000 using the GSP per capita in 1990 and the region of the state. This will give us the opportunity to reiterate and emphasize that these methods are not constrained to numeric independent variables. As in the second Toy Example in Chapter 3, we can represent categorical independent variables appropriately and model using ordinary least squares estimation.
Model the 2000 violent crime using the 1990 GSP per capita and determine if the relationship differ across the nine census regions.
If you find an appropriate model, estimate the expected 2000 violent crime rate for a state in the Pacific region with a GSP per capita of $50,000 in 1990. Also, predict the 2000 violent crime rate for that same state.
The Preamble
This preamble is mostly the same as the one in last section, so only run these lines if you are starting a new session in R. However, there is a new line. Because R (by default) loads string data as strings,
### Preamble
library(KnoxStats)
library(lawstat)
library(lmtest)
dt = read.csv("http://rur.kvasaheim.com/data/crime.csv", stringsAsFactors = TRUE)
attach(dt)
summary(dt)
Note that there are many variables in this particular data set. Since we are modeling the violent crime rate in 2000 using the rate in 1990 and the state's region, we will only use the variables vcrime00, gspcap90, and census9.
Again, remember that back in the early part of this century, R used to read strings in the data as factors, by default. However, a few years back, it changed the default behavior. So, if you are loading categorical variables — and using them in your analysis — you should include stringsAsFactors = TRUE in your read.csv call.
Just be aware of this.
Finding the Model
The following creates the interaction model between a numeric and a categorical variable. This particular type of interaction analysis is referred to as the Analysis of Covariance, ANCOVA:
modEd1 = lm(vcrime00 ~ gspcap90 * census9) summary.aov(modEd1)
The interaction model allows for the effect to vary between the levels. In terms of this problem, the interaction model allows the effect of the 1990 violent crime rate on the 2000 to be different for the Midwest, the Northeast, the South, and the West.
It does not force it to be different. It only allows it to be different.
· · ─ ·✶· ─ · ·
(Source: Wikipedia)
Because of the writings of a 14th-Century monk by the name of William of Ockham, there is a bias in science to create models that are as simple as possible, without being too simple (his doctrine of efficient reasoning).
Non sunt multiplicanda entia sine necessitate.
The usual translation is "Things are not to be multiplied without necessity." In other words, simpler models tend to be more helpful than complicated ones. Realize that they are more "helpful" and not more "correct." To drive this point home, allow me to quote George E. P. Box (1976):
Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. [emphasis mine]
Note, however, that this doctrine/belief did not originate with Billy. It goes back to — at least — Aristotle in his Posterior Analytics:
"We may assume the superiority ceteris paribus of the demonstration which derives from fewer postulates or hypotheses."
▪──── ⚔ ────▪
The results from the above code are given here
Df Sum Sq Mean Sq F value Pr(>F) gspcap90 1 816163 816163 26.149 1.32e-05 *** census9 8 794820 99353 3.183 0.0087 ** gspcap90:census9 8 273868 34233 1.097 0.3901 Residuals 33 1029986 31212
Note that the p-value (last column) for the interaction term (gspcap90:census9) is greater than our usual \(\alpha = 0.05\) (\(\mathrm{p-value} = 0.3901\)). This tells us that the interaction term is not statistically significant. In other words, we can remove it from our model without adversely affecting our understanding of the dependent variable.
When the model has no interaction terms, it is called an additive model. The following code fits the additive model.
modEd2 = lm(vcrime00 ~ gspcap90 + census9)
Testing the Requirements
As usual, the next step is to test the assumptions. Note that this process is the same, even if the particulars differ. The census9 variable is categorical, not numeric. That requires we think a bit more about how to perform the assumption testing.
Normality
Again, the first assumption to test is the Normality of the residuals:
e = residuals(modEd2) # Normality checking overlay(e) shapiroTest(e)
According to the Shapiro-Wilk test there is no significant evidence that the residuals come from a non-Normal distribution (\(\mathrm{p-value}=0.1732\)). Thus, the model passes this test. ✅
Constant Expected Value (Functional Form)
The second requirement we test is that the expected value of the residuals is constant against each of the independent variables.
# Expected Value plot(gspcap90, e) runs.test(e, order=gspcap90)
The runs test indicates that there is no evidence the expected values are not constant (\(\mathrm{p-value}=0.2038\)). Thus, this test is passed, too. ✅
Yippee!
But wait! This only tested for constant expected value of the residuals against one of the two independent variables. It is required that the expected value is constant against all of them.
Here is the problem with just using the runs test when the independent variable is categorical: The ordering within each level is not uniquely defined. For a hint, always graph and think about what the graphic is telling you. Look at the residuals plot.
plot(census9,e)
Holy side-by-side box-and-whiskers plot, Batman! This makes sense, because we are plotting a numeric variable (e) across nine levels. We need to test that the expected value (mean) is the same in each group.
From your prior statistics class, this screams ANOVA!
summary(aov(e~census9))
The p-value returned is \(1\), which is greater than \(\alpha\). So we fail to reject the null hypothesis of equal expected values.
Thus, this requirement is entirely passed. (sigh of relief) ✅
Well, this result should not be surprising. Because of the mathematics of OLS, the means in each group will be centered at zero. Thus, you should expect p-values of 1 whenever doing this test.
Constant Variance
The last requirement is that the residuals have a constant variance against each of the independent variables. For the numeric variable, this is not a problem:
# Heteroskedasticity plot(gspcap90,e) hetero.test(e,gspcap90)
With a p-value of \(0.7008\) for the heteroskedasticity test, this test is passed for the numeric independent variable.
For the categorical variable, remember that we need to test equality of variance across several groups. From your previous statistics course, you may recall that the Fligner-Killeen test does just this:
plot(census9,e) fligner.test(e~census9)
According the the Fligner-Killeen test, there is no evidence of heteroskedasticity (\(\mathrm{p-value}=0.7869\)).
Thus, this model passes the homoskedastic requirement. Vahooo! ✅
We could have taken care of both tests for heteroskedasticity using the Breusch-Pagan test. However, if the model fails that test, we would have no clue as to how to fix it. Breaking it into two parts allows us the additional information of which variable caused the issues.
Personally, I run the Breusch-Pagan test to determine if there is a violation, then the separate tests to get information on where the issue lies.
Multicollinearity
The last thing to check is multicollinearity. Let us use the vif function in the car package to see how that function is used.
vif(modEd2)
This gives output
GVIF Df GVIF^(1/(2*Df)) gspcap90 1.226757 1 1.107591 census9 1.226757 8 1.012855
So, what does this output mean?
Focus on the GVIF^(1/(2×df))column. This holds the VIF values, adjusted for dimensionality. Since both are below our usual cut-off of 5, there is no multicollinearity concern. ✅
If we had used the interaction model, we would have run
vif(modEd1, type="predictor")
That version also adjusts for interaction terms.
In reality, it never hurts to run the version that includes type="predictor". If there are no interaction terms, then it returns exactly the same thing as the default type="terms".
The Final Model
This model is appropriate, and we can now see the estimates. However, if we are in the "model creation" or "model selection" mode, we need to determine if both variables are statistically significant. If either is not, then that variable needs to be dropped and the new model tested.
To get p-values for the variables, just run
summary.aov(modEd2)
Yeppers, that is summary.aov that you are using. It provides statistical significance of the variables, while summary.lm provides the statistical significance of the levels of the categorical variables.
The output from the summary.aov(modEd2) command is
Df Sum Sq Mean Sq F value Pr(>F) gspcap90 1 816163 816163 25.664 9.07e-06 *** census9 8 794820 99353 3.124 0.00753 ** Residuals 41 1303854 31801
The p-value for the gspcap90 variable is less than alpha, so that variable has a significant effect on the violent crime rate. The p-value for the census9 variable is also less than alpha, so it too needs to be included in the final model.
In short, this is the model we can use. The abbreviated regression table from this model is
Coefficients:
Estimate Pr(>|t|)
(Intercept) 1.125e+02 0.308
gspcap90 1.550e-02 6.7e-05 ***
census9East South Central 7.518e+01 0.535
census9Middle Atlantic -6.777e+01 0.608
census9Mountain -1.707e+02 0.101
census9New England -1.708e+02 0.122
census9Pacific -1.316e+02 0.263
census9South Atlantic 1.590e+02 0.125
census9West North Central -4.953e+01 0.638
census9West South Central 1.197e+02 0.323
From this, we can conclude that there is a statistically significant, and positive(!), effect of average state wealth on the violent crime rate. We get that conclusion from the gspcap90 line in the table.
The rest of the table compares the effect of each level of the census9 variable to the base category, East North Central. As no p-values is less than alpha, we cannot conclude that any of the regions is statistically different in its effect from the East North Central region.
What about when compared to the Mountain region?
Changing the Base Category
In other words, how do we change the base level? First, we have to specify that we want the Mountain region to be the base category against which everything else is calculated. Then, we need to re-fit the model with the new base.
census9 = set.base(census9, "Mountain") modEd3 = lm(vcrime00 ~ gspcap90 + census9) summary(modEd3)
The regression table now indicates that the violent crime rate in the Mountain region is significantly lower than that in the East South Central, South Atlantic, and West South Central regions. Here is the regression table
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.209772 96.710953 -0.602 0.550559
gspcap90 0.015502 0.003493 4.438 6.7e-05 ***
census9East North Central 170.671828 101.668158 1.679 0.100815
census9East South Central 245.850643 109.804042 2.239 0.030645 *
census9Middle Atlantic 102.903945 121.830985 0.845 0.403211
census9New England -0.152624 96.646604 -0.002 0.998748
census9Pacific 39.030098 105.450428 0.370 0.713193
census9South Atlantic 329.709936 89.146396 3.699 0.000637 ***
census9West North Central 121.136876 92.324591 1.312 0.196793
census9West South Central 290.395590 109.258299 2.658 0.011159 *
Question: How does the violent crime rate in the different regions compare to the Pacific region?
census9 = set.base(census9, "Pacific") modEd3 = lm(vcrime00 ~ gspcap90 + census9) summary(modEd3)
The violent crime rate in the Pacific region is significantly lower than that in the South Atlantic and the West South Central regions.
Question: How do the regions compare to the South Atlantic region?
census9 = set.base(census9, "South Atlantic") modEd3 = lm(vcrime00 ~ gspcap90 + census9) summary(modEd3)
The violent crime rate in the South Atlantic region is significantly higher than in the Mountain, New England, Pacific, and West South Central regions.
Please remember the multiple comparisons issue. That these individual analyses only work if you perform only one of them. Multiple comparisons require adjustment of the alpha-level.
Creating the Graphic
The following code generates the graphic in Figure \(\PageIndex{1}\), below.
par(mar=c(4,4,0,1)+0.5, family="serif", las=1) par(xaxs="i", yaxs="i") par(cex.lab=1.2, font.lab=2) plot.new() plot.window( xlim=c(0,75), ylim=c(0,2500) ) axis(1); axis(2) title(xlab="GSP per Capita (1990) [$000]", line=2.75) title(ylab="Violent Crime Rate (2000)", line=3.5) xx = seq(15,70)*1000 yyPac = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="Pacific")) yyMtn = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="Mountain")) yySAt = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="South Atlantic")) lines(xx/1000,yyPac, col="steelblue", lwd=2) lines(xx/1000,yyMtn, col="gold", lwd=2) lines(xx/1000,yySAt, col="pink", lwd=2) points(gspcap90/1000,vcrime00, pch=21, bg="lightsteelblue")
It would be helpful to have a legend, but let us leave that for another day!
Also, you need to be able to determine what each line of this script does.
Confidence Interval for Slope
We can obtain confidence intervals for the effects using the same method as before. The interpretation follows the same rules, but the table is much bigger:
confint(modEd2)
We are 95% confident that the effect of the GSP per capita on the violent crime rate is between 8 and 23 additional violent crimes (per 100,000 population) for every $10,000 increase in GSP per capita.
The confidence intervals for the effects of each of the levels is as compared to the base level. Thus, we are 95% confident that the average violent crime rate in the West North Central region is between 21 and 396 lower than that in the South Atlantic region (the base level).
Confidence Interval for Y
Again, we can estimate the expected value of a violent crime rate, given the GSP per capita and the region.
predict(modEd2, newdata=data.frame(gspcap90=50000, census9="Pacific"), interval="confidence")
We are 95% confident that the expected violent crime rate in a Pacific-region state with a GSP per capita of $50,000 is between 537 and 975, with a point estimate of 756 violent crimes per 100,000 people.
Prediction Interval for Y
Finally, we can also calculate a prediction interval for a new observation:
predict(modEd2, newdata=data.frame(gspcap90=50000, census9="Pacific"), interval="prediction")
We are 95% sure that the violent crime rate for a new observation of a Pacific-region state with a GSP per capita of $50,000 is between 334 and 1177, with a best guess of 756 violent crimes per 100,000 people.
As is always the case, the width of the prediction interval is larger than the width of the confidence interval. Remember why this is the case.


