Skip to main content
Statistics LibreTexts

7.2: Violent Crime, Wealth, Region

  • Page ID
    57739
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    That was fun!

    Let's now try this with two independent variables. We will model the violent crime rate in 2000 using the GSP per capita in 1990 and the region of the state. This will give us the opportunity to reiterate and emphasize that these methods are not constrained to numeric independent variables. As in the second Toy Example in Chapter 3, we can represent categorical independent variables appropriately and model using ordinary least squares estimation.

    The Example

    Model the 2000 violent crime using the 1990 GSP per capita and determine if the relationship differ across the nine census regions.

    If you find an appropriate model, estimate the expected 2000 violent crime rate for a state in the Pacific region with a GSP per capita of $50,000 in 1990. Also, predict the 2000 violent crime rate for that same state.

    The Preamble

    This preamble is mostly the same as the one in last section, so only run these lines if you are starting a new session in R. However, there is a new line. Because R (by default) loads string data as strings,

    ### Preamble
    library(KnoxStats)
    library(lawstat)
    library(lmtest)
    
    dt = read.csv("http://rur.kvasaheim.com/data/crime.csv", stringsAsFactors = TRUE)
    attach(dt)
    
    summary(dt)
    

    Note that there are many variables in this particular data set. Since we are modeling the violent crime rate in 2000 using the rate in 1990 and the state's region, we will only use the variables vcrime00, gspcap90, and census9.

    Caution

    Again, remember that back in the early part of this century, R used to read strings in the data as factors, by default. However, a few years back, it changed the default behavior. So, if you are loading categorical variables — and using them in your analysis — you should include stringsAsFactors = TRUE in your read.csv call.

    Just be aware of this.

    Finding the Model

    The following creates the interaction model between a numeric and a categorical variable. This particular type of interaction analysis is referred to as the Analysis of Covariance, ANCOVA:

    modEd1 = lm(vcrime00 ~ gspcap90 * census9)
    summary.aov(modEd1)
    

    The interaction model allows for the effect to vary between the levels. In terms of this problem, the interaction model allows the effect of the 1990 violent crime rate on the 2000 to be different for the Midwest, the Northeast, the South, and the West.

    It does not force it to be different. It only allows it to be different.

    · · ─ ·✶· ─ · ·

    Jacob Bernoulli
    George E. P. Box (2011)
    (Source: Wikipedia)

    Because of the writings of a 14th-Century monk by the name of William of Ockham, there is a bias in science to create models that are as simple as possible, without being too simple (his doctrine of efficient reasoning).

    Non sunt multiplicanda entia sine necessitate.

    The usual translation is "Things are not to be multiplied without necessity." In other words, simpler models tend to be more helpful than complicated ones. Realize that they are more "helpful" and not more "correct." To drive this point home, allow me to quote George E. P. Box (1976):

    Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. [emphasis mine]

    Note, however, that this doctrine/belief did not originate with Billy. It goes back to — at least — Aristotle in his Posterior Analytics:

    "We may assume the superiority ceteris paribus of the demonstration which derives from fewer postulates or hypotheses."

    ▪──── ⚔ ────▪

    The results from the above code are given here

                     Df  Sum Sq Mean Sq F value   Pr(>F)
    gspcap90          1  816163  816163  26.149 1.32e-05 ***
    census9           8  794820   99353   3.183   0.0087 **
    gspcap90:census9  8  273868   34233   1.097   0.3901
    Residuals        33 1029986   31212
    

    Note that the p-value (last column) for the interaction term (gspcap90:census9) is greater than our usual \(\alpha = 0.05\) (\(\mathrm{p-value} = 0.3901\)). This tells us that the interaction term is not statistically significant. In other words, we can remove it from our model without adversely affecting our understanding of the dependent variable.

    When the model has no interaction terms, it is called an additive model. The following code fits the additive model.

    modEd2 = lm(vcrime00 ~ gspcap90 + census9)
    

    Testing the Requirements

    As usual, the next step is to test the assumptions. Note that this process is the same, even if the particulars differ. The census9 variable is categorical, not numeric. That requires we think a bit more about how to perform the assumption testing.

    Normality

    Again, the first assumption to test is the Normality of the residuals:

    e = residuals(modEd2)
    
    # Normality checking
    overlay(e)
    shapiroTest(e)
    

    According to the Shapiro-Wilk test there is no significant evidence that the residuals come from a non-Normal distribution (\(\mathrm{p-value}=0.1732\)). Thus, the model passes this test. ✅

    Constant Expected Value (Functional Form)

    The second requirement we test is that the expected value of the residuals is constant against each of the independent variables.

    # Expected Value
    plot(gspcap90, e)
    runs.test(e, order=gspcap90)
    

    The runs test indicates that there is no evidence the expected values are not constant (\(\mathrm{p-value}=0.2038\)). Thus, this test is passed, too. ✅

    Yippee!

    But wait! This only tested for constant expected value of the residuals against one of the two independent variables. It is required that the expected value is constant against all of them.

    Here is the problem with just using the runs test when the independent variable is categorical: The ordering within each level is not uniquely defined. For a hint, always graph and think about what the graphic is telling you. Look at the residuals plot.

    plot(census9,e)
    

    Holy side-by-side box-and-whiskers plot, Batman! This makes sense, because we are plotting a numeric variable (e) across nine levels. We need to test that the expected value (mean) is the same in each group.

    From your prior statistics class, this screams ANOVA!

    summary(aov(e~census9))
    

    The p-value returned is \(1\), which is greater than \(\alpha\). So we fail to reject the null hypothesis of equal expected values.

    Thus, this requirement is entirely passed. (sigh of relief)

    Caution

    Well, this result should not be surprising. Because of the mathematics of OLS, the means in each group will be centered at zero. Thus, you should expect p-values of 1 whenever doing this test.

    Constant Variance

    The last requirement is that the residuals have a constant variance against each of the independent variables. For the numeric variable, this is not a problem:

    # Heteroskedasticity
    plot(gspcap90,e)
    hetero.test(e,gspcap90)
    

    With a p-value of \(0.7008\) for the heteroskedasticity test, this test is passed for the numeric independent variable.

    For the categorical variable, remember that we need to test equality of variance across several groups. From your previous statistics course, you may recall that the Fligner-Killeen test does just this:

    plot(census9,e)
    fligner.test(e~census9)
    

    According the the Fligner-Killeen test, there is no evidence of heteroskedasticity (\(\mathrm{p-value}=0.7869\)).

    Thus, this model passes the homoskedastic requirement. Vahooo!

    Note

    We could have taken care of both tests for heteroskedasticity using the Breusch-Pagan test. However, if the model fails that test, we would have no clue as to how to fix it. Breaking it into two parts allows us the additional information of which variable caused the issues.

    Personally, I run the Breusch-Pagan test to determine if there is a violation, then the separate tests to get information on where the issue lies.

    Multicollinearity

    The last thing to check is multicollinearity. Let us use the vif function in the car package to see how that function is used.

    vif(modEd2)

    This gives output

                 GVIF Df GVIF^(1/(2*Df))
    gspcap90 1.226757  1        1.107591
    census9  1.226757  8        1.012855
    

    So, what does this output mean?

    Focus on the GVIF^(1/(2×df))column. This holds the VIF values, adjusted for dimensionality. Since both are below our usual cut-off of 5, there is no multicollinearity concern.

    If we had used the interaction model, we would have run

    vif(modEd1, type="predictor")

    That version also adjusts for interaction terms.

    Note

    In reality, it never hurts to run the version that includes type="predictor". If there are no interaction terms, then it returns exactly the same thing as the default type="terms".

    The Final Model

    This model is appropriate, and we can now see the estimates. However, if we are in the "model creation" or "model selection" mode, we need to determine if both variables are statistically significant. If either is not, then that variable needs to be dropped and the new model tested.

    To get p-values for the variables, just run

    summary.aov(modEd2)
    

    Yeppers, that is summary.aov that you are using. It provides statistical significance of the variables, while summary.lm provides the statistical significance of the levels of the categorical variables.

    The output from the summary.aov(modEd2) command is

                Df  Sum Sq Mean Sq F value   Pr(>F)
    gspcap90     1  816163  816163  25.664 9.07e-06 ***
    census9      8  794820   99353   3.124  0.00753 **
    Residuals   41 1303854   31801
    

    The p-value for the gspcap90 variable is less than alpha, so that variable has a significant effect on the violent crime rate. The p-value for the census9 variable is also less than alpha, so it too needs to be included in the final model.

    In short, this is the model we can use. The abbreviated regression table from this model is

    Coefficients:
                                Estimate        Pr(>|t|)
    (Intercept)                1.125e+02           0.308
    gspcap90                   1.550e-02         6.7e-05 ***
    census9East South Central  7.518e+01           0.535
    census9Middle Atlantic    -6.777e+01           0.608
    census9Mountain           -1.707e+02           0.101
    census9New England        -1.708e+02           0.122
    census9Pacific            -1.316e+02           0.263
    census9South Atlantic      1.590e+02           0.125
    census9West North Central -4.953e+01           0.638
    census9West South Central  1.197e+02           0.323
    

    From this, we can conclude that there is a statistically significant, and positive(!), effect of average state wealth on the violent crime rate. We get that conclusion from the gspcap90 line in the table.

    The rest of the table compares the effect of each level of the census9 variable to the base category, East North Central. As no p-values is less than alpha, we cannot conclude that any of the regions is statistically different in its effect from the East North Central region.

    What about when compared to the Mountain region?

    Changing the Base Category

    In other words, how do we change the base level? First, we have to specify that we want the Mountain region to be the base category against which everything else is calculated. Then, we need to re-fit the model with the new base.

    census9 = set.base(census9, "Mountain")
    modEd3 = lm(vcrime00 ~ gspcap90 + census9)
    summary(modEd3)
    

    The regression table now indicates that the violent crime rate in the Mountain region is significantly lower than that in the East South Central, South Atlantic, and West South Central regions. Here is the regression table

    Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)               -58.209772  96.710953  -0.602 0.550559    
    gspcap90                    0.015502   0.003493   4.438  6.7e-05 ***
    census9East North Central 170.671828 101.668158   1.679 0.100815    
    census9East South Central 245.850643 109.804042   2.239 0.030645 *  
    census9Middle Atlantic    102.903945 121.830985   0.845 0.403211    
    census9New England         -0.152624  96.646604  -0.002 0.998748    
    census9Pacific             39.030098 105.450428   0.370 0.713193    
    census9South Atlantic     329.709936  89.146396   3.699 0.000637 ***
    census9West North Central 121.136876  92.324591   1.312 0.196793    
    census9West South Central 290.395590 109.258299   2.658 0.011159 *  
    

    Question: How does the violent crime rate in the different regions compare to the Pacific region?

    census9 = set.base(census9, "Pacific")
    modEd3 = lm(vcrime00 ~ gspcap90 + census9)
    summary(modEd3)
    

    The violent crime rate in the Pacific region is significantly lower than that in the South Atlantic and the West South Central regions.

    Question: How do the regions compare to the South Atlantic region?

    census9 = set.base(census9, "South Atlantic")
    modEd3 = lm(vcrime00 ~ gspcap90 + census9)
    summary(modEd3)
    

    The violent crime rate in the South Atlantic region is significantly higher than in the Mountain, New England, Pacific, and West South Central regions.

    Please remember the multiple comparisons issue. That these individual analyses only work if you perform only one of them. Multiple comparisons require adjustment of the alpha-level.

    Creating the Graphic

    The following code generates the graphic in Figure \(\PageIndex{1}\), below.

    par(mar=c(4,4,0,1)+0.5, family="serif", las=1)
    par(xaxs="i", yaxs="i")
    par(cex.lab=1.2, font.lab=2)
    
    plot.new()
    plot.window( xlim=c(0,75), ylim=c(0,2500) )
    
    axis(1); axis(2)
    title(xlab="GSP per Capita (1990) [$000]", line=2.75)
    title(ylab="Violent Crime Rate (2000)", line=3.5)
    
    xx = seq(15,70)*1000
    yyPac = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="Pacific"))
    yyMtn = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="Mountain"))
    yySAt = predict(modEd2, newdata=data.frame(gspcap90=xx, census9="South Atlantic"))
    
    lines(xx/1000,yyPac, col="steelblue", lwd=2)
    lines(xx/1000,yyMtn, col="gold", lwd=2)
    lines(xx/1000,yySAt, col="pink", lwd=2)
    
    points(gspcap90/1000,vcrime00, pch=21, bg="lightsteelblue")
    
    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): Plot of the violent crime rate in 2000 against the GSP per capita in 1990 (in thousands of dollars). The ordinary least squares line of best fit is included. The red-colored line is the estimate for the South Atlantic states; blue, Pacific states; and gold, Mountain states.

    It would be helpful to have a legend, but let us leave that for another day!

    Also, you need to be able to determine what each line of this script does.

    Confidence Interval for Slope

    We can obtain confidence intervals for the effects using the same method as before. The interpretation follows the same rules, but the table is much bigger:

    confint(modEd2)
    

    We are 95% confident that the effect of the GSP per capita on the violent crime rate is between 8 and 23 additional violent crimes (per 100,000 population) for every $10,000 increase in GSP per capita.

    The confidence intervals for the effects of each of the levels is as compared to the base level. Thus, we are 95% confident that the average violent crime rate in the West North Central region is between 21 and 396 lower than that in the South Atlantic region (the base level).

    Confidence Interval for Y

    Again, we can estimate the expected value of a violent crime rate, given the GSP per capita and the region.

    predict(modEd2, newdata=data.frame(gspcap90=50000, census9="Pacific"), interval="confidence")
    

    We are 95% confident that the expected violent crime rate in a Pacific-region state with a GSP per capita of $50,000 is between 537 and 975, with a point estimate of 756 violent crimes per 100,000 people.

    Prediction Interval for Y

    Finally, we can also calculate a prediction interval for a new observation:

    predict(modEd2, newdata=data.frame(gspcap90=50000, census9="Pacific"), interval="prediction")
    

    We are 95% sure that the violent crime rate for a new observation of a Pacific-region state with a GSP per capita of $50,000 is between 334 and 1177, with a best guess of 756 violent crimes per 100,000 people.

    Note

    As is always the case, the width of the prediction interval is larger than the width of the confidence interval. Remember why this is the case.


    This page titled 7.2: Violent Crime, Wealth, Region is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?