Skip to main content
Statistics LibreTexts

7.3: Cows in Děčín

  • Page ID
    57740
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    To illustrate the process of model selection, let us examine the Děčín ballot measure of 2009. That ballot measure sought to constitutionally restrict the number of cows that can be housed within the city limits. While this extended example seems rather dated, it does cover some interesting issues in statistical modeling and questions we can answer with our model.

    The Example

    The voters of Děčín are being sent to the polls to vote on a constitutional referendum that proposes to limit the number of cows that could be housed within the city limits. This was not the first time that Ruritanians were sent to the polls to vote on this or a closely related issue. Given the information from previous votes, and the demographics of Děčín voters, what is the probability that this ballot measure will pass?

    Before attempting any analysis, there needs to be a search of the literature to inform us as to which variables should be present, and which directions those variables should affect the dependent variable. From that literature review, we hypothesize that the vote in favor of such ballot measures depends on three variables: age of the population, religiosity of the population, and whether the ballot measure also restricts chickens. The effect direction for each is that kraj that are more religious should vote against cow-housing at a higher rate; Measures that also ban chickens should have a harder time passing; Measures passed more recently should have a more difficult chance of passing, as the young tend to support cows, and the elderly tend to oppose them (wanting quiet, dung-free neighborhoods).

    With this theory and the resulting hypotheses, we can take our next step: Getting to know the data.

    Get to Know the Data (The Preamble)

    Before we begin trying to answer this question, we must get to know our data. There are several functions available to us to visualize the data: histogram, scatter plots, and quantile-quantile plots. In addition to visualizing the data, we should calculate several of the descriptive statistics for the variables of interest.

    library(KnoxStats)
    
    cows = read.csv("http://rur.kvasaheim.com/data/cows.csv")
    summary(cows)
    

    Here are some descriptive statistics on the variables in the cows dataset.

               Year Passed    Chicken Ban    Religious Percent 
    Minimum       1998           0                 51.00  
    Maximum       2008           1                 85.00  
    Median        2004           1                 67.50  
    Mean          2004           0.5938            66.75  
    Variance      6.07           0.2490            88.19
    Cv            0.58           0.8404             0.14
    

    Variability

    Since we have multiple independent variables, we should calculate both univariate and bivariate descriptive statistics. The table above provides the univariate descriptive statistics. The primary univariate question to ask about the independent variables here is whether there is sufficient variation. The two measures we need to examine are the variance and the coefficient of variation. If both of these numbers are small, then there may be an issue.

    In this data, the variance of the Chicken Ban variable is small and potentially worrisome; however, its coefficient of variation (a scaled standard deviation),

    \begin{equation}
    c_v = \left|\frac{s}{\ \bar{x}\ }\right|
    \end{equation}

    indicates that there is no serious issue (the value is close to 1).

    Note

    As this is a dichotomous variable, the mean is the percent of the values equal to 1. Thus, there are about 60% of the values 1 and 40% of the values 0 --- more than sufficient variation.

    None of the three variables have small enough variation to cause us concern.

    Relationships

    After getting to know the variables individually, it is important to get to know the relationships between the variables. This can be done through correlation tests and bivariate scatter plots. Independent variables with strong correlations with the dependent variable should be considered for inclusion in the model. Independent variables with strong correlations with other independent variables should be of concern. Remember that one of the assumptions of OLS regression is that the independent variables are statistically independent of each other. If independent variables are highly correlated, the statistical properties of the method weaken.

    Question:

    If independent variables are highly correlated, the statistical properties of the method weaken. Why?

    The pairwise correlations are provided in the table below. Of the three independent variables, only Chicken Ban and Religious Percent have a statistically significant correlation (\(t=3.2869; \nu=30; \text{p-value}=0.0026\)). Should the level of correlation be a concern? Perhaps. While their correlation is only \(r=0.5146\), this corresponds to an \(R^2\) value of just \(0.2648\). As such, the correlation may not be large enough to severely affect our coefficient estimates. Let us just remember this relationship for the future.

    Table \(\PageIndex{1}\): Covariance table
    Year Passed Chicken Ban Religious Percent
    Year Passed   0.1903 0.2399
    Chicken Ban 0.1903   0.5146
    Religious Percent 0.2399 0.5146  

    This table provides the correlations between the variables in the cows data. The correlation between Chicken Ban and Percent Religious is statistically significant (\(t=3.2869; \nu=30; \text{p-value}=0.0026\)). This is the sole statistically significant correlation.

    Note

    The issue is actually more than a statistics issue. If two independent variables are highly correlated with each other, it is logically impossible to determine which affects the dependent variable or how much of the effect to partition to each independent variable. Statistics is, however, able to tease out the independent relationships better than not.

    As a rule of thumb, if the correlation is greater than \(r = 0.90\), there may be a serious logical issue. If two variables are so highly correlated, which of the two is the "correct" independent variable? How can one tell? Can both be good? Is the commonality between them the real independent variable?

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): Correlation plots between the three independent variables. The correlation between Civil Ban and Percent Religious was statistically significant according to the Pearson product-moment correlation test. This is evident in this graph, as well.

    Variance Inflation Factor

    This problem is a bit more extensive than suggested above. Recall that one of the mathematical requirements is that the rank of the design matrix equals \(p\) (the number of parameters to be estimated). This can happen if one variable is perfectly correlated with (linear function of) another variable. It can also happen if a variable is a linear function of the other variables.

    Thus, while checking the bivariate correlations is helpful, it is not the answer we need. We need to check if the independent variable is a linear function of (or close to) a combination of the others.

    To determine the level of multicollinearity we can use the "variance inflation factor" (VIF). Recall how to calculate the VIF for a given independent variable: Regress all other independent variables on it, calculate the \(R^2_i\) and calculate the VIF from

    \begin{equation}
    \text{VIF}_i = \frac{1}{1-R^2_i}
    \end{equation}

    Once you have calculated the VIF for each independent variable, compare the values to the "usual" Rule of Thumb that depends on your discipline. The three typical boundaries are 5, 8, and 10. If all of your VIF scores are greater than 10, then there is an issue with multicollinearity. If all are less than 5, then there is no issue. If it is between those extremes, then you should think about the effect of multicollinearity on your estimators. What do those VIF values correspond to? A VIF of 5 means the \(R^2\) value is 0.80. A VIF of 10 means the \(R^2\) is 0.90. Keep that in mind. In other words, the other independent variables explain 90% of the variation in this independent variable.

    Of course, if we think about the effects of multicollinearity, rather than just detecting "severe levels," then we may wish to eschew such tests, assume multicollinearity is an issue, and adjust for it. On the other hand, if there is no reason to think that the model should suffer from multicollinearity, we will want to avoid such adjustments.

    These are questions of science, not of statistics.

    Know the science behind your theory.

    Model the Data

    The example asked us to determine the probability that the ballot measure will pass. Before we can answer that question, we need to model the proportion of the vote in favor of the ballot measure using our independent variables; that is, we need to be able to predict the proportion of the vote in favor of the ballot measure with the information we have.

    Thus, the dependent variable will be propWin and the independent variables will be yearPassed, chickens, and religPct. For now, let us assume a linear relationship between the independent variables and the dependent variable.

    Model Selection

    Unless you have a lot of independent variables, I recommend you start with the interaction model.

    Caution

    Some will disagree and recommend starting with the simplest model and building complexity from that. There tends to be little difference between the two model-building methods. On either case, one has to worry about the multiple comparisons issue. How we should address it in the realm of model building is still unknown. We are certain of two things, however. First, the Bonferroni procedure is far too conservative. Second, doing nothing is not an acceptable option.

    The interaction model includes the effects of each independent variable singly (main effects) as well as all possible combinations of those variables (interaction effects).

    Table \(\PageIndex{2}\): The Grammar of Formulas
    ~ y ~ x1 Separates the dependent variable (left-hand side) and the independent variables (right-hand side).
    + y ~ x1 + x2 Indicates the following variable is added to the formula
    - y ~ x1 + x2 - 1 Indicates the following variable is removed from the formula. In the example, -1 removed the constant term.
    : y ~ x1:x2 Indicates the following and the preceding variable are multiplied in the formula
    * y ~ x1 * x2 Indicates the following and the preceding variable are crossed in the formula
    ^ y ~ (x1+x2+x3)^2 Includes the specified level of interactions.
    I() y ~ I(x1 + x2) Replaces the formula grammar of what is in the parentheses with algebraic grammar. In this example, the sole independent variable is the sum of the two variables.

    Table \(\PageIndex{1}\): The symbols and their meanings in the grammar of formulas. I sure wish I could locate the book that created these, but I cannot find it anymore. It is in the Oklahoma State University library... somewhere. It's a blue book, if that helps.

    R uses the usual formula grammar (the table above). Its use takes a little practice to get the hang of, but it is entirely logical (for the most part). For instance, if you wish to fit the model \(y = \beta_0 + \beta_1x + \varepsilon\), you would use y ~ x.

    However, if you wish to force the y-intercept to be 0, that is to fit the model \(y = \beta_1 x + \varepsilon\), then you have a choice:

    • y ~ x - 1
    • y ~ x + 0

    The two are identical.

    The first is logically interpreted as the usual model, "less the intercept term" (hence the -1). This is the usual method I use. The second exists for backward compatibility. I strongly encourage you to use the first method.

    Some other examples of this grammar include:

    Table \(\PageIndex{3}\): Formula Examples
    Algebraic Form Formula Form
    \(y = \beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1 x2\) y ~ x1*x2
    \(y = \beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1 x2\) y ~ x1 + x2 + x1:x2

    y ~ (x1 + x2)^2
    \(y = \beta_0 + \beta_1 x1 x2\) y ~ x1:x2
    \(y = \beta_0 + \beta_1 x1^3 + \beta_2\sin(x2)\) y ~ I(x1^3) + I(sin(x2))
    \(y = \beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x1x2 +
    \beta_5 x1x3 + \beta_6 x2x3 + \beta_7 x1x2x3 \)
    y ~ x1*x2*x3
    \(y = \beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x1x2 +
    beta_5 x1x3 + \beta_6 x2x3 \)
    y ~ (x1 + x2 + x3)^2
    y ~ (x1 + x2 + x3)^2 - x1:*x2:x3

    Note that the first two examples above are the same model. These illustrate what comprises a "crossed" model. Also note that the "wedge" operator, ^, indicates how many variables are multiplied in the terms. So ^2 indicates you want all one-way and two-way interactions for the model, and ^3 means you want all one-, two-, and three-way interactions.

    Thus, we can see the following pairs are equivalent:

    • y ~ (x1 + x2 + x3)^3 and y ~ (x1 * x2 * x3)
    • y ~ (x1 + x2 + x3)^2 and y ~ (x1 * x2 * x3) - x1:x2:x3

    With this brief introduction to the grammar of formulas, we can return to our example. We have three independent variables; the formula to give a full three-way interaction model is

    propWin ~ yearPassed * chickens * religPct

    As we will use this model a bit, we save the linear regression results into a variable. Thus, the two lines to run are

    mod1 = lm(propWin ~ yearPassed * chickens * religPct)
    summary(mod1)
    

    These lines give the following output (well the first, fourth, and fifth column of that output):

                                  t value   Pr(>|t|)
    (Intercept)                     1.148      0.262
    yearPassed                     -0.901      0.377
    chickens                       -1.084      0.289
    religPct                        1.557      0.133
    yearPassed:chickens             0.950      0.352
    yearPassed:religPct             0.510      0.615
    chickens:religPct               0.979      0.338
    yearPassed:chickens:religPct   -0.895      0.379
    

    The line starting yearPassed:chickens:religPct is the three-way interaction term. As it is the highest interaction, it is the only one we can interpret here. Note that it is not statistically significant (\(\text{p-value}=0.379\)). Thus, removing that term will do two things. First, it will simplify the model. Second, it will not significantly harm the model's descriptive (or predictive) ability.

    That second model can be written as either

    mod2 = lm(propWin ~ yearPassed * chickens * religPct - yearPassed:chickens:religPct)
    

    or as

    mod2 = lm( propWin ~ (yearPassed + chickens + religPct)^2 )
    

    The two formulas are equivalent.

    Note that the summary.aov(mod2) command indicates that none of the three two-way interactions are statistically significant. Thus, these two-way interactions should be removed from the model.

    Caution

    Again, some would alternatively advocate removing just the least significant effect, then refit the new model. Others would suggest refitting with three different models, one for each combination of interaction. There is no "always best" answer, other than the one that your science suggests.

    This leaves a model with no interactions — an additive model. Fitting the additive model and checking the statistical significance of the variables is as above.

    mod3 = lm(propWin ~ yearPassed + chickens + religPct)
    summary.aov(mod3)
    

    Note that all three variables are significant according to this output (the chicken variable is statistically significant because we specified an effect direction). Thus, this is our provisional model.

    Question:

    Given that the additive model is the appropriate model, does the effect of religiosity religPct) change over time?

    The Additive Model

    That is, the equation we will use to fit the data is

    \begin{equation}
    \texttt{propWin} = \beta_0 + \beta_1 (\texttt{yearPassed}) + \beta_2 (\texttt{chickens}) + \beta_3 (\texttt{religPct}) + \varepsilon
    \end{equation}

    If \(\varepsilon \sim N(0, \sigma)\), then we know

    \begin{align}
    E[\texttt{propWin}] = \beta_0 + \beta_1 (\texttt{yearPassed}) + \beta_2 (\texttt{chickens}) + \beta_3 (\texttt{religPct})
    \end{align}

    Check the Assumptions

    But, does this model violate any of the assumptions of OLS regression? All of the usual tests (Shapiro-Wilk, Breusch-Pagan, and runs) pass.

    What about multicollinearity? Remember that the effect of multicollinearity is to inflate the standard errors (reduce the t-value, increase the p-value). Thus, if multicollinearity exists, fixing it will make the variables even more statistically significant.

    How do we test it? We can do it the hard way or the easy way. The hard way is to estimate three regression equations, calculate the individual \(R^2\) values, and calculate the resulting VIF values. So, by hand, this would be:

    vif1 = lm(yearPassed ~ chickens + religPct)
    vif2 = lm(chickens ~ yearPassed + religPct)
    vif3 = lm(religPct ~ yearPassed + chickens)
    
    1/(1-summary(vif1)$r.squared)
    1/(1-summary(vif2)$r.squared)
    1/(1-summary(vif3)$r.squared)
    

    Or, you can use the car package:

    library(car)
    vif(mod3)
    

    The results of these VIF checks are

    yearPassed   chickens   religPct
      1.067965   1.368963   1.399954
    

    None of these three is even close to the lowest "Rule of Thumb." As such, multicollinearity is not an issue in this model.

    The Results

    The regression table for model mod3 is given below. Notice that all three variables of interest are statistically significant at the \(\alpha = 0.05\) level. Additionally, the model has an \(\overline{R}^2\) of \(0.7565\), which is a great fit in most of the social sciences. The direction of the coefficients also agrees with theory.

    Caution

    You may claim that the Chicken variable is not statistically significant at the \(\alpha=0.05\) level. However, the provided p-values are two-tailed (non-directional) p-values. Our hypotheses were all directional hypotheses (one-tailed). Thus, to get the one-tailed p-values you just halve the two-tailed p-values. With that, all three independent variables are statistically significant.

    Thus, the equation for the line of best fit is approximately

    \begin{align*}
    E[\texttt{propWin}] &= 0.1512 \\[1ex]
    & \qquad - 0.0201 (\texttt{yearPassed}) \\[1ex]
    & \qquad - 0.0373 (\texttt{chickens}) \\[1ex]
    & \qquad + 0.0095 (\texttt{religPct})
    \end{align*}

                             Estimate    Std. Error    t-value     p-value
    Constant Term              0.1512        0.0659      2.293      0.0295
    Year Passed (post 2000)   -0.0201        0.0036     -5.618   << 0.0001
    Banned Chickens           -0.0373        0.0200     -1.868      0.0723 
    Percent Religious          0.0095        0.0011      8.801   << 0.0001
    

    Table \(\PageIndex{2}\): Results table for the regression of proportion support of a generic ballot limiting the number of cows housed in the city against the three included variables. The \(R^2\) for the model is \(0.7801\); the \(\overline{R}^2\), \(0.7565\). The p-values calculated are based on two-tailed test. The hypotheses were one-tailed hypotheses. As such, all three explanatory variables are statistically significant at the standard level of significance (\(\alpha = 0.05\)).

    From this, we can make the following observations. First, the expected vote proportion is declining as time passes by about 2 percentage points per year. Second, ballot measures that also ban chickens are less likely to pass. Finally, the more religious a kraj, the more likely a cow ban is to pass.

    Question:

    What does the \(0.1512\) represent in this context (in real terms)?

    Predicting Děčín

    According to this model, what is the expected vote in Děčín? To answer this, we need information about Děčín and its ballot measure, specifically the value of the independent variables:

    yearPassed = 9
    chickens = 0
    religPct = 48

    With this information, and under the assumption that the model is correct, we have our prediction that 42% of the Děčín voters will vote in favor of this ballot measure.

    Thankfully, R does not require us to do this calculation by hand. The R code for predicting the percent of Děčín voters voting in favor of this ballot measure can be

    DECIN = data.frame(yearPassed=9, chickens=0, religPct=48)
    predict(mod3, newdata=DECIN)
    

    The first line was used to make the code more readable. It is also helpful to first define the variable DECIN if you are going to make predictions for Děčín using several models.

    If neither of these methods appeals to you and you wish to do this in one line, that line would be

    predict(mod3, newdata=data.frame(yearPassed=9, chickens=0, religPct=48))
    

    Note the inclusion of the predict function, which predicts the dependent variable value given values for each of the independent variables (read the help file on predict; we will use this function frequently).

    Question:

    How does the estimate change if you specify more variables than are in the model for Děčín? For instance:
    DECIN = data.frame(yearPassed=9, chickens=0, religPct=48, age=45.2, channels=12, turnout=0.72)

    Graphing the Results

    Now that we have confidence in our model, we can use it to predict the effects of each of the three independent variables on the vote in favor of these ballot measures. There are three independent variables, so we cannot create a single graph that displays the results. However, as one of the variables is dichotomous, we can show the results in just two graphs (the number of continuous independent variables).

    Both of these graphs will have the vote in favor as the dependent variable (vertical axis). One of the two graphs will have percent religious as the primary independent variable, whereas the other will have the year passed as the primary independent variable. The chicken variable will be present in both graphs, signified by two separate curves, one where the ballot measure banned chickens and one where it did not:

    fig-ch01_patchfile_01.jpg
    fig-ch01_patchfile_02.jpg
    Figure \(\PageIndex{2}\): Prediction graphs of our cows model. These graphs contain two independent variables plotted against the dependent variable, with the dichotomous independent variable included as separate lines. Note that the effect of each of the three independent variables is made manifest by these two graphs.

    The graphs illustrate the results of the model — this is their purpose. Although the graphs "illustrate the story," we must still "tell the story" of the graphics, including numbers from the prediction table. The following paragraphs explain the graphics.

    Both graphics show that the effect of adding a chicken ban to the referendum tends to reduce the vote in favor of the referendum. All things being equal, a ballot measure banning chickens will have 3.7% fewer people vote for it than a like measure not banning chickens \((s=1.9988, t=-1.87, \text{p-value}=0.0723)\).

    The top graphic illustrates the effect of passing time on the proportion of the vote in favor of these referenda: As the year increases by one, the proportion voting in favor of the referendum decreases by 2% on average \((t=-5.62, \text{p-value} \ll 0.0001)\).

    The bottom graphic shows the effect of religiosity on the ballot outcome: those kraj with higher levels of religiosity tend to vote in favor of these measures at a higher level than kraj with lower levels of religiosity. In fact, increasing the level of religiosity in the kraj by 1% will tend to increase the vote in favor of the ballot measure by 0.95% \((t=8.80, \text{p-value} \ll 0.0001)\).

    Note the interweaving of the graphic discussion with concrete, numerical effects (and statistical significance in parentheses) from the prediction table. This combination aids the reader in interpreting the graphic(s) in terms of statistical language.

    Answering the Question with the Parametric Bootstrap

    Thus, we have a prediction of 42% of the voters will support the ballot measure. However, this is not an answer to the original question, which asked about the probability of the ballot measure passing. From a modeling standpoint, this probability depends on the coefficient estimates, which are just estimates of the true population value, and the standard errors, which are measures of our certainty in those estimates.

    In the ordinary least squares method, those parameter estimates are random variables, since they are functions of the data. In other words, if we re-ran human history, the estimated effect would be different, since reality would be different. Furthermore, as these are random variables, they have an associated distribution --- the Normal distribution. In fact, the distribution of each parameter estimate is Normal, with expected value equal to the estimate and standard deviation equal to the standard error. Thus, for example, the effect of yearPassed is \(b_1 \sim N(\mu=-0.0201, \sigma=0.0036)\); of chicken, \(b_2 \sim N(\mu=-0.0373, \sigma=0.0200)\); and of pctRelig, \(b_3 \sim N(\mu=0.0095, \sigma=0.0011)\).

    Let us leverage these facts to (virtually) re-run human history many, many, many times, get the parameter estimates for each history, and predict the outcome of the ballot measure in Děčín. In other words, let us perform a Monte Carlo (simulation) analysis. The steps are the same as with any Monte Carlo analysis we have done (Kennedy 2008). The only difference is what we do within the loop. Here, we draw random numbers from the appropriate distribution and calculate the predicted vote.

    Before you look at the following algorithm, write your own in your notes and compare it to the one below:

    1. Initialize variables
    2. Perform loop
      1. Draw from the four distributions
      2. Predict the Děčín outcome
    3. Calculate the proportion of times the ballot measure garnered more than 50% of the vote

    One can also store the random numbers inside the loop and predict outside the loop. Also, if the statistical program allows it, you can avoid the loop and just draw all the numbers at once. This last has the advantage of being very fast. It is also the method I use here, in the R script:

    # Initialize variables
      outcome <- numeric()
      trials  <- 1e6
      
      # Coefficient estimates
      b.intc  <-  0.151221
      b.year  <- -0.020095
      b.cban  <- -0.037331
      b.rpct  <-  0.009452
      
      # Coefficient standard errors
      s.intc  <-  0.065938
      s.year  <-  0.003577
      s.cban  <-  0.019988
      s.rpct  <-  0.001074
      
      # Distributions (the "loop")
      e.intc  <- rnorm(trials, m=b.intc, s=s.intc)
      e.year  <- rnorm(trials, m=b.year, s=s.year)
      e.cban  <- rnorm(trials, m=b.cban, s=s.cban)
      e.rpct  <- rnorm(trials, m=b.rpct, s=s.rpct)
      
      outcome <- e.intc + e.year*9 + e.cban*0 + e.rpct*48
    

    At this point, the variable outcome holds the proportion of people voting in favor of the ballot measure in one million simulated elections. To answer the question, we just need to determine the proportion of those elections in which the outcome is greater than \(0.50\): mean(outcome>0.50) will work.

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{3}\): Plot of the predicted vote outcomes from the Monte Carlo experiment described in the text. Note that, while the expected proportion of the vote in favor of the ballot measure is 42%, there is still a 20% chance of the ballot measure passing, given that our model is correct.

    Of course, the numbers are nice, but a histogram may tell a better story. The following code will produce a histogram similiar to that above.

    hist(outcome, main="", xlab="Proportion Vote for Ballot Measure", breaks=-1:99/100)
    hist(outcome[outcome>0.50], main="", yaxt="n", breaks=-1:99/100, col=2, add=TRUE)
    axis(1, at=0.50, labels="50%")
    

    The histogram of the Děčín predictions is presented above (Figure \(\PageIndex{3}\). Note that the expected outcome is still 42%, which we found above, but that there is a spread to that prediction the histogram makes manifest, which the single prediction did not. In fact, prior to this analysis, we may have concluded that there was no possibility that the ballot measure would pass in Děčín based on our model; now, we see that there is a 20% chance of the ballot measure passing.

    ·•—–٠✤٠—–•·

    Thus, we have an estimated answer to our original question. Given that our model is correct, there is approximately a 20% chance that the ballot measure to limit the number of cows in the city will pass in Děčín, with a point prediction of 42% in favor of the bill.

    The actual results of the 2009 ballot measure in Děčín were that the ballot measure passed with 53% of the vote. This result is well within the 95% prediction interval suggested by Figure \(\PageIndex{3}\). Also, the fact that the ballot measure passed should not be too surprising, since this model gave it a 20% probability of passing, and 20% is not a rare event by any stretch of the imagination.

    A Fundamental Problem

    There is a really big problem with these results, however. Run the following code and interpret it.

    mean(outcome>1) + mean(outcome<0)
    
    Always

    This is an important exercise: Always check that predictions make sense.

    In a future chapter, we will revisit this issue and propose a solution.



    This page titled 7.3: Cows in Děčín is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?