8.2: The South Sudanese Referendum
- Page ID
- 57745
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)This page presents a case study in electoral forensics, applying linear regression to a pressing political question: Was the 2011 South Sudanese independence referendum conducted fairly?
We begin with a fundamental statistical test for electoral fairness: the independence between a voter's intent and its probability of being invalidated. You will transform vote-count data, perform a linear regression on logit-transformed proportions, and interpret the results as evidence for or against electoral fairness. Crucially, you will also learn how to visualize this analysis effectively by plotting data points, a prediction curve, and confidence bands in the original proportion units — revealing the story behind the numbers with compelling clarity. The workflow demonstrates how statistical modeling, when paired with thoughtful visualization, can provide powerful insights into complex real-world events.
✦•················• ✦ •··················•✦
Electoral Forensics Background
(Source: Wikipedia)
Free and fair elections are one of the requirements for a legitimate democratic system; furthermore, being a legitimate democratic State is necessary for some forms of external assistance. As such, many not-so-democratic States wish to appear democratic. They hold elections, but the elections are either fraudulent or the electoral system (rules governing the elections) is unfair.
There are many definitions for fairness in an election, but they all contain the same requirement that a person's vote has the same probability of being counted as anyone else's. In other words, the probability of a vote being invalidated is independent of the characteristics of the person casting the vote — including who the vote was for. This aspect of fairness can actually be tested in elections where the number of invalidated votes is counted: If the proportion of the vote for a specific candidate or position is not independent of the proportion of the vote invalidated in the electoral division, then there is evidence against this assumption of fairness.
The Research Question
And so, with this background in elections and democracy, ...
Research Question:
Does the 2011 independence referendum in southern Sudan indicate an issue with fairness?
Narrative Solution
As one of the conditions of the 2005 Naivasha Agreement, which ended the civil war in Sudan, the South was allowed to vote on independence from the North. That referendum was held January 9–15, 2011. Official results stated that 98.83% of the South Sudanese voted against unity and in favor of independence.
The xsd2011referendum data set contains the number of votes in favor of independence (Secession), the number of votes declared invalid (Invalid), and the total number of votes cast (Votes). Load it and save it into the xsd variable without attaching the data.
### Preamble
library(KnoxStats)
library(car)
library(lmtest)
xsd = read.csv("https://rur.kvasaheim.com/data/xsd2011referendum.csv")
Because we need to determine if there is a relationship between the proportion of the vote for a specific side and the proportion of the vote invalidated in the electoral division, and because we just have vote counts, we need to create those proportions. The proportion of the vote for the candidate is the number of votes for the candidate divided by the number of valid votes. The invalidation rate is the number of invalid ballots divided by the number of cast ballots.
xsd$Valid = xsd$Votes - xsd$Invalid xsd$pSec = xsd$Secession/xsd$Valid xsd$pInv = xsd$Invalid/xsd$Votes
Once that is done, we need to transform these proportions using the logit transformation (why?), perform linear regression, and check for a (linear) relationship. If a relationship exists in the transformed variables, then a relationship exists in the untransformed variables. First, however, it is always a good idea to plot the variables to see if there is an obvious answer to the question. Figure \(\PageIndex{1}\) gives a plot of the proportion of the vote invalidated against the proportion of the vote in favor of independence.
Suggested by the plot, there appears to be a strong relationship between the two variables, evidence of an election that is not fair. Because of the direction of the slope, it appears as though those areas voting most strongly in favor of independence had a much lower probability of having their votes rejected.
As we are using the logit transform, we must drop any electoral division (here, county) that has zero invalid votes or zero votes in favor of secession. We need to do this because the domain of the logit function is \(p \in (0, 1)\).
Question:
How will removing these counties affect the conclusions drawn?
To easily do this in R, we can use the which function, which returns which entries have the provided condition. Thus,
dr = which(xsd$Invalid==0)
returns a vector of values \(\{15, 19, 23, 24, 28, 46, 47, 49, 50, 57, 72, 73\}\).
These numbers correspond to the counties that had zero invalid votes cast. Storing this vector in the variable dr (for "drop") allows us to remove those counties from any subsequent calculations. As such, our proportion calculations are:
p.ind = xsd$Secession[-dr]/xsd$Votes[-dr] p.inv = xsd$Invalid[-dr]/xsd$Votes[-dr]
The negative signs tells R to return values in the vector other than these entries (to drop these entries).
And so, the two lines to transform the dependent variable and fit the OLS model are
l.inv = logit(p.inv) model.xsd = lm(l.inv ~ p.ind)
The results of the linear regression on the transformed dependent variable are given in the table below. There is a very strong relationship between the proportion of the vote invalidated in the county and the proportion of the vote in favor of secession: Those counties with a greater proportion of people voting for independence also had a lower proportion of the vote invalidated. That there is a strong relationship between these two variables is troubling.
Estimate StdErr t-value p-value Constant term 1.8978 0.7690 2.468 0.0155 Proportion of Vote for Independence -9.3991 0.8287 -11.342 << 0.0001
Table: Results table for the South Sudan referendum. The results are in logit units. Note the high level of statistical significance in the effect of the proportion of the vote in favor of independence. This very strongly suggests a lack of fairness in the election.
To make this relationship more obvious, and to make our point stronger, we can plot the data, the prediction curve, and the 95% Working-Hotelling confidence bands on the same plot.
Recall that what confidence intervals are to univariate data, confidence bands are to bivariate data.
The Graphing Philosophy of R
In base R, the philosophy behind graphing using base graphics is to start with a fresh plot and paint successive layers on top of it. This allows us to create graphs that tell the story and to do so easily. To make the graph described above, we need to
- Plot the points (displayed in proportion units),
- Plot the prediction curve (displayed in proportion units, but calculated in logit units),
- Plot the 95% confidence bands (displayed in proportion units, but calculated in logit units).
The first step has been done already (Figure \(\PageIndex{1}\)).
The second step requires the repeated use of the predict function. First, to make things easier, let us define newX as a series of "proportion of vote in favor of independence" values for which we want to make predictions:
newX = seq(0, 1, length=1e4)
This creates a vector containing 10,000 values equally spaced between 0 and 1.
With this, our predict statement will be
l.pred = predict(model.xsd, newdata=data.frame(p.ind=newX), se.fit=TRUE)
The se.fit=TRUE parameter, which calculates the standard error of the fit at that x-value, will be important for calculating the confidence bands. This is just a courtesy from R, as we know how to calculate this value from earlier in this book.
Remember that these predictions are in logit units. To get them into level units, we just apply the logistic function to these point predictions:
p.pred = logistic(l.pred$fit)
The $fit selects only the fitted predictions from the l.pred variable. This is necessary as we are also using the se.fit=TRUE parameter.
Now that we have the predictions in the original units, we merely paint it on the current plot (from Step 1):
lines(newX, p.pred)
The third step requires us to calculate the 95% confidence bands and paint them on the plot as well. For want of better estimates, let us use the Working-Hotelling bands. The formula to calculate the 95% confidence bands is
prwh = predictWH(model.xsd)
Once again, we must back-transform these two variables using the logistic function. So, our final confidence bands are
ucb = logistic(prwh$ucb) lcb = logistic(prwh$lcb)
Finally, we paint this on the current plot with
lines(newX, ucb, col="grey") lines(newX, lcb, col="grey")
This gives the Figure \(\PageIndex{2}\), below.
Note that the predictions are curved in these units; they are straight in logit units. Also note the confidence bands are wider where the value of \(x\) is farther from \(\bar{x}\). Lastly, note that no horizontal line can fit between the two confidence bands. This illustrates that there is a statistically significant relationship between the two variables at the \(\alpha=0.05\) level.
Question:
This illustrates that there is a statistically significant relationship between the two variables at the \(\alpha=0.05\) level. (Why?)
Note that the figure gives the same information as the regression table. The difference is that the graphic tells a clear story.
Graphs usually make the points more manifest.
The Graphic's Code
Here is the code I used to obtain Figure \(\PageIndex{2}\). There are some interesting things in it. Please pore through the code to know what each line does.
par(mar=c(4,4,0,0)+0.5) par(xaxs="i", yaxs="i") par(family="serif", las=1) par(font.lab=2, cex.lab=1.2) par(xpd=NA) plot.new() plot.window(xlim=c(0,1), ylim=c(0,0.1)) axis(1, at=0:10/10); axis(2) title(xlab="Support for Succession", line=2.5) title(ylab="Invalidation Rate", line=2.75) points(p.ind,p.inv, pch=21, bg="lavender") lines(newX, p.pred) lines(newX, ucb, col="grey") lines(newX, lcb, col="grey")


