The death of one man is a tragedy. The death of millions is a statistic.
– Josef Stalin, Potsdam 1945
950,000 – 1,200,000
– Estimate of Soviet repression deaths, 1937-1938 (Ellman 2002)
Stalin’s infamous quote about the statistical character death of millions is worth giving some thought. The clear intent of his statement is that the death of an individual touches us personally and its force cannot be denied, but that the deaths of a multitude are incomprehensible, and as a consequence mere statistics, more easily ignored. I’d argue that Stalin was half right. A statistic is an abstraction, a description of events beyond our personal experience, and so hard to visualise. Few if any of us can imagine what the deaths of millions is “really” like, but we can imagine one death, and this gives the lone death its feeling of immediate tragedy, a feeling that is missing from Ellman’s cold statistical description.
Yet it is not so simple: without numbers, without counts, without a description of what happened, we have no chance of understanding what really happened, no opportunity event to try to summon the missing feeling. And in truth, as I write this, sitting in comfort on a Saturday morning, half a world and a whole lifetime away from the Gulags, when I put the Ellman estimate next to the Stalin quote a dull dread settles in my stomach and a chill settles over me. The Stalinist repression is something truly beyond my experience, but with a combination of statistical data and those recorded personal histories that have come down to us, it is not entirely beyond my comprehension. Because what Ellman’s numbers tell us is this: over a two year period, Stalinist repression wiped out the equivalent of every man, woman and child currently alive in the city where I live. Each one of those deaths had it’s own story, was it’s own tragedy, and only some of those are known to us now. Even so, with a few carefully chosen statistics, the scale of the atrocity starts to come into focus.
Thus it is no small thing to say that the first task of the statistician and the scientist is to summarise the data, to find some collection of numbers that can convey to an audience a sense of what has happened. This is the job of descriptive statistics, but it’s not a job that can be told solely using the numbers. You are a data analyst, not a statistical software package. Part of your job is to take these statistics and turn them into a description. When you analyse data, it is not sufficient to list off a collection of numbers. Always remember that what you’re really trying to do is communicate with a human audience. The numbers are important, but they need to be put together into a meaningful story that your audience can interpret. That means you need to think about framing. You need to think about context. And you need to think about the individual events that your statistics are summarising.
Ellman, Michael. 2002. “Soviet Repression Statistics: Some Comments.” Europe-Asia Studies 54 (7). Taylor & Francis: 1151–72.
Note for non-Australians: the AFL is an Australian rules football competition. You don’t need to know anything about Australian rules in order to follow this section.
The choice to use Σ to denote summation isn’t arbitrary: it’s the Greek upper case letter sigma, which is the analogue of the letter S in that alphabet. Similarly, there’s an equivalent symbol used to denote the multiplication of lots of numbers: because multiplications are also called “products”, we use the Π symbol for this; the Greek upper case pi, which is the analogue of the letter P.
Note that, just as we saw with the combine function
c()and the remove function
sum()function has unnamed arguments. I’ll talk about unnamed arguments later in Section 8.4.1, but for now let’s just ignore this detail.
Or at least, the basic statistical theory – these days there is a whole subfield of statistics called robust statistics that tries to grapple with the messiness of real data and develop theory that can cope with it.
As we saw earlier, it does have a function called
mode(), but it does something completely different.
This is called a “0-1 loss function”, meaning that you either win (1) or you lose (0), with no middle ground.
Well, I will very briefly mention the one that I think is coolest, for a very particular definition of “cool”, that is. Variances are additive. Here’s what that means: suppose I have two variables X and Y, whose variances are $
With the possible exception of the third question.
Strictly, the assumption is that the data are normally distributed, which is an important concept that we’ll discuss more in Chapter 9, and will turn up over and over again later in the book.
The assumption again being that the data are normally-distributed!
The “−3” part is something that statisticians tack on to ensure that the normal curve has kurtosis zero. It looks a bit stupid, just sticking a “-3” at the end of the formula, but there are good mathematical reasons for doing this.
I haven’t discussed how to compute z-scores, explicitly, but you can probably guess. For a variable
X, the simplest way is to use a command like
(X - mean(X)) / sd(X). There’s also a fancier function called
scale()that you can use, but it relies on somewhat more complicated R concepts that I haven’t explained yet.
Technically, because I’m calculating means and standard deviations from a sample of data, but want to talk about my grumpiness relative to a population, what I’m actually doing is estimating a z score. However, since we haven’t talked about estimation yet (see Chapter 10) I think it’s best to ignore this subtlety, especially as it makes very little difference to our calculations.
Though some caution is usually warranted. It’s not always the case that one standard deviation on variable A corresponds to the same “kind” of thing as one standard deviation on variable B. Use common sense when trying to determine whether or not the z scores of two variables can be meaningfully compared.
Actually, even that table is more than I’d bother with. In practice most people pick one measure of central tendency, and one measure of variability only.
Just like we saw with the variance and the standard deviation, in practice we divide by N−1 rather than N.
This is an oversimplification, but it’ll do for our purposes.
If you are reading this after having already completed Chapter 11 you might be wondering about hypothesis tests for correlations. R has a function called
cor.test()that runs a hypothesis test for a single correlation, and the
psychpackage contains a version called
corr.test()that can run tests for every correlation in a correlation matrix; hypothesis tests for correlations are discussed in more detail in Section 15.6.
An alternative usage of
cor()is to correlate one set of variables with another subset of variables. If
Yare both data frames with the same number of rows, then
cor(x = X, y = Y)will produce a correlation matrix that correlates all variables in
Xwith all variables in
It’s worth noting that, even though we have missing data for each of these variables, the output doesn’t contain any
NAvalues. This is because, while
describe()also has an
na.rmargument, the default value for this function is
na.rm = TRUE.
The technical term here is “missing completely at random” (often written MCAR for short). Makes sense, I suppose, but it does sound ungrammatical to me.