Skip to main content
Statistics LibreTexts

6: Introduction to Probability

  • Page ID
    29469
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    [God] has afforded us only the twilight … of Probability.

    – John Locke

    Up to this point in the book, we’ve discussed some of the key ideas in experimental design, and we’ve talked a little about how you can summarize a data set. To a lot of people, this is all there is to statistics: it’s about calculating averages, collecting all the numbers, drawing pictures, and putting them all in a report somewhere. Kind of like stamp collecting, but with numbers. However, statistics cover much more than that. In fact, descriptive statistics is one of the smallest parts of statistics, and one of the least powerful. The bigger and more useful part of statistics is that it provides that let you make inferences about data.

    Once you start thinking about statistics in these terms – that statistics is there to help us draw inferences from data – you start seeing examples of it everywhere. For instance, there is a data set available from the Centers for Disease Control (CDC) that tracks self-reported symptoms of anxiety and depression. We might be interested in looking at the percentage of self-reported depression in September 2022 which is 24.2%.

    This kind of value is unremarkable in papers or in everyday life, but let’s think about what it entails. A polling company has conducted a survey, usually a pretty big one because they can afford it. The study consisted of making calls or texting a sample of households in the United states, of which there is an estimated 128 million. The last timeframe of he study had a sample of 45,006, of which the researchers obtained a response rate of 4.7%. So, researchers interviewed about 2115 people to represent 128 million households. Of those 2115 people, 512 reported symptoms of depression. Clearly, the actual percentage of people reporting depressive symptoms remains unknown since we only interviewed 2115 people. Even assuming that no one lied to the polling company the only thing we can say with 100% confidence is that the true percentage of Americans with depressive symptoms is somewhere between 512/128000000 (about 0.0004%) and 127999488/128000000 (about 99.99%)

    So, on what basis is it legitimate for the polling company, the newspaper, and the readership to conclude that the percent of Americans with depressive symptoms is about 24.2%?

    The answer to the question is pretty obvious: if I call 2115 people at random, and 512 of them say they have symptoms of depression, it seems very unlikely that these are the only 512 people out of the entire number of households who actually have depressive symptoms. In other words, we assume that the data collected by the polling company is pretty representative of the population at large. But how representative? Would we be surprised to discover that the true percentage is actually 23%? 29%? 37%? At this point everyday intuition starts to break down a bit. No-one would be surprised by 24%, and everybody would be surprised by 37%, but it’s a bit hard to say whether 29% is plausible. We need some more powerful tools than just looking at the numbers and guessing.

    Inferential statistics provides the tools that we need to answer these sorts of questions, and since these kinds of questions lie at the heart of the scientific enterprise, they take up the lions share of every introductory course on statistics and research methods. However, the theory of statistical inference is built on top of probability theory. And it is to probability theory that we must now turn. This discussion of probability theory is basically background: there’s not a lot of statistics per se in this chapter, and you don’t need to understand this material in as much depth as the other chapters in this part of the book. Nevertheless, because probability theory does underpin so much of statistics, it’s worth covering some of the basics.


    This page titled 6: Introduction to Probability is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Danielle Navarro.