Skip to main content
Statistics LibreTexts

22.4: Probability Distributions

  • Page ID
    57825
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    In the study of statistics, we do not navigate uncertainty without a map. Our guides are probability distributions — mathematical functions that describe the likelihood of different outcomes for a random process. You have likely encountered several of these foundational distributions already, such as the Binomial for counts of success, the Poisson for events over time or space, and the ubiquitous Normal distribution that models so many natural phenomena. Others, like the Chi-square, Student’s t, and F distributions, emerge powerfully as essential tools for statistical inference, particularly when forming confidence intervals and testing hypotheses.

    Each distribution tells a story: some arise naturally from the world around us, while others are derived as crucial consequences of our need to analyze data and draw conclusions. The coming sections will explore these distributions — and others — in detail. We will examine their properties, their origins, and, most importantly, their practical applications, building the necessary toolkit to model variability and make informed statistical decisions.

    ✦•················• ✦ •··················•✦

     

    The Uniform Distribution

    The most important distribution is the Uniform distribution. All other distributions can be generated from it, either directly or in a limiting process. This distribution is so important that statistics textbooks from the 20th century had random values from a Uniform as the Table 1 (usually inside the front cover).

    The Uniform distribution is a continuous probability distribution with a fixed upper and lower bound and with all elements of its sample space having the same likelihood.

    \begin{equation}
    f(y; a,b) = \left\{ \begin{array}{ll}\frac{1}{b-a} \qquad\qquad & y \in (a, b) \\ 0 & \text{otherwise} \end{array} \right.
    \end{equation}

    This probability distribution has two parameters, \(a\) and \(b\), where \(a\) is the minimum possible value returned, and \(b\) is the maximum possible value returned.

    From this description, we have the following results if \(Y \sim Unif(a,b)\):

    • The sample space is \(\mathcal{S} = [a, b]\).
    • The expected value is \(E[Y] = \frac{b+a}{2}\).
    • The variance is \(\sigma^2 = \frac{b-a}{12}\).
    • The skew is zero.

    Finally, the cumulative distribution function, CDF, is

    \begin{equation}
    F(y; a,b) = \left\{ \begin{array}{ll}0&y \le a \\ \frac{y}{b-a} \qquad\qquad & x \in (a, b) \\ 1 & y \ge b \end{array} \right.
    \end{equation}

    pdf of a Uniform distribtion
    Figure \(\PageIndex{1}\): A typical Uniform distribution. Note that the likelihood (vertical axis) is zero outside the shown sample space of \(\mathcal{S} = [\text{min}, \text{max}] = [a,b]\). Also note that the likelihood within its sample space is a constant.

     

    The Bernoulli Distribution

    Jacob Bernoulli
    Jacob Bernoulli, c.1687
    (Source: Wikipedia)

    Arguably, the Bernoulli is the grandfather of all discrete distributions. It models a random variable that has two possible outcomes, which we call "failure" and "success," or 0 and 1. This is its "sample space," the set of all outcomes with non-zero likelihood:
    \begin{equation}
    \mathcal{S} = \{0,1\}
    \end{equation}

    The Bernoulli has a single parameter, \(\pi\), which is the probability of the random variable being 1. We will frequently refer to \(\pi\) as the "success probability." The success probability is a real number between 0 and 1, inclusive. Thus, we can write \(\pi \in \Pi = [0, 1]\).

    The probability mass function (pmf) for the Bernoulli is
    \begin{align}
    f(y) &= \left\{ \begin{array}{ll} 1-\pi & y=0 \\ \pi & y=1 \\ 0 & \text{otherwise} \end{array} \right.
    \end{align}

    A Bernoulli distribution with \(\pi =3.142 0.77\)
    Figure \(\PageIndex{2}\): A Bernoulli distribution. Here, the probability of success is \(\pi = 0.77\), hence the probability of a success (the spike at 1) is greater than the probability of a failure (spike at 0).

     

    Technically, the probability mass function must return a value for every real \(y\), hence the need for the third line in the definition of the Bernoulli pmf. With that being said, it is frequently left off and assumed. Thus, we will often see it as
    \begin{align}
    f(y) &= \left\{ \begin{array}{ll} 1-\pi & y=0 \\ \pi & y=1 \\ \end{array} \right.
    \end{align}
    without any loss of understanding.

    Its cumulative distribution function (CDF) of the Bernoulli distribution is
    \begin{align}
    F(y) &= \left\{ \begin{array}{ll} 0 & y<0 \\ 1-\pi & 0 \leq y < 1 \\ 1 & 1 \leq y \\ \end{array} \right.
    \end{align}

    Similarly, this will frequently be written as
    \begin{equation}
    F(y) = 1-\pi \qquad \text{for } 0 \leq y < 1
    \end{equation}
    without confusion.

    The mean (expected value) of a Bernoulli random variable is
    \begin{align}
    E[Y] &\stackrel{\text{def}}{=} \sum_{i \in \mathcal{S}}\ y_i\ f(y_i) \\[1em]
    &= 0 \times (1-\pi) + 1 \times (\pi) \\[1em]
    &= \pi
    \end{align}

    Its variance is
    \begin{align}
    V[Y] &\stackrel{\text{def}}{=} \sum_{i \in \mathcal{S}}\ (y_i-\mu)^2\ f(y_i) \\[1em]
    &= (0-\pi)^2 \times (1-\pi) + (1-\pi)^2 \times (\pi) \\[1em]
    &= \pi^2(1-\pi) + \pi(1-\pi)^2 \\[1em]
    &= \pi^2 - \pi^3 + \pi - 2\pi^2 + \pi^3 \\[1em]
    &= \pi(1-\pi)
    \end{align}

    Skew (or skewness) measures the asymmetry of a probability distribution around its mean. Positive skew (right-skew) indicates a long tail on the right, meaning most data is clustered on the left with some high outliers, while negative skew (left-skew) has a long tail on the left with low outliers. Knowing skew is important for several reasons, some of which are:

    • Interpretation of Averages
      In a skewed distribution, the mean is pulled toward the tail. Knowing the skew warns you that the mean may be misleading (e.g., average income is often skewed high by outliers, so the median is more representative).
    • Informing Data Transformations & Model Choice
      Many statistical models assume symmetry. Identifying skew helps you decide if you need to transform the data (e.g., using a log transformation for positive skew) or choose a non-parametric test.
    • Risk and Decision-Making
      In finance, positive skew of returns might be desirable (small chance of huge gains), while negative skew signals a risk of large losses.

    The skew of a Bernoulli is
    \begin{align}
    \gamma_3(Y) &\stackrel{\text{def}}{=} E\left[\left(\frac{Y-\mu}{\sigma}\right)^3\right] \\[1em]
    &= \sum_{i \in \{0,1\}}\ \left(\frac{y_i-\mu}{\sigma}\right)^3\ f(y_i) \\[1em]
    &= \left(\frac{0-\pi}{\pi(1-\pi)}\right)^3 \times (1-\pi) + \left(\frac{1-\pi}{\pi(1-\pi)}\right)^3 \times (\pi) \\[1em]
    & \text{\ldots\ some algebra \ldots} \nonumber \\[1em]
    &= \frac{1-2\pi}{\sqrt{\pi(1-\pi)}}
    \end{align}

    Here, we are defining skew as the "third standardized moment." Other definitions are available, including the Hildebrand ratio, which you may have learned in your introductory statistics course,
    \begin{equation}
    H \stackrel{\text{def}}{=} \frac{\overline{Y} - \widetilde{Y}}{S}
    \end{equation}

     

    Example \(\PageIndex{1}\)

    What is the Hildebrand ratio for a Bernoulli distribution?

    Solution.
    Because the median of the Bernoulli is not a smooth function, let us break this proof into three cases. Note, we will assume \(\pi \in (0,1)\). If \(\pi =0\) or \(\pi = 1\), the statistics are not interesting.

    Case 1: \(\pi > 0.500\)
    Let \(Y \sim Bern(\pi)\). Let \(\pi > 0.500\). From the probability mass function and the definition of the Hildebrand ratio, we have the following:
    \begin{align}
    H &\stackrel{\text{def}}{=} \frac{\overline{Y} - \widetilde{Y}}{S} \\[1em]
    &= \frac{\pi - 1}{\pi(1-\pi)} \\[1em]
    &= - \frac{1}{\pi}
    \end{align}

    Since this is always negative, we know that the Bernoulli is negatively skewed if \(\pi > 0.500\).

    Case 2: \(\pi > 0.500\)
    As previously, we have the following:
    \begin{align}
    H &\stackrel{\text{def}}{=} \frac{\overline{Y} - \widetilde{Y}}{S} \\[1em]
    &= \frac{\pi - 0}{\pi(1-\pi)} \\[1em]
    &= \frac{1}{1-\pi}
    \end{align}

    Since this is always positive, we know that the Bernoulli is positively skewed if \(\pi < 0.500\).

    Case 3: \(\pi = 0.500\)
    As previously, we have the following:
    \begin{align}
    H &\stackrel{\text{def}}{=} \frac{\overline{Y} - \widetilde{Y}}{S} \\[1em]
    &= \frac{\pi - \pi}{\pi(1-\pi)} \\[1em]
    &= 0
    \end{align}

    Thus, the Bernoulli is symmetric when \(\pi=0.500\).

    \(\blacksquare\)

    Note

    In the previous example, I required \(\pi \notin \{0,1\}\). This is not a restriction for applied statistics. Why? What does it mean if \(\pi=0\) or \(\pi=1\)? When would such a thing happen? Why would we need to study it?

    Kurtosis measures the "tailedness" of a probability distribution, indicating how much data is in the tails compared to the center. A high kurtosis means heavy tails and more outliers, while low kurtosis means lighter tails and fewer outliers. Knowing kurtosis is valuable for a few reasons:

    • Risk and Outlier Analysis
      In fields like finance and quality control, high kurtosis signals a higher probability of extreme events (both crashes and windfalls). This is critical for stress-testing and risk management, as it warns of more frequent outliers than a Normal process would predict.
    • Statistical Modeling
      Many statistical tests (e.g., ANOVA, regression) assume Normally distributed errors. High kurtosis can violate this assumption, leading to incorrect conclusions. Checking kurtosis helps validate your model and guides you toward more robust analytical methods.
    • Process Understanding
      In engineering or manufacturing, the shape of a distribution reveals process behavior. Low kurtosis might indicate a process with consistent, controlled variation, while high kurtosis could suggest intermittent, severe faults.

    The kurtosis of a Bernoulli is
    \begin{align}
    \gamma_4(Y) &\stackrel{\text{def}}{=} E\left[\left(\frac{Y-\mu}{\sigma}\right)^4 \right]\\[1em]
    &= \sum_{i \in \{0,1\}}\ \left(\frac{y_i-\mu}{\sigma}\right)^4\ f(y_i) \\[1em]
    &= \left(\frac{0-\pi}{\pi(1-\pi)}\right)^4 \times (1-\pi) + \left(\frac{1-\pi}{\pi(1-\pi)}\right)^4 \times (\pi) \\[1em]
    & \text{\ldots\ some algebra \ldots} \nonumber \\[1em]
    &= \frac{1 - 3\pi(1-\pi)}{\pi(1-\pi)}
    \end{align}

    Here, we are defining kurtosis as the "fourth standardized moment."

    The usual measure is called the "excess kurtosis." This is just the kurtosis minus 3. Why 3? The kurtosis of the Normal distribution is 3. Thus, the excess kurtosis measures how its kurtosis differs from the Normal. Thus, the excess kurtosis of the Bernoulli is

    \begin{equation}
    \frac{1-6\pi(1-\pi)}{\pi(1-\pi)}
    \end{equation}

    Note

    While the Bernoulli distribution is heavily used only in Chapter 15: Binary Dependent Variables, it is a simple distribution that serves as a basis for better understanding other probability distributions

      

    The Binomial Distribution

    The Binomial distribution arises from modeling independent repeated trials where the variable is the number of successes out of a known number of trials.

    Definition: Binomial Random Variable

    Let \(X_i \stackrel{\text{iid}}{\sim} Bern(\pi)\). If we define \(Y = \sum_{i=1}^n X_i\), then \(Y \sim Bin(n,\pi)\).

    Here, the symbol \(\stackrel{\text{iid}}{\sim}\) indicates the random variables are independent and identically distributed. In other words, they constitute a random sample.

    Interpreting the above definition gives us the following five requirements for a random variable to follow a Binomial distribution:

    1. The number of trials, \(n\), is known.
    2. Each trial has two possible outcomes: success and failure.
    3. The probability of a success, \(\pi \in \Pi = (0, 1)\), does not change from trial to trial.
    4. The trials are independent.
    5. The random variable is the number of successes in those \(n\) trials.

    If your random variable follows all five of these conditions, then it follows a Binomial distribution with parameters \(n\) and \(\pi\).

    If you are interested, the probability mass function (pmf) of a Binomial random variable is
    \begin{equation}
    P[Y=y;\ n, \pi]\ =\ \binom{n}{y}\ \pi^{y} \left(1-\pi\right)^{n-y}
    \end{equation}

    A sample Binomial distribution.
    Figure \(\PageIndex{3}\): A graphic of the probability mass function (pmf) of a typical Binomial distribution. Here, \(n=8\) and \(\pi=0.77\). What is the mean?

    The Binomial distribution is symmetric only when \(\pi = 0.500\). If \(\pi < 0.500\), then it is skewed right. Otherwise, it is skewed left. The sample space of a Binomial random variable is \(\mathcal{S} = \{0, 1, 2, \ldots, n\}\). The expected value is \(E[Y] = n\pi\) and the variance is \(V[Y] = n\pi(1-\pi)\).

    A Simple Normal Approximation

    Under certain circumstances, the Binomial distribution can be approximated with the Normal distribution. This arises from the Central Limit Theorem (Section: The Central Limit Theorem). This is especially important if we are examining the distribution of a proportion instead of a count.

    Theorem \(\PageIndex{1}\): Normal Approximation to the Binomial

    Let \(Y \sim Bin(n,\ \pi)\). The distribution of the number of successes is
    \begin{equation}
    Y \stackrel{\cdot}{\sim} N\big(n\pi,\ n\pi(1-\pi)\big)
    \end{equation}

    Here, the symbol \(\stackrel{\cdot}{\sim}\) indicates the distribution is approximate. As one would expect, the approximation improves as the sample size, \(n\), increases. This is due entirely to the Central Limit Theorem (Section: The Central Limit Theorem) and the fact that a Binomial random variable is the sum of independent and identically distributed (iid) Bernoulli random variables.

    Also, the approximation improves if any of a variety of "continuity corrections" are used.

    Theorem \(\PageIndex{2}\): Distribution of Sample Proportion

    Let \(Y \sim Bin(n,\ \pi)\). The distribution of the sample proportion, \(P \stackrel{\text{def}}{=} \frac{Y}{n}\) is
    \begin{equation}
    P \stackrel{\cdot}{\sim}\ N\left(\pi,\ \frac{\pi(1-\pi)}{n}\right)
    \end{equation}

    As one would expect from the Central Limit Theorem, the approximation improves as the sample size, \(n\), increases.

    Proof.
    This proof proceeds from approximating the distribution of \(Y\) with a Normal distribution (Theorem: Normal Approximation to the Binomial), then using the characteristics of the Normal distribution to obtain the answer. I leave this as an exercise.

     

    The Poisson Distribution

    Simon Denis Poisson
    Siméon Denis Poisson, c.1840
    (Source: Wikipedia)

    The Poisson distribution arises from counting the number of successes over a time period or an area. Contrast this with the Binomial, where the successes were counted over the number of trials.

    The probability mass function of the Poisson distribution is
    \begin{equation}
    P[Y=Y;\ \lambda] = \frac{e^{-\lambda} \lambda^y}{y!}
    \end{equation}

    The expected value is \(E[Y] = \lambda\), and the variance is \(V[Y] = \lambda\). The sample space is \(\mathcal{S} = \{0, 1, 2, \ldots\}\). It is skewed right, regardless of the value of \(\lambda\), however that skew goes to zero as \(\lambda \to \infty\). If the skew is defined as the standardized third moment, we can "easily" show that
    \begin{equation}
    \gamma_3(Y) = E\left[\left(\frac{Y-\mu}{\sigma}\right)^3\right] = \lambda^{-1/2}
    \end{equation}

    Similarly, if we define the kurtosis as the standardized fourth moment, we know the excess kurtosis converges to zero as \(\lambda \to \infty\)
    \begin{equation}
    \textrm{excess kurtosis} = E\left[\frac{(X-\mu)^4}{\sigma^4}\right] - 3 = \lambda^{-1}
    \end{equation}

    These previous results show that the Poisson distribution becomes more and more Normal as \(\lambda \to \infty\).

    A typical Poisson distribution.
    Figure \(\PageIndex{4}\): Here, not that the Poisson continues ad infinitum to the right. This Poisson has parameter \(\lambda = 3.15\).

     

    Note

    The Poisson distribution is an example of an "infinitely divisible" distribution. This means that any Poisson distribution is the sum of two other Poisson distributions. This characteristic is rare, but it also holds for the Normal distribution. It is important because the Central Limit Theorem applies to sums of random variables. Since the Poisson distribution is a sum of other Poisson distributions, the CLT tells us that the Poisson distribution converges to the Normal distribution (as \(\lambda \to \infty\)).

    The Poisson distribution is the number of successes over a time period or an area. This suggests that the Poisson distribution arises as a limiting case of the Binomial distribution when \(n \to \infty\), and \(\mu = n\pi\) is a constant value. We prove this in the next theorem.

    Theorem \(\PageIndex{3}\): Poisson as the Limit of a Binomial

    If \(Y_n \sim Bin\left(n, \pi \right)\), then
    \begin{equation}
    \lim_{n \to \infty}\ Y_n \sim Pois\left(\lambda\right)
    \end{equation}
    as long as \(\mu = n\pi = \lambda\) remains constant.

    Proof.
    This proof heavily relies on Sterling's approximation, \(n! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n\), which you could have seen in your calculus course.

    \begin{align}
    P[Y=y;\ n,\pi] &= \binom{n}{y}\ \pi^{y} \left(1-\pi\right)^{n-y} \\[1em]
    &= \frac{n!}{(n-y)!\ y!}\ \pi^{y} \left(1-\pi\right)^{n-y} \\[1em]
    &\approx \frac{\sqrt{2\pi n}\left(\frac{n}{e}\right)^n}{\left( \sqrt{2\pi (n-y)}\left(\frac{n-y}{e}\right)^{n-y} \right)\ y!}\quad \pi^{y} \left(1-\pi\right)^{n-y} \\[1em]
    &= \sqrt{\frac{n}{n-y}}\ \frac{n^n e^{-y}}{(n-y)^{n-y}\ y!}\ \pi^{y} \left(1-\pi\right)^{n-y}
    \end{align}

    Now, let \(n \to \infty\) to give us

    \begin{align}
    &= 1\ \frac{n^n e^{-y}}{(n-y)^{n-y}\ y!}\ \pi^{y} \left(1-\pi\right)^{n-y}
    \end{align}

    and holding \(n\pi = \lambda\) (i.e., \(\pi = \lambda/n\)), we have

    \begin{align}
    P[Y=y;\ n,\pi] &= \frac{n^n\ e^{-y}}{n^{n-y}\left( 1-\frac{y}{n}\right)^{n-y}\ y!}\quad \left( \frac{\lambda}{n} \right)^y \left( 1-\frac{\lambda}{n} \right)^{n-y} \\[1em]
    &= \frac{\lambda^y \left( 1-\frac{\lambda}{n} \right)^{n-y} e^{-y}}{\left( 1-\frac{y}{n} \right)^{n-y}\ y!} \\[1em]
    \end{align}

    As \(n \to \infty\), we have \(n-y \approx n\). This gives

    \begin{align}
    P[Y=y;\ n,\pi] &\approx \frac{\lambda^y \left( 1-\frac{\lambda}{n} \right)^{n} e^{-y}}{\left( 1-\frac{y}{n} \right)^{n}\ y!}
    \end{align}

    Remember from calculus that \(n \to \infty\) means \(\left( 1-\frac{y}{n} \right)^{n} \to e^{-y}\), by definition. Now, applying this limit, we have

    \begin{align}
    P[Y=y;\ n,\pi] &= \frac{\lambda^y e^{-\lambda}e^{-y}}{e^{-y}\ y!} \\[1em]
    &= \frac{\lambda^y e^{-\lambda}}{y!}
    \end{align}

    Since this is the probability mass function of the Poisson distribution, the theorem is proven.

    \(\blacksquare\)

    Thus, we have shown that the Poisson distribution is the limiting distribution of a Binomial distribution when the number of trials goes to infinity and the expected value remains constant (the probability of success goes to zero).

    Stop and think about why this fact suggests that the Poisson can model successes over time or space.

    Note

    The reason for understanding the Poisson distribution is that it is used to model counts (number of successes over time or space). It is the focus of count regression of Chapter 17: Count Dependent Variables.

      

    The Exponential Distribution

    Laplace,_Pierre-Simon,_marquis_de.jpg 

    Pierre-Simon de Laplace, c.1850
    (Source: Wikipedia)

    The exponential distribution is widely used in various fields due to its property of modeling the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. In reliability engineering and survival analysis, it is a fundamental tool for modeling the lifespan of components or the time until failure of a system, particularly for items with a constant hazard rate, meaning they are memoryless. In queueing theory, it is used to model the inter-arrival times of customers at a service point or the service times themselves, forming the basis for models like M/M/1 queues. Other key applications include telecommunications, where it models the time between incoming calls or data packets; finance, for modeling the time until a default event in credit risk modeling; and epidemiology, for describing the incubation period of certain diseases. Its mathematical simplicity and the memoryless property make it a convenient and powerful first-choice model for analyzing waiting times and durations.

    Its sample space is all positive real numbers, \(\mathcal{S} = (0, \infty)\).

    The probability density function (pdf) of an Exponential is either

    \begin{equation}
    f(y; \lambda) = \lambda\ e^{\lambda y}
    \end{equation}

    or

    \begin{equation}
    f(y; \theta) = \frac{1}{\theta}\ e^{y/\theta}
    \end{equation}

    Note that there are two main parameterizations of the Exponential distribution. I tend to use the \(\lambda\) parameterization because it's more fun to write a \(\lambda\) than a \(\theta\). Note that the \(\theta\) is infrequently a \(\beta\) and represents the "scale" of the distribution. So, that's three parameterizations for the Exponential. That's not a record, by any means. All it means to us is that we need to know what parameterization is being used. I will stick with the \(\lambda\) parameterization for fun and profit.

    The expected value and standard deviation are \(\mu = \sigma = 1/\lambda\). The median is \(\ln 2 / \lambda\). The skew is \(\gamma_1 = 2\), and the excess kurtosis is \(6\).

    The cumulative distribution function (CDF) is

    \begin{equation}
    F(y; \lambda) = \left\{ \begin{array}{ll}1-e^{-\lambda x} \qquad & x \ge 0\\ 0 & \text{otherwise} \end{array}\right.
    \end{equation}

    An Exponential distribution
    Figure \(\PageIndex{5}\): The probability density function for an Exponential distribution. Here, \(\lambda = 0.2\). Thus, the expected value is 5... and the height of the Exponential at its mode is 0.2.

       

    The Memoryless Property

    Because it is one of the defining hallmarks of the Exponential distribution, let's look at its memoryless property. In terms of linear regression, it's little more than a curiousity. However, for statistics and data science in general, it is quite useful. Note that the exponential distribution is uniquely memoryless among continuous distributions, meaning the probability of an event occurring in the next interval is independent of how much time has already passed. Formally, for \( X \sim \text{Exp}(\lambda) \) and \( s, t > 0 \), the property states \( P(X > s + t \,|\, X > s) = P(X > t) \). This characteristic makes it ideal for modeling waiting times where the future is independent of the past, such as radioactive particle decay or the arrival of the next customer at a service desk.

    Proof.
    Using the definition of conditional probability and the exponential survival function \( P(X > x) = e^{-\lambda x} \):

    \begin{align*}
    P(X > s + t \,|\, X > s) &= \frac{P(X > s + t \cap X > s)}{P(X > s)} \\
    &= \frac{P(X > s + t)}{P(X > s)} \quad \text{(since } s+t > s\text{)} \\
    &= \frac{e^{-\lambda (s + t)}}{e^{-\lambda s}} \\
    &= e^{-\lambda t} \\
    &= P(X > t).
    \end{align*}

    As a Transformed Uniform???

    The last thing I want to discuss is not unique to the Exponential. I mentioned above in the Uniform distribution section that it was the grandfather of all distributions in that all distributions can be generated from the Uniform. Frequently, this is done through easy definitions (like the Bernoulli) or limiting cases (like the Poisson and Normal). However, it can also be done through the "Probability Integral Transform," which was was introduced by Ronald Fisher in his 1932 edition of the book Statistical Methods for Research Workers.

    What it means is that the standard Uniform distribution can be transformed into (just about) any other distribution. For instance, an Exponential random variable can be generated from a Uniform through this transformation:

    \begin{equation}
    −\frac{1}{\lambda} \log(U) \sim Exp(\lambda)
    \end{equation}

    Here, \(U \sim Unif(0, 1)\), a standard Uniform distribution.

    This is not unique to the Exponential. Also, you should just file this under "hmmmm... interesting" and nothing more.

     

    The Normal (Gaussian) Distribution

    Carl Friedrich Gauss
    Carl Friedrich Gauss c.1840
    (Source: Wikipedia)

    The Normal distribution, also known as the Gaussian distribution and the Gauss-Laplace distribution, is ubiquitous in statistics. This is due to the Central Limit Theorem (see Section: The Central Limit Theorem), which states that the distribution of the sample sum (or sample mean) approaches a Normal distribution, regardless of how the original variable is distributed (as long and the variance is finite and the sample is random).

    The probability density function (pdf) of the Normal is
    \begin{equation}
    f(y;\ \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(y-\mu)^2}{\sigma^2} \right]\label{eq:appS-Normal}
    \end{equation}

    Note

    The probability mass function (pmf) describes the probability of a discrete random variable taking on a specific exact value. The probability density function (pdf) describes the relative likelihood of a continuous random variable falling near a particular value, where probability is given by the area under the pdf curve over an interval.

    In short: a pmf gives actual probabilities for distinct outcomes, while a pmf gives a density that must be integrated to find probabilities for a range of values.

    The Normal distribution is symmetric, has expected value \(E[Y] = \mu\), variance \(V[Y] = \sigma^2\), and sample space \(\mathcal{S} = ℝ\), all real values.

    Gaussian (Normal) distribution
    Figure \(\PageIndex{6}\): A Normal distribution showing the mean and several other values. Recall from your introductory statistics course that about 68% of the distribution is between \(\mu -\sigma\) and \(\mu + \sigma\); 95% between \(\mu -2\sigma\) and \(\mu + 2\sigma\); and 99.5% between \(\mu -3\sigma\) and \(\mu + 3\sigma\);

     

    Example \(\PageIndex{2}\)

    Calculate the skew of the Normal distribution.

    Solution.
    There are several ways to show that the Normal is symmetric (skew = 0). Here are three.

    Method 1: Math
    The mathematical definition of symmetry is that \(f(y)\) is symmetric about the vertical line \(\mu\) if \(f(\mu - y) = f(\mu + y)\). Here is the proof:
    \begin{align}
    f(\mu - y) &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(\mu - y - \mu)^2}{\sigma^2} \right] \\[1em]
    &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{( - y)^2}{\sigma^2} \right] \\[1em]
    &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(y)^2}{\sigma^2} \right]
    \end{align}

    and

    \begin{align}
    f(\mu + y) &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(\mu + y - \mu)^2}{\sigma^2} \right] \\[1em]
    &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(y)^2}{\sigma^2} \right]
    \end{align}
    Thus since \(f(\mu - y) = f(\mu + y)\), we have shown \(f(y)\) is symmetric about \(\mu\).

    Method 2: Hildebrand
    We can also use the Hildebrand rule to show that the Normal distribution is symmetric:
    \begin{align}
    H &= \frac{\overline{Y} - \widetilde{Y}}{S} \\[1em]
    &= \frac{\mu - \mu}{\sigma} \\[1em]
    &= 0
    \end{align}
    Thus, since \(H=0\), the Normal distribution is a symmetric distribution.

    Note that I skipped over the part where I prove \(\widetilde{Y} = \mu\). I leave that for later (Theorem: The Median of a Normal).

    Method 3: Third Standardized Moment
    We can also use the third standardized moment to show that the Normal distribution is symmetric:
    \begin{align}
    f(y) &= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(y - \mu)^2}{\sigma^2} \right] \\[1em]
    \gamma_3 &= E\left[\left(\frac{Y - \mu}{\sigma}\right)^3\right] \\
    &= \int_{ℝ}\ f(y)\ \left(\frac{y - \mu}{\sigma}\right)^3\ \text{d}y
    \end{align}
    I leave it as an exercise to expand the cube, separately calculate \(E[Y^2]\) and \(E[Y^3]\), and do the integration. There is a lot of algebra, but not much more than that if you are careful.

    \(\blacksquare\)

    There is a wealth of information on the Normal distribution. Arguably, it is the most studied distribution in statistics. It is also the most important distribution in statistics because of the Central Limit Theorem. For these reasons, I offer little to say about it beyond beyond the CLT.

    Theorem \(\PageIndex{4}\): The Median of a Normal

    The median of a Normal distribution is \(\mu\).

    Proof.
    By definition for a continuous distribution, the median is the value \(\tilde{y}\) such that \(F(\tilde{y}) = 0.500\). Thus, this proof reduces to a "mere calculation:"

    \begin{align}
    F(\mu) &= \int_{-\infty}^{\mu}\ \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{1}{2}\ \frac{(y - \mu)^2}{\sigma^2} \right]\ \mathrm{d}y
    \end{align}

    Since this equals \(0.500\), we have shown that \(\mu\) is the median of a Normal distribution.

    \(\blacksquare\)

     

    The Chi-Square Distribution

    Prasanta Chandra Mahalanobis
    Prasanta C. Mahalanobis, 1962
    (Source: Wikipedia)

    The Chi-square distribution arose from multiple areas. One was the need to model the variation of the sample. A second was in examining categorical variables. It is used in the Chi-square goodness-of-fit test and the Chi-square test of independence. In both of those cases, the test statistic only approximately follows a Chi-square distribution.

    If you want it (and I'm not entirely sure why you would), here is the probability density function of the Chi-square distribution based on its one parameter, \(\nu\), the "number of degrees of freedom:"

    \begin{equation}
    f(y;\ \nu) = \frac{1}{2^{\nu/2}\ \Gamma\left(\frac{\nu}{2}\right)}y^{\nu/2-1} e^{-y/2}
    \end{equation}
    where \(0 < \nu\) and the gamma function is defined as

    \begin{equation}
    \Gamma(y) \stackrel{\text{def}}{=} \int_0^\infty\ t^{y-1}e^{-t}\ \text{d}t
    \end{equation}

    Note

    The Chi-square distribution is actually the distribution of the second fraction in the exponent in the Normal probability density function. That is, if \(Y \sim N(\mu,\ \sigma^2)\), then

    \begin{equation}
    \frac{(Y-\mu)^2}{\sigma^2} \sim \chi^2_{\nu=1}
    \end{equation}

    The square root of this fraction is called the "Mahalanobis distance." The Mahalanobis distance is unitless, scale-invariant, and takes into account the correlations of the data set. One sees it in data science applications, especially in clustering and outlier detection.

    When you read probability papers from around the turn of the 20th century, you may see \(\chi\) as a variable of interest. Originally, it was just a curvey, fancy \(x\) used to illustrate that the Normal distribution could approximate the Binomial distribution:

    \begin{equation}
    \chi = \frac{x - n\pi}{\sqrt{n\pi(1-\pi)}}
    \end{equation}

    With this, \(\chi \stackrel{\cdot}{\sim} N(0,\ 1)\).

    At the start of the 20th century, statisticians were able to turn this equation into a definition for the Chi-square distribution:

    Definition: The Chi-Square Distribution

    Let \(Z_i\) be a random sample of size \(\nu\) from a standard Normal distribution, that is \(Z_i \stackrel{\text{iid}}{\sim} N(0,\, 1)\), then

    \begin{equation}
    \sum_{i=1}^{\nu} Z_i^2 \sim \chi^2_{\nu}
    \end{equation}

    In this definition, \(\nu\) (pronounced "nu") is the number of degrees of freedom, the number of those Normal distributions that are independent.

    It is this definition that is most helpful in determining what is (and what is not) a Chi-square random variable.

    Note the following:

    • The expected value of a Chi-square distribution is \(\nu\), and the variance is \(2\nu\).
    • The sample space is \(\mathcal{S} = (0, \infty)\).
    • It is always positively skewed (\(\gamma_3(Y) = \sqrt{8/\nu}\)), although that skew goes to zero as \(\nu \to \infty\).
    • Its excess kurtosis also goes to 0 as \(\nu \to \infty\) (ex.kurt. = \(12/\nu\)).
    • If the number of degrees of freedom are 2 or less, then the probability density function is strictly decreasing (its first derivative is always less than zero). Among other things, this means that the mode is zero if \(\nu \le 2\).
    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{7}\): A Chi-square distribution. Here, \(\nu=5\).

     

    The Student's t Distribution

    William Sealy Gosset in 1908
    William Sealy Gosset in 1908
    (Source: Wikipedia)

    Arguably, most of the previous distributions arise from Nature. The Student's t distribution arises from a need to test certain hypotheses. This distribution is named after its creator, William Sealy Gosset (as pictured in the figure to the right). Gosset worked for Guinness Brewery as a master brewer (and statistician) near the end of the 19th century. He applied the well-known statistical techniques in his job, but found that these techniques had major flaws. Apparently, he felt that he was rejecting far too many bushels of grain based on current statistical techniques.

    Specifically, Gosset determined that the test statistic

    \begin{equation}
    Z = \frac{Y-\mu}{s/\sqrt{n}}
    \end{equation}

    did not follow a standard Normal distribution as expected, when the sample size was small. In fact, for his small samples, it was not even "sufficiently" close.

    The problem was that Gosset was working with small samples of barley, while most statistics at the time were concerned with large samples. In my opinion, this is Gosset's greatest contribution: paying attention to small-sample properties of estimators. While Guinness supported him, they did not want a repeat of a previous employee who published trade secrets in a scientific journal. Thus, to get his discoveries out there, Gosset had to publish under a pseudonym. He chose "Student."

    The following is Gosset's definition of his Student's t distribution.

    Definition: The Student's t Distribution

    Let \(Z \sim N(0,\, 1)\) and \(V \sim \chi^2_{\nu}\) with \(Z \perp V\). Define the following ratio:

    \begin{equation}
    T = \frac{Z}{\sqrt{V/\nu}}
    \end{equation}

    The random variable \(T\) follows a Student's t distribution with \(\nu\) degrees of freedom.

    If you care to see it (and some do like the mathematics), this is the probability density function calculated by Fisher:
    \begin{equation}
    f(y;\ \nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\ \Gamma\left(\frac{\nu}{2}\right)} \left( 1 + \frac{y^2}{\nu}\right)^{-\frac{\nu+1}{2}}
    \end{equation}

    Again, \(\Gamma(\cdot)\) is the gamma function.

    Here are some results:

    • Its sample space is \(\mathcal{S} = ℝ\).
    • The mean of the t distribution is \(E[Y] = 0\), if \(\nu > 1\).
    • The variance is \(V[Y] = \frac{\nu}{\nu-2}\), if \(\nu>2\); \(\infty\), if \(\nu \in (1,2]\); and undefined elsewhere.
    • It has zero skew, \(\gamma_3(Y) = 0\).
    • Its excess kurtosis is \(\frac{6}{\nu-1}\), if \(\nu > 4\); \(\infty\), if \(2 \le \nu \le 4\); and undefined elsewhere.

    As \(\nu \to \infty\), the Student's t distribution converges to the Normal distribution. The proof of this is a nice exploration of the mathematics you already know. You should make sure you can prove it.

    Student's t distribution
    Figure \(\PageIndex{8}\): A typical Student's t distribution. Here, \(\nu=4\).

     

    Note

    While the t-distribution was formulated by Gosset to deal with the distribution of a test statistic, it was actually first seen as a posterior distribution a couple decades earlier. However, as Bayesian inference and frequentist statistics were rarely aware of each other, the results by Helmert and Lüroth remained unknown to Gosset and Fisher.

     

    The Cauchy Distribution

    Augustin-Louis Cauchy
    Augustin-Louis Cauchy c.1840
    (Source: Wikipedia)

    The Cauchy distribution is named after Augustin-Louis Cauchy, a French mathematician who specialized in complex analysis and abstract algebra. The distribution is originally (and most helpfully) defined as

    Definition: The Cauchy Distribution

    Let \(Z_1 \sim N(0,\ 1)\), \(Z_2 \sim N(0,\ 1)\), and \(Z_1 \perp Z_2\). Then, the ratio

    \begin{equation}
    Y \stackrel{\text{def}}{=} \frac{Z_1}{Z_2}
    \end{equation}

    follows a standard Cauchy distribution.

    Note that this is just a t distribution with one degree of freedom. As such, the probability density function for the standard Cauchy is

    \begin{equation}
    f(y) = \frac{1}{\pi\left(1+y^2\right)}
    \end{equation}

    Regardless of the fact that the standard Cauchy is symmetric with median 0, neither its mean nor its variance exist. I leave this proof as an exercise for you.

    The standard Cauchy can be generalized to different medians (locations) and spreads (scales):

    \begin{equation}
    f(y;\ \eta, \gamma) \stackrel{\text{def}}{=} \frac{1}{\pi \gamma \left( 1+\left(\frac{y-\eta}{\gamma}\right)^2 \right)}
    \end{equation}

    Again, neither the mean nor the variance exist, however the median and mode are now \(\eta\) ("eta") and the interquartile range is \(2\gamma\) ("gamma").

    Finally, its cumulative distribution function (CDF) is

    \begin{equation}
    F(y;\ \eta, \gamma) = \frac{1}{\pi} \text{atan}\left(\frac{y-y_0}{\gamma}\right)+\frac{1}{2}
    \end{equation}

    A standard Cauchy distribution
    Figure \(\PageIndex{9}\): This is a Cauchy distribution. It is also a t distribution with 1 degree of freedom. Note that, while this distribution is symmetric and bell-shaped, it has neither a variance nor an expected value. This is because the tails of the distribution are especially heavy (large probability of observing events in the tails). Also, note that the Central Limit Theorem does not apply to the Cauchy distribution. Why?

     

    As befits the magical world of mathematics, the Cauchy distribution pops up in many places. For me, the most interesting place is as the Witch of Agnesi.

    I find it very interesting to see how frequently mathematical equations find their way into many different areas of mathematics and science. The key, as I see it, is to understand how the mathematical expressions came into being, then see what applies to the current probability/statistics problem. A trip through history may suggest that there is "no progress in the history of [mathematical] knowledge —merely a continuous and sublime recapitulation" (Umberto Eco, The Name of the Rose, 1980).

    Note

    The Cauchy distribution is one of the most helpful distributions available. Its first (and above) moments are not finite. This means the Central Limit Theorem does not apply to it. As such, it is frequently used to illustrate the importance of finite variances.

     

     

    Snedecor's F Distribution

    George W. Snedecor
    George W. Snedecor in 1915
    (Source: Wikipedia)

    The F distribution, developed by George W. Snedecor (Calculation and Interpretation of Analysis of Variance and Covariance, 1934) and named for Ronald Fisher, is frequently used for testing compound hypotheses, notably in the analysis of variance (ANOVA) procedure (see the Ruritanian Crops example).

    Definition: Snedecor's F Distribution

    Let \(V_1 \sim \chi^2(\nu_1)\), \(V_2 \sim \chi^2(\nu_2)\), and \(V_1 \perp V_2\). Define the following ratio:

    \begin{equation}
    F = \frac{V_1/\nu_1}{V_2/\nu_2} = \frac{\nu_2\ V_1}{\nu_1\ V_2}
    \end{equation}

    The random variable F follows Snedecor's F distribution with \(\nu_1\) and \(\nu_2\) degrees of freedom.

    Here are some results:

    • The support set is \(\mathcal{S} = (0, \infty)\).
    • If \(\nu_2>2\), then the mean of the F distribution is
      \begin{equation}E[F] = \frac{\nu_2}{\nu_2-2}\end{equation}
      If \(\nu_2 \le 2\), then the expected value is infinite.
    • When \(\nu_2>4\), its variance is
      \begin{equation}V[F] = \frac{2\nu_2^2(\nu_1 +\nu_2+2)}{\nu_1(\nu_2-2)^2(\nu_2-4)}\end{equation}
      If \(\nu_2 \le 4\), then the variance is infinite.
    • The distribution (function) is always right-skewed.
    Snedecor's F distribution
    Figure \(\PageIndex{10}\): An F distribution. Here, \(\nu_1=15 \) and \(\nu_2=2 \).

     

    If you want it, here is its probability density function. Note, however, that the definition of the F distribution is much more helpful.
    \begin{equation}
    f(y;\ \nu_1,\nu_2) = \frac{1}{y\ B\left( \frac{\nu_1}{2}, \frac{\nu_2}{2} \right)}\ \sqrt{\frac{ {\phantom{I^I}y}^{\nu_1}\ \nu_1^{\nu_1}\ \nu_2^{\nu_2} }{\left(y\ \nu_1 +\nu_2\right)^{\nu_1+\nu_2}}}
    \end{equation}

    In this formula, \(B(\cdot,\cdot)\) is the beta function, defined as
    \begin{equation}
    B(x,y) \stackrel{\text{def}}{=} \int_0^1\ t^{x-1} (1-t)^{y-1}\ \text{d}t
    \end{equation}

    Interesting... There seems to be some relationship between the beta function and the Binomial distribution.

    By the way, if you would prefer writing the beta function in terms of the gamma function,
    \begin{equation}
    B(x,y) = \frac{\Gamma(x)\ \Gamma(y)}{\Gamma(x+y)}
    \end{equation}

    Most likely, you will have seen the gamma function in your calculus class. If not, there is nothing to worry about, since the computer will be performing the calculations for you. Focus on how these distributions arise (i.e., their actual definition) and how they are used.

     

    Making These Graphics

    So, how did I generate these graphics? Of course, I used R. This allowed me to save the script and share it with y'all. The script actually requires no additional packages; it runs on the base R you download from CRAN. 

    Enjoy!

     

     


    This page titled 22.4: Probability Distributions is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?