11.5: The Multinomial Distribution

Last updated
Save as PDF

Page ID: 10237

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\)

Basic Theory

Multinomial trials

A multinomial trials process is a sequence of independent, identically distributed random variables \(\bs{X} =(X_1, X_2, \ldots)\) each taking \(k\) possible values. Thus, the multinomial trials process is a simple generalization of the Bernoulli trials process (which corresponds to \(k = 2\)). For simplicity, we will denote the set of outcomes by \(\{1, 2, \ldots, k\}\), and we will denote the common probability density function of the trial variables by \[ p_i = \P(X_j = i), \quad i \in \{1, 2, \ldots, k\} \] Of course \(p_i \gt 0\) for each \(i\) and \(\sum_{i=1}^k p_i = 1\). In statistical terms, the sequence \(\bs{X}\) is formed by sampling from the distribution.

As with our discussion of the binomial distribution, we are interested in the random variables that count the number of times each outcome occurred. Thus, let \[ Y_i = \#\left\{j \in \{1, 2, \ldots, n\}: X_j = i\right\} = \sum_{j=1}^n \bs{1}(X_j = i), \quad i \in \{1, 2, \ldots, k\} \] Of course, these random variables also depend on the parameter \(n\) (the number of trials), but this parameter is fixed in our discussion so we suppress it to keep the notation simple. Note that \(\sum_{i=1}^k Y_i = n\) so if we know the values of \(k - 1\) of the counting variables, we can find the value of the remaining variable.

Basic arguments using independence and combinatorics can be used to derive the joint, marginal, and conditional densities of the counting variables. In particular, recall the definition of the multinomial coefficient: for nonnegative integers \((j_1, j_2, \ldots, j_n)\) with \(\sum_{i=1}^k j_i = n\), \[ \binom{n}{j_1, j_2, \dots, j_k} = \frac{n!}{j_1! j_2! \cdots j_k!} \]

Joint Distribution

For nonnegative integers \((j_1, j_2, \ldots, j_k)\) with \(\sum_{i=1}^k j_i = n\), \[ \P(Y_1 = j_1, Y_2, = j_2 \ldots, Y_k = j_k) = \binom{n}{j_1, j_2, \ldots, j_k} p_1^{j_1} p_2^{j_2} \cdots p_k^{j_k} \]

Proof

By independence, any sequence of trials in which outcome \(i\) occurs exactly \(j_i\) times for \(i \in \{1, 2, \ldots, k\}\) has probability \(p_1^{j_1} p_2^{j_2} \cdots p_k^{j_k}\). The number of such sequences is the multinomial coefficient \(\binom{n}{j_1, j_2, \ldots, j_k}\). Thus, the result follows from the additive property of probability.

The distribution of \(\bs{Y} = (Y_1, Y_2, \ldots, Y_k)\) is called the multinomial distribution with parameters \(n\) and \(\bs{p} = (p_1, p_2, \ldots, p_k)\). We also say that \( (Y_1, Y_2, \ldots, Y_{k-1}) \) has this distribution (recall that the values of \(k - 1\) of the counting variables determine the value of the remaining variable). Usually, it is clear from context which meaning of the term multinomial distribution is intended. Again, the ordinary binomial distribution corresponds to \(k = 2\).

Marginal Distributions

For each \(i \in \{1, 2, \ldots, k\}\), \(Y_i\) has the binomial distribution with parameters \(n\) and \(p_i\): \[ \P(Y_i = j) = \binom{n}{j} p_i^j (1 - p_i)^{n-j}, \quad j \in \{0, 1, \ldots, n\} \]

Proof

There is a simple probabilistic proof. If we think of each trial as resulting in outcome \(i\) or not, then clearly we have a sequence of \(n\) Bernoulli trials with success parameter \(p_i\). Random variable \(Y_i\) is the number of successes in the \(n\) trials. The result could also be obtained by summing the joint probability density function in Exercise 1 over all of the other variables, but this would be much harder.

Grouping

The multinomial distribution is preserved when the counting variables are combined. Specifically, suppose that \((A_1, A_2, \ldots, A_m)\) is a partition of the index set \(\{1, 2, \ldots, k\}\) into nonempty subsets. For \(j \in \{1, 2, \ldots, m\}\) let \[ Z_j = \sum_{i \in A_j} Y_i, \quad q_j = \sum_{i \in A_j} p_i \]

\(\bs{Z} = (Z_1, Z_2, \ldots, Z_m)\) has the multinomial distribution with parameters \(n\) and \(\bs{q} = (q_1, q_2, \ldots, q_m)\).

Proof

Again, there is a simple probabilistic proof. Each trial, independently of the others, results in an outome in \(A_j\) with probability \(q_j\). For each \(j\), \(Z_j\) counts the number of trails which result in an outcome in \(A_j\). This result could also be derived from the joint probability density function in Exercise 1, but again, this would be a much harder proof.

Conditional Distribution

The multinomial distribution is also preserved when some of the counting variables are observed. Specifically, suppose that \((A, B)\) is a partition of the index set \(\{1, 2, \ldots, k\}\) into nonempty subsets. Suppose that \((j_i : i \in B)\) is a sequence of nonnegative integers, indexed by \(B\) such that \(j = \sum_{i \in B} j_i \le n\). Let \(p = \sum_{i \in A} p_i\).

The conditional distribution of \((Y_i: i \in A)\) given \((Y_i = j_i: i \in B)\) is multinomial with parameters \(n - j\) and \((p_i / p: i \in A)\).

Proof

Again, there is a simple probabilistic argument and a harder analytic argument. If we know \(Y_i = j_i\) for \(i \in B\), then there are \(n - j\) trials remaining, each of which, independently of the others, must result in an outcome in \(A\). The conditional probability of a trial resulting in \(i \in A\) is \(p_i / p\).

Combinations of the basic results involving grouping and conditioning can be used to compute any marginal or conditional distributions.

Moments

We will compute the mean and variance of each counting variable, and the covariance and correlation of each pair of variables.

For \(i \in \{1, 2, \ldots, k\}\), the mean and variance of \(Y_i\) are

\(\E(Y_i) = n p_i\)
\(\var(Y_i) = n p_i (1 - p_i)\)

Proof

Recall that \(Y_i\) has the binomial distribution with parameters \(n\) and \(p_i\).

For distinct \(i, \; j \in \{1, 2, \ldots, k\}\),

\(\cov(Y_i, Y_j) = - n p_i p_j\)
\(\cor(Y_i, Y_j) = -\sqrt{p_i p_j \big/ \left[(1 - p_i)(1 - p_j)\right]}\)

Proof

From the bi-linearity of the covariance operator, we have \[ \cov(Y_i, Y_j) = \sum_{s=1}^n \sum_{t=1}^n \cov[\bs{1}(X_s = i), \bs{1}(X_t = j)] \] If \(s = t\), the covariance of the indicator variables is \(-p_i p_j\). If \(s \ne t\) the covariance is 0 by independence. Part (b) can be obtained from part (a) using the definition of correlation and the variances of \(Y_i\) and \(Y_j\) given above.

From the last result, note that the number of times outcome \(i\) occurs and the number of times outcome \(j\) occurs are negatively correlated, but the correlation does not depend on \(n\).

If \(k = 2\), then the number of times outcome 1 occurs and the number of times outcome 2 occurs are perfectly correlated.

Proof

This follows immediately from the result above on covariance since we must have \(i = 1\) and \(j = 2\), and \(p_2 = 1 - p_1\). Of course we can also argue this directly since \(Y_2 = n - Y_1\).

Examples and Applications

In the dice experiment, select the number of aces. For each die distribution, start with a single die and add dice one at a time, noting the shape of the probability density function and the size and location of the mean/standard deviation bar. When you get to 10 dice, run the simulation 1000 times and compare the relative frequency function to the probability density function, and the empirical moments to the distribution moments.

Suppose that we throw 10 standard, fair dice. Find the probability of each of the following events:

Scores 1 and 6 occur once each and the other scores occur twice each.
Scores 2 and 4 occur 3 times each.
There are 4 even scores and 6 odd scores.
Scores 1 and 3 occur twice each given that score 2 occurs once and score 5 three times.

Answer

0.00375
0.0178
0.205
0.0879

Suppose that we roll 4 ace-six flat dice (faces 1 and 6 have probability \(\frac{1}{4}\) each; faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\) each). Find the joint probability density function of the number of times each score occurs.

Answer

\(f(u, v, w, x, y, z) = \binom{4}{u, v, w, x, y, z} \left(\frac{1}{4}\right)^{u+z} \left(\frac{1}{8}\right)^{v + w + x + y}\) for nonnegative integers \(u, \, v, \, w, \, x, \, y, \, z\) that sum to 4

In the dice experiment, select 4 ace-six flats. Run the experiment 500 times and compute the joint relative frequency function of the number times each score occurs. Compare the relative frequency function to the true probability density function.

Suppose that we roll 20 ace-six flat dice. Find the covariance and correlation of the number of 1's and the number of 2's.

Answer

covariance: \(-0.625\); correlation: \(-0.0386\)

In the dice experiment, select 20 ace-six flat dice. Run the experiment 500 times, updating after each run. Compute the empirical covariance and correlation of the number of 1's and the number of 2's. Compare the results with the theoretical results computed previously.

Search

Text Color

Text Size

Margin Size

Font Type