Skip to main content
Statistics LibreTexts

12.1: The Likelihood

  • Page ID
    57763
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    From a theoretical standpoint, the likelihood is just a generalization of probability. Where probability is bounded by both 0 and 1, the likelihood is only bounded below by 0. Values with higher probabilities are more likely to be observed. The same is true of likelihood: Values with higher likelihoods are more likely to be observed.

    In the discrete case, the likelihood and the probability (mass) are the same. In the continuous case, the likelihood is the probability density. In other words, you really have come across the likelihood before. In your previous statistics course, the likelihood was called the "probability density" for continuous random variables and the "probability mass" for discrete random variables.

    The difference between the likelihood and the probability mass or density is only one of emphasis. The probability mass (or density) is a function of observable values given the parameters of the distribution.

    The Probability and the Likelihood

    The likelihood is a function of the parameters, given the observed values (data). That difference is illustrated in the next two examples.

    Example \(\PageIndex{1a}\): Probability and the Binomial

    Given that the success probability of a binomial random variable is \(\pi = 0.25\), what is the probability of observing exactly one success out of two trials?

    Solution.
    The probability mass function of the binomial distribution is

    \begin{equation}
    f(x;\ \pi,n) = \binom{n}{x}\ \pi^x\ (1-\pi)^{n-x}
    \end{equation}

    In this particular instance, the probability mass function is
    \begin{equation}
    f(x;\ \pi=0.25,n=2) = \binom{2}{x}\ 0.25^x\ (1-0.25)^{2-x}
    \end{equation}

    And now calculating the probability gives
    \begin{align}
    f(1;\ \pi=0.25,n=2) &= \binom{2}{1}\ 0.25^1\ (1-0.25)^{2-1} \\
    &= 2\ 0.25^1\ (0.75)^{1} \\
    &= 2\ (0.25)\ (0.75) \\
    &= 0.375
    \end{align}

    Thus, the probability I observe exactly one success in two trials, given the success probability is 0.25 is just 0.375, which is a probability of 3 in 8.

    \(\blacksquare\)

    Example \(\PageIndex{1b}\): Likelihood and the Binomial

    Given that I observed exactly one success in two trials, what is the likelihood that the success probability is \(\pi = 0.25\)?

    Notice that this question is very similar to the previous. The difference is subtle. The previous question asked about the probability of an observation. This one asks about the likelihood of the parameter.

    Solution.
    The likelihood for a binomial random variable is
    \begin{equation}
    f(\pi;\ x, n) = \binom{n}{x}\ \pi^x\ (1-\pi)^{n-x}
    \end{equation}

    In this particular instance, the likelihood is
    \begin{equation}
    f(\pi;\ x=1,n=2) = \binom{2}{1}\ \pi^1\ (1-\pi)^{2-1}
    \end{equation}

    Thus, the value of the likelihood for \(\pi=0.25\) is
    \begin{align}
    f(0.25;\ x=1,n=2) &= \binom{2}{1}\ 0.25^1\ (1-0.25)^{2-1} \\[1ex]
    &= 2\ (0.25)^1\ (0.75)^{1} \\[1ex]
    &= 2\ (0.25)\ (0.75) \\[1ex]
    &= 0.375
    \end{align}

    Thus, we have calculated the likelihood that \(\pi = 0.25\) is 0.375. Is this a lot? It actually depends heavily on the number of data points. In general, the larger your sample size, the smaller the likelihood (of observing that particular data). Thus, the likelihood can only meaningfully be interpreted when in relation to other likelihoods based on the same data.

    \(\blacksquare\)

    This last part deserves to be repeated.

    Why? It is because we will come across statistics like the AIC and the BIC that can be used for model selection. However, since both are dependent on the likelihood, the values of the dependent variable need to be the same.

    Caution

    The likelihood can only meaningfully be interpreted when in relation to other likelihoods based on the same data.

    The Differences?

    The probability and the likelihood are numerically the same. The use and interpretation, however, are different. With probability, we are looking at a function of possible outcomes. With likelihood, we are looking at a function of possible values of the parameters. Thus, in the first case, we could ask questions about which value of \(x\) is most likely. In the second case, we would ask questions about which value of the parameter (\pi\) is most likely.

    Maximizing the Likelihood

    Likelihood cries out to be maximized. Is \(\pi = 0.25\) the maximum likelihood in the previous example? No. Calculate the likelihood of \(\pi=0.40\) to see that \(0.25\) is not the maximum (the value of \(\pi\) that produces the largest likelihood value). If you calculated \(f(0.40;\ x=1,n=2) = 0.48\), then you did the calculations correctly.

    Note that \(f(0.40;\ x=1,n=2) > f(0.25;\ x=1,n=2)\). Thus, \(\pi=0.25\) is not the maximum likelihood estimate of \(\pi\) in this case (for this data). What is? Such optimization requires using calculus. From the above, you should be able to see that the objective function is

    \begin{align}
    Q(\pi) &= \binom{2}{1}\ \pi^1\ (1-\pi)^{2-1} = 2\ \pi\ (1-\pi)
    \end{align}


    This is a function of the parameter, since we are trying to determine the value of \(\pi\) that is most likely, given the data. The optimization proceeds as expected:

    \begin{align}
    \frac{\text{d}}{\text{d}\pi}\ Q(\pi) &= 2(1-2\pi) \\[1em]
    0 & \stackrel{\text{set}}{=} 2(1-2\hat{\pi}) \\[1em]
    0 &= 1-2\hat{\pi} \\[1em]
    1 &= 2\hat{\pi} \\[1em]
    \frac{1}{2} &= \hat{\pi}
    \end{align}

    Thus, given that we observed 1 success in 2 trials (the data), the maximum likelihood estimator of \(\pi\) is \(\hat{\pi} = 0.500\).

    For some reason, I am not surprised at this outcome. Are you?

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): A graphic showing how the likelihood varies as the parameter changes. Unsurprisingly, the maximum likelihood occurs at \(\hat{\pi} = 0.500\).

    Figure \(\PageIndex{1}\) shows how the value of the likelihood changes as the parameter value changes. It can be shown that the likelihood for this problem ranges from 0 to 0.5.

    In general, one can show that the maximum likelihood estimator of \(\pi\) is \(\hat{\pi} = x/n\), where \(x\) is the number of successes and \(n\) is the number of trials. I will leave that as an exercise.

    To prove this, you would perform the same steps, but leave the \(x\) and \(n\) in the likelihood. If you do the calculations and the calculus correctly, you will end up with

    \begin{equation}
    \hat{\pi} = \frac{x}{n}
    \end{equation}

    The Poisson Examples

    Example \(\PageIndex{2a}\): Valné Shromáždění

    Let \(Y\) be the number of Ruritanians walking through the door of the Valné Shromáždění, the general assembly building of Ruritania. The King would like to estimate the average number of people entering between noon and 1pm. To do this, the King had his Secretary of the Interior count the number of people entering the building during that hour on Monday.

    On Monday, the Secretary counted \(y=17\). With this information, calculate the estimate of \(\lambda\) using maximum likelihood estimation.


    A second distribution that you probably saw in your previous statistics class is the Poisson distribution. It has just one parameter, \(\lambda\), the average rate. In this example, we will determine the maximum likelihood estimator of \(\lambda\).

    Solution.
    The likelihood for a discrete distribution, like the Poisson, is just the probability mass function:
    \begin{equation}
    \mathcal{L}(\lambda;\ y) = \frac{e^{-\lambda}\ \lambda^y}{y!}
    \end{equation}

    That is the likelihood of each observation. Here, we only took one measurement (on Monday). Thus this is also the entire likelihood.

    The next step is to maximize the likelihood with respect to the parameter, \(\lambda\):
    \begin{align}
    \frac{d}{d\lambda} \mathcal{L}(\lambda;\ y=17) &= \frac{d}{d\lambda} \left( \frac{e^{-\lambda}\ \lambda^{17}}{17!} \right) \\[2em]
    &= \frac{-e^{-\lambda}\ 17(\lambda^{17-1}) + e^{-\lambda}\lambda^{17}}{17!} \\[1em]
    0 &\stackrel{\text{set}}{=} -e^{-\hat{\lambda}}\ 17(\hat{\lambda}^{16}) + e^{-\hat{\lambda}}\hat{\lambda}^{17} \\[1em]
    0 &= -17(\hat{\lambda}^{16}) + \hat{\lambda}^{17} \\[1em]
    0 &= -17 + \hat{\lambda}
    \end{align}

    Thus, the maximum likelihood estimate of \(\lambda\) is \(\hat{\lambda} = 17\).

    \(\blacksquare\)

    And so, we report to His Majesty that our estimate of the average number of people passing through the doors of the Valné Shromáždění is 17 per hour.

    By the way, a graphic of the likelihood function for varying values of \(\lambda\) is given in Figure \(\PageIndex{2}\). Note that the function achieves its maximum when \(\lambda = 17\). Thus, the MLE for \(\lambda\) is \(\hat{\lambda} = 17\).

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{2}\): A graphic showing how the likelihood varies as the parameter changes. This is based on the information in Example \(\PageIndex{2a}\).
    Example \(\PageIndex{2b}\): Valné Shromáždění, Again

    His Majesty liked the report, especially the font (he likes serifs). However, he asked an excellent question: "Bylo by lepší měřit více než jednou?"

    To address his point, the Secretary of the Interior decided to take multiple measurements over several days. So, for the next week, he measured the number of people entering the Valné Shromáždění an hour at a time, randomly selecting the time of day each time. Here is that data: \(15, 20, 23, 34, 23\).

    With that new data what is the maximum likelihood estimator of \(\lambda\), given these \(n=5\) measurements?

    Solution.
    From the previous example, we know that the likelihood of a single observation is
    \begin{equation}
    \mathcal{L}(\lambda;\ y) = \frac{e^{-\lambda}\ \lambda^y}{y!}
    \end{equation}

    Thus, the likelihood of \(n\) independent observations is
    \begin{equation}
    \mathcal{L}(\lambda;\ y,n) = \prod_{i=1}^n\ \frac{e^{-\lambda}\ \lambda^{y_i}}{y_i!}
    \end{equation}

    How do we know this? Remember from your introductory statistics class the product rule for independent events.

    Lemma \(\PageIndex{1}\): Independent Events

    Let \(A\) and \(B\) be two independent events. The probability of both events happening is the product of the individual events. That is

    \begin{equation}
    P[A \cap B] = P[A] \cdot P[B]
    \end{equation}

    Note

    This theorem can easily be extended to any finite number of events. The requirement is that the events are independent. The result is that the probability of all occurring is the product of the probability of each occurring.

    Next, since there is a product involved, it will usually be easier to maximize the logarithm of the likelihood,

    \begin{equation}
    \ell(\lambda;\ y,n) = \sum_{i=1}^n\left( -\lambda + y_i\ \log \lambda - \log y_i! \right)
    \end{equation}

    And so, we maximize this target function with respect to \(\lambda\) to obtain our estimator:

    \begin{align}
    \frac{ \text{d} }{ \text{d} \lambda} \ell(\lambda;\ y,n) &= \frac{ \text{d} }{ \text{d} \lambda} \sum_{i=1}^n\ \left( -\lambda + y_i\ \log \lambda - \log y_i! \right) \\[1em]
    &= \sum_{i=1}^n\ -1 + \sum_{i=1}^n \frac{y_i}{\lambda} \\[1em]
    &= -n + \frac{n\bar{y}}{\lambda} \\[1em]
    0 &\stackrel{\text{set}}{=} -n + \frac{n\bar{y}}{\hat{\lambda}} \\[1em]
    n &= \frac{n\bar{y}}{\hat{\lambda}}
    \end{align}

    Thus, with multiple measurement, the maximum likelihood estimator of \(\lambda\) is
    \begin{equation}
    \hat{\lambda} = \bar{y}
    \end{equation}

    Before moving on, think about the result to ensure that it makes sense. This is always an important step!

    \(\blacksquare\)

    Caution

    At the end of every result, you should think about its consequences. Make sure the results make sense. If they do not, then double-check your work or see the world in a more subtle light.

    The Exponential Example

    Another important distribution is the Exponential distribution. It is used to model the time until some event occurs. Actuaries may use it to model (estimate) the time until a person dies or gets into an automobile accident or gets sued or some other wonderful event.

    It has a single parameter, \(\lambda\), which is the rate. If you are having déjà vu again, do not worry. There is an intimate connection between the Poisson and Exponential distributions. If the time between arrivals follows an exponential distribution, then the number of arrivals follows a Poisson distribution. This means that the average will be \(1/\lambda\). Double-check that this actually makes sense.

    The following examples deals with this distribution.

    Example \(\PageIndex{3}\): Ruritanian Lifetimes

    His Majesty has some additional work for us. He would like to estimate the average lifetime of Ruritanians.

    Let us use maximum likelihood estimation to provide an estimator for \(\lambda\), the average rate of a person dying (NOT the average time until death).

    Solution.
    The probability density function for the exponential distribution, when parameterized on its rate, is
    \begin{equation}
    f(x;\ \lambda) = \lambda\ e^{-\lambda x}
    \end{equation}

    Thus, the likelihood function for a single observation is
    \begin{equation}
    \mathcal{L}(\lambda;\ x) = \lambda\ e^{-\lambda x}
    \end{equation}

    And, the likelihood function for \(n\) independent observations is
    \begin{equation}
    \mathcal{L}(\lambda;\ x,n) = \prod_{i=1}^n\ \lambda\ e^{-\lambda x_i}
    \end{equation}

    As this is a product, the log-likelihood will be easier to differentiate. It is
    \begin{equation}
    \ell(\lambda;\ x,n) = \sum_{i=1}^n\ \left( \log\lambda - \lambda x_i \right)
    \end{equation}

    Now, we maximize it.
    \begin{align}
    \frac{\text{d}}{\text{d}\lambda} \ell(\lambda;\ x,n) &= \frac{\text{d}}{\text{d}\lambda} \sum_{i=1}^n\ \left( \log\lambda - \lambda x_i \right) \\[1em]
    &= \sum_{i=1}^n\ \frac{1}{\lambda} - \sum_{i=1}^n\ x_i \\[1em]
    &= \frac{n}{\lambda} - n\bar{x} \\[1em]
    0 &\stackrel{\text{set}}{=} \frac{n}{\hat{\lambda}} - n\bar{x} \\[1em]
    0 &= \frac{1}{\hat{\lambda}} - \bar{x} \\[1em]
    \hat{\lambda} &= \frac{\ 1\ }{\bar{x}}
    \end{align}

    \(\blacksquare\)

    From this, it can be shown that the maximum likelihood estimator of the mean of an exponential distribution is
    \begin{equation}
    \hat{\mu} = \bar{x}
    \end{equation}

    All it takes is knowing that the expected value of an exponential distribution is \(\mu = 1/\lambda\).

    ───── ⋆⋅☆⋅⋆ ─────

    Since the original question dealt with the average age, we would want to calculate \(\hat{\mu}\), not \(\hat{\lambda}\). I leave it as an exercise to show that a maximum likelihood estimator of \(\mu\) for the following parameterization of the exponential distribution
    \begin{equation}
    f(x; \mu) = \frac{1}{\mu}\ e^{-x/\mu}
    \end{equation}

    is \(\hat{\mu} = \bar{x}\).

    Note

    It should be noted that the maximum likelihood estimator is awesome in that functions "pass through." In other words, it can be shown that

    \begin{equation}
    \widehat{f(x)}_{\text{MLE}} = f(\widehat{x}_{\text{MLE}})
    \end{equation}

    In words, the maximum likelihood estimator of a function of a parameter is that function of the maximum likelihood estimator of the parameter.

    The proof is straightforward, but beyond the scope of this textbook.

    This is as good a time as any. There are two "drawbacks" to using maximum likelihood to estimate parameters. The first is that there is no guarantee that the estimator is unique. The second is that there is no guarantee that the estimator is unbiased. While these seem bad, there is a nifty theorem that states the MLE is asymptotically unbiased; that is, as the sample size increases, its bias goes to zero.

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{3}\): A graphic showing how the likelihood varies as the parameter changes.

    Figure \(\PageIndex{3}\) shows the likelihood graph for \(\lambda\) of an Exponential distribution. From this graphic, you should be able to estimate the value of \(\mu\), the average age of death in Ruritania.

    Caution

    In a future course, you may be dealing with maximum likelihood estimators frequently. Note that the graphic above tells a story beyond the estimate. It also gives insight into how precise the estimate is. The flatter a graphic around the estimate, the greater the uncertainty.

    To see this, note that you probably had more difficulty estimating the maximum in Figure \(\PageIndex{3}\) because of the curve's flatness. The likelihood in Figure \(\PageIndex{2}\) is more sharp, thus it is much easier to determine the maximum value. If you want to explore this, please check out Fisher Information, which is defined as

    \begin{equation}
    \mathcal{I}(\theta) = E\left[\left. \left(\frac{\partial}{\partial \theta} \log f(x;\theta)\right)^2\ \right|\ \theta \right]
    \end{equation}


    This page titled 12.1: The Likelihood is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?