# 1.2: Probability Measures

- Page ID
- 12756

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

Now we are ready to formally define probability.

### Definition \(\PageIndex{1}\)

A * probability measure* on the sample space \(\Omega\) is a function, denoted \(P\), from subsets of \(\Omega\) to the real numbers \(\mathbb{R}\), such that the following hold:

- \(P(\Omega) = 1\)
- If \(A\) is any event in \(\Omega\), then \(P(A) \geq 0\).
- If events \(A_1\) and \(A_2\) are disjoint, then \(P(A_1\cup A_2) = P(A_1) + P(A_2)\).

More generally, if \(A_1, A_2, \ldots, A_n, \ldots\) is a sequence of*pairwise disjoint*events, i.e., \(A_i\cap A_j = \varnothing\), for every \(i \neq j\), then $$P(A_1\cup A_2\cup \cdots \cup A_n \cup\cdots) = P(A_1) + P(A_2) + \cdots + P(A_n) + \cdots.\notag$$

So essentially, we are defining probability to be an ** operation** on the events of a sample space, which assigns numbers to events in such a way that the three properties stated in Definition 1.2.1 are satisfied.

Definition 1.2.1 is often referred to as the ** axiomatic definition of probability**, where the three properties give the three

**axioms**of probability. These three axioms are all we need to assume about the operation of probability in order for many other desirable properties of probability to hold, which we now state.

### Properties of Probability Measures

Let \(\Omega\) be a sample space with probability measure \(P\). Also, let \(A\) and \(B\) be any events in \(\Omega\). Then the following hold.

- \(P(A^c) = 1 - P(A)\)
- \(P(\varnothing) = 0\)
- If \(A \subseteq B\), then \(P(A) \leq P(B)\).
- \(P(A)\leq 1\)
**Addition Law:**\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

### Exercise \(\PageIndex{1}\)

Can you prove the five properties of probability measures stated above using only the three axioms of probability measures stated in Definition 1.2.1?

**Answer**-
(1) For the first property, note that by definition of the complement of an event \(A\) we have

$$A\cup A^c = \Omega \quad\text{and}\quad A\cap A^c = \varnothing.\notag$$

In other words, given any event \(A\), we can represent the sample space \(\Omega\) as a disjoint union of \(A\) with its complement. Thus, by the first and third axioms, we derive the first property:

$$1 = P(\Omega) = P(A\cup A^c) = P(A) + P(A^c)\notag$$

$$\Rightarrow P(A^c) = 1 - P(A)\notag$$

(2) For the second property, note that we can write \(\Omega = \Omega\cup\varnothing\), and that this is a disjoint union, since anything intersected with the empty set will necessarily be empty. So, using the first and third axioms, we derive the second property:

$$1 = P(\Omega) = P(\Omega\cup\varnothing) = P(\Omega) + P(\varnothing) = 1 + P(\varnothing)\notag$$

$$\Rightarrow P(\varnothing) = 0\notag$$

(3) For the third property, note that we can write \(B = A\cup(B\cap A^c)\), and that this is a disjoint union, since \(A\) and \(A^c\) are disjoint. By the third axiom, we have

$$P(B) = P(A\cup(B\cap A^c)) = P(A) + P(B\cap A^c). \label{disjoint}$$

By the second axiom, we know that \(P(B\cap A^c) \geq 0\). Thus, if we remove it from the right-hand side of Equation \ref{disjoint}, we are left with something smaller, which proves the third property:

$$P(B) = P(A) + P(B\cap A^c) \geq P(A) \quad\Rightarrow\quad P(B) \geq P(A)\notag$$

(4) For the fourth property, we will use the third property that we just proved. By definition, any event \(A\) is a subset of the sample space \(\Omega\), i.e., \(A\subseteq \Omega\). Thus, by the third property and the first axiom, we derive the fourth property:

$$P(A) \leq P(\Omega) = 1 \quad\Rightarrow\quad P(A) \leq 1\notag$$

(5) For the fifth property, note that we can write the union of events \(A\) and \(B\) as the union of the following two disjoint events:

$$A\cup B = A\cup (A^c\cap B),\notag$$

in other words, the union of \(A\) and \(B\) is given by the union of all the outcomes in \(A\) with all the outcomes in \(B\) that are

*not*in \(B\). Furthermore, note that event \(B\) can be written as the union the following two disjoint events:$$B = (A\cap B) \cup (A^c\cap B),\notag$$

in other words, \(B\) is written as the disjoint union of all the outcomes in \(B\) that are also in \(A\) with the outcomes in \(B\) that are

*not*in \(A\). We can use this expression for \(B\) to find an expression for \(P(A^c\cap B)\) to substitute in the expression for \(A\cup B\) in order to derive the fifth property:$$ P(B) = P(A\cap B) + P(A^c\cap B) \Rightarrow P(A^c\cap B) = P(B) - P(A\cap B) \notag$$

$$P(A\cup B) = P(A) + P(A^c\cap B) \Rightarrow P(A\cup B) = P(A) + P(B) - P(A\cap B)\notag$$

Note that the axiomatic definition (Definition 1.2.1) does not tell us how to *compute* probabilities. It simply defines a formal, mathematical behavior of probability. In other words, the axiomatic definition describes how probability should theoretically *behave* when applied to events. To compute probabilities, we use the properties stated above, as the next example demonstrates.

### Example \(\PageIndex{1}\)

Continuing in the context of Example 1.1.5, let's define a probability measure on \(\Omega\). Assuming that the coin we toss is *fair*, then the outcomes in \(\Omega\) are **equally likely**, meaning that each outcome has the *same probability* of occurring. Since there are four outcomes, and we know that probability of the sample space must be 1 (first axiom of probability in Definition 1.2.1), it follows that the probability of each outcome is \(\frac{1}{4} = 0.25\).

So, we can write

$$P(hh) = P(ht) = P(th) = P(tt) = 0.25.\notag$$

The reader can verify this defines a probability measure satisfying the three axioms.

With this probability measure on the outcomes we can now compute the probability of any event in \(\Omega\) by simply *counting* the number of outcomes in the event. Thus, we find the probability of events \(A\) and \(B\) previously defined:

$$P(A) = P(\{hh, ht, th\}) = \frac{3}{4} = 0.75\notag$$

$$P(B) = P(\{ht, th\}) = \frac{2}{4} = 0.50.\notag$$

We consider the case of equally likely outcomes further in the next section: Section 1.3.

There is another, more empirical, approach to defining probability, given by using *relative frequencies* and a version of the Law of Large Numbers.

### Relative Frequency Approximation

To *estimate* the probability of an event \(A\), repeat the random experiment several times (each repetition is called a *trial*) and count the number of times \(A\) occurred, i.e., the number of times the resulting outcome is in \(A\). Then, we approximate the probability of \(A\) using **relative** **frequency**:

$$P(A) \approx \frac{\text{number of times}\ A\ \text{occurred}}{\text{number of trials}}.\notag$$

### Law of Large Numbers

As the number of trials increases, the relative frequency approximation approaches the theoretical value of \(P(A)\).

This approach to defining probability is sometimes referred to as the ** frequentist definition of probability**. Under this definition, probability represents a

*long-run average*. The two approaches to defining probability are equivalent. It can be shown that using relative frequencies to define a probability measure satisfies the axiomatic definition.