2.9: Probability Spaces Revisited
In this section we discuss probability spaces from the more advanced point of view of measure theory. The previous two sections on positive measures and existence and uniqueness are prerequisites. The discussion is divided into two parts: first those concepts that are shared rather equally between probability theory and general measure theory, and second those concepts that are for the most part unique to probability theory. In particular, it's a mistake to think of probability theory as a mere branch of measure theory. Probability has its own notation, terminology, point of view, and applications that makes it an incredibly rich subject on its own.
Basic Concepts
Our first discussion concerns topics that were discussed in the section on positive measures. So no proofs are necessary, but you will notice that the notation, and in some cases the terminology, is very different.
Definitions
We can now give a precise definition of the probability space, the mathematical model of a random experiment.
A probability space \((S, \mathscr S, \P)\), consists of three essential parts:
- A set of outcomes \(S\).
- A \(\sigma\)-algebra of events \(\mathscr S\).
- A probability measure \(\P\) on the sample space \( (S, \mathscr S) \).
Often the special notation \( (\Omega, \mathscr{F}, \P) \) is used for a probability space in the literature—the symbol \( \Omega \) for the set of outcomes is intended to remind us that these are all possible outcomes. However in this text, we don't insist on the special notation, and use whatever notation seems most appropriate in a given context.
In probability, \(\sigma\)-algebras are not just important for theoretical and foundational purposes, but are important for practical purposes as well. A \(\sigma\)-algebra can be used to specify partial information about an experiment—a concept of fundamental importance. Specifically, suppose that \(\mathscr{A}\) is a collection of events in the experiment, and that we know whether or not \(A\) occurred for each \(A \in \mathscr{A}\). Then in fact, we can determine whether or not \(A\) occurred for each \(A \in \sigma(\mathscr{A})\), the \(\sigma\)-algebra generated by \(\mathscr{A}\).
Technically, a random variable for our experiment is a measurable function from the sample space into another measurable space.
Suppose that \( (S, \mathscr S, \P) \) is a probability space and that \( (T, \mathscr T) \) is another measurable space. A random variable \( X \) with values in \( T \) is a measurable function from \( S \) into \( T \).
- The probability distribution of \( X \) is the mapping on \( \mathscr T \) given by \( B \mapsto \P(X \in B) \).
- The collection of events \(\{\{X \in B\}: B \in \mathscr T\}\) is a sub \(\sigma\)-algebra of \(\mathscr S\), and is the \(\sigma\)-algebra generated by \(X\), denoted \(\sigma(X)\).
Details
If we observe the value of \(X\), then we know whether or not each event in \(\sigma(X)\) has occurred. More generally, we can construct the \( \sigma \)-algebra associated with any collection of random variables.
suppose that \( (T_i, \mathscr T_i) \) is a measurable space for each \( i \) in an index set \( I \), and that \(X_i\) is a random variable taking values in \( T_i \) for each \( i \in I \). The \( \sigma \)-algebra generated by \( \{X_i: i \in I\} \) is \[ \sigma\{X_i: i \in I\} = \sigma\left\{\{X \in B_i\}: B_i \in \mathscr T_i, \; i \in I\right\} \]
If we observe the value of \(X_i\) for each \(i \in I\) then we know whether or not each event in \(\sigma\{X_i: i \in I\}\) has occurred. This idea is very important in the study of stochastic processes.
Null Events, Almost Sure Events, and Equivalence
Suppose that \( (S, \mathscr S, \P) \) is a probability space.
Define the following collections of events:
- \(\mathscr{N} = \{A \in \mathscr S: \P(A) = 0\} \), the collection of null events
- \(\mathscr{M} = \{A \in \mathscr S: \P(A) = 1\}\), the collection of almost sure events
- \( \mathscr{D} = \mathscr{N} \cup \mathscr{M} = \{A \in \mathscr S: \P(A) = 0 \text{ or } \P(A) = 1 \} \), the collection of essentially deterministic events
The collection of essentially deterministic events \( \mathscr D \) is a sub \( \sigma \)-algebra of \( \mathscr S \).
In the section on independence, we showed that \( \mathscr{D} \) is also a collection of independent events.
Intuitively, equivalent events or random variables are those that are indistinguishable from a probabilistic point of view. Recall first that the symmetric difference between events \( A \) and \( B \) is \( A \bigtriangleup B = (A \setminus B) \cup (B \setminus A) \); it is the event that occurs if and only if one of the events occurs, but not the other, and corresponds to exclusive or . Here is the definition for events:
Events \(A\) and \(B\) are equivalent if \( A \bigtriangleup B \in \mathscr{N} \), and we denote this by \( A \equiv B \). The relation \( \equiv \) is an equivalence relation on \( \mathscr S \). That is, for \( A, \, B, \, C \in \mathscr S \),
- \(A \equiv A\) (the reflexive property ).
- If \(A \equiv B\) then \(B \equiv A\) (the symmetric property ).
- If \(A \equiv B\) and \(B \equiv C\) then \(A \equiv C\) (the transitive property ).
Thus \(A \equiv B\) if and only if \(\P(A \bigtriangleup B) = \P(A \setminus B) + \P(B \setminus A) = 0\) if and only if \(\P(A \setminus B) = \P(B \setminus A) = 0\). The equivalence relation \( \equiv \) partitions \( \mathscr S \) into disjoint classes of mutually equivalent events. Equivalence is preserved under the set operations.
Suppose that \( A, \, B \in \mathscr S \). If \( A \equiv B \) then \( A^c \equiv B^c \).
Suppose that \( A_i, \, B_i \in \mathscr S \) for \( i \) in a countable index set \( I \). If \( A_i \equiv B_i \) for \( i \in I \) then
- \( \bigcup_{i \in I} A_i \equiv \bigcup_{i \in I} B_i \)
- \( \bigcap_{i \in I} A_i \equiv \bigcap_{i \in I} B_i \)
Equivalent events have the same probability.
If \( A, \, B \in \mathscr S \) and \(A \equiv B\) then \(\P(A) = \P(B)\).
The converse trivially fails, and a counterexample is given below However, the null and almost sure events do form equivalence classes.
Suppose that \( A \in \mathscr S \).
- If \(A \in \mathscr{N}\) then \(A \equiv B\) if and only if \(B \in \mathscr{N}\).
- If \(A \in \mathscr{M}\) then \(A \equiv B\) if and only if \(B \in \mathscr{M}\).
We can extend the notion of equivalence to random variables taking values in the same space. Thus suppose that \( (T, \mathscr T) \) is another measurable space. If \( X \) and \( Y \) are random variables with values in \( T \), then \( (X, Y) \) is a random variable with values in \( T \times T \), which is given the usual product \( \sigma \)-algebra \( \mathscr T \otimes \mathscr T \). We assume that the diagonal set \( D = \{(x, x): x \in T\} \in \mathscr T \otimes \mathscr T \), which is almost always true in applications.
Random variables \(X\) and \(Y\) taking values in \(T\) are equivalent if \( \P(X = Y) = 1 \). Again we write \( X \equiv Y \). The relation \( \equiv \) is an equivalence relation on the collection of random variables that take values in \(T\). That is, for random variables \( X \), \( Y \), and \( Z \) with values in \( T \),
- \(X \equiv X\) (the reflexive property ).
- If \(X \equiv Y\) then \(Y \equiv X\) (the symmetric property ).
- If \( X \equiv Y\) and \(Y \equiv Z\) then \(X \equiv Z\) (the transitive property ).
So the collection of random variables with values in \( T \) is partitioned into disjoint classes of mutually equivalent variables.
Suppose that \(X\) and \(Y\) are random variables taking values in \(T\) and that \( X \equiv Y \). Then
- \( \{X \in B\} \equiv \{Y \in B\} \) for every \(B \in \mathscr T\).
- \( X \) and \( Y \) have the same probability distribution on \( (T, \mathscr T) \).
Again, the converse to part (b) fails with a passion, and a counterexample is given below. It often happens that a definition for random variables subsumes the corresponding definition for events, by considering the indicator variables of the events. So it is with equivalence.
Suppose that \(A, \, B \in \mathscr S\). Then \(A \equiv B\) if and only if \(\bs 1_A \equiv \bs 1_B\).
Equivalence is preserved under a deterministic transformation of the variables. For the next result, suppose that \( (U, \mathscr U) \) is yet another measurable space, along with \( (T, \mathscr T) \).
Suppose \( X, \, Y \) are random variables with values in \( T \) and that \( g: T \to U \) is measurable. If \(X \equiv Y\) then \( g(X) \equiv g(Y) \).
Suppose again that \( (S, \mathscr S, \P) \) is a probability space corresponding to a random experiment. Let \( \mathscr V \) denote the collection of all real-valued random variables for the experiment, that is, all measurable functions from \( S \) into \( \R \). With the usual definitions of addition and scalar multiplication, \( (\mathscr V, +, \cdot) \) is a vector space. However, in probability theory, we often do not want to distinguish between random variables that are equivalent, so it's nice to know that the vector space structure is preserved when we identify equivalent random variables. Formally, let \( [X] \) denote the equivalence class generated by a real-valued random variable \( X \in \mathscr V \), and let \( \mathscr W \) denote the collection of all such equivalence classes. In modular notation, \( \mathscr W\) is the set \(\mathscr V \big/ \equiv \). We define addition and scalar multiplication on \( \mathscr{V} \) by \[ [X] + [Y] = [X + Y], \; c [X] = [c X]; \quad [X], \; [Y] \in \mathscr{V}, \; c \in \R \]
\( (\mathscr W, +, \cdot) \) is a vector space.
Often we don't bother to use the special notation for the equivalence class associated with a random variable. Rather, it's understood that equivalent random variables represent the same object. Spaces of functions in a general measure space are studied in the chapter on Distributions, and spaces of random variables are studied in more detail in the chapter on Expected Value.
Completion
Suppose again that \( (S, \mathscr S, \P) \) is a probability space, and that \( \mathscr N \) denotes the collection of null events, as above. Suppose that \( A \in \mathscr N \) so that \( \P(A) = 0 \). If \( B \subseteq A \) and \( B \in \mathscr S \), then we know that \( \P(B) = 0 \) so \( B \in \mathscr{N} \) also. However, in general there might be subsets of \( A \) that are not in \( \mathscr S \). This leads naturally to the following definition.
The probability space \( (S, \mathscr S, \P) \) is complete if \( A \in \mathscr N \) and \( B \subseteq A \) imply \( B \in \mathscr S \) (and hence \( B \in \mathscr{N} \)).
So the probability space is complete if every subset of an event with probability 0 is also an event (and hence also has probability 0). We know from our work on positive measures that every \( \sigma \)-finite measure space that is not complete can be completed. So in particular a probability space that is not complete can be completed. To review the construction, recall that the equivalence relation \( \equiv \) that we used above on \( \mathscr S \) is extended to \( \mathscr{P}(S) \) (the power set of \( S \)).
For \( A, \, B \subseteq S \), define \( A \equiv B \) if and only if there exists \( N \in \mathscr{N} \) such that \( A \bigtriangleup B \subseteq N \). The relation \( \equiv \) is an equivalence relation on \( \mathscr{P}(S) \).
Here is how the probability space is completed:
Let \( \mathscr S_0 = \{A \subseteq S: A \equiv B \text{ for some } B \in \mathscr S \} \). For \( A \in \mathscr S_0 \), define \( \P_0(A) = \P(B) \) where \( B \in \mathscr S \) and \( A \equiv B \). Then
- \( \mathscr S_0 \) is a \( \sigma \)-algebra of subsets of \( S \) and \( \mathscr S \subseteq \mathscr S_0 \).
- \( \P_0 \) is a probability measure on \( (S, \mathscr S_0) \).
- \( (S, \mathscr S_0, \P_0) \) is complete, and is the completion of \( (S, \mathscr S, \P) \).
Product Spaces
Our next discussion concerns the construction of probability spaces that correspond to specified distributions. To set the stage, suppose that \( (S, \mathscr S, \P) \) is a probability space. If we let \( X \) denote the identity function on \( S \), so that \( X(x) = x \) for \( x \in S \), then \( \{X \in A\} = A \) for \( A \in \mathscr S \) and hence \( \P(X \in A) = \P(A) \). That is, \( \P \) is the probability distribution of \( X \). We have seen this before—every probability measure can be thought of as the distribution of a random variable. The next result shows how to construct a probability space that corresponds to a sequence of independent random variables with specified distributions.
Suppose \( n \in \N_+ \) and that \( (S_i, \mathscr S_i, \P_i) \) is a probability space for \( i \in \{1, 2, \ldots, n\} \). The corresponding product measure space \( (S, \mathscr S, \P) \) is a probability space. If \( X_i: S \to S_i \) is the \( i \)th coordinate function on \( S\) so that \( X_i(\bs x) = x_i \) for \( \bs x = (x_1, x_2, \ldots, x_n) \in S \) then \( (X_1, X_2, \ldots, X_n) \) is a sequence of independent random variables on \( (S, \mathscr{S}, \P) \), and \( X_i \) has distribution \( \P_i \) on \( (S_i, \mathscr S_i) \) for each \( i \in \{1, 2, \ldots, n \} \).
Proof
Of course, the existence of the product space \( (S, \mathscr S, \P) \) follows immediately from the more general result for products of positive measure spaces. Recall that \( S = \prod_{i=1}^n S_i \) and that \( \mathscr S \) is the \( \sigma \)-algebra generated by sets of the from \( \prod_{i=1}^n A_i \) where \( A_i \in \mathscr S_i \) for each \( i \in \{1, 2, \ldots, n\} \). Finally, \( \P \) is the unique positive measre on \( (S, \mathscr S) \) satisfying \[ \P\left(\prod_{i=1}^n A_i\right) = \prod_{i=1}^n \P_i(A_i) \] where again, \( A_i \in \mathscr S_i \) for each \( i \in \{1, 2, \ldots, n\} \). Clearly \( \P \) is a probability measure since \( \P(S) = \prod_{i=1}^n \P_i(S_i) = 1 \). Suppose that \( A_i \in \mathscr S_i \) for \( i \in \{1, 2, \ldots, n\} \). Then \( \{X_1 \in A_1, X_2 \in A_2 \ldots, X_n \in A_n\} = \prod_{i=1}^n A_i \in \mathscr S\). Hence \[ \P(X_1 \in A_1, X_2 \in A_2, \ldots, X_n \in A_n) = \prod_{i=1}^n \P_i(A_i) \] If we fix \( i \in \{1, 2, \ldots, n\} \) and let \( A_j = S_j \) for \( j \ne i \), then the displayed equation give \( \P(X_i \in A_i) = \P_i(A_i) \), so \( X_i \) has distribution \( \P_i \) on \( (S_i, \mathscr S_i) \). Returning to the displayed equation we have \[ \P(X_1 \in A_1, X_2 \in A_2, \ldots, X_n \in A_n) = \prod_{i=1}^n \P(X_i \in A_i) \] so \( (X_1, X_2, \ldots, X_n) \) are independent.
Intuitively, the given probability spaces correspond to \( n \) random experiments. The product space then is the probability space that corresponds to the experiments performed independently. When modeling a random experiment, if we say that we have a finite sequence of independent random variables with specified distributions, we can rest assured that there actually is a probability space that supports this statement
We can extend the last result to an infinite sequence of probability spaces. Suppose that \( (S_i, \mathscr S_i) \) is a measurable space for each \( i \in \N_+ \). Recall that the product space \( \prod_{i=1}^\infty S_i \) consists of all sequences \( \bs x = (x_1, x_2, \ldots) \) such that \( x_i \in S_i \) for each \( i \in \N_+ \). The corresponding product \( \sigma \)-algebra \( \mathscr S \) is generated by the collection of cylinder sets . That is, \( \mathscr S = \sigma(\mathscr B) \) where \[ \mathscr{B} = \left\{\prod_{i=1}^\infty A_i: A_i \in \mathscr S_i \text{ for each } i \in \N_+ \text{ and } A_i = S_i \text{ for all but finitely many } i \in \N_+\right\} \]
Suppose that \( (S_i, \mathscr{S}_i, \P_i) \) is a probability space for \(i \in \N_+ \). Let \( (S, \mathscr S) \) denote the product measurable space so that \( \mathscr S = \sigma(\mathscr B) \) where \( \mathscr B \) is the collection of cylinder sets. Then there exists a unique probability measure \( \P \) on \( (S, \mathscr S) \) that satisfies \[ \P\left(\prod_{i=1}^\infty A_i\right) = \prod_{i=1}^\infty \P_i(A_i), \quad \prod_{i=1}^\infty A_i \in \mathscr B\] If \( X_i: S \to S_i \) is the \( i \)th coordinate function on \( S\) for \( i \in \N_+ \), so that \( X_i(\bs x) = x_i \) for \( \bs x = (x_1, x_2, \ldots) \in S \), then \( (X_1, X_2, \ldots) \) is a sequence of independent random variables on \( (S, \mathscr{S}, \P) \), and \( X_i \) has distribution \( \P_i \) on \( (S_i, \mathscr S_i) \) for each \( i \in \N_+ \).
Proof
The proof is similar to the one in for positive measure spaces in the section on existence and uniqueness. First recall that the collection of cylinder sets \( \mathscr B \) is a semi-algebra. We define \( \P: \mathscr{B} \to [0, 1] \) as in the statement of the theorem. Note that all but finitely many factors are 1. The consistency conditions are satisfied, so \( \P \) can be extended to a probability measure on the algebra \( \mathscr A \) generated by \( \mathscr{B} \). That is, \( \mathscr A \) is the collection of all finite, disjoint unions of cylinder sets. The standard extension theorem and uniqueness theorem now apply, so \( \P\) can be extended uniquely to a measure on \( \mathscr S = \sigma(\mathscr A)\). The proof that \( (X_1, X_2, \ldots) \) are independent and that \( X_i \) has distribution \( \P_i \) for each \( i \in \N_+ \) is just as in the previous theorem.
Once again, if we model a random process by starting with an infinite sequence of independent random variables, we can be sure that there exists a probability space that supports this sequence. The particular probability space constructed in the last theorem is called the canonical probability space associated with the sequence of random variables. Note also that it was important that we had probability measures rather than just general positive measures in the construction, since the infinite product \( \prod_{i=1}^\infty \P_i(A_i) \) is always well defined. The next section on Stochastic Processes continues the discussion of how to construct probability spaces that correspond to a collection of random variables with specified distributional properties.
Probability Concepts
Our next discussion concerns topics that are unique to probability theory and do not have simple analogies in general measure theory.
Independence
As usual, suppose that \( (S, \mathscr S, \P) \) is a probability space. We have already studied the independence of collections of events and the independence of collections of random variables. A more complete and general treatment results if we define the independence of collections of collections of events, and most importantly, the independence of collections of \( \sigma \)-algebras. This extension actually occurred already, when we went from independence of a collection of events to independence of a collection of random variables, but we did not note it at the time. In spite of the layers of set theory, the basic idea is the same.
Suppose that \( \mathscr{A}_i \) is a collection of events for each \( i \) in an index set \( I \). Then \( \mathscr{A} = \{\mathscr{A}_i: i \in I\} \) is independent if and only if for every choice of \( A_i \in \mathscr{A}_i \) for \( i \in I \), the collection of events \(\{ A_i: i \in I\} \) is independent. That is, for every finite \(J \subseteq I \), \[ \P\left(\bigcap_{j \in J} A_j\right) = \prod_{j \in J} \P(A_j) \]
As noted above, independence of random variables, as we defined previously, is a special case of our new definition.
Suppose that \( (T_i, \mathscr T_i) \) is a measurable space for each \( i \) in an index set \( I \), and that \( X_i \) is a random variable taking values in a set \( T_i \) for each \( i \in I \). The independence of \( \{X_i: i \in I\} \) is equivalent to the independence of \( \{\sigma(X_i): i \in I\} \).
Independence of events is also a special case of the new definition, and thus our new definition really does subsume our old one.
Suppose that \( A_i \) is an event for each \( i \in I \). The independence of \( \{A_i: i \in I\} \) is equivalent to the independence of \( \{\mathscr{A}_i: i \in I\} \) where \( \mathscr{A}_i = \sigma\{A_i\} = \{S, \emptyset, A_i, A_i^c\} \) for each \( i \in I \).
For every collection of objects that we have considered (collections of events, collections of random variables, collections of collections of events), the notion of independence has the basic inheritance property.
Suppose that \( \mathscr{A} \) is a collection of collections of events.
- If \( \mathscr{A} \) is independent then \( \mathscr{B} \) is independent for every \( \mathscr{B} \subseteq \mathscr{A} \).
- If \( \mathscr{B} \) is independent for every finite \( \mathscr{B} \subseteq \mathscr{A} \) then \( \mathscr{A} \) is independent.
Our most important collections are \( \sigma \)-algebras, and so we are most interested in the independence of a collection of \( \sigma \)-algebras. The next result allows us to go from the independence of certain types of collections to the independence of the \( \sigma \)-algebras generated by these collections. To understand the result, you will need to review the definitions and theorems concerning \( \pi \)-systems and \( \lambda \)-systems. The proof uses Dynkin's \( \pi \)-\( \lambda \) theorem, named for Eugene Dynkin.
Suppose that \( \mathscr{A}_i \) is a collection of events for each \( i \) in an index set \( I \), and that \( \mathscr{A_i} \) is a \( \pi \)-system for each \( i \in I \). If \( \left\{\mathscr{A}_i: i \in I\right\} \) is independent, then \( \left\{\sigma(\mathscr{A}_i): i \in I\right\} \) is independent.
Proof
In light of the previous result, it suffices to consider a finite set of collections. Thus, suppose that \( \{\mathscr{A}_1, \mathscr{A}_2, \ldots, \mathscr{A}_n\} \) is independent. Now, fix \( A_i \in \mathscr{A}_i \) for \( i \in \{2, 3, \ldots, n\} \) and let \( E = \bigcap_{i=2}^n A_i \). Let \( \mathscr{L} = \{B \in \mathscr S: \P(B \cap E) = \P(B) \P(E)\} \). Trivially \( S \in \mathscr{L} \) since \( \P(S \cap E) = \P(E) = \P(S) \P(E) \). Next suppose that \( A \in \mathscr{L} \). Then \[ \P(A^c \cap E) = \P(E) - \P(A \cap E) = \P(E) - \P(A) \P(E) = [1 - \P(A)] \P(E) = \P(A^c) \P(E) \] Thus \( A^c \in \mathscr{L} \). Finally, suppose that \( \{A_j: j \in J\} \) is a countable collection of disjoint sets in \( \mathscr{L} \). Then \[ \P\left[\left(\bigcup_{j \in J} A_j \right) \cap E \right] = \P\left[ \bigcup_{j \in J} (A_j \cap E) \right] = \sum_{j \in J} \P(A_j \cap E) = \sum_{j \in J} \P(A_j) \P(E) = \P(E) \sum_{j \in J} \P(A_j) = \P(E) \P\left(\bigcup_{j \in J} A_j \right) \] Therefore \( \bigcup_{j \in J} A_j \in \mathscr{L} \) and so \( \mathscr{L} \) is a \( \lambda \)-system. Trivially \( \mathscr{A_1} \subseteq \mathscr{L} \) by the original independence assumption, so by the \( \pi \)-\( \lambda \) theorem, \( \sigma(\mathscr{A}_1) \subseteq \mathscr{L} \). Thus, we have that for every \( A_1 \in \sigma(\mathscr{A}_1) \) and \( A_i \in \mathscr{A}_i \) for \( i \in \{2, 3, \ldots, n\} \), \[ \P\left(\bigcap_{i=1}^n A_i \right) = \prod_{i=1}^n \P(A_i) \] Thus we have shown that \( \left\{\sigma(\mathscr{A}_1), \mathscr{A}_2, \ldots, \mathscr{A}_n\right\} \) is independent. Repeating the argument \( n - 1 \) additional times, we get that \( \{\sigma(\mathscr{A}_1), \sigma(\mathscr{A}_2), \ldots, \sigma(\mathscr{A}_n)\} \) is independent.
The next result is a rigorous statement of the strong independence that is implied the independence of a collection of events.
Suppose that \( \mathscr{A} \) is an independent collection of events, and that \( \left\{\mathscr{B}_j: j \in J\right\} \) is a partition of \( \mathscr{A} \). That is, \( \mathscr{B}_j \cap \mathscr{B}_k = \emptyset \) for \( j \ne k \) and \( \bigcup_{j \in J} \mathscr{B}_j = \mathscr{A} \). Then \( \left\{\sigma(\mathscr{B}_j): j \in J\right\} \) is independent.
Proof
Let \( \mathscr{B}_j^* \) denote the set of all finite intersections of sets in \( \mathscr{B}_j \), for each \( j \in J \). Then clearly \( \mathscr{B}_j^* \) is a \( \pi \)-system for each \( j \), and \( \left\{\mathscr{B}_j^*: j \in J\right\} \) is independent. By the previous theorem, \( \left\{\sigma(\mathscr{B}_j^*): j \in J\right\} \) is independent. But clearly \( \sigma(\mathscr{B}_j^*) = \sigma(\mathscr{B}_j) \) for \( j \in J \).
Let's bring the result down to earth. Suppose that \( A, B, C, D \) are independent events. In our elementary discussion of independence, you were asked to show, for example, that \( A \cup B^c \) and \( C^c \cup D^c \) are independent. This is a consequence of the much stronger statement that the \( \sigma \)-algebras \( \sigma\{A, B\} \) and \( \sigma\{C, D\} \) are independent.
Exchangeability
As usual, suppose that \( (S, \mathscr S, \P) \) is a probability space corresponding to a random experiment Roughly speaking, a sequence of events or a sequence of random variables is exchangeable if the probability law that governs the sequence is unchanged when the order of the events or variables is changed. Exchangeable variables arise naturally in sampling experiments and many other settings, and are a natural generalization of a sequence of independent, identically distributed (IID) variables. Conversely, it turns out that any exchangeable sequence of variables can be constructed from an IID sequence. First we give the definition for events:
Suppose that \(\mathscr A = \{A_i: i \in I\}\) is a collection of events, where \(I\) is a nonempty index set. Then \( \mathscr A \) is exchangeable if the probability of the intersection of a finite number of the events depends only on the number of events. That is, if \(J\) and \(K\) are finite subsets of \(I\) and \(\#(J) = \#(K)\) then \[\P\left( \bigcap_{j \in J} A_j\right) = \P \left( \bigcap_{k \in K} A_k\right)\]
Exchangeability has the same basic inheritance property that we have seen before.
Suppose that \(\mathscr A\) is a collection of events.
- If \(\mathscr A \) is exchangeable then \(\mathscr B\) is exchangeable for every \(\mathscr B \subseteq \mathscr A\).
- Conversely, if \(\mathscr B\) is exchangeable for every finite \(\mathscr B \subseteq \mathscr A\) then \(\mathscr A\) is exchangeable.
For a collection of exchangeable events, the inclusion exclusion law for the probability of a union is much simpler than the general version.
Suppose that \(\{A_1, A_2, \ldots, A_n\}\) is an exchangeable collection of events. For \(J \subseteq \{1, 2, \ldots, n\}\) with \(\#(J) = k\), let \(p_k = \P\left( \bigcap_{j \in J} A_j\right)\). Then \[\P\left(\bigcup_{i = 1}^n A_i\right) = \sum_{k=1}^n (-1)^{k-1} \binom{n}{k} p_k\]
Proof
The inclusion-exclusion rule gives \[\P \left( \bigcup_{i \in I} A_i \right) = \sum_{k = 1}^n (-1)^{k - 1} \sum_{J \subseteq I, \; \#(J) = k} \P \left( \bigcap_{j \in J} A_j \right)\] But \(p_k = \P\left( \bigcap_{j \in J} A_j\right)\) for every \( J \subseteq \{1, 2, \ldots, n\} \) with \( \#(J) = k \), and there are \( \binom{n}{k} \) such subsets.
The concept of exchangeablility can be extended to random variables in the natural way. Suppose that \( (T, \mathscr T) \) is a measurable space.
Suppose that \(\mathscr A \) is a collection of random variables, each taking values in \(T\). The collection \(\mathscr A\) is exchangeable if for any \(\{X_1, X_2, \ldots, X_n\} \subseteq \mathscr A \), the distribution of the random vector \((X_1, X_2, \ldots, X_n)\) depends only on \(n\).
Thus, the distribution of the random vector is unchanged if the coordinates are permuted. Once again, exchangeability has the same basic inheritance property as a collection of independent variables.
Suppose that \(\mathscr{A}\) is a collection of random variables, each taking values in \( T \).
- If \(\mathscr A\) is exchangeable then \(\mathscr B\) is exchangeable for every \(\mathscr B \subseteq \mathscr A\).
- Conversely, if \(\mathscr B\) is exchangeable for every finite \(\mathscr B \subseteq \mathscr A\) then \(\mathscr A\) is exchangeable.
Suppose that \( \mathscr A \) is a collection of random variables, each taking values in \( T \), and that \( \mathscr A \) is exchangeable. Then trivially the variables are identically distributed: if \( X, \, Y \in \mathscr A \) and \( A \in \mathscr T \), then \( \P(X \in A) = \P(Y \in A) \). Also, the definition of exchangeable variables subsumes the definition for events:
Suppose that \(\mathscr A\) is a collection of events, and let \(\mathscr B = \{\bs 1_A: A \in \mathscr A \}\) denote the corresponding collection of indicator random variables. Then \(\mathscr A\) is exchangeable if and only if \(\mathscr B\) is exchangeable.
Tail Events and Variables
Suppose again that we have a random experiment modeled by a probability space \( (S, \mathscr S, \P) \).
Suppose that \((X_1, X_2, \ldots)\) be a sequence of random variables. The tail sigma algebra of the sequence is \[ \mathscr T = \bigcap_{n=1}^\infty \sigma\{X_n, X_{n+1}, \ldots\} \]
- An event \(B \in \mathscr T\) is a tail event for the sequence.
- A random variable \( Y \) that is measurable with respect to \( \mathscr T \) is a tail random variable for the sequence.
Informally, a tail event (random variable) is an event (random variable) that can be defined in terms of \(\{X_n, X_{n+1}, \ldots\}\) for each \(n \in \N_+\). The tail sigma algebra for a sequence of events \( (A_1, A_2, \ldots) \) is defined analogously (or simply let \(X_k = \bs{1}(A_k)\), the indicator variable of \(A\), for each \(k\)). For the following results, you may need to review some of the definitions in the section on Convergence.
Suppose that \((A_1, A_2, \ldots)\) is a sequence of events.
- If the sequence is increasing then \(\lim_{n \to \infty} A_n = \bigcup_{n=1}^\infty A_n\) is a tail event of the sequence.
- If the sequence is decreasing then \(\lim_{n \to \infty} A_n = \bigcap_{n=1}^\infty A_n\) is a tail event of the sequence.
Proof
- If the sequence is increasing then \( \bigcup_{n=1}^\infty A_n = \bigcup_{n=k}^\infty A_n \in \sigma\{A_k, A_{k+1}, \ldots\}\) for every \( k \in \N_+ \).
- If the sequence is decreasing then \( \bigcap_{n=1}^\infty A_n = \bigcap_{n=k}^\infty A_k \in \sigma\{A_k, A_{k+1}, \ldots\} \) for every \( k \in \N_+ \)
Suppose again that \( (A_1, A_2, \ldots) \) is a sequence of events. Each of the following is a tail event of the sequence:
- \(\limsup_{n \to \infty} A_n = \bigcap_{n=1}^\infty \bigcup_{i=n}^\infty A_i\)
- \(\liminf_{n \to \infty} A_n = \bigcup_{n=1}^\infty \bigcap_{i=n}^\infty A_i\)
Proof
- The events \( \bigcup_{i=n}^\infty A_i \) are decreasing in \( n \) and hence \( \limsup_{n \to \infty} A_n = \lim_{n \to \infty} \bigcup_{i=n}^\infty A_i \in \mathscr T \) by the previous result.
- The events \( \bigcap_{i=n}^\infty A_i \) are increasing in \( n \) and hence \( \liminf_{n \to \infty} A_n = \lim_{n \to \infty} \bigcap_{i=n}^\infty A_i \in \mathscr T \) by the previous result.
Suppose that \( \bs X = (X_1, X_2, \ldots) \) is a sequence of real-valued random variables.
- \(\{X_n \text{ converges as } n \to \infty\}\) is a tail event for \( \bs X \).
- \( \liminf_{n \to \infty} X_n \) is a tail random variable for \( \bs X \).
- \( \limsup_{n \to \infty} X_n \) is a tail random variable for \( \bs X \).
Proof
- The Cauchy criterion for convergence (named for Augustin Cauchy of course) states that \( X_n \) converges as \( n \to \infty \) if an only if for every \( \epsilon > 0 \) there exists \( N \in \N_+ \) (depending on \( \epsilon \)) such that if \(m, \, n \ge N \) then \( \left|X_n - X_m\right| \lt \epsilon \). In this criterion, we can without loss of generality take \( \epsilon \) to be rational, and for a given \( k \in \N_+ \) we can insist that \( m, \, n \ge k \). With these restrictions, the Cauchy criterion is a countable intersection of events, each of which is in \( \sigma\{X_k, X_{k+1}, \ldots\} \).
- Recall that \( \liminf_{n \to \infty} X_n = \lim_{n \to \infty} \inf\{X_k: k \ge n\} \).
- Similarly, recall that \( \limsup_{n \to \infty} X_n = \lim_{n \to \infty} \sup\{X_k: k \ge n\} \).
The random variable in part (b) may take the value \( -\infty \), and the random variable in (c) may take the value \( \infty \). From parts (b) and (c) together, note that if \( X_n \to X_\infty \) as \( n \to \infty \) on the sample space \( \mathscr S \), then \( X_\infty \) is a tail random variable for \( \bs X \).
There are a number of zero-one laws in probability. These are theorems that give conditions under which an event will be essentially deterministic; that is, have probability 0 or probability 1. Interestingly, it can sometimes be difficult to determine which of these extremes is actually the case. The following result is the Kolmogorov zero-one law , named for Andrey Kolmogorov. It states that an event in the tail \(\sigma\)-algebra of an independent sequence will have probability 0 or 1.
Suppose that \( \bs X = (X_1, X_2, \ldots) \) is an independent sequence of random variables
- If \(B\) is a tail event for \( \bs X \) then \(\P(B) = 0\) or \(\P(B) = 1\).
- If \( Y \) is a real-valued tail random variable for \( \bs X \) then \( Y \) is constant with probability 1.
Proof
- By definition \( B \in \sigma\{X_{n+1}, X_{n+2}, \ldots\} \) for each \( n \in \N_+ \), and hence \(\{X_1, X_2, \ldots, X_n, \bs{1}_B\}\) is an independent set of random variables. Thus \(\{X_1, X_2, \ldots, \bs{1}_B\}\) is an independent set of random variables. But \( B \in \sigma\{X_1, X_2, \ldots\} \), so it follows that the event \(B\) is independent of itself. Therefore \(\P(B) = 0\) or \(\P(B) = 1\).
- The function \(y \mapsto \P(Y \le y) \) on \( \R \) is the (cumulative) distribution function of \( Y \). This function is clearly increasing. Moreover, simple applications of the continuity theorems show that it is right continuous and that \( \P(Y \le y) \to 0 \) as \( y \to -\infty \) and \( \P(Y \le y) \to 1 \) as \( y \to \infty \). (Explicit proofs are given in the section on distribution functions in the chapter on Distributions.) But since \( Y \) is a tail random variable, \( \{Y \le y\} \) is a tail event and hence \( \P(Y \le y) \in \{0, 1\} \) for each \( y \in \R \). It follows that there exists \( c \in \R \) such that \( \P(Y \le y) = 0 \) for \( y \lt c \) and \( \P(Y \le y) = 1 \) for \( y \ge c \). Hence \( \P(Y = c) = 1 \).
From the Komogorov zero-one law and the result above , note that if \((A_1, A_2, \ldots)\) is a sequence of independent events, then \(\limsup_{n \to \infty} A_n\) must have probability 0 or 1. The Borel-Cantelli lemmas give conditions for which of these is correct:
Suppose that \( (A_1, A_2, \ldots) \) is a sequence of independent events.
- If \( \sum_{i=1}^\infty \P(A_i) \lt \infty \) then \( \P\left(\limsup_{n \to \infty} A_n\right) = 0 \).
- If \( \sum_{i=1}^\infty \P(A_i) = \infty \) then \( \P\left(\limsup_{n \to \infty} A_n\right) = 1 \).
Another proof of the Kolmogorov zero-one law will be given using the martingale convergence theorem.
Examples and Exercises
As always, be sure to try the computational exercises and proofs yourself before reading the answers and proofs in the text.
Counterexamples
Equal probability certainly does not imply equivalent events.
Consider the simple experiment of tossing a fair coin. The event that the coin lands heads and the event that the coin lands tails have the same probability, but are not equivalent.
Proof
Let \( S \) denote the sample space, and \( H \) the event of heads, so that \( H^c \) is the event of tails. Since the coin is fair, \( \P(H) = \P(H^c) = \frac{1}{2} \). But \( H \bigtriangleup H^c = S\), so \( \P(H \bigtriangleup H^c) = 1 \), so \( H \) and \( H^c \) are as far from equivalent as possible.
Similarly, equivalent distributions does not imply equivalent random variables.
Consider the experiment of rolling a standard, fair die. Let \( X \) denote the score and \( Y = 7 - X \). Then \( X \) and \( Y \) have the same distribution but are not equivalent.
Proof
Since the die is fair, \( X \) is uniformly distributed on \(S = \{1, 2, 3, 4, 5, 6\} \). Also \( \P(Y = k) = \P(X = 7 - k) = \frac{1}{6} \) for \( k \in S \), so \( Y \) also has the uniform distribution on \( S \). But \( \P(X = Y) = \P\left(X = \frac{7}{2}\right) = 0 \), so \( X \) and \( Y \) are as far from equivalent as possible.
Consider the experiment of rolling two standard, fair dice and recording the sequence of scores \( (X, Y) \). Then \( X \) and \( Y \) are independent and have the same distribution, but are not equivalent.
Proof
Since the dice are fair, \( (X, Y) \) has the uniform distribution on \( \{1, 2, 3, 4, 5, 6\}^2 \). Equivalently, \( X \) and \( Y \) are independent, and each has the uniform distribution on \( \{1, 2, 3, 4, 5, 6\} \). But \( \P(X = Y) = \frac{1}{6} \), so \( X \) and \( Y \) are not equivalent.