4.13: Kernels and Operators
The goal of this section is to study a type of mathematical object that arises naturally in the context of conditional expected value and parametric distributions, and is of fundamental importance in the study of stochastic processes, particularly Markov processes. In a sense, the main object of study in this section is a generalization of a matrix, and the operations generalizations of matrix operations. If you keep this in mind, this section may seem less abstract.
Basic Theory
Definitions
Recall that a measurable space \( (S, \mathscr S) \) consists of a set \( S \) and a \( \sigma \)-algebra \( \mathscr S \) of subsets of \( S \). If \( \mu \) is a positive measure on \( (S, \mathscr S) \), then \( (S, \mathscr S, \mu) \) is a measure space. The two most important special cases that we have studied frequently are
- Discrete : \(S\) is countable, \(\mathscr S = \mathscr P(S)\) is the collection of all subsets of \(S\), and \( \mu = \# \) is counting measure on \( (S, \mathscr S) \).
- Euclidean : \(S\) is a measurable subset of \(\R^n\) for some \(n \in \N_+\), \(\mathscr S\) is the collection of subsets of \(S\) that are also measurable, and \( \mu = \lambda_n \) is \( n \)-dimensional Lebesgue measure on \( (S, \mathscr S) \).
More generally, \( S \) usually comes with a topology that is locally compact, Hausdorff, with a countable base ( LCCB ), and \( \mathscr S \) is the Borel \( \sigma \)-algebra , the \( \sigma \)-algebra generated by the topology (the collection of open subsets of \( S \)). The measure \( \mu \) is usually a Borel measure , and so satisfies \( \mu(C) \lt \infty \) if \( C \subseteq S \) is compact. A discrete measure space is of this type, corresponding to the discrete topology. A Euclidean measure space is also of this type, corresponding to the Euclidean topology, if \( S \) is open or closed (which is usually the case). In the discrete case, every function from \( S \) to another measurable space is measurable, and every from function from \( S \) to another topological space is continuous, so the measure theory is not really necessary.
Recall also that the measure space \((S, \mathscr S, \mu)\) is \(\sigma\)-finite if there exists a countable collection \(\{A_i: i \in I\} \subseteq \mathscr S\) such that \(\mu(A_i) \lt \infty\) for \(i \in I\) and \(S = \bigcup_{i \in I} A_i\). If \((S, \mathscr S, \mu)\) is a Borel measure space corresponding to an LCCB topology, then it is \(\sigma\)-finite.
If \(f: S \to \R\) is measurable, define \( \| f \| = \sup\{\left|f(x)\right|: x \in S\}\). Of course we may well have \(\|f\| = \infty\). Let \( \mathscr B(S) \) denote the collection of bounded measurable functions \( f: S \to \R \). Under the usual operations of pointwise addition and scalar multiplication, \( \mathscr B(S) \) is a vector space, and \(\| \cdot \|\) is the natural norm on this space, known as the supremum norm . This vector space plays an important role.
In this section, it is sometimes more natural to write integrals with respect to the positive measure \( \mu \) with the differential before the integrand, rather than after. However, rest assured that this is mere notation, the meaning of the integral is the same. So if \( f: S \to \R \) is measurable then we may write the integral of \( f \) with respect to \( \mu \) in operator notation as \[ \mu f = \int_S \mu(dx) f(x) \] assuming, as usual, that the integral exists. This will be the case if \( f \) is nonnegative, although \( \infty \) is a possible value. More generally, the integral exists in \( \R \cup \{-\infty, \infty\} \) if \( \mu f^+ \lt \infty \) or \( \mu f^- \lt \infty\) where \( f^+ \) and \( f^- \) are the positive and negative parts of \( f \). If both are finite, the integral exists in \( \R \) (and \( f \) is integrable with respect to \( \mu \)). If If \( \mu \) is a probability measure and we think of \( (S, \mathscr S) \) as the sample space of a random experiment, then we can think of \( f \) as a real-valued random variable, in which case our new notation is not too far from our traditional expected value \( \E(f) \). Our main definition comes next.
Suppose that \( (S, \mathscr S) \) and \( (T, \mathscr T) \) are measurable spaces. A kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) is a function \( K: S \times \mathscr T \to [0, \infty] \) such that
- \( x \mapsto K(x, A) \) is a measurable function from \(S\) into \([0, \infty]\) for each \( A \in \mathscr T \).
- \( A \mapsto K(x, A) \) is a positive measure on \( \mathscr T \) for each \( x \in S \).
If \( (T, \mathscr T) = (S, \mathscr S) \), then \( K \) is said to be a kernel on \( (S, \mathscr S) \).
There are several classes of kernels that deserve special names.
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Then
- \(K\) is \(\sigma\)-finite if the measure \(K(x, \cdot)\) is \(\sigma\)-finite for every \(x \in S\).
- \( K \) is finite if \( K(x, T) \lt \infty \) for every \( x \in S \).
- \( K \) is bounded if \( K(x, T) \) is bounded in \( x \in S \).
- \( K \) is a probability kernel if \( K(x, T) = 1 \) for every \( x \in S \).
Define \( \|K\| = \sup\{K(x, T): x \in S\} \), so that \(\|K\| \lt \infty\) if \(K\) is a bounded kernel and \(\|K\| = 1\) if \(K\) is a probability kernel.
So a probability kernel is bounded, a bounded kernel is finite, and a finite kernel is \(\sigma\)-finite. The terms stochastic kernel and Markov kernel are also used for probability kernels, and for a probability kernel \( \|K\| = 1 \) of course. The terms are consistent with terms used for measures: \( K \) is a finite kernel if and only if \( K(x, \cdot) \) is a finite measure for each \( x \in S \), and \( K \) is a probability kernel if and only if \( K(x, \cdot) \) is a probability measure for each \( x \in S \). Note that \( \|K\| \) is simply the supremum norm of the function \( x \mapsto K(x, T) \).
A kernel defines two natural integral operators, by operating on the left with measures, and by operating on the right with functions. As usual, we are often a bit casual witht the question of existence. Basically in this section, we assume that any integrals mentioned exist.
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \).
- If \( \mu \) is a positive measure on \( (S, \mathscr S) \), then \( \mu K \) defined as follows is a positive measure on \( (T, \mathscr T) \): \[ \mu K(A) = \int_S \mu(dx) K(x, A), \quad A \in \mathscr T \]
- If \( f: T \to \R \) is measurable, then \( K f: S \to \R \) defined as follows is measurable (assuming that the integrals exist in \( \R \)): \[K f(x) = \int_T K(x, dy) f(y), \quad x \in S\]
Proof
- Clearly \( \mu K(A) \ge 0 \) for \( A \in \mathscr T \). Suppose that \( \{A_j: i \in J\} \) is a countable collection of disjoint sets in \( \mathscr T \) and \( A = \bigcup_{j \in J} A_j \). Then \begin{align*} \mu K(A) & = \int_S \mu(dx) K(x, A) = \int_S \mu(dx) \left(\sum_{j \in J} K(x, A_j) \right) \\ & = \sum_{j \in J} \int_S \mu(dx) K(x, A_j) = \sum_{j \in J} \mu K(A_j) \end{align*} The interchange of sum and integral is justified since the terms are nonnegative.
- The measurability of \( K f \) follows from the measurability of \( f \) and of \( x \mapsto K(x, A) \) for \( A \in \mathscr S \), and from basic properties of the integral.
Thus, a kernel transforms measures on \( (S, \mathscr S) \) into measures on \( (T, \mathscr T) \), and transforms certain measurable functions from \( T \) to \( \R \) into measurable functions from \( S \) to \( \R \). Again, part (b) assumes that \( f \) is integrable with respect to the measure \( K(x, \cdot) \) for every \( x \in S \). In particular, the last statement will hold in the following important special case:
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) and that \( f \in \mathscr B(T) \).
- If \( K \) is finite then \(Kf\) is defined and \(\|Kf\| = \|K\| \|f\|\).
- If \(K\) is bounded then \(Kf \in \mathscr B(T)\).
Proof
- If \( K \) is finite then \[ K \left|f\right|(x) = \int_T K(x, dy) \left|f(y)\right| \le \int_T K(x, dy) \|f\| = \|f\| K(x, T) \lt \infty \quad x \in S \] Hence \( f \) is integrable with respect to \( K(x, \cdot) \) for each \( x \in S \) so \(Kf\) is defined. Continuing with our inequalities, we have \(|K f(x)| \le K |f|(x) \le \|f\| K(x, T) \le \|f\| \|K\|\) so \(\|Kf\| \le \|K\| \|f\|\). Moreover equality holds when \( f = \bs{1}_T \), the constant function 1 on \( T \).
- If \( K \) is bounded then \( \|K\| \lt \infty \) so from (a), \( \|K f \| \lt \infty \).
The identity kernel \( I \) on the measurable space \( (S, \mathscr S) \) is defined by \( I(x, A) = \bs{1}(x \in A) \) for \( x \in S \) and \( A \in \mathscr S \).
Thus, \( I(x, A) = 1 \) if \( x \in A \) and \( I(x, A) = 0 \) if \( x \notin A \). So \( x \mapsto I(x, A) \) is the indicator function of \( A \in \mathscr S \), while \( A \mapsto I(x, A) \) is point mass at \( x \in S \). Clearly the identity kernel is a probability kernel. If we need to indicate the dependence on the particular space, we will add a subscript. The following result justifies the name.
Let \( I \) denote the identity kernel on \( (S, \mathscr S) \).
- If \( \mu \) is a positive measure on \( (S, \mathscr S) \) then \( \mu I = \mu \).
- If \( f: S \to \R \) is measurable, then \( I f = f \).
Constructions
We can create a new kernel from two given kernels, by the usual operations of addition and scalar multiplication.
Suppose that \( K \) and \( L\) are kernels from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), and that \( c \in [0, \infty) \). Then \( c K \) and \( K + L \) defined below are also kernels from \( (S, \mathscr S) \) to \( (T, \mathscr T) \).
- \((c K)(x, A) = c K(x, A)\) for \( x \in S \) and \( A \in \mathscr T \).
- \((K + L)(x, A) = K(x, A) + L(x, A)\) for \( x \in S \) and \( A \in \mathscr T \).
If \( K \) and \( L \) are \( \sigma \)-finite (finite) (bounded) then \( c K \) and \( K + L \) are \( \sigma \)-finite (finite) (bounded), respectively.
Proof
These results are simple.
- Since \( x \mapsto K(x, A) \) is measurable for \( A \in \mathscr T \), so is \( x \mapsto c K(x, A) \). Since \( A \mapsto K(x, A) \) is a positive measure on \( (T, \mathscr T) \) for \( x \in S \), so is \( A \mapsto c K(x, A) \) since \( c \ge 0 \).
- Since \( x \mapsto K(x, A) \) and \( x \mapsto L(x, A) \) are measurable for \( A \in \mathscr T \), so is \( x \mapsto K(x, A) + L(x, A) \). Since \( A \mapsto K(x, A) \) and \( A \mapsto L(x, A) \) are positive measures on \( (T, \mathscr T) \) for \( x \in S \), so is \( A \mapsto K(x, A) + L(x, A) \).
A simple corollary of the last result is that if \(a, \, b \in [0, \infty)\) then \(a K + b L\) is a kerneal from \((S, \mathscr S)\) to \((T, \mathscr T)\). In particular, if \(K, \, L\) are probability kernels and \(p \in (0, 1)\) then \(p K + (1 - p) L\) is a probability kernel. A more interesting and important way to form a new kernel from two given kernels is via a
multiplication
operation.
Suppose that \( K \) is a kernel from \( (R, \mathscr R) \) to \( (S, \mathscr S) \) and that \( L \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Then \( K L \) defined as follows is a kernel from \( (R, \mathscr R) \) to \( (T, \mathscr T) \): \[ K L(x, A) = \int_S K(x, dy) L(y, A), \quad x \in R, \, A \in \mathscr T \]
- If \(K\) is finite and \(L\) is bounded then \(K L\) is finite.
- If \(K\) and \(L\) are bounded then \(K L\) is bounded.
- If \(K\) and \(L\) are stochastic then \(K L\) is stochastic
Proof
The measurability of \( x \mapsto (K L)(x, A) \) for \( A \in \mathscr T \) follows from basic properties of the integral. For the second property, fix \( x \in R \). Clearly \( K L(x, A) \ge 0 \) for \( A \in \mathscr T \). Suppose that \( \{A_j: j \in J\} \) is a countable collection of disjoint sets in \( \mathscr T \) and \( A = \bigcup_{j \in J} A_j \). Then \begin{align*} K L(x, A) & = \int_S K(x, dy) L(x, A) = \int_S K(x, dy) \left(\sum_{j \in J} L(y, A_j)\right) \\ & = \sum_{j \in J} \int_S K(x, dy) L(y, A_j) = \sum_{j \in J} K L(x, A_j) \end{align*} The interchange of sum and integral is justified since the terms are nonnegative.
Once again, the identity kernel lives up to its name:
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Then
- \( I_S K = K \)
- \( K I_T = K \)
The next several results show that the operations are associative whenever they make sense.
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), \( \mu \) is a positive measure on \( \mathscr S \), \( c \in [0, \infty) \), and \( f: T \to \R \) is measurable. Then, assuming that the appropriate integrals exist,
- \( c (\mu K) = (c \mu) K \)
- \( c (K f) = (c K) f \)
- \( (\mu K) f = \mu (K f)\)
Proof
These results follow easily from the definitions.
- The common measure on \( \mathscr T \) is \( c \mu K(A) = c \int_S \mu(dx) K(x, A) \) for \( A \in \mathscr T \).
- The common function from \( S \) to \( \R \) is \( c K f(x) = c \int_S K(x, dy) f(y) \) for \( x \in S \), assuming that the integral exists for \( x \in S \).
- The common real number is \( \mu K f = \int_S \mu(dx) \int_T K(x, dy) f(y) \), assuming that the integrals exist.
Suppose that \( K \) is a kernel from \( (R, \mathscr R) \) to \( (S, \mathscr S) \) and \( L \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Suppose also that \( \mu \) is a positive measure on \( (R, \mathscr R) \), \( f: T \to \R \) is measurable, and \( c \in [0, \infty) \). Then, assuming that the appropriate integrals exist,
- \( (\mu K) L = \mu (K L) \)
- \( K ( L f) = (K L) f \)
- \( c (K L) = (c K) L \)
Proof
These results follow easily from the definitions.
- The common measure on \( (T, \mathscr T) \) is \( \mu K L(A) = \int_R \mu(dx) \int_S K(x, dy) L(y, A) \) for \( A \in \mathscr T \).
- The common measurable function from \( R \) to \( \R \) is \( K L f(x) = \int_S K(x, dy) \int_T L(y, dz) f(z) \) for \( x \in R \), assuming that the integral exists for \( x \in S \).
- The common kernel from \( (R, \mathscr R) \) to \( (T, \mathscr T) \) is \( c K L(x, A) = c \int_S K(x, dy) L(y, A) \) for \( x \in R \) and \( A \in \mathscr T \).
Suppose that \( K \) is a kernel from \( (R, \mathscr R) \) to \( (S, \mathscr S) \), \( L \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), and \( M \) is a kernel from \( (T, \mathscr T) \) to \( (U, \mathscr U)\). Then \( (K L) M = K (L M) \).
Proof
This results follow easily from the definitions. The common kernel from \( (R, \mathscr R) \) to \( (U, \mathscr U) \) is \[K L M(x, A) = \int_S K(x, dy) \int_T L(y, dz) M(z, A), \quad x \in R, \, A \in \mathscr U \]
The next several results show that the distributive property holds whenever the operations makes sense.
Suppose that \( K \) and \( L \) are kernels from \( (R, \mathscr R) \) to \( (S, \mathscr S) \) and that \( M \) and \( N \) are kernels from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Suppose also that \( \mu \) is a positive measure on \( (R, \mathscr R) \) and that \( f: S \to \R \) is measurable. Then, assuming that the appropriate integrals exist,
- \((K + L) M = K M + L M\)
- \( K (M + N) = K M + K N \)
- \( \mu (K + L) = \mu K + \mu L \)
- \( (K + L) f = K f + L f \)
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), and that \( \mu \) and \( \nu \) are positive measures on \( (S, \mathscr S) \), and that \( f \) and \( g \) are measurable functions from \( T \) to \( \R \). Then, assuming that the appropriate integrals exist,
- \( (\mu + \nu) K = \mu K + \nu K \)
- \( K(f + g) = K f + K g \)
- \( \mu(f + g) = \mu f + \mu g \)
- \( (\mu + \nu) f = \mu f + \nu f \)
In particular, note that if \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), then the transformation \( \mu \mapsto \mu K \) defined for positive measures on \( (S, \mathscr S)\), and the transformation \( f \mapsto K f \) defined for measurable functions \( f: T \to \R \) (for which \( K f \) exists), are both linear operators. If \( \mu \) is a positive measure on \( (S, \mathscr S) \), then the integral operator \( f \mapsto \mu f \) defined for measurable \( f: S \to \R \) (for which \( \mu f \) exists) is also linear, but of course, we already knew that. Finally, note that the operator \( f \mapsto K f \) is positive : if \( f \ge 0 \) then \( K f \ge 0 \). Here is the important summary of our results when the kernel is bounded.
If \( K \) is a bounded kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), then \( f \mapsto K f \) is a bounded, linear transformation from \( \mathscr B(T) \) to \( \mathscr B(S) \) and \( \|K\| \) is the norm of the transformation.
The commutative property for the product of kernels fails with a passion. If \( K \) and \( L \) are kernels, then depending on the measurable spaces, \( K L \) may be well defined, but not \( L K \). Even if both products are defined, they may be kernels from or to different measurable spaces. Even if both are defined from and to the same measurable spaces, it may well happen that \( K L \neq L K \). Some examples are given below
If \( K \) is a kernel on \( (S, \mathscr S) \) and \( n \in \N \), we let \( K^n = K K \cdots K \), the \( n \)-fold power of \( K \). By convention, \( K^0 = I \), the identity kernel on \( S \).
Fixed points of the operators associated with a kernel turn out to be very important.
Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \).
- A positive measure \( \mu \) on \( (S, \mathscr S) \) such that \( \mu K = \mu \) is said to be invariant for \( K \).
- A measurable function \( f: T \to \R \) such that \( K f = f \) is said to be invariant for \( K \)
So in the language of linear algebra (or functional analysis), an invariant measure is a left eigenvector of the kernel, while an invariant function is a right eigenvector of the kernel, both corresponding to the eigenvalue 1. By our results above, if \( \mu \) and \( \nu \) are invariant measures and \( c \in [0, \infty) \), then \( \mu + \nu \) and \( c \mu \) are also invariant. Similarly, if \( f \) and \( g \) are invariant functions and \( c \in \R \), the \( f + g \) and \( c f \) are also invariant.
Of couse we are particularly interested in probability kernels.
Suppose that \( P \) is a probability kernel from \((R, \mathscr R)\) to \( (S, \mathscr S) \) and that \( Q \) is a probability kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \). Suppose also that \( \mu \) is a probability measure on \( (R, \mathscr R) \). Then
- \( P Q \) is a probability kernel from \( (R, \mathscr R) \) to \( (T, \mathscr T) \).
- \( \mu P \) is a probability measure on \( (S, \mathscr S) \).
Proof
- We know that \( P Q \) is a kernel from \( (R, \mathscr R) \) to \( (T, \mathscr T) \). So we just need to note that \[P Q(T) = \int_S P(x, dy) Q(y, T) = \int_S P(x, dy) = P(x, S) = 1, \quad x \in R \]
- We know that \( \mu P \) is a positive measure on \( (S, \mathscr S)) \). So we just need to note that \[ \mu P(S) = \int_R \mu(dx) P(x, S) = \int_R \mu(dx) = \mu(R) = 1 \]
As a corollary, it follows that if \( P \) is a probability kernel on \( (S, \mathscr S) \), then so is \( P^n \) for \( n \in \N \).
The operators associated with a kernel are of fundamental importance, and we can easily recover the kernel from the operators. Suppose that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), and let \( x \in S \) and \( A \in \mathscr T \). Then trivially, \(K \bs{1}_A(x) = K(x, A)\) where as usual, \( \bs{1}_A \) is the indicator function of \( A \). Trivially also \( \delta_x K(A) = K(x, A) \) where \( \delta_x \) is point mass at \( x \).
Kernel Functions
Usually our measurable spaces are in fact measure spaces, with natural measures associated with the spaces, as in the special cases described in (1) . When we start with measure spaces, kernels are usually constructed from density functions in much the same way that positive measures are defined from density functions.
Suppose that \( (S, \mathscr S, \lambda) \) and \( (T, \mathscr T, \mu) \) are measure spaces. As usual, \( S \times T \) is given the product \( \sigma \)-algebra \( \mathscr S \otimes \mathscr T \). If \( k: S \times T \to [0, \infty) \) is measurable, then the function \( K \) defined as follows is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \): \[ K(x, A) = \int_A k(x, y) \mu(dy), \quad x \in S, \, A \in \mathscr T \]
Proof
The measurability of \( x \mapsto K(x, A) = \int_A k(x, y) \mu(dy) \) for \( A \in \mathscr T \) follows from a basic property of the integral. The fact that \( A \mapsto K(x, A) = \int_A k(x, y) \mu(dy) \) is a positive measure on \( \mathscr T \) for \( x \in S \) also follows from a basic property of the integral. In fact, \( y \mapsto k(x, y) \) is the density of this measure with respect to \( \mu \).
Clearly the kernel \( K \) depends on the positive measure \( \mu \) on \( (T, \mathscr T) \) as well as the function \( k \), while the measure \( \lambda \) on \( (S, \mathscr S) \) plays no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Appropriately enough, the function \( k \) is called a kernel density function (with respect to \( \mu \)), or simply a kernel function .
Suppose again that \( (S, \mathscr S, \lambda) \) and \( (T, \mathscr T, \mu) \) are measure spaces. Suppose also \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) with kernel function \( k \). If \( f: T \to \R \) is measurable, then, assuming that the integrals exists, \[ K f(x) = \int_S k(x, y) f(y) \mu(dy), \quad x \in S \]
Proof
This follows since the function \( y \mapsto k(x, y) \) is the density of the measure \( A \mapsto K(x, A) \) with respect to \( \mu \): \[ K f(x) = \int_S K(x, dy) f(y) = \int_S k(x, y) f(y) \mu(dy), \quad x \in S \]
A kernel function defines an operator on the left with functions on \( S \) in a completely analogous way to the operator on the right above with functions on \( T \).
Suppose again that \( (S, \mathscr S, \lambda) \) and \( (T, \mathscr T, \mu) \) are measure spaces, and that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) with kernel function \( k \). If \( f: S \to \R \) is measurable, then the function \( f K: T \to \R \) defined as follows is also measurable, assuming that the integrals exists \[ f K(y) = \int_S \lambda(dx) f(x) k(x, y), \quad y \in T \]
The operator defined above depends on the measure \( \lambda \) on \( (S, \mathscr S) \) as well as the kernel function \( k \), while the measure \( \mu \) on \( (T, \mathscr T) \) playes no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Here is how our new operation on the left with functions relates to our old operation on the left with measures .
Suppose again that \( (S, \mathscr S, \lambda) \) and \( (T, \mathscr T, \mu) \) are measure spaces, and that \( K \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) with kernel function \( k \). Suppose also that \( f: S \to [0, \infty) \) is measurable, and let \( \rho \) denote the measure on \( (S, \mathscr S) \) that has density \( f \) with respect to \( \lambda \). Then \( f K \) is the density of the measure \( \rho K \) with respect to \( \mu \).
Proof
The main tool, as usual, is an interchange of integrals. For \( B \in \mathscr T \), \begin{align*} \rho K(B) & = \int_S \rho(dx) K(x, B) = \int_S f(x) K(x, B) \lambda(dx) = \int_S f(x) \left[\int_B k(x, y) \mu(dy)\right] \lambda(dx) \\ & = \int_B \left[\int_S f(x) k(x, y) \lambda(dx)\right] \mu(dy) = \int_B f K(y) \mu(dy) \end{align*}
As always, we are particularly interested in stochastic kernels. With a kernel function, we can have doubly stochastic kernels.
Suppose again that \( (S, \mathscr S, \lambda) \) and \( (T, \mathscr T, \mu) \) are measure spaces and that \( k: S \times T \to [0, \infty) \) is measurable. Then \( k \) is a double stochastic kernel function if
- \( \int_T k(x, y) \mu(dy) = 1 \) for \( x \in S \)
- \( \int_S \lambda(dx) k(x, y) = 1 \) for \( y \in S \)
Of course, condition (a) simply means that the kernel associated with \( k \) is a stochastic kernel according to our original definition.
The most common and important special case is when the two spaces are the same. Thus, if \( (S, \mathscr S, \lambda) \) is a measure space and \( k : S \times S \to [0, \infty) \) is measurable, then we have an operator \( K \) that operates on the left and on the right with measurable functions \( f: S \to \R \): \begin{align*} f K(y) & = \int_S \lambda(dx) f(x) k(x, y), \quad y \in S \\ K f(x) & = \int_S k(x, y) f(y) \lambda(d y), \quad x \in S \end{align*} If \( f \) is nonnegative and \( \mu \) is the measure on with density function \( f \), then \( f K \) is the density function of the measure \( \mu K \) (both with respect to \( \lambda \)).
Suppose again that \( (S, \mathscr S, \lambda) \) is a measure space and \( k : S \times S \to [0, \infty) \) is measurable. Then \( k \) is symmetric if \( k(x, y) = k(y, x) \) for all \( (x, y) \in S^2 \).
Of course, if \( k \) is a symmetric, stochastic kernel function on \( (S, \mathscr S, \lambda) \) then \( k \) is doubly stochastic, but the converse is not true.
Suppose that \( (R, \mathscr R, \lambda) \), \( (S, \mathscr S, \mu) \), and \( (T, \mathscr T, \rho) \) are measure spaces. Suppose also that \( K \) is a kernel from \( (R, \mathscr R) \) to \( (S, \mathscr S) \) with kernel function \( k \), and that \( L \) is a kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \) with kernel function \( l \). Then the kernel \( K L \) from \( (R, \mathscr R) \) to \( (T, \mathscr T) \) has density \( k l \) given by \[ k l(x, z) = \int_S k(x, y) l(y, z) \mu(dy), \quad (x, z) \in R \times T \]
Proof
Once again, the main tool is an interchange of integrals via Fubini's theorem. Let \( x \in R \) and \( B \in \mathscr T \). Then \begin{align*} K L(x, B) & = \int_S K(x, dy) L(y, B) = \int_S k(x, y) L(y, B) \mu(dy) \\ & = \int_S k(x, y) \left[\int_B l(y, z) \rho(dz) \right] \mu(dy) = \int_B \left[\int_S k(x, y) l(y, z) \mu(dy) \right] \rho(dz) = \int_B k l(x, z) \mu(dz) \end{align*}
Examples and Special Cases
The Discrete Case
In this subsection, we assume that the measure spaces are discrete, as described in (1) . Since the \( \sigma \)-algebra (all subsets) and the measure (counting measure) are understood, we don't need to reference them. Recall that integrals with respect to counting measure are sums. Suppose now that \( K \) is a kernel from the discrete space \(S\) to the discrete space \(T\). For \( x \in S \) and \( y \in T \), let \( K(x, y) = K(x, \{y\}) \). Then more generally, \[ K(x, A) = \sum_{y \in A} K(x, y), \quad x \in S, \, A \subseteq T \] The function \( (x, y) \mapsto K(x, y) \) is simply the kernel function of the kernel \( K \), as defined above , but in this case we usually don't bother with using a different symbol for the function as opposed to the kernel. The function \( K \) can be thought of as a matrix , with rows indexed by \( S \) and columns indexed by \( T \) (and so an infinite matrix if \( S \) or \( T \) is countably infinite). With this interpretation, all of the operations defined above can be thought of as matrix operations. If \( f: T \to \R \) and \( f \) is thought of as a column vector indexed by \( T \), then \( K f \) is simply the ordinary product of the matrix \( K \) and the vector \( f \); the product is a column vector indexed by \( S \): \[K f(x) = \sum_{y \in S} K(x, y) f(y), \quad x \in S \] Similarly, if \( f: S \to \R \) and \( f \) is thought of as a row vector indexed by \( S \), then \( f K \) is simple the ordinary product of the vector \( f \) and the matrix \( K \); the product is a row vector indexed by \( T \): \[ f K(y) = \sum_{x \in S} f(x) K(x, y), \quad y \in T \] If \( L \) is another kernel from \( T \) to another discrete space \( U \), then as functions, \( K L \) is the simply the matrix product of \( K \) and \( L \): \[ K L(x, z) = \sum_{y \in T} K(x, y) L(x, z), \quad (x, z) \in S \times L \]
Let \( S = \{1, 2, 3\} \) and \( T = \{1, 2, 3, 4\} \). Define the kernel \( K \) from \( S \) to \( T \) by \( K(x, y) = x + y \) for \( (x, y) \in S \times T \). Define the function \( f \) on \( S \) by \( f(x) = x! \) for \( x \in S \), and define the function \( g \) on \( T \) by \( g(y) = y^2\) for \( y \in T \). Compute each of the following using matrix algebra:
- \( f K \)
- \( K g \)
Answer
In matrix form, \[ K = \left[\begin{matrix} 2 & 3 & 4 & 5 \\ 3 & 4 & 5 & 6 \\ 4 & 5 & 6 & 7 \end{matrix} \right], \quad f = \left[\begin{matrix} 1 & 2 & 6 \end{matrix} \right], \quad g = \left[\begin{matrix} 1 \\ 4 \\ 9 \\ 16 \end{matrix} \right]\]
- As a row vector indexed by \( T \), the product is \( f K = \left[\begin{matrix} 32 & 41 & 50 & 59\end{matrix}\right] \)
- As a column vector indexed by \( S \), \[ K g = \left[\begin{matrix} 130 \\ 160 \\ 190 \end{matrix}\right] \]
Let \( R = \{0, 1\} \), \( S = \{a, b\} \), and \( T = \{1, 2, 3\} \). Define the kernel \( K \) from \( R \) to \( S \), the kernel \( L \) from \( S \) to \( S \) and the kernel \( M \) from \( S \) to \( T \) in matrix form as follows: \[ K = \left[\begin{matrix} 1 & 4 \\ 2 & 3\end{matrix}\right], \; L = \left[\begin{matrix} 2 & 2 \\ 1 & 5 \end{matrix}\right], \; M = \left[\begin{matrix} 1 & 0 & 2 \\ 0 & 3 & 1 \end{matrix} \right] \] Compute each of the following kernels, or explain why the operation does not make sense:
- \( K L \)
- \( L K \)
- \( K^2 \)
- \( L^2 \)
- \( K M \)
- \( L M \)
Proof
Note that these are not just abstract matrices, but rather have rows and columns indexed by the appropriate spaces. So the products make sense only when the spaces match appropriately; it's not just a matter of the number of rows and columns.
- \( K L \) is the kernel from \( R \) to \( S \) given by \[ K L = \left[\begin{matrix} 6 & 22 \\ 7 & 19 \end{matrix} \right] \]
- \( L K \) is not defined since the column space \( S \) of \( L \) is not the same as the row space \( R \) of \( K \).
- \( K^2 \) is not defined since the row space \( R \) is not the same as the column space \( S \).
- \( L^2 \) is the kernel from \( S \) to \( S \) given by \[ L^2 = \left[\begin{matrix} 6 & 14 \\ 7 & 27 \end{matrix}\right] \]
- \( K M \) is the kernel from \( R \) to \( T \) given by \[ K M = \left[\begin{matrix} 1 & 12 & 6 \\ 2 & 9 & 7 \end{matrix} \right] \]
- \( L M \) is the kernel from \( S \) to \( T \) given by \[ L M = \left[\begin{matrix} 2 & 6 & 6 \\ 1 & 15 & 7 \end{matrix}\right] \]
Conditional Probability
An important class of probability kernels arises from the distribution of one random variable, conditioned on the value of another random variable. In this subsection, suppose that \( (\Omega, \mathscr{F}, \P) \) is a probability space, and that \( (S, \mathscr S) \) and \( (T, \mathscr T) \) are measurable spaces. Further, suppose that \( X \) and \( Y \) are random variables defined on the probability space, with \( X \) taking values in \( S \) and that \( Y \) taking values in \( T \). Informally, \( X \) and \( Y \) are random variables defined on the same underlying random experiment.
The function \( P \) defined as follows is a probability kernel from \( (S, \mathscr S) \) to \( (T, \mathscr T) \), known as the conditional probability kernel of \( Y \) given \( X \). \[ P(x, A) = \P(Y \in A \mid X = x), \quad x \in S, \, A \in \mathscr T \]
Proof
Recall that for \( A \in \mathscr T \), the conditional probability \( \P(Y \in A \mid X) \) is itself a random variable, and is measurable with respect to \( \sigma(X) \). That is, \( \P(Y \in A \mid X) = P(X, A) \) for some measurable function \(x \mapsto P(x, A) \) from \( S \) to \( [0, 1] \). Then, by definition, \( \P(Y \in A \mid X = x) = P(x, A) \). Trivially, of course, \( A \mapsto P(x, A) \) is a probability measure on \( (T, \mathscr T) \) for \( x \in S \).
The operators associated with this kernel have natural interpretations.
Let \( P \) be the conditional probability kernel of \( Y \) given \( X \).
- If \( f: T \to \R \) is measurable, then \( Pf(x) = \E[f(Y) \mid X = x] \) for \( x \in S \) (assuming as usual that the expected value exists).
- If \( \mu \) is the probability distribution of \( X \) then \( \mu P \) is the probability distribution of \( Y \).
Proof
These are basic results that we have already studied, dressed up in new notation.
- Since \( A \mapsto P(x, A) \) is the conditional distribution of \( Y \) given \( X = x \), \[ \E[f(Y) \mid X = x] = \int_S P(x, dy) f(y) = P f(x) \]
- Let \( A \in \mathscr T \). Conditioning on \( X \) gives \[ \P(Y \in A) = \E[\P(Y \in A \mid X)] = \int_S \mu(dx) P(Y \in A \mid X = x) = \int_S \mu(dx) P(x, A) = \mu P(A) \]
As in the general discussion above, the measurable spaces \( (S, \mathscr S) \) and \( (T, \mathscr T) \) are usually measure spaces with natural measures attached. So the conditional probability distributions are often given via conditional probability density functions, which then play the role of kernel functions. The next two exercises give examples.
Suppose that \( X \) and \( Y \) are random variables for an experiment, taking values in \( \R \). For \( x \in \R \), the conditional distribution of \( Y \) given \( X = x \) is normal with mean \( x \) and standard deviation 1. Use the notation and operations of this section for the following computations:
- Give the kernel function for the conditional distribution of \( Y \) given \(X\).
- Find \( \E\left(Y^2 \bigm| X = x\right) \).
- Suppose that \( X \) has the standard normal distribution. Find the probability density function of \( Y \).
Answer
- The kernel function (with respect to Lebesgue measure, of course) is \[ p(x, y) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} (y - x)^2}, \quad x, \, y \in \R \]
- Let \( g(y) = y^2 \) for \( y \in \R \). Then \( E\left(Y^2 \bigm| X = x\right) = P g(x) = 1 + x^2\) for \( x \in \R \)
- The standard normal PDF \( f \) is given \( f(x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} \) for \( x \in \R \). Thus \( Y \) has PDF \( f P \). \[ f P(y) \int_{-\infty}^\infty f(x) p(x, y) dx = \frac{1}{2 \sqrt{\pi}} e^{-\frac{1}{4} y^2}, \quad y \in \R\] This is the PDF of the normal distribution with mean 0 and variance 2.
Suppose that \( X \) and \( Y \) are random variables for an experiment, with \( X \) taking values in \( \{a, b, c\} \) and \( Y \) taking values in \( \{1, 2, 3, 4\} \). The kernel function of \( Y \) given \( X \) is as follows: \( P(a, y) = 1/4 \), \( P(b, y) = y / 10 \), and \( P(c, y) = y^2/30 \), each for \( y \in \{1, 2, 3, 4\} \).
- Give the kernel \( P \) in matrix form and verify that it is a probability kernel.
- Find \( f P \) where \( f(a) = f(b) = f(c) = 1/3 \). The result is the density function of \( Y \) given that \( X \) is uniformly distributed.
- Find \( P g \) where \( g(y) = y \) for \( y \in \{1, 2, 3, 4\} \). The resulting function is \( \E(Y \mid X = x) \) for \( x \in \{a, b, c\} \).
Answer
- \( P \) is given in matrix form below. Note that the row sums are 1. \[ P = \left[\begin{matrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ \frac{1}{10} & \frac{2}{10} & \frac{3}{10} & \frac{4}{10} \\ \frac{1}{30} & \frac{4}{30} & \frac{9}{30} & \frac{16}{30} \end{matrix} \right]\]
- In matrix form, \( f = \left[\begin{matrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{matrix} \right]\) and \(f P = \left[\begin{matrix} \frac{23}{180} & \frac{35}{180} & \frac{51}{180} & \frac{71}{180} \end{matrix} \right]\).
- In matrix form, \[ g = \left[\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix} \right], \quad P g = \left[\begin{matrix} \frac{5}{2} \\ 3 \\ \frac{10}{3} \end{matrix} \right]\]
Parametric Distributions
A parametric probability distribution also defines a probability kernel in a natural way, with the parameter playing the role of the kernel variable, and the distribution playing the role of the measure. Such distributions are usually defined in terms of a parametric density function which then defines a kernel function, again with the parameter playing the role of the first argument and the variable the role of the second argument. If the parameter is thought of as a given value of another random variable, as in Bayesian analysis, then there is considerable overlap with the previous subsection. In most cases, (and in particular in the examples below), the spaces involved are either discrete or Euclidean, as described in (1) .
Consider the parametric family of exponential distributions. Let \( f \) denote the identity function on \( (0, \infty) \).
- Give the probability density function as a probability kernel function \( p \) on \( (0, \infty) \).
- Find \( P f \).
- Find \( f P \).
- Find \( p^2 \), the kernel function corresponding to the product kernel \( P^2 \).
Answer
- \( p(r, x) = r e^{-r x} \) for \( r, \, x \in (0, \infty) \).
- For \( r \in (0, \infty) \), \[ P f(r) = \int_0^\infty p(r, x) f(x) \, dx = \int_0^\infty x r e^{-r x} dx = \frac{1}{r} \] This is the mean of the exponential distribution.
- For \( x \in (0, \infty) \), \[ f P(x) = \int_0^\infty f(r) p(r, x) \, dr = \int_0^\infty r^2 e^{-r x} dr = \frac{2}{x^3} \]
- For \( r, \, y \in (0, \infty) \), \[ p^2(r, y) = \int_0^\infty p(r, x) p(x, y) \, dx = \int_0^\infty = \int_0^\infty r x e^{-(r + y) x} dx = \frac{r}{(r + y)^2} \]
Consider the parametric family of Poisson distributions. Let \(f \) be the identity function on \(\N \) and let \( g \) be the identity function on \( (0, \infty) \).
- Give the probability density function \( p \) as a probability kernel function from \( (0, \infty) \) to \( \N \).
- Show that \( P f = g \).
- Show that \( g P = f \).
Answer
- \( p(r, n) = e^{-r} \frac{r^n}{n!} \) for \( r \in (0, \infty) \) and \( n \in \N \).
- For \( r \in (0, \infty) \), \( P f(r) \) is the mean of the Poisson distribution with parameter \( r \): \[ P f(r) = \sum_{n=0}^\infty p(r, n) f(n) = \sum_{n=0}^\infty n e^{-r} \frac{r^n}{n!} = r \]
- For \( n \in \N \), \[ g P(n) = \int_0^\infty g(r) p(r, n) \, dr = \int_0^\infty e^{-r} \frac{r^{n+1}}{n!} dr = n \]
Clearly the Poisson distribution has some very special and elegant properties. The next family of distributions also has some very special properties. Compare this exercise with the exercise (30) .
Consider the family of normal distributions, parameterized by the mean and with variance 1.
- Give the probability density function as a probability kernel function \( p \) on \( \R \).
- Show that \( p \) is symmetric.
- Let \( f \) be the identity function on \( \R \). Show that \( P f = f \) and \( f P = f \).
- For \( n \in \N \), find \( p^n \) the kernel function for the operator \( P^n \).
Answer
- For \( \mu, \, x \in \R \), \[ p(\mu, x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - \mu)^2} \] That is, \( x \mapsto p(x, \mu) \) is the normal probability density function with mean \( \mu \) and variance 1.
- Note that \( p(\mu, x) = p(x, \mu) \) for \( \mu, \, x \in \R \). So \( \mu \mapsto p(\mu, x) \) is the normal probability density function with mean \( x \) and variance 1.
- Since \( f(x) = x \) for \( x \in \R \), this follows from the previous two parts: \( P f(\mu) = \mu \) for \( \mu \in \R \) and \( f P(x) = x \) for \( x \in \R \)
- For \( \mu, \, y \in \R \), \[ p^2(\mu, x) = \int_{-\infty}^\infty p(\mu, t) p(t, y) \, dt = \frac{1}{\sqrt{4 \pi}} e^{-\frac{1}{4}(x - \mu)^2} \] so that \( x \mapsto p^2(\mu, x) \) is the normal PDF with mean \( \mu \) and variance 2. By induction, \[ p^n(\mu, x) = \frac{1}{\sqrt{2 \pi n}} e^{-\frac{1}{2 n}(x - \mu)^2} \] for \( n \in \N_+ \) and \( \mu, \, x \in \R \). Thus \( x \mapsto p^n(\mu, x) \) is the normal PDF with mean \( \mu \) and variance \( n \).
For each of the following special distributions, express the probability density function as a probability kernel function. Be sure to specify the parameter spaces.
- The general normal distribution on \( \R \).
- The beta distribution on \( (0, 1) \).
- The negative binomial distribution on \( \N \).
Answer
- The normal distribution with mean \( \mu \) and standard deviation \( \sigma \) defines a kernel function \( p \) from \( \R \times (0, \infty) \) to \( \R \) given by \[ p[(\mu, \sigma), x] = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left[-\left(\frac{x - \mu}{\sigma}\right)^2\right] \]
- The beta distribution with left parameter \( a \) and right parameter \( b \) defines a kernel function \( p \) from \( (0, \infty)^2 \) to \( (0, 1) \) given by \[ p[(a, b), x] = \frac{1}{B(a, b)} x^{a - 1} y^{b - 1} \] where \( B \) is the beta function.
- The negative binomial distribution with stopping parameter \( k \) and success parameter \( \alpha \) defines a kernel function \( p \) from \( (0, \infty) \times (0, 1) \) to \( \N \) given by \[ p[(n, \alpha), k] = \binom{n + k - 1}{n} \alpha^k (1 - \alpha)^n \]