Buckets:
Bayesian Machine Learning via Category Theory
Jared Culbertson and Kirk Sturtz
December 6, 2013
Abstract
From the Bayesian perspective, the category of conditional probabilities (a variant of the Kleisli category of the Giry monad, whose objects are measurable spaces and arrows are Markov kernels) gives a nice framework for conceptualization and analysis of many aspects of machine learning. Using categorical methods, we construct models for parametric and nonparametric Bayesian reasoning on function spaces, thus providing a basis for the supervised learning problem. In particular, stochastic processes are arrows to these function spaces which serve as prior probabilities. The resulting inference maps can often be analytically constructed in this symmetric monoidal weakly closed category. We also show how to view general stochastic processes using functor categories and demonstrate the Kalman filter as an archetype for the hidden Markov model.
Keywords: Bayesian machine learning, categorical probability, Bayesian probability
Contents
| 1 | Introduction | 2 |
| 2 | The Category of Conditional Probabilities | 6 |
| 2.1 | (Weak) Product Spaces and Joint Distributions . . . . . | 8 |
| 2.2 | Constructing a Joint Distribution Given Conditionals . . . . . | 12 |
| 2.3 | Constructing Regular Conditionals given a Joint Distribution . . . . . | 13 |
| 3 | The Bayesian Paradigm using | 15 |
| 4 | Elementary applications of Bayesian probability | 18 |
| 5 | The Tensor Product | 25 |
| 5.1 | Graphs of Conditional Probabilities . . . . . | 27 |
| 5.2 | A Tensor Product of Conditionals . . . . . | 29 |
| 5.3 | Symmetric Monoidal Categories . . . . . | 29 |
| 1 | INTRODUCTION | 2 |
| 6 | Function Spaces | 31 |
| 6.1 | Stochastic Processes . . . . . | 36 |
| 6.2 | Gaussian Processes . . . . . | 40 |
| 6.3 | GPs via Joint Normal Distributions . . . . . | 41 |
| 7 | Bayesian Models for Function Estimation | 43 |
| 7.1 | Nonparametric Models . . . . . | 43 |
| 7.1.1 | Noise Free Measurement Model . . . . . | 44 |
| 7.1.2 | Gaussian Additive Measurement Noise Model . . . . . | 46 |
| 7.2 | Parametric Models . . . . . | 50 |
| 8 | Constructing Inference Maps | 54 |
| 8.1 | The noise free inference map . . . . . | 54 |
| 8.2 | The noisy measurement inference map . . . . . | 59 |
| 8.3 | The inference map for parametric models . . . . . | 61 |
| 9 | Stochastic Processes as Points | 65 |
| 9.1 | Markov processes via Functor Categories . . . . . | 65 |
| 9.2 | Hidden Markov Models . . . . . | 68 |
| 10 | Final Remarks | 69 |
| 11 | Appendix A: Integrals over probability measures. | 71 |
| 12 | Appendix B: The weak closed structure in | 72 |
| 13 | References | 72 |
1 Introduction
Speculation on the utility of using categorical methods in machine learning (ML) has been expounded by numerous people, including by the denizens at the n-category cafe blog [5] as early as 2007. Our approach to realizing categorical ML is based upon viewing ML from a probabilistic perspective and using categorical Bayesian probability. Several recent texts (e.g., [2, 19]), along with countless research papers on ML have emphasized the subject from the perspective of Bayesian reasoning. Combining this viewpoint with the recent work [6], which provides a categorical framework for Bayesian probability, we develop a category theoretic perspective on ML. The abstraction provided by category theory serves as a basis not only for an organization of ones thoughts on the subject, but also provides an efficient graphical method for model building in much the same way that probabilistic graphical modeling (PGM) has provided for Bayesian network problems.
In this paper, we focus entirely on the supervised learning problem, i.e., the regression or function estimation problem. The general framework applies to any Bayesian machinelearning problem, however. For instance, the unsupervised clustering or density estimation problems can be characterized in a similar way by changing the hypothesis space and sampling distribution. For simplicity, we choose to focus on regression and leave the other problems to the industrious reader. For us, then, the Bayesian learning problem is to determine a function $f : X \rightarrow Y$ which takes an input $\mathbf{x} \in X$ , such as a feature vector, and associates an output (or class) $f(\mathbf{x})$ with $\mathbf{x}$ . Given a measurement $(\mathbf{x}, y)$ , or a set of measurements ${(\mathbf{x}i, y_i)}{i=1}^N$ where each $y_i$ is a labeled output (i.e., training data), we interpret this problem as an estimation problem of an unknown function $f$ which lies in $Y^X$ , the space of all measurable functions1 from $X$ to $Y$ such that $f(\mathbf{x}_i) \approx y_i$ . When $Y$ is a vector space the space $Y^X$ is also a vector space that is infinite dimensional when $X$ is infinite. If we choose to allow all such functions (every function $f \in Y^X$ is a valid model) then the problem is nonparametric. On the other hand, if we only allow functions from some subspace $V \subset Y^X$ of finite dimension $p$ , then we have a parametric model characterized by a measurable map $i : \mathbb{R}^p \rightarrow Y^X$ . The image of $i$ is then the space of functions which we consider as valid models of the unknown function for the Bayesian estimation problem. Hence, the elements $\mathbf{a} \in \mathbb{R}^p$ completely determine the valid modeling functions $i(\mathbf{a}) \in Y^X$ . Bayesian modeling splits the problem into two aspects: (1) specification of the hypothesis space, which consist of the “valid” functions $f$ , and (2) a noisy measurement model such as $y_i = f(\mathbf{x}_i) + \epsilon_i$ , where the noise component $\epsilon_i$ is often modeled by a Gaussian distribution. Bayesian reasoning with the hypothesis space taken as $Y^X$ or any subspace $V \subset Y^X$ (finite or infinite dimensional) and the noisy measurement model determining a sampling distribution can then be used to efficiently estimate (learn) the function $f$ without over fitting the data.
We cast this whole process into a graphical formulation using category theory, which like PGM, can in turn be used as a modeling tool itself. In fact, we view the components of these various models, which are just Markov kernels, as interchangeable parts. An important piece of the any solving the ML problem with a Bayesian model consists of choosing the appropriate parts for a given setting. The close relationship between parametric and nonparametric models comes to the forefront in the analysis with the measurable map $i : \mathbb{R}^p \rightarrow Y^X$ connecting the two different types of models. To illustrate this point suppose we are given a normal distribution $P$ on $\mathbb{R}^p$ as a prior probability on the unknown parameters. Then the push forward measure2 of $P$ by $i$ is a Gaussian process, which is a basic tool in nonparametric modeling. When composed with a noisy measurement model, this provides the whole Bayesian model required for a complete analysis and an inference map
1Recall that a $\sigma$ -algebra $\Sigma_X$ on $X$ is a collection of subsets of $X$ that is closed under complements and countable unions (and hence intersections); the pair $(X, \Sigma_X)$ is called a measurable space and any set $A \in \Sigma_X$ is called a measurable set of $X$ . A measurable function $f : X \rightarrow Y$ is defined by the property that for any measurable set $B$ in the $\sigma$ -algebra of $Y$ , we have that $f^{-1}(B)$ is in the $\sigma$ -algebra of $X$ . For example, all continuous functions are measurable with respect to the Borel $\sigma$ -algebras.
2A measure $\mu$ on a measurable space $(X, \Sigma_X)$ is a nonnegative real-valued function $\mu : X \rightarrow \mathbb{R}{\geq 0}$ such that $\mu(\emptyset) = 0$ and $\mu(\cup{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \mu(A_i)$ . A probability measure is a measure where $\mu(X) = 1$ . In this paper, all measures are probability measures and the terminology “distribution” will be synonymous with “probability measure.”can be analytically constructed.3 Consequently, given any measurement $(\mathbf{x}, y)$ taking the inference map conditioned at $(\mathbf{x}, y)$ yields the updated prior probability which is another normal distribution on $\mathbb{R}^p$ .
The ability to do Bayesian probability involving function spaces relies on the fact that the category of measurable spaces, $\mathcal{Meas}$ , has the structure of a symmetric monoidal closed category (SMCC). Through the evaluation map, this in turn provides the category of conditional probabilities $\mathcal{P}$ with the structure of a symmetric monoidal weakly closed category (SMwCC), which is necessary for modeling stochastic processes as probability measures on function spaces. On the other hand, the ordinary product $X \times Y$ with its product $\sigma$ -algebra is used for the Bayesian aspect of updating joint (and marginal) distributions. From a modeling viewpoint, the SMwCC structure is used for carrying along a parameter space (along with its relationship to the output space through the evaluation map). Thus we can describe training data and measurements as ordered pairs $(\mathbf{x}_i, y_i) \in X \otimes Y$ , where $X$ plays the role of a parameter space.
A few notes on the exposition. In this paper our intended audience consists of (1) the practicing ML engineer with only a passing knowledge of category theory (e.g., knowing about objects, arrows and commutative diagrams), and (2) those knowledgeable of category theory with an interest of how ML can be formulated within this context. For the ML engineer familiar with Markov kernels, we believe that the presentation of $\mathcal{P}$ and its applications can serve as an easier introduction to categorical ideas and methods than many standard approaches. While some terminology will be unfamiliar, the examples should provide an adequate understanding to relate the knowledge of ML to the categorical perspective. If ML researchers find this categorical perspective useful for further developments or simply for modeling purposes, then this paper will have achieved its goal.
In the categorical framework for Bayesian probability, Bayes' equation is replaced by an integral equation where the integrals are defined over probability measures. The analysis requires these integrals be evaluated on arbitrary measurable sets and this is often possible using the three basic rules provided in Appendix A. Detailed knowledge of measure theory is not necessary outside of understanding these three rules and the basics of $\sigma$ -algebras and measures, which are used extensively for evaluating integrals in this paper. Some proofs require more advanced measure-theoretic ideas, but the proofs can safely be avoided by the unfamiliar reader and are provided for the convenience of those who might be interested in such details.
For the category theorist, we hope the paper makes the fundamental ideas of ML transparent, and conveys our belief that Bayesian probability can be characterized categorically and usefully applied to fields such as ML. We believe the further development of categorical probability can be motivated by such applications and in the final remarks we comment on one such direction that we are pursuing.
These notes are intended to be tutorial in nature, and so contain much more detail that would be reasonable for a standard research paper. As in this introductory section,
3The inference map need not be unique.basic definitions will be given as footnotes, while more important definitions, lemmas and theorems. Although an effort has been made to make the exposition as self-contained as possible, complete self-containment is clearly an unachievable goal. In the presentation, we avoid the use of the terminology of random variables for two reasons: (1) formally a random variable is a measurable function $f : X \rightarrow Y$ and a probability measure $P$ on $X$ gives rise to the distribution of the random variable $f_*(P)$ which is the push forward measure of $P$ . In practice the random variable $f$ itself is more often than not impossible to characterize functionally (consider the process of flipping a coin), while reference to the random variable using a binomial distribution, or any other distribution, is simply making reference to some probability measure. As a result, in practice the term “random variable” is often not making reference to any measurable function $f$ and the pushforward measure of some probability measure $P$ at all but rather is just referring to a probability measure; (2) the term “random variable” has a connotation that, we believe, should be de-emphasized in a Bayesian approach to modeling uncertainty. Thus while a random variable can be modeled as a push forward probability measure within the framework presented we feel no need to single them out as having any special relevance beyond the remark already given. In illustrating the application of categorical Bayesian probability we do however show how to translate the familiar language of random variables into the unfamiliar categorical framework for the particular case of Gaussian distributions which are the most important application for ML since Gaussian Processes are characterized on finite subsets by Gaussian distributions. This provides a particularly nice illustration of the non uniqueness of conditional sampling distribution and inference pairs given a joint distribution.
Organization. The paper is organized as follows: The theory of Bayesian probability in $\mathcal{P}$ is first addressed and applied to elementary problems on finite spaces where the detailed solutions to inference, prediction and decision problems are provided. If one understands the “how and why” in solving these problems then the extension to solving problems in ML is a simple step as one uses the same basic paradigm with only the hypothesis space changed to a function space. Nonparametric modeling is presented next, and then the parametric model can be seen as a submodel of the nonparametric model. We then proceed to give a general definition of stochastic process as a special type of arrow in a functor category $\mathcal{P}^X$ , and by varying the category $X$ or placing conditions on the projection maps onto subspaces one obtains the various types of stochastic processes such as Markov processes or GP. Finally, we remark on the area where category theory may have the biggest impact on applications for ML by integrating the probabilistic models with decision theory into one common framework.
The results presented here derived from a categorical analysis of the ML problem(s) will come as no surprise to ML professionals. We acknowledge and thank our colleagues who are experts in the field who provided assistance and feedback.## 2 The Category of Conditional Probabilities
The development of a categorical basis for probability was initiated by Lawvere [16], and further developed by Giry [14] using monads to characterize the adjunction given in Lawvere’s original work. The Kleisli category of the Giry monad $\mathcal{G}$ is what Lawvere called the category of probabilistic mappings and what we shall refer to as the category of conditional probabilities.4 Further progress was given in the unpublished dissertation of Meng [18] which provides a wealth of information and provides a basis for thinking about stochastic processes from a categorical viewpoint. While this work does not address the Bayesian perspective it does provide an alternative “statistical viewpoint” toward solving such problems using generalized metrics. Additional interesting work on this category is presented in a seminar by Voevodsky, in Russian, available in an online video [22]. The extension of categorical probability to the Bayesian viewpoint is given in the paper [6], though Lawvere and Peter Huber were aware of a similar approach in the 1960’s.5 Coecke and Speckens [4] provide an alternative graphical language for Bayesian reasoning under the assumption of finite spaces which they refer to as standard probability theory. In such spaces the arrows can be represented by stochastic matrices [13]. More recently Fong [12] has provided further applications of the category of conditional probabilities to Causal Theories for Bayesian networks.
Much of the material in this section is directly from [6], with some additional explanation where necessary. The category6 of conditional probabilities, which we denote by $\mathcal{P}$ , has countably generated7 measurable spaces $(X, \Sigma_X)$ as objects and an arrow between two such objects
is a Markov kernel (also called a regular conditional probability) assigning to each element $x \in X$ and each measurable set $B \in \Sigma_Y$ the probability of $B$ given $x$ , denoted $T(B \mid x)$ . The term “regular” refers to the fact that the function $T$ is conditioned on points rather than measurable sets $A \in \Sigma_X$ . When $(X, \Sigma_X)$ is a countable set (either finite or countably infinite) with the discrete $\sigma$ -algebra then every singleton ${x}$ is measurable and the term “regular” is unnecessary. More precisely, an arrow $T: X \rightarrow Y$ in $\mathcal{P}$ is a function $T: \Sigma_Y \times X \rightarrow [0, 1]$ satisfying
4Monads had not yet been developed at the time of Lawvere’s work. However the adjunction construction he provided was the Giry monad on measurable spaces.
5In a personal communication Lawvere related that he and Peter Huber gave a seminar in Zurich around 1965 on “Bayesian sections.” This refers to the existence of inference maps in the Eilenberg–Moore category of $\mathcal{G}$ -algebras. These inference maps are discussed in Section 3, although we discuss them only in the context of the category $\mathcal{P}$ .
6A category is a collection of (1) objects and (2) morphisms (or arrows) between the objects (including a required identity morphism for each object), along with a prescribed method for associative composition of morphisms.
7A space $(X, \Sigma_X)$ is countably generated if there exist a countable set of measurable sets ${A_i}_{i=1}^\infty$ which generated the $\sigma$ -algebra $\Sigma_X$ .1. 1. for all $B \in \Sigma_Y$ , the function $T(B | \cdot): X \rightarrow [0, 1]$ is measurable, and 2. 2. for all $x \in X$ , the function $T(\cdot | x): \Sigma_Y \rightarrow [0, 1]$ is a perfect probability measure8 on $Y$ .
For technical reasons it is necessary that the probability measures in (2) constitute an equiprfect family of probability measures to avoid pathological cases which prevent the existence of inference maps necessary for Bayesian reasoning.9
The notation $T(B | x)$ is chosen as it coincides with the standard notation “ $p(H | D)$ ” of conditional probability theory. For an arrow $T: (X, \Sigma_X) \rightarrow (Y, \Sigma_Y)$ , we occasionally denote the measurable function $T(B | \cdot): \Sigma_Y \rightarrow [0, 1]$ by $T_B$ and the probability measure $T(\cdot | x): \Sigma_Y \rightarrow [0, 1]$ by $T_x$ . Hereafter, for notational brevity we write a measurable space $(X, \Sigma_X)$ simply as $X$ when referring to a generic $\sigma$ -algebra $\Sigma_X$ .
Given two arrows
the composition $U \circ T: \Sigma_Z \times X \rightarrow [0, 1]$ is marginalization over Y defined by
The integral of any real valued measurable function $f: X \rightarrow \mathbb{R}$ with respect to any measure $P$ on $X$ is
called the P-expectation of $f$ . Consequently the composite $(U \circ T)(C | x)$ is the $T_x$ -expectation of $U_C$ ,
Let $\mathcal{Meas}$ denote the category of measurable spaces where the objects are measurable spaces $(X, \Sigma_X)$ and the arrows are measurable functions $f: X \rightarrow Y$ . Every measurable mapping $f: X \rightarrow Y$ may be regarded as a $\mathcal{P}$ arrow
defined by the Dirac (or one point) measure
8A perfect probability measure $P$ on $Y$ is a probability measure such that for any measurable function $f: Y \rightarrow \mathbb{R}$ there exist a real Borel set $E \subset f(Y)$ satisfying $P(f^{-1}(E)) = 1$ .
9Specifically, the subsequent Theorem 1 is a constructive procedure which requires perfect probability measures. Corollary 2 then gives the inference map. Without the hypothesis of perfect measures a pathological counterexample can be constructed as in [9, Problem 10.26]. The paper by Faden [11] gives conditions on the existence of conditional probabilities and this constraint is explained in full detail in [6]. Note that the class of perfect measures is quite broad and includes all probability measures defined on Polish spaces.The relation between the dirac measure and the characteristic (indicator) function $\mathbb{1}$ is
and this property is used ubiquitously in the analysis of integrals.
Taking the measurable mapping $f$ to be the identity map on $X$ gives for each object $X$ the morphism $X \xrightarrow{\delta_{Id_X}} X$ given by
which is the identity morphism for $X$ in $\mathcal{P}$ . Using standard notation we denote the identity mapping on any object $X$ by $1_X = \delta_{Id_X}$ , or for brevity simply by $1$ if the space $X$ is clear from the context. With these objects and arrows, law of composition, associativity, and identity, standard measure-theoretic arguments show that $\mathcal{P}$ forms a category.
There is a distinguished object in $\mathcal{P}$ that play an important role in Bayesian probability. For any set $Y$ with the indiscrete $\sigma$ -algebra $\Sigma_Y = {Y, \emptyset}$ , there is a unique arrow from any object $X$ to $Y$ since any arrow $P: X \rightarrow Y$ is completely determined by the fact that $P_x$ must be a probability measure on $Y$ . Hence $Y$ is a terminal object, and we denote the unique arrow by $!_X : X \rightarrow Y$ . Up to isomorphism, the canonical terminal object is the one-element set which we denote by $1 = {\star}$ with the only possible $\sigma$ -algebra. It follows that any arrow $P : 1 \rightarrow X$ from the terminal object to any space $X$ is an (absolute) probability measure on $X$ , i.e., it is an “absolute” probability measure on $X$ because there is no variability (conditioning) possible within the singleton set $1 = {\star}$ .
Figure 1: The representation of a probability measure in $\mathcal{P}$ .
We refer to any arrow $P: 1 \rightarrow X$ with domain $1$ as either a probability measure or a distribution on $X$ . If $X$ is countable then $X$ is isomorphic in $\mathcal{P}$ to a discrete space $\mathbf{m} = {0, 1, 2, \dots, m-1}$ with the discrete $\sigma$ -algebra where the integer $m$ corresponds to the number of atoms in the $\sigma$ -algebra $\Sigma_X$ . Consequently every finite space is, up to isomorphism, just a discrete space and therefore every distribution $P: 1 \rightarrow X$ is of the form $P = \sum_{i=0}^{m-1} p_i \delta_i$ where $\sum_{i=0}^{m-1} p_i = 1$ .
2.1 (Weak) Product Spaces and Joint Distributions
In Bayesian probability, determining the joint distribution on a “product space” is often the problem to be solved. In many applications for which Bayesian reasoning is appropriate, the problem reduces to computing a particular marginal or conditional probability; these can be obtained in a straightforward way if the joint distribution is known. Beforeproceeding to formulate precisely what the term “product space” means in $\mathcal{P}$ , we describe the categorical construct of a finite product space in any category.
Let $\mathcal{C}$ be an arbitrary category and $X, Y \in_{ob} \mathcal{C}$ . We say the product of $X$ and $Y$ exists if there is an object, which we denote by $X \times Y$ , along with two arrows $p_X: X \times Y \rightarrow X$ and $p_Y: X \times Y \rightarrow Y$ in $\mathcal{C}$ such that given any other object $T$ in $\mathcal{C}$ and arrows $f: T \rightarrow X$ and $g: T \rightarrow Y$ there is a unique $\mathcal{C}$ arrow $\langle f, g \rangle: T \rightarrow X \times Y$ that makes the diagram
commute. If the given diagram is a product then we often write the product as a triple $(X \times Y, p_X, p_Y)$ . We must not let the notation deceive us; the object $X \times Y$ could just as well be represented by $P_{X,Y}$ . The important point is that it is an object in $\mathcal{C}$ that we need to specify in order to show that binary products exist. Products are an example of a universal construction in categories. The term “universal” implies that these constructions are unique up to a unique isomorphism. Thus if $(P_{X,Y}, p_X, p_Y)$ and $(Q_{X,Y}, q_X, q_Y)$ are both products for the objects $X$ and $Y$ then there exist unique arrows $\alpha: P_{X,Y} \rightarrow Q_{X,Y}$ and $\beta: Q_{X,Y} \rightarrow P_{X,Y}$ in $\mathcal{C}$ such that $\beta \circ \alpha = 1_{P_{X,Y}}$ and $\alpha \circ \beta = 1_{Q_{X,Y}}$ so that the objects $P_{X,Y}$ and $Q_{X,Y}$ are isomorphic.
If the product of all object pairs $X$ and $Y$ exist in $\mathcal{C}$ then we say binary products exist in $\mathcal{C}$ . The existence of binary products implies the existence of arbitrary finite products in $\mathcal{C}$ . So if ${X_i}{i=1}^N$ is a finite set of objects in $\mathcal{C}$ then there is an object which we denote by $\prod{i=1}^N X_i$ (in general, this need not be the cartesian product) as well as arrows ${p_{X_j}: \prod_{i=1}^N X_i \rightarrow X_j}{j=1}^N$ . Then if we are given an arbitrary $T \in{ob} \mathcal{C}$ and a family of arrows $f_j: T \rightarrow X_j$ in $\mathcal{C}$ there exists a unique $\mathcal{C}$ arrow $\langle f_1, \dots, f_N \rangle$ such that for every integer $j \in {1, 2, \dots, N}$ the diagram
commutes. The arrows $p_{X_i}$ defining a product space are often called the projection maps due to the analogy with the cartesian products in the category of sets, $\mathcal{Set}$ .In $\mathcal{Set}$ , the product of two sets $X$ and $Y$ is the cartesian product $X \times Y$ consisting of all pairs $(x, y)$ of elements with $x \in X$ and $y \in Y$ along with the two projection mappings $\pi_X: X \times Y \rightarrow X$ sending $(x, y) \mapsto x$ and $\pi_Y: X \times Y \rightarrow Y$ sending $(x, y) \mapsto y$ . Given any pair of functions $f: T \rightarrow X \times Y$ and $g: T \rightarrow X \times Y$ the function $\langle f, g \rangle: T \rightarrow X \times Y$ sending $t \mapsto (f(t), g(t))$ clearly makes Diagram 2 commute. But it is also the unique such function because if $\gamma: T \rightarrow X \times Y$ were any other function making the diagram commute then the equations
would also be satisfied. But since the function $\gamma$ has codomain $X \times Y$ which consist of ordered pairs $(x, y)$ it follows that for each $t \in T$ that $\gamma(t) = \langle \gamma_1(t), \gamma_2(t) \rangle$ for some functions $\gamma_1: T \rightarrow X$ and $\gamma_2: T \rightarrow Y$ . Substituting $\gamma = \langle \gamma_1, \gamma_2 \rangle$ into equations 3 it follows that
from which it follows $\gamma = \langle \gamma_1, \gamma_2 \rangle = \langle f, g \rangle$ thereby proving that there exist at most one such function $T \rightarrow X \times Y$ making the requisite Diagram 2 commute. If the requirement of the uniqueness of the arrow $\langle f, g \rangle$ in the definition of a product is dropped then we have the definition of a weak product of $X$ and $Y$ .
Given the relationship between the categories $\mathcal{P}$ and $\mathcal{Meas}$ it is worthwhile to examine products in $\mathcal{Meas}$ . Given $X, Y \in_{ob} \mathcal{Meas}$ the product $X \times Y$ is the cartesian product $X \times Y$ of sets endowed with the smallest $\sigma$ -algebra such that the two set projection maps $\pi_X: X \times Y \rightarrow X$ sending $(x, y) \mapsto x$ and $\pi_Y: X \times Y \rightarrow Y$ sending $(x, y) \mapsto y$ are measurable. In other words, we take the smallest subset of the powerset of $X \times Y$ such that for all $A \in \Sigma_X$ and for all $B \in \Sigma_Y$ the preimages $\pi_X^{-1}(A) = A \times Y$ and $\pi_Y^{-1}(B) = X \times B$ are measurable. Since a $\sigma$ -algebra requires that the intersection of any two measurable sets is also measurable it follows that $\pi_X^{-1}(A) \cap \pi_Y^{-1}(B) = A \times B$ must also be measurable. Measurable sets of the form $A \times B$ are called rectangles and generate the collection of all measurable sets defining the $\sigma$ -algebra $\Sigma_{X \times Y}$ in the sense that $\Sigma_{X \times Y}$ is equal to the intersection of all $\sigma$ -algebras containing the rectangles. When the $\sigma$ -algebra on a set is determined by the a family of maps ${p_k: X \times Y \rightarrow Z_k}{k \in K}$ , where $K$ is some indexing set such that all of these maps $p_k$ are measurable we say the $\sigma$ -algebra is induced (or generated) by the family of maps ${p_k}{k \in K}$ .10 The cartesian product $X \times Y$ with the $\sigma$ -algebra induced by the two projection maps $\pi_X$ and $\pi_Y$ is easily verified to be a product of $X$ and $Y$ since given any two measurable maps $f: Z \rightarrow X$ and $g: Z \rightarrow Y$ the map $\langle f, g \rangle: Z \rightarrow X \times Y$ sending $z \mapsto (f(z), g(z))$ is the unique measurable map satisfying the defining property of a product for $(X \times Y, \pi_X, \pi_Y)$ . This $\sigma$ -algebra induced by the projection maps $\pi_X$ and $\pi_Y$ is called the product $\sigma$ -algebra and the use of the notation $X \times Y$ in $\mathcal{Meas}$ will imply the product $\sigma$ -algebra on the set $X \times Y$ .
Having the product $(X \times Y, \pi_X, \pi_Y)$ in $\mathcal{Meas}$ and the fact that every measurable function $f \in_{ar} \mathcal{Meas}$ determines an arrow $\delta_f \in_{ar} \mathcal{P}$ , it is tempting to consider the triple
10The terminology initial is also used in lieu of induced.$(X \times Y, \delta_{\pi_X}, \delta_{\pi_Y})$ as a potential product in $\mathcal{P}$ . However taking this triple fails to be a product space of $X$ and $Y$ in $\mathcal{P}$ because the uniqueness condition fails; given two probability measures $P: 1 \rightarrow X$ and $Q: 1 \rightarrow Y$ there are many joint distributions $J$ making the diagram
commute. In particular, the tensor product measure defined on rectangles by $(P \otimes Q)(A \times B) = P(A)Q(B)$ extends to a joint probability measure on $X \times Y$ by
or equivalently,
Here $\bar{x}: Y \rightarrow X$ is the constant function at $x$ and $\Gamma_{\bar{x}}: Y \rightarrow X \times Y$ is the associated graph function, with $\bar{y}$ and $\Gamma_{\bar{y}}$ defined similarly. The fact that $Q \otimes P = P \otimes Q$ is Fubini's Theorem; by taking a rectangle $\varsigma = A \times B \in \Sigma_{X \times Y}$ the equality of these two measures is immediate since
Using the fact that every measurable set $\varsigma$ in $X \times Y$ is a countable union of rectangles, Fubini's Theorem follows.
It is clear that in $\mathcal{P}$ the uniqueness condition required in the definition of a product of $X$ and $Y$ will always fail unless at least one of $X$ and $Y$ is a terminal object 1, and consequently only weak products exist in $\mathcal{P}$ . However it is the nonuniqueness of products in $\mathcal{P}$ that makes this category interesting. Instead of referring to weak products in $\mathcal{P}$ we shall abuse terminology and simply refer to them as products with the understanding that all products in $\mathcal{P}$ are weak.## 2.2 Constructing a Joint Distribution Given Conditionals
We now show how marginals and conditionals can be used to determine joint distributions in $\mathcal{P}$ . Given a conditional probability measure $h: X \rightarrow Y$ and a probability measure $P_X: 1 \rightarrow X$ on $X$ , consider the diagram
where $J_h$ is the uniquely determined joint distribution on the product space $X \times Y$ defined on the rectangles of the $\sigma$ -algebra $\Sigma_X \times \Sigma_Y$ by
The marginal of $J_h$ with respect to $Y$ then satisfies $\delta_{\pi_Y} \circ J_h = h \circ P_X$ and the marginal of $J_h$ with respect to $X$ is $P_X$ . By a symmetric argument, if we are given a probability measure $P_Y$ and conditional probability $k: Y \rightarrow X$ then we obtain a unique joint distribution $J_k$ on the product space $X \times Y$ given on the rectangles by
However if we are given $P_X, P_Y, h, k$ as indicated in the diagram
then we have that $J_h = J_k$ if and only if the compatibility condition is satisfied on the rectangles
In the extreme case, suppose we have a conditional $h: X \rightarrow Y$ which factors through the terminal object 1 as
where $!$ represents the unique arrow from $X \rightarrow 1$ . If we are also given a probability measure $P: 1 \rightarrow X$ , then we can calculate the joint distribution determined by $P$ and $h = Q \circ !$ as
so that $J = P \otimes Q$ . In this situation we say that the marginals $P$ and $Q$ are independent. Thus in $\mathcal{P}$ independence corresponds to a special instance of a conditional—one that factors through the terminal object.
2.3 Constructing Regular Conditionals given a Joint Distribution
The following result is the theorem from which the inference maps in Bayesian probability theory are constructed. The fact that we require equiprfect families of probability measures is critical for the construction.
Theorem 1. Let $X$ and $Y$ be countably generated measurable spaces and $(X \times Y, \Sigma_{X \times Y})$ the product in $\text{Meas}$ with projection map $\pi_Y$ . If $J$ is a joint distribution on $X \times Y$ with marginal $P_Y = \delta_{\pi_Y} \circ J$ on $Y$ , then there exists a $\mathcal{P}$ arrow $f$ that makes the diagram
commute and satisfies
Moreover, this $f$ is the unique $\mathcal{P}$ -morphism with these properties, up to a set of $P_Y$ -measure zero.
Proof. Since $\Sigma_X$ and $\Sigma_Y$ are both countably generated, it follows that $\Sigma_{X \times Y}$ is countably generated as well. Let $\mathcal{G}$ be a countable generating set for $\Sigma_{X \times Y}$ . For each $A \in \mathcal{G}$ , define a measure $\mu_A$ on $Y$ by
Then $\mu_A$ is absolutely continuous with respect to $P_Y$ and hence we can let $\tilde{f}A = \frac{d\mu_A}{dP_Y}$ , the Radon–Nikodym derivative. For each $A \in \mathcal{G}$ this Radon–Nikodym derivative is unique up to a set of measure zero, say $\hat{A}$ . Let $N = \cup{A \in \mathcal{A}} \hat{A}$ and $E_1 = N^c$ . Then $\tilde{f}A|{E_1}$ is unique for all $A \in \mathcal{A}$ . Note that $f_{X \times Y} = 1$ and $f_\emptyset = 0$ on $E_1$ . The condition $\tilde{f}_A \leq 1$ on $E_1$ for all $A \in \mathcal{A}$ then follows.
For all $B \in \Sigma_Y$ and any countable union $\cup_{i=1}^n A_i$ of disjoint sets of $\mathcal{A}$ we have
with the last equality following from the Monotone Convergence Theorem and the fact that all of the $\tilde{f}_{A_i}$ are nonnegative. From the uniqueness of the Radon–Nikodym derivative it follows
Since there exist only a countable number of finite collection of sets of $\mathcal{A}$ we can find a set $E \subset E_1$ of $P_Y$ -measure one such that the normalized set function $\tilde{f}(y): \mathcal{A} \rightarrow [0, 1]$ is finitely additive on $E$ .
These facts altogether show there exists a set $E \in \Sigma_Y$ with $P_Y$ -measure one where for all $y \in E$ ,
- $0 \leq \tilde{f}_A(y) \leq 1 \quad \forall A \in \mathcal{A}$ ,
- $\tilde{f}\emptyset(y) = 0$ and $\tilde{f}{X \times Y}(y) = 1$ , and
- for any finite collection ${A_i}{i=1}^n$ of disjoint sets of $\mathcal{A}$ we have $\tilde{f}{\cup_{i=1}^n A_i}(y) = \sum_{i=1}^n \tilde{f}_{A_i}(y)$ .
Thus the set function $\tilde{f}: E \times \mathcal{A} \rightarrow [0, 1]$ satisfies the condition that $\tilde{f}(y, \cdot)$ is a probability measure on the algebra $\mathcal{A}$ . By the Carathéodory extension theorem there exist a unique extension of $\tilde{f}(y, \cdot)$ to a probability measure $\hat{f}(y, \cdot): \Sigma_{X \times Y} \rightarrow [0, 1]$ . Now define a set function $f: Y \times \Sigma_{X \times Y} \rightarrow [0, 1]$ by
Since each $A \in \Sigma_{X \times Y}$ can be written as the pointwise limit of an increasing sequence ${A_n}{n=1}^\infty$ of sets $A_n \in \mathcal{A}$ it follows that $f_A = \lim{n \rightarrow \infty} f_{A_n}$ is measurable. From this we also obtain the desired commutativity of the diagram
□We can use the result from Theorem 1 to obtain a broader understanding of the situation.
Corollary 2. Let $X$ and $Y$ be countably generated measurable spaces and $J$ a joint distribution on $X \times Y$ with marginal distributions $P_X$ and $P_Y$ on $X$ and $Y$ , respectively. Then there exist $\mathcal{P}$ arrows $f$ and $g$ such that the diagram
commutes and
Proof. From Theorem 1 there exist a $\mathcal{P}$ arrow $Y \xrightarrow{f} X \times Y$ satisfying $J = f \circ P_Y$ . Take the composite $\delta_{\pi_X} \circ f$ and note $(\delta_{\pi_X} \circ f)_U(y) = f_y(U \times Y)$ giving
Similarly using a $\mathcal{P}$ arrow $X \xrightarrow{g} X \times Y$ satisfying $J = g \circ P_X$ gives
□
Note that if the joint distribution $J$ is defined by a probability measure $P_X$ and a conditional $h: X \rightarrow Y$ using Diagram 8, then using the above result and notation it follows $h = \delta_{\pi_Y} \circ g$ .
3 The Bayesian Paradigm using $\mathcal{P}$
The categorical paradigm of Bayesian probability can be compactly summarized with as follows. Let $D$ and $H$ be measurable spaces, which model a data and hypothesisspace, respectively. For example, $D$ might be a Euclidean space corresponding to some measurements that are being taken and $H$ a parameterization of some decision that needs to be made.
Figure 2: The generic Bayesian model.
The notation $\mathcal{S}$ is used to emphasize the fact we think of $\mathcal{S}$ as a sampling distribution on $D$ . In the context of Bayesian probability the (perfect) probability measure $P_H$ is often called a prior probability or, for brevity, just a prior. Given a prior $P$ and sampling distribution $\mathcal{S}$ the joint distribution $J: 1 \rightarrow H \times D$ can be constructed using Definition 9. Using the marginal $P_D = \mathcal{S} \circ P_H$ on $D$ it follows by Corollary 2.2 there exist an arrow $f: D \rightarrow H \times D$ satisfying $J = f \circ P_D$ . Composing this arrow $f$ with the coordinate projection $\delta_{\pi_H}$ gives an arrow $\mathcal{I} = \delta_{\pi_H} \circ f: D \rightarrow H$ which we refer to as the inference map, and it satisfies
which is called the product rule.
With the above in mind we formally define a Bayesian model to consist of
- (i) two measurable spaces $H$ and $D$ representing hypotheses and data, respectively,
- (ii) a probability measure $P_H$ on the $H$ space called the prior probability,
- (iii) a $\mathcal{P}$ arrow $\mathcal{S}: H \rightarrow D$ called the sampling distribution,
The sampling distribution $\mathcal{S}$ and inference map $\mathcal{I}$ are often written as $P_{D|Y}$ and $P_{H|D}$ , respectively, although using the notation $P_{|\cdot}$ for all arrows in the category which are necessarily conditional probabilities is notationally redundant and nondistinguishing (requiring the subscripts to distinguish arrows).
Given this model and a measurement $\mu$ , which is often just a point mass on $D$ (i.e., $\mu = \delta_d: 1 \rightarrow D$ ), there is an update procedure that incorporates this measurement and the prior probability. Thus the measurement $\mu$ can itself be viewed as a probability measure on $D$ , and the “posterior” probability measure can be calculated as $\hat{P}_H = \mathcal{I} \circ \mu$ on $H$ provided the measurement $\mu$ is absolutely continuous with respect to $P_D$ , which we write as $\mu \ll P_D$ . Informally, this means that the observed measurement is considered “possible” with respect to prior assumptions.
Let us expand upon this condition $\mu \ll P_D$ more closely. We know from Theorem 1 that the inference map $\mathcal{I}$ is uniquely determined by $P_H$ and $\mathcal{S}$ up to a set of $P_D$ -measure zero.In general, there is no reason a priori that an arbitrary (perfect) probability measurement $\mu: 1 \rightarrow D$ is required to be absolutely continuous with respect to $P_D$ . If $\mu$ is not absolutely continuous with respect to $P_D$ , then a different choice of inference map $\mathcal{I}'$ could yield a different posterior probability—i.e., we could have $\mathcal{I} \circ \mu \neq \mathcal{I}' \circ \mu$ . Thus we make the assumption that measurement probabilities on $D$ are absolutely continuous with respect to the prior probability $P_D$ on $D$ .
In practice this condition is often not met. For example the probability measure $P_D$ may be a normal distribution on $\mathbb{R}$ and consequently $P_D({y}) = 0$ for any point $y \in \mathbb{R}$ . Since Dirac measurements do not satisfy $\delta_y \ll P_D$ , this could create a problem. However, it is clear that the Dirac measures can be approximated arbitrarily closely by a limiting process of sharply peaked normal distributions which do satisfy this absolute continuity condition. Thus while the absolute continuity condition may not be satisfied precisely the error in approximating the measurement by assuming a Dirac measure is negligible. Thus it is standard to assume that measurements belong to a particular class of probability measures on $D$ which are broad enough to approximate measurements and known to be absolutely continuous with respect to the prior.
In summary, the Bayesian process works in the following way. Given a prior probability $P_H$ and sampling distribution $\mathcal{S}$ one determines the inference map $\mathcal{I}$ . (For computational purposes the construction of the entire map $\mathcal{I}$ is in general not necessary.) Once a measurement $\mu: 1 \rightarrow D$ is taken, we then calculate the posterior probability by $\mathcal{I} \circ \mu$ . This updating procedure can be characterized by the diagram
where the solid lines indicate arrows given a priori, the dotted line indicates the arrow determined using Theorem 1, and the dashed lines indicate the updating after a measurement. Note that if there is no uncertainty in the measurement, then $\mu = \delta_{{x}}$ for some $x \in D$ , but in practice there is usually some uncertainty in the measurements themselves. Consequently the posterior probability must be computed as a composite - so the posterior probability of an event $A \in \Sigma_H$ given a measurement $\mu$ is $(\mathcal{I} \circ \mu)(A) = \int_D \mathcal{I}_A(x) d\mu$ .
Following the calculation of the posterior probability, the sampling distribution is then updated, if required. The process can then repeat: using the posterior probability and the updated sampling distribution the updated joint probability distribution on the product space is determined and the corresponding (updated) inference map determined (for computational purposes the “entire map” $\mathcal{I}$ need not be determined if the measurements are deterministic). We can then continue to iterate as long as new measurements are received. For some problems, such as with the standard urn problem with replacement of balls, the sampling distribution does not change from iterate to iterate, but the inferencemap is updated since the posterior probability on the hypothesis space changes with each measurement.
Remark 3. Note that for countable spaces $X$ and $Y$ the compatibility condition reduces to the standard Bayes equation since for any $x \in X$ the singleton ${x} \in \Sigma_X$ and similarly any element $y \in Y$ implies ${y} \in \Sigma_Y$ , so that the joint distribution $J: 1 \rightarrow X \times Y$ on ${x} \times {y}$ reduces to the equation
which in more familiar notation is the Bayesian equation
4 Elementary applications of Bayesian probability
Before proceeding to show how the category $\mathcal{P}$ can be applied to ML where the unknowns are functions, we illustrate its use to solve inference, prediction, and decision processes in the more familiar setting where the unknown parameter(s) are real values. We present two elementary problems illustrating basic model building using categorical diagrams, much like that used in probabilistic graphical models for Bayesian networks, which can serve to clarify the modeling aspect of any probabilistic problem.
To illustrate the inference-sampling distribution relationship and how we make computations in the category $\mathcal{P}$ , we consider first an urn problem where we have discrete $\sigma$ -algebras. The discreteness condition is not critical as we will eventually see - it only makes the analysis and computational aspect easier.
Example 4. Million dollar draw.11
The diagram consists of two rectangular boxes representing urns. Urn 1 on the left contains four balls arranged in a 2x2 grid: the top-left is blue (B), the top-right is red (R), the bottom-left is red (R), and the bottom-right is blue (B). Urn 2 on the right contains three balls: one blue (B) in the top-left position, and two red (R) balls in the bottom-left and bottom-right positions.
You are given two draws and if you pull out a red ball you win a million dollars. You are unable to see the two urns so you don't know which urn you are drawing from and the draw is done without replacement. The $\mathcal{P}$ diagram for both inference and calculating sampling distributions is given by
11 This problem is taken from Peter Green's tutorial on Bayesian Inference which can be viewed at http://videolectures.net/mlss2011\_green\_bayesian.where the dashed arrows indicate morphisms to be calculated rather than morphisms determined by modeling,
and
The sampling distribution is the binomial distribution given by
Suppose that on our first draw, we draw from one of the urns (which one is unknown) and draw a blue ball. We ask the following questions:
- (Inference) What is the probability that we made the draw from Urn 1 (Urn 2)?
- (Prediction) What is the probability of drawing a red ball on the second draw (from the same urn)?
- (Decision) Given you have drawn a blue ball on the first draw should you switch urns to increase the probability of drawing a red ball?
To solve these problems, we implicitly or explicitly construct the joint distribution $J$ via the standard construction given $P_U$ and the conditional $\mathcal{S}$and then construct the inference map by requiring the compatibility condition, i.e., the integral equation
is satisfied. Since our problem is discrete the integral reduces to a sum.
Our first step is to calculate the prior on $B$ which is the composite $P_B = \mathcal{S} \circ P_U$ , from which we calculate
and similarly
To solve the inference problem, we need to compute the values of the inference map $\mathcal{I}$ using equation 17. This amounts to computing the joint distribution on all possible measurable sets,
which reduce to the equations
Substituting values for $\mathcal{S}$ , $P_B$ , and $P_I$ one determines
which answers question (1). The odds that one drew the blue ball from Urn 1 relative to Urn 2 are $\frac{8}{15}$ , so it is almost twice as likely that one made the draw from the second urn.
The Prediction Problem. Here we implicitly (or explicitly) need to construct the product space $U \times B_1 \times B_2$ where $B_i$ represents the $i^{th}$ drawing of a ball from the same(unknown) urn. To do this we use the basic construction for joint distributions using a regular conditional probability, $\mathcal{S}_2$ , which expresses the probability of drawing either a red or a blue ball from the same urn as the first draw. This conditional probability is given by
Now we construct the joint distribution $K$ on the product space $(U \times B_1) \times B_2$
To answer the prediction question we calculate the odds of drawing a red versus a blue ball. Thus
where the right hand side follows from the definition (construction) of the iterated product space $(U \times B_1) \times B_2$ . The computation of the expression 18 yields
Similarly $K(U \times {b} \times {b}) = \frac{12}{40}$ . So the odds are
The Decision Problem To answer the decision problem we need to consider the conditional probability of switching urns on the second draw which leads to the conditional
given by
Carrying out the same computation as above we find the joint distribution $\hat{K}$ on the product space $(U \times B_1) \times B_2$ constructed from $J$ and $\hat{S}_2$ yields
which shows that it doesn't matter whether you switch or not - you get the same probability of drawing a red ball.
The probability of drawing a blue ball is
so the odds of drawing a blue ball outweigh the odds of drawing a red ball by the ratio $\frac{12}{11}$ . The odds are against you.
Here is an example illustrating that the regular conditional probabilities (inference or sampling distributions) are defined only up to sets of measure zero.
Example 5. We have a rather bland deck of three cards as shown
| Front | |||
| Card 1 | Card 2 | Card 3 | |
| Back |
We shuffle the deck, pull out a card and expose one face which is red.12 The prediction question is
12 This problem is taken from David MacKays tutorial on Information Theory which can be viewed at http://videolectures.net/mlss09uk\_mackay\_it/.What is the probability the other side of the card is red?
To answer this note that this card problem is identical to the urn problem with urns being cards and balls becoming the colored sides of each card. Thus we have an analogous model in $\mathcal{P}$ for this problem. Let
We have the $\mathcal{P}$ diagram
with the sampling distribution given by
The prior on $C$ is $P_C = \frac{1}{3}\delta_1 + \frac{1}{3}\delta_2 + \frac{1}{3}\delta_3$ . From this we can construct the joint distribution on $C \times F$
Using
we find
Now, like in the urn problem, to predict the next draw (flip of the card), it is necessary to add another measurable set $F_2$ and conditional probability $\mathcal{S}_2$ and construct the product diagram and joint distribution $K$The twist now arises in that the conditional probability $\mathcal{S}_2$ is not uniquely defined - what are the values
The answer is it doesn't matter what we put down for these values since they have measure $J({1} \times {g}) = 0$ . We can still compute the desired quantity of interest proceeding forth with these arbitrarily chosen values on the point sets of measure zero. Thus we choose
We chose the arbitrary values such that $\mathcal{S}_2$ is a deterministic mapping which seems appropriate since flipping a given card uniquely determined the color on the other side.
Now we can solve the prediction problem by computing the joint measure values
and
so it is twice as likely to observe a red face upon flipping the card than seeing a green face. Converting the odds of $\frac{r}{g} = \frac{2}{1}$ to a probability gives $Pr({r}|{r}) = \frac{2}{3}$ .
To test one's understanding of the categorical approach to Bayesian probability we suggest the following problem.Example 6. The Monty Hall Problem. You are a contestant in a game show in which a prize is hidden behind one of three curtains. You will win a prize if you select the correct curtain. After you have picked one curtain but before the curtain is lifted, the emcee lifts one of the other curtains, revealing a goat, and asks if you would like to switch from your current selection to the remaining curtain. How will your chances change if you switch?
There are three components which need modeled in this problem:
The prior on $D$ is $P_D = \frac{1}{3}\delta_{d_1} + \frac{1}{3}\delta_{d_2} + \frac{1}{3}\delta_{d_3}$ . Your selection of a curtain, say curtain 1, gives the deterministic measure $P_C = \delta_{C_1}$ . There is a conditional probability from the product space $D \times C$ to $O$
where the conditional probability $\mathcal{S}((i, j), {k})$ represents the probability that Monty opens door $k$ given that the prize is behind door $i$ and you have chosen door $j$ . If you have chosen curtain 1 then we have the partial data given by
Complete the table, as necessary, to compute the inference conditional, $D \times C \xleftarrow{\mathcal{I}} O$ , and conclude that if Monty opens either curtain 2 or 3 it is in your best interest to switch doors.
5 The Tensor Product
Given any function $f: X \rightarrow Y$ the graph of $f$ is defined as the set function
By our previous notation $\Gamma_f = \langle Id_X, f \rangle$ . If $g: Y \rightarrow X$ is any function we also refer to the set function
as a graph function.
Any fixed $x \in X$ determines a constant function $\bar{x}: Y \rightarrow X$ sending every $y \in Y$ to $x$ . These functions are always measurable and consequently determine “constant” graph functions $\Gamma_{\bar{x}}: Y \rightarrow X \times Y$ . Similarly, every fixed $y \in Y$ determines a constant graph function $\Gamma_{\bar{y}}: X \rightarrow X \times Y$ . Together, these constant graph functions can be used to define a $\sigma$ -algebra on the set $X \times Y$ which is finer (larger) than the product $\sigma$ -algebra $\Sigma_{X \times Y}$ . Let $X \otimes Y$ denote the set $X \times Y$ endowed with the largest $\sigma$ -algebra structure such that all the constant graph functions $\Gamma_{\bar{x}}: X \rightarrow X \otimes Y$ and $\Gamma_{\bar{y}}: Y \rightarrow X \otimes Y$ are measurable. We say this $\sigma$ -algebra $X \otimes Y$ is coinduced by the maps ${\Gamma_{\bar{x}}: X \rightarrow X \times Y}{x \in X}$ and ${\Gamma{\bar{y}}: Y \rightarrow X \times Y}_{y \in Y}$ . Explicitly, this $\sigma$ -algebra is given by
where for any function $f: W \rightarrow Z$ ,
This is in contrast to the smallest $\sigma$ -algebra on $X \times Y$ , defined in Section 2.1 so that the two projection maps ${\pi_X: X \times Y \rightarrow X, \pi_Y: X \times Y \rightarrow Y}$ are measurable. Such a $\sigma$ -algebra is said to be induced by the projection maps, or simply referred to as the initial $\sigma$ -algebra.
The following result on coinduced $\sigma$ -algebras is used repeatedly.
Lemma 7. Let the $\sigma$ -algebra of $Y$ be coinduced by a collection of maps ${f_i: X_i \rightarrow Y}_{i \in I}$ . Then any map $g: Y \rightarrow Z$ is measurable if and only if the composition $g \circ f_i$ is measurable for each $i \in I$ .
Proof. Consider the diagram
If $B \in \Sigma_Z$ then $g^{-1}(B) \in \Sigma_Y$ if and only if $f_i^{-1}(g^{-1}(B)) \in \Sigma_{X_i}$ . $\square$
This result is used frequently when $Y$ in the above diagram is replaced by a tensor product space $X \otimes Y$ . For example, using this lemma it follows that the projection maps$\pi_Y: X \otimes Y \rightarrow Y$ and $\pi_X: X \otimes Y \rightarrow X$ are both measurable because the diagrams in Figure 3 commute.
Figure 3: The commutativity of these diagrams, together with the measurability of the constant functions and constant graph functions, implies the projection maps $\pi_X$ and $\pi_Y$ are measurable.
By the measurability of the projection maps and the universal property of the product, it follows the identity mapping on the set $X \times Y$ yields a measurable function
called the restriction of the $\sigma$ -algebra. In contrast, the identity function $X \times Y \rightarrow X \otimes Y$ is not necessarily measurable. Given any probability measure $P$ on $X \otimes Y$ the restriction mapping induces the pushforward probability measure $\delta_{id} \circ P = P(id^{-1}(\cdot))$ on the product $\sigma$ -algebra.
5.1 Graphs of Conditional Probabilities
The tensor product of two probability measures $P: 1 \rightarrow X$ and $Q: 1 \rightarrow Y$ was defined in Equations 5 and 6 as the joint distribution on the product $\sigma$ -algebra by either of the expressions
and
which are equivalent on the product $\sigma$ -algebra. Here we have introduced the new notation of left tensor $\otimes$ and right tensor $\otimes$ because we can extend these definitions to be defined on the tensor $\sigma$ -algebra though in general the equivalence of these two expressions may no longer hold true. These definitions can be extended to conditional probability measures $P: Z \rightarrow X$ and $Q: Z \rightarrow Y$ trivially by conditioning on a point $z \in Z$ ,
and
which are equivalent on the product $\sigma$ -algebra but not on the tensor $\sigma$ -algebra. However in the special case when $Z = X$ and $P = 1_X$ , then Equations 21 and 22 do coincide on $\Sigma_{X \otimes Y}$ because by Equation 21
while by Equation 22
In this case we denote the common conditional by $\Gamma_Q$ , called the graph of Q by analogy to the graph of a function, and this map gives the commutative diagram in Figure 4.
Figure 4: The tensor product of a conditional with an identity map in $\mathcal{P}$ .
The commutativity of the diagram in Figure 4 follows from
and
5.2 A Tensor Product of Conditionals
Given any conditional $P: Z \rightarrow Y$ in $\mathcal{P}$ we can define a tensor product $1_X \otimes P$ by
which makes the diagram in Figure 5 commute and justifies the notation $1_X \otimes P$ (and explains also why the notation $\Gamma_Q$ for the graph map was used to distinguish it from this map).
Figure 5: The tensor product of conditional $1_X$ and $P$ in $\mathcal{P}$ .
This tensor product $1_X \otimes P$ essentially comes from the diagram
where given a measurable set $\mathcal{A} \in \Sigma_{X \otimes Y}$ one pulls it back under the constant graph function $\Gamma_{\bar{x}}$ and then applies the conditional $P$ to the pair $(\Gamma_{\bar{x}}^{-1}(\mathcal{A}) \mid z)$ .
5.3 Symmetric Monoidal Categories
A category $\mathcal{C}$ is said to be a monoidal category if it possesses the following three properties:
- There is a bifunctor
which is associative up to isomorphism,
where $Id_{\mathcal{C}}$ is the identity functor on $\mathcal{C}$ . Hence for every triple $X, Y, Z$ of objects, there is an isomorphism
which is natural in $X, Y, Z$ . This condition is called the associativity axiom.
- There is an object $I \in \mathcal{C}$ such that for every object $X \in_{ob} \mathcal{C}$ there is a left unit isomorphism
and a right unit isomorphism
These two conditions are called the unity axioms.
- For every quadruple of objects $X, Y, W, Z$ the diagram
commutes. This is called the associativity coherence condition.
If $\mathcal{C}$ is a monoidal category under a bifunctor $\square$ and identity 1 it is denoted $(\mathcal{C}, \square, 1)$ . A monoidal category $(\mathcal{C}, \square, 1)$ is symmetric if for every pair of objects $X, Y$ there exist an isomorphism
which is natural in $X$ and $Y$ , and the three diagrams in Figure 6 commute.
Xet Storage Details
- Size:
- 78.3 kB
- Xet hash:
- d32ec9038f16048874ff156e97eda3131760891427bdd1a4fb02d2c3f52b88d2
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.