huggingchat/papers-content / 1304 /1304.1920.md

|

20.3 kB

Generalizing the No-U-Turn Sampler to Riemannian Manifolds

Michael Betancourt

Applied Statistics Center, Columbia University, New York, NY 10027, USA*

Hamiltonian Monte Carlo provides efficient Markov transitions at the expense of introducing two free parameters: a step size and total integration time. Because the step size controls discretization error it can be readily tuned to achieve certain accuracy criteria, but the total integration time is left unconstrained. Recently Hoffman and Gelman proposed a criterion for tuning the integration time in certain systems with their No U-Turn Sampler, or NUTS. In this paper I investigate the dynamical basis for the success of NUTS and generalize it to Riemannian Manifold Hamiltonian Monte Carlo.

Taking advantage of a natural mapping between forms on a manifold and measures on a probability space, Hamiltonian Monte Carlo (HMC) generates Metropolis proposals by simulating Hamiltonian flow. The autocorrelation of the resulting Markov chain depends on how long the flow is integrated, but there are no conditions on the optimal integration time that minimizes autocorrelation for a general target distribution. When the target distribution is unimodal, however, there is a natural criterion: turning points of the resultant trajectories. The No U-Turn Sampler identifies these turning points for Euclidean manifolds, but the criterion begins to fail when applied to more complex distributions and Riemannian manifolds. Appealing to the geometry of HMC, however, admits a straightforward generalization of the No U-Turn Sampler that is not only amenable to Riemannian manifolds but also isolates the turning points in more complicated, non-convex target distributions.

HAMILTONIAN MONTE CARLO

Given a target density $\pi(\mathbf{q})$ , Hamiltonian Monte Carlo [1–3] yields Metropolis proposals for the extended density,

$\pi(\mathbf{p}, \mathbf{q}) = \exp[-H(\mathbf{p}, \mathbf{q})],$

where the Hamiltonian, $H(\mathbf{p}, \mathbf{q})$ , is defined as

$H(\mathbf{p}, \mathbf{q}) = T(\mathbf{p}, \mathbf{q}) + V(\mathbf{q}),$

with the potential energy, $V(\mathbf{q})$ defined by the target density,

$V(\mathbf{q}) = -\log \pi(\mathbf{q}).$

Up to a few weak constraints, the kinetic energy, $T(\mathbf{p}, \mathbf{q})$ can be tuned to improve the proposals.

In particular, HMC defines the potential as a function on some manifold $\mathcal{M}$ with coordinate functions $\mathbf{q}$ , and the Hamiltonian as scalar function on the cotangent bundle, $T^*\mathcal{M}$ , with coordinate functions ${\mathbf{p}, \mathbf{q}}$ . At a

point $S \in T^*\mathcal{M}$ the Hamiltonian is then more formally written as

$H (S) = T (S) + V (R),$

where $R = \pi(S) \in \mathcal{M}$ ¹.

Given the natural symplectic geometry of $T^\mathcal{M}$ , the Hamiltonian defines a vector field, and integrating along the vector field for some time $t$ generates a flow that maps any point, $S_0 \in T^\mathcal{M}$ , to some final point, $S_t \in T^*\mathcal{M}$ , that serves as the proposal.

Integrating for only a short time yields highly correlated points and random walk behavior, but trajectories that are integrated for too long may double back on themselves and waste computational resources. In order to optimize the proposal, we need to determine exactly how long to evolve the trajectories in order to maximize the distance between proposals and minimize the autocorrelation of the chain.

HARMONIC MOTION

In many applications, a Markov chain is meant to sample from a single mode of the target distribution, which manifests as a well in the potential energy. For small enough values of Hamiltonian, the trajectories become bound in the neighborhood of the mode and execute harmonic motion.

For example, consider Euclidean Manifold Hamiltonian Monte Carlo (EMHMC),

$H(S) = \frac{1}{2} \sum_{jk} p_j(S) p_k(S) (M^{-1})^{jk} + V(R),$

with a quadratic potential,

$V(R) = \frac{1}{2} \sum_{jk} q^j(R) q^k(R) W_{jk}.$

¹ Here $\pi$ is the natural projection operator $\pi : T^\mathcal{M} \rightarrow \mathcal{M}$ on the cotangent bundle, not the notation for probability density used above. A point $R \in \mathcal{M}$ is identified by the same position coordinate functions $\mathbf{q}$ in $T^\mathcal{M}$ , so that $\mathbf{q}(R) = \mathbf{q}(S)$ . The value of the momentum coordinate functions identify a unique covector on the base manifold, $\mathbf{p}(R) \in T_R^*\mathcal{M}$ .

* betanalpha@gmail.comThe trajectories from this Hamiltonian execute simple harmonic motion [4] along the the directions $\mathbf{N}_j = \sqrt{\mathbf{M}} \cdot \hat{\mathbf{n}}_j$ , where the $\hat{\mathbf{n}}_j$ are the eigenvectors of $\sqrt{\mathbf{M}^{-1}} \cdot \mathbf{W} \cdot \sqrt{\mathbf{M}^{-1}}$ . In terms of the coordinate functions²,

$\mathbf{q}(S_t) = A \sum_j \exp[i(\omega_j t + \phi_j + \pi/2)] \mathbf{N}^j \quad (1)$

$\mathbf{p}(S_t) = A \sum_j \exp[i(\omega_j t + \phi_j)] \mathbf{N}_j. \quad (2)$

with the frequencies given by the respective eigenvalues, $\omega_j = \sqrt{\lambda_j}$ , and the amplitude, $A$ , and phases, $\phi_j$ , determined by the initial conditions.

The motion along each direction $\mathbf{N}_j$ turns back on itself after reaching a turning point, where the projection of the momentum vanishes in the process of changing sign. Because the extent of the trajectory is limited by the oscillation with the longest period, $T_j = 2\pi/\omega_j$ , maximizing distance is equivalent to integrating until reaching the turning point of the longest oscillation, or TPOLO for succinctness. For a more general Hamiltonian the motion will be more complex but still harmonic, and the TPOLO still determines the optimal integration time.

Sampling from a Gaussian distribution $\mathcal{N}(\mathbf{0}, \Sigma)$ with

$\Sigma = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix},$

and $\rho = 0.95$ , for example, yields the potential

$V(R) = \frac{1}{2} \sum_{jk} q^j(R) q^k(R) (\Sigma^{-1})_{ij}. \quad (3)$

The resulting Hamiltonian dynamics execute simple harmonic motion, with the bulk of the motion turning back on itself at the TPOLOs (Figure 1).

Identifying the slowest turning point during a trajectory, however, is not an easy task.

THE NO U-TURN SAMPLER

In HMC, naively terminating a trajectory based on some dynamic criterion sacrifices detailed balance and threatens convergence to the target distribution. The No U-Turn Sampler, or NUTS, admits a dynamic termination criterion that preserves detailed balance by using Hamiltonian dynamics to build a binary tree from which the transition is sampled [5].

The termination criterion itself attempts to identify the longest turning point by taking advantage of the geometry of EMHMC. Here the base manifold, $\mathcal{M}$ , is endowed with a Euclidean geometry where the distance between two points, $\mathbf{r}(R_1, R_2) = \mathbf{q}(R_1) - \mathbf{q}(R_2)$ , is a well-defined vector. Because the momentum is tangential to the trajectory when $\mathbf{M} = \mathbb{I}$ , the contraction of the distance onto the current momentum vanishes³,

$\sum_j p_j(R_t) r^j(R_t, R_0) = 0,$

exactly when the trajectory should have started to turn back on itself, and consequently provides a valid termination criterion.

In practice, the success of the NUTS criterion is not limited to the case of a unit mass matrix. Consider the expansion of the distance as an integral over the trajectory,

$\begin{aligned} \mathbf{r}(R_t, R_0) &= \mathbf{q}(R_t) - \mathbf{q}_i(R_0) \\ &= \int_0^t d\tau \mathbf{M}^{-1} \mathbf{p}(R_\tau) \\ &= \mathbf{M}^{-1} \int_0^t d\tau \mathbf{p}(R_\tau) \\ &= t \mathbf{M}^{-1} \boldsymbol{\rho}(R_t), \end{aligned}$

where

$\boldsymbol{\rho}(R_t) \equiv \frac{1}{t} \int_0^t d\tau \mathbf{p}(R_\tau).$

In terms of $\boldsymbol{\rho}(R_t)$ , the NUTS criterion becomes

$\begin{aligned} \sum_j p_j(R_t) r^j(R_t, R_0) &= 0 \\ t \sum_{jk} p_j(R_t) \rho_k(R_t) (M^{-1})^{jk} &= 0 \\ \sum_{jk} p_j(R_t) \rho_k(R_t) (M^{-1})^{jk} &= 0 \\ \langle \mathbf{p}(R_t), \boldsymbol{\rho}(R_t) \rangle_{\mathbf{M}^{-1}} &= 0, \end{aligned}$

where the inner product on $\mathcal{M}$ is defined by

$\langle \mathbf{a}, \mathbf{b} \rangle_{\Lambda} \equiv \sum_{jk} a_j b_k \Lambda^{jk}.$

² The following uses complex exponential notation with an implicit projection to the real axis assumed. For example, $\exp[i\omega t]$ formally represents $\mathcal{R}(\exp[i\omega t]) = \mathcal{R}(\cos(\omega t) + i \sin(\omega t)) = \cos(\omega t)$ .

³ Here we're taking advantage of the fact that a point $S \in T^\mathcal{M}$ identifies both a point, $R = \pi(S) \in \mathcal{M}$ and a covector, $\mathbf{p}(R) \in T_R^\mathcal{M}$ , to construct the NUTS criterion on the base manifold, $\mathcal{M}$ .FIG. 1. Given the quadratic potential (3), the Hamiltonian trajectories execute simple harmonic motion with (a) eigenfrequencies $\omega = (1 \pm \rho)^{-\frac{1}{2}}$ for $\mathbf{M} = \mathbb{I}$ and (b) degenerate eigenfrequencies $\omega = 1$ for $\mathbf{M} = \Sigma^{-1}$ , with phases and amplitude determined by the initial position and momentum. The trajectories turn back on themselves at the turning points of the longest oscillation, which align with the semi-major axis of the potential.

Now, for simple harmonic motion (2),

$\begin{aligned} \rho(R_t) &= \frac{1}{t} \int_0^t d\tau A \sum_j \exp[i(\omega_j t + \phi_j)] \mathbf{N}_j \\ &= A \sum_j \frac{1}{t} \int_0^t d\tau \exp[i(\omega_j t + \phi_t)] \mathbf{N}_j \\ &= A \sum_j \frac{\exp[i(\omega_j t + \phi_j)] - \exp[i\phi_j]}{i\omega_j t} \mathbf{N}_j. \end{aligned}$

so that⁴

$\begin{aligned} 0 &= \langle \mathbf{p}(R_t), \rho(R_t) \rangle_{\mathbf{M}^{-1}} \\ &= A^2 \sum_{jk} \frac{1}{i\omega_j t} (\exp[i(\omega_j t + \phi_j)] - \exp[i\phi_j]) \\ &\quad \times \exp[i(\omega_k t + \phi_k)] \langle \mathbf{N}_j, \mathbf{N}_k \rangle_{\mathbf{M}^{-1}} \\ &= A^2 \sum_j \frac{(\exp[i(\omega_j t + \phi_j)] - \exp[i\phi_j]) \exp[i(\omega_j t + \phi_j)]}{i\omega_j t} \\ &= A^2 \sum_j \frac{\exp[i\phi_j]}{i\omega_j t} (\exp[i\omega_j t] - 1) \exp[i(\omega_j t + \phi_j)]. \end{aligned}$

⁴ Note that the $\mathbf{N}_j$ are orthogonal with respect to the metric $\mathbf{M}^{-1}$ .

As the trajectory evolves higher frequency components begin to decay as $\omega_j t \gg 1$ , isolating the slowest oscillation or at least a set of slow oscillations with degenerate frequencies⁵. In either case the criterion reduces to

$\begin{aligned} 0 &= \langle \mathbf{p}(R_t), \rho(R_t) \rangle_{\mathbf{M}^{-1}} \\ &= \frac{A^2}{i\omega t} (\exp[i\omega t] - 1) \exp[i\omega t] \sum_j \exp[i2\phi_j]. \end{aligned}$

The first term in parentheses vanishes after every complete oscillation; the more interesting zero occurs when

$\exp[i\omega t] \sum_j \exp[i2\phi_j] = 0.$

If the phases are coherent, $\phi_j = \phi \forall j$ , then the zero occurs at the nearest turning point,

$t = \frac{T}{2} \left[ n + \left( 1 - \frac{2\phi}{\pi} \right) \right], \quad n \in \mathbb{Z},$

but if the phases are incoherent then the total momentum is constant and there are no unique turning points along

⁵ In practice, the transient behavior can lead to premature satisfaction of the criterion; care must be taken when trajectories are terminated quickly.the trajectory. In this case the sum tends towards unity leaving the zero at

$t = \frac{T}{2},$

which maximizes the distance by integrating to the opposite point of the orbit.

In the more general case the phase offsets lead to beating, and the criterion vanishes at the average of the turning points,

$t \approx \frac{T}{2} \left[ n + \left( 1 - \frac{2}{\pi} \frac{\sum_j \phi_j}{n} \right) \right], n \in \mathbb{Z}.$

For simple harmonic motion the NUTS criterion identifies the first TPOLO along the trajectory, and hence the optimal integration time, for any EMHMC system (Figure 2, 3).

Although the NUTS criterion performs well in potentials that induce more complicated harmonic motion, it eventually begins to fail when applied to wells with more intricate geometry. The potential from the banana or twisted Gaussian distribution [6],

$V(R) = \frac{1}{2} \left[ \frac{q_1^2(R)}{\sigma_1^2} + \frac{(q_2(R) + \beta q_1^2(R) - 100\beta)^2}{\sigma_2^2} \right] \quad (4)$

with $\beta = 0.03$ , $\sigma_1 = 0.01$ , and $\sigma_2 = 1$ , for example, yields a non-convex well on which NUTS prematurely terminates (Figure 4).

Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) [7]

$H(S) = \frac{1}{2} \sum_{jk} p_i(S) p_j(S) \Lambda^{jk}(R) - \frac{1}{2} \log |\Lambda(R)| + V(R),$

endows the base manifold with a Riemannian geometry that can simplify the motion in such complex potentials, but here the distance $\mathbf{r}$ has no real meaning. Although the trajectories are much smoother in RMHMC than EMHMC, the now ill-defined NUTS criterion still terminates prematurely (Figure 5).

IDENTIFYING TURNING POINTS ON RIEMANNIAN MANIFOLDS

Although the motivating construction of the NUTS criterion required a measure of distance on the manifold, the intermediate form derived above does not. In particular,

$\langle \mathbf{p}(R_t), \boldsymbol{\rho}(R_t) \rangle_{\Lambda(R_t)} < 0$

with

$\boldsymbol{\rho}(R_t) = \frac{1}{t} \int_0^t d\tau \mathbf{p}(R_\tau),$

is entirely well-defined on a Riemannian manifold, provided that we can add momenta from different points along the trajectory. Fortunately, the symplectic geometry of $T^*\mathcal{M}$ furnishes such a means.

The canonical one-form, $\theta \in T^*(T^*\mathcal{M})$ [4], given in coordinates as

$\theta = \sum_j p_j dq^j,$

is a natural object on the cotangent bundle. Note that $\theta$ has no $dp_j$ components in any coordinate system — it is a horiztional form on $T^*\mathcal{M}$ .

Now the canonical one-form can be Lie dragged along the Hamiltonian flow [9],

$\theta_t^* = \theta + \int_0^t d\tau \mathcal{L}_{\vec{H}} \theta,$

where the components of the Lie derivative, $\mathcal{L}_{\vec{H}} \theta$ , are given by

$\begin{aligned} (\mathcal{L}_{\vec{H}} \theta)_j &= \sum_k \left[ \frac{dq^k}{dt} \frac{\partial}{\partial q^k} + \frac{dp_k}{dt} \frac{\partial}{\partial p_k} \right] \theta_j + \theta_k \frac{\partial}{\partial q^j} \frac{dq^k}{dt} \\ &= \sum_k \left[ \frac{dq^k}{dt} \frac{\partial}{\partial q^k} + \frac{dp_k}{dt} \frac{\partial}{\partial p_k} \right] p_j + p_k \frac{\partial}{\partial q^j} \frac{dq^k}{dt} \\ &= \sum_k \frac{dq^k}{dt} \frac{\partial p_j}{\partial q^k} + \frac{dp_k}{dt} \frac{\partial p_j}{\partial p_k} + p_k \frac{d}{dt} \frac{\partial q^k}{\partial q^j} \\ &= \sum_k \frac{dq^k}{dt} 0 + \frac{dp_k}{dt} \delta_j^k + p_k \frac{d}{dt} \delta_j^k \\ &= \sum_k \frac{dq^k}{dt} 0 + \frac{dp_k}{dt} \delta_j^k + p_k 0 \\ &= \frac{dp_j}{dt}. \end{aligned}$

Dragging $\theta$ from the beginning of a trajectory to the end then yields⁶

$\begin{aligned} \theta_{-t}^*(S_t) &= \sum_j \left[ p_j(S_t) + \int_t^0 dt \frac{dp_j}{dt} \right] dq^j(S_0) \\ &= [p_j(S_t) + p_j(S_0) - p_j(S_t)] dq^j(S_t) \\ &= p_j(S_0) dq^j(S_t). \end{aligned}$

Because $\theta$ is a horizontal form, $\theta_{-t}^*$ defines a unique covector $\mathbf{p}^*(R_t)$ , with components

$\begin{aligned} p_j^*(R_t) &= (\theta_{-t}^*(S_t))_j \\ &= p_j(R_0), \end{aligned}$

⁶ Lie dragging is canonically defined as against the flow of the vector field, from the end of a trajectory to the beginning. Here we're dragging with the flow, hence the minus sign.Diagonal Mass Matrix

FIG. 2. When applied to the quadratic potential (3) with $\mathbf{M} = \mathbb{I}$ (Figure 1a), the NUTS criterion terminates the trajectory at the first TPOLO, as desired.

Dense Mass Matrix

FIG. 3. The NUTS criterion successfully identifies the first TPOLO even for more sophisticated choices of the mass matrix, such as the quadratic potential (3) with $\mathbf{M} = \Sigma^{-1}$ (Figure 1b).Banana (EMHMC)

FIG. 4. On more complicated potentials, such as that arising from the banana distribution (4), the NUTS criterion terminates prematurely even when ignoring the transient behavior.

Banana (RMHMC w/ SoftAbs Metric)

FIG. 5. RMHMC trajectories, here using the SoftAbs metric [8] with $\alpha = 1$ , yield much smoother trajectories in the banana potential (4), but the ill-defined NUTS criterion still terminates prematurely.on the base manifold at $R_t = \pi(S_t) \in \mathcal{M}$ . This then defines a mapping of the momenta at any point $R_0$ along a trajectory to the $R_0$ (Figure 6).

We can now define $\rho(R_t)$ for any RMHMC system as

$\rho(R_t) = \frac{1}{t} \int_0^t d\tau \mathbf{p}^*(R_\tau),$

yielding the generalized NUTS criterion,

$\langle \mathbf{p}(R_t), \rho(R_t) \rangle_{\Lambda(R_t)} < 0, \quad (5)$

amenable to HMC on Riemannian manifolds.

In practice, the integral can be approximated by aggregating the momenta along discrete time intervals, $t_k - t_{k-1} = \delta t_k$ ,

$\rho = \frac{1}{\sum_k \delta t_k} \sum_k \mathbf{p}^*(R_{t_k}),$

such as those naturally introduced in any discrete integrator.

Rigorously defined, this generalized criterion identifies the TPOLO even in warped distributions such as the banana (Figure 7).

CONCLUSION

By taking advantage of both symplectic and Riemannian geometry, NUTS is readily generalized to RMHMC. When applied to bound trajectories, this criterion identifies the turning points of the longest oscillation and terminates the motion before it begins to double back on itself and waste computation.

This generalization is readily adapted into the NUTS framework and is currently being implemented in Stan [10].

ACKNOWLEDGEMENTS

I thank Andrew Gelman and the Stan development team for many engaging conversations, and Tarun Chitra for helpful discussion.

[1] S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth, Physics Letters B 195, 216 (1987).
[2] R. Neal, Mcmc using hamiltonian dynamics, in Handbook of Markov Chain Monte Carlo, edited by S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, CRC Press, New York, 2011.
[3] M. J. Betancourt and L. C. Stein, (2011), 1112.4118.
[4] J. V. José and E. J. Saletan, Classical Dynamics: A Contemporary Approach (Cambridge University Press, New York, 1998).
[5] M. D. Hoffman and A. Gelman, ArXiv e-prints (2011), 1111.4246.
[6] H. Haario, E. Saksman, and J. Tamminen, Computational Statistics 14, 375 (1999).
[7] M. Girolami and B. Calderhead, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, 123 (2011).
[8] M. J. Betancourt, ArXiv e-prints (2012), 1212.4693.
[9] B. Schutz, Geometrical Methods of Mathematical Physics (Cambridge University Press, New York, 1980).
[10] Stan Development Team, Stan: A c++ library for probability and sampling, version 1.0, 2012.The diagram shows two curved paths representing trajectories in phase space. The left path starts at a point labeled $\theta(S_0)$ and ends at a point labeled $\theta_{-t}^*(S_t) = \theta(S_0)$ . The right path starts at a point labeled $p(R_0)$ and ends at a point labeled $p^*(R_t) = p(R_0)$ . Below these paths is a horizontal arrow pointing from left to right, labeled $T^*\mathcal{M}$ on the left and $\mathcal{M}$ on the right, with the symbol $\pi$ above it, representing the projection map from the cotangent bundle to the base manifold.

FIG. 6. The Hamiltonian flow allows us to Lie drag the canonical one form, $\theta$ , from any point $S_0$ to some final point $S_t$ along the trajectory. This defines a unique momentum on the base manifold, $\mathcal{M}$ , providing a map of the momentum at $R_0 = \pi(S_0)$ to $R_t = \pi(S_t)$ , where it can be compared with the current momentum.

FIG. 7. Only with a well-defined criterion (5) do RMHMC trajectories, here using the SoftAbs metric [8] with $\alpha = 1$ , terminate at the TPOLO of the banana distribution. Note that the transient behavior persists, and care must still be taken at the beginning of a trajectory.

Buckets:

huggingchat
/

papers-content