huggingchat/papers-content / 1602 /1602.01024.md

|

raw

91.8 kB

On Deep Multi-View Representation Learning: Objectives and Optimization

Weiran Wang

Toyota Technological Institute at Chicago
Chicago, IL 60637, USA

WEIRANWANG@TTIC.EDU

Raman Arora

Department of Computer Science
Johns Hopkins University
Baltimore, MD 21218, USA

ARORA@CS.JHU.EDU

Karen Livescu

Toyota Technological Institute at Chicago
Chicago, IL 60637, USA

KLIVESCU@TTIC.EDU

Jeff Bilmes

Department of Electrical Engineering
University of Washington, Seattle
Seattle, WA 98195, USA

BILMES@EE.WASHINGTON.EDU

Editor:

Abstract

We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for learning while only one view is available for downstream tasks. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feed-forward networks with a batch-style correlation-based objective. We analyze several techniques based on prior work, as well as new variants, and compare them empirically on image, speech, and text tasks. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with our new variant, deep canonically correlated autoencoders (DCCAE). We also explore a stochastic optimization procedure for minibatch correlation-based objectives and discuss the time/performance trade-offs for kernel-based and neural network-based implementations.

1. Introduction

In many applications, we have access to multiple “views” of data at training time while only one view is available at test time, or for a downstream task. The views can be multiple measurement modalities, such as simultaneously recorded audio + video (Kidron et al., 2005; Chaudhuri et al., 2009), audio + articulation (Arora and Livescu, 2013; Wang et al., 2015a), images + text (Hardoon et al., 2004; Socher and Li, 2010; Hodosh et al., 2013; Yan and Mikolajczyk, 2015), or parallel text in two languages (Vinokourov et al., 2003; Haghighi et al., 2008; Chandar et al., 2014; Faruqui and Dyer, 2014; Lu et al., 2015), but may also be different information extracted from the same source, such as words + context (Pennington et al., 2014) or document text + text ofinbound hyperlinks (Bickel and Scheffer, 2004). The presence of multiple information sources presents an opportunity to learn better representations (features) by analyzing the views simultaneously. Typical approaches are based on learning a feature transformation of the “primary” view (the one available at test time) that captures useful information from the second view using a paired two-view training set. Under certain assumptions, theoretical results exist showing the advantages of multi-view techniques for downstream tasks (Kakade and Foster, 2007; Foster et al., 2009; Chaudhuri et al., 2009). Experimentally, prior work has shown the benefit of multi-view methods on tasks such as retrieval (Vinokourov et al., 2003; Hardoon et al., 2004; Socher and Li, 2010; Hodosh et al., 2013), clustering (Blaschko and Lampert, 2008; Chaudhuri et al., 2009), and classification/recognition (Dhillon et al., 2011; Arora and Livescu, 2013; Ngiam et al., 2011).

Recent work has introduced several approaches for multi-view representation learning based on deep neural networks (DNNs), using two main training criteria (objectives). One type of objective is based on deep autoencoders (Hinton and Salakhutdinov, 2006), where the objective is to learn a compact representation that best reconstructs the inputs. In the multi-view learning scenario, it is natural to use an encoder to extract the shared representation from the primary view, and use different decoders to reconstruct each view’s input features from the shared representation. This approach has been shown to be effective for speech and vision tasks (Ngiam et al., 2011).

The second main DNN-based multi-view approach is based on deep extensions of canonical correlation analysis (CCA, Hotelling, 1936), which learns features in two views that are maximally correlated. The CCA objective has been studied extensively and has a number of useful properties and interpretations (Borga, 2001; Bach and Jordan, 2002, 2005; Chechik et al., 2005), and the optimal linear projection mappings can be obtained by solving an eigenvalue system of a matrix whose dimensions equal the input dimensionalities. To overcome the limiting power of linear projections, a nonlinear extension—kernel canonical correlation analysis (KCCA)—has also been proposed (Lai and Fyfe, 2000; Akaho, 2001; Melzer et al., 2001; Bach and Jordan, 2002; Hardoon et al., 2004). CCA and KCCA have long been the workhorse for multi-view feature learning and dimensionality reduction (Vinokourov et al., 2003; Kakade and Foster, 2007; Socher and Li, 2010; Dhillon et al., 2011). Several alternative nonlinear CCA-like approaches based on neural networks have also been proposed (Lai and Fyfe, 1999; Hsieh, 2000), but the full DNN extension of CCA, termed deep CCA (DCCA, Andrew et al., 2013) has been developed only recently. Compared to kernel methods, DNNs are more scalable to large amounts of training data and also have the potential advantage that the non-linear mapping can be learned, unlike with a fixed kernel approach.

The contributions of this paper are as follows. We present a head-to-head comparison of several DNN-based approaches, along with linear and kernel CCA, in the unsupervised multi-view feature learning setting where the second view is not available at test time. We compare approaches based on prior work, as well as developing and comparing new variants. Empirically, we find that CCA-based approaches tend to outperform unconstrained reconstruction-based approaches. One of the new methods we propose, a DNN-based model combining CCA and autoencoder-based terms, is the consistent winner across several tasks. Finally, we study a stochastic optimization approach for deep CCA, both empirically and theoretically. To facilitate future work, we have released our implementations and a new benchmark dataset of simulated two-view data based on MNIST.¹

An early version of this work appeared in (Wang et al., 2015b). This paper expands on that work with expanded discussion of related work (Sections 3.2 and 3.3), additional experiments, and theo-

Data and implementations can be found at http://ttic.uchicago.edu/~wwang5/dccae.html.Figure 1: Schematic diagram of DNN-based multi-view representation learning approaches.

retical analysis. In particular we include additional experiments on word similarity tasks with multilingual word embedding learning (Section 4.3); extensive empirical comparisons between batch and stochastic optimization methods for DCCA; and comparisons between DCCA and two popular low-rank approximate KCCA methods, demonstrating their computational trade-offs (Section 4.4). We also analyze the stochastic optimization approach for DCCA theoretically in Appendix A.

2. DNN-based multi-view feature learning

In this section, we discuss several existing and new multi-view learning approaches based on deep feed-forward neural networks, including their objective functions and optimization procedures. Schematic diagrams summarizing the methods are given in Fig. 1.

Notation In the multi-view feature learning scenario, we have access to paired observations from two views, denoted $(\mathbf{x}_1, \mathbf{y}_1), \dots, (\mathbf{x}_N, \mathbf{y}_N)$ , where $N$ is the sample size and $\mathbf{x}_i \in \mathbb{R}^{D_x}$ and $\mathbf{y}_i \in \mathbb{R}^{D_y}$ for $i = 1, \dots, N$ . We also denote the data matrices for each view $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_N]$ and $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}N]$ . We use bold-face letters, e.g. $\mathbf{f}$ , to denote multidimensional mappings implemented by DNNs. A DNN $\mathbf{f}$ of depth $K_f$ implements a nested mapping of the form $\mathbf{f}(\mathbf{x}) = \mathbf{f}{K_f}(\dots \mathbf{f}_1(\mathbf{x}; \mathbf{W}1) \dots; \mathbf{W}{K_f})$ , where $\mathbf{W}_j$ are the weight parameters² of layer $j$ , $j = 1, \dots, K_f$ , and $\mathbf{f}_j$ is the mapping of layer $j$ which takes the form of a linear mapping followed by an element-wise activation: $\mathbf{f}_j(\mathbf{t}) = s(\mathbf{W}_j^\top \mathbf{t})$ , and typical choices for $s$ include sigmoid, tanh, ReLU, etc. The collection of the learnable parameters in model $\mathbf{f}$ is denoted $\mathbf{W}_f$ , e.g. $\mathbf{W}_f = {\mathbf{W}1, \dots, \mathbf{W}{K_f}}$ in the DNN case. We write the $\mathbf{f}$ -projected (view 1) data matrix as $\mathbf{f}(\mathbf{X}) = [\mathbf{f}(\mathbf{x}_1), \dots, \mathbf{f}(\mathbf{x}_N)]$ . The dimensionality of the projection (feature vector), which is equal to the number of output units in the DNN case, is denoted $L$ .

2.1 Split autoencoders (SplitAE)

Ngiam et al. (2011) propose to extract shared representations by learning to reconstruct both views from the one view that is available at test time. In this approach, the feature extraction network $\mathbf{f}$ is shared while the reconstruction networks $\mathbf{p}$ and $\mathbf{q}$ are separate for each view. We refer to this model as a split autoencoder (SplitAE), shown schematically in Fig. 1 (a). The objective of this model is

Biases can be introduced by appending an extra 1 to the input.the sum of reconstruction errors for the two views (we omit the $\ell_2$ weight decay term for all models in this section):

$\min_{\mathbf{w}_f, \mathbf{w}_p, \mathbf{w}_q} \frac{1}{N} \sum_{i=1}^N \|\mathbf{x}_i - \mathbf{p}(\mathbf{f}(\mathbf{x}_i))\|^2 + \|\mathbf{y}_i - \mathbf{q}(\mathbf{f}(\mathbf{x}_i))\|^2.$

The intuition for this model is that the shared representation can be extracted from a single view, and can be used to reconstruct all views.³ The autoencoder loss is the empirical expectation of the loss incurred at each training sample, and thus stochastic gradient descent (SGD) can be used to optimize the objective efficiently with the gradients estimated on small minibatches of samples.

2.2 Deep canonical correlation analysis (DCCA)

Andrew et al. (2013) propose a DNN extension of CCA termed deep CCA (DCCA; see Fig. 1 (b)). In DCCA, two DNNs $\mathbf{f}$ and $\mathbf{g}$ are used to extract nonlinear features for each view and the canonical correlation between the extracted features $\mathbf{f}(\mathbf{X})$ and $\mathbf{g}(\mathbf{Y})$ is maximized:

$\begin{aligned} \max_{\mathbf{w}_f, \mathbf{w}_g, \mathbf{U}, \mathbf{V}} \quad & \frac{1}{N} \text{tr} \left( \mathbf{U}^\top \mathbf{f}(\mathbf{X}) \mathbf{g}(\mathbf{Y})^\top \mathbf{V} \right) \\ \text{s.t.} \quad & \mathbf{U}^\top \left( \frac{1}{N} \mathbf{f}(\mathbf{X}) \mathbf{f}(\mathbf{X})^\top + r_x \mathbf{I} \right) \mathbf{U} = \mathbf{I}, \\ & \mathbf{V}^\top \left( \frac{1}{N} \mathbf{g}(\mathbf{Y}) \mathbf{g}(\mathbf{Y})^\top + r_y \mathbf{I} \right) \mathbf{V} = \mathbf{I}, \\ & \mathbf{u}_i^\top \mathbf{f}(\mathbf{X}) \mathbf{g}(\mathbf{Y})^\top \mathbf{v}_j = 0, \quad \text{for } i \neq j, \end{aligned} \tag{1}$

where $\mathbf{U} = [\mathbf{u}_1, \dots, \mathbf{u}_L]$ and $\mathbf{V} = [\mathbf{v}_1, \dots, \mathbf{v}_L]$ are the CCA directions that project the DNN outputs and $(r_x, r_y) > 0$ are regularization parameters added to the diagonal of the sample auto-covariance matrices (Bie and Moor, 2003; Hardoon et al., 2004). In DCCA, $\mathbf{U}^\top \mathbf{f}(\cdot)$ is the final projection mapping used for downstream tasks.⁴ One intuition for CCA-based objectives is that, while it may be difficult to accurately reconstruct one view from the other view, it may be easier, and perhaps sufficient, to learn a predictor of a function (or subspace) of the second view. In addition, it should be helpful for the learned dimensions within each view to be uncorrelated so that they provide complementary information.

Optimization The DCCA objective couples all training samples through the whitening constraints (i.e., it is a fully batch objective), so stochastic gradient descent (SGD) cannot be applied in a standard way. However, it is observed that DCCA can still be optimized effectively as long as the gradient is estimated using a sufficiently large minibatch (Wang et al., 2015a, with the gradient formulas given as in Andrew et al., 2013). Intuitively, this approach works because a large minibatch

The authors also propose a bimodal deep autoencoder combining DNN transformed features from both views; this model is more natural for the multimodal fusion setting, where both views are available at test time, but can also be used in our multi-view setting (see Ngiam et al., 2011). Empirically, however, Ngiam et al. (2011) report that SplitAE tends to work better in the multi-view setting than bimodal autoencoders.
In principle there is no need for the final linear projection; we could define DCCA such that the correlation objective and constraints are imposed on the final nonlinear layer of the two DNNs. The final linear projection is equivalent to constraining the final DNN layer to be linear. While this is in principle not needed, it is crucial for practical algorithmic implementations such as ours, and matches the original formulation of DCCA (Andrew et al., 2013).of samples contains sufficient information for estimating the covariance matrices. We provide empirical analysis about the optimization of DCCA (in comparison with KCCA) in Section 4.4 and its theoretical analysis is given in Appendix A.

2.3 Deep canonically correlated autoencoders (DCCAE)

Inspired by both CCA and reconstruction-based objectives, we propose a new model that consists of two autoencoders and optimizes the combination of canonical correlation between the learned “bottleneck” representations and the reconstruction errors of the autoencoders. In other words, we optimize the following objective

$\begin{aligned} \min_{\mathbf{W}_f, \mathbf{W}_g, \mathbf{W}_p, \mathbf{W}_q, \mathbf{U}, \mathbf{V}} \quad & -\frac{1}{N} \text{tr} \left( \mathbf{U}^\top \mathbf{f}(\mathbf{X}) \mathbf{g}(\mathbf{Y})^\top \mathbf{V} \right) \\ & + \frac{\lambda}{N} \sum_{i=1}^N \|\mathbf{x}_i - \mathbf{p}(\mathbf{f}(\mathbf{x}_i))\|^2 + \|\mathbf{y}_i - \mathbf{q}(\mathbf{g}(\mathbf{y}_i))\|^2 \\ \text{s.t.} \quad & \text{the same constraints as in (1),} \end{aligned} \quad (2)$

where $\lambda > 0$ is a trade-off parameter. Alternatively, this approach can be seen as adding an autoencoder regularization term to DCCA. We call this approach deep canonically correlated autoencoders (DCCAE). Fig. 1 (c) shows a schematic representation of the approach.

Optimization We apply stochastic optimization to the DCCAE objective. Notice obtaining good stochastic estimates of the gradient for the correlation and autoencoder terms may involve different minibatch sizes. We explore the minibatch sizes for each term separately (by training DCCA and an autoencoder) and select them using a validation set. The stochastic gradient is then the sum of the gradient for the DCCA term (usually estimated using large minibatches) and the gradient for the autoencoder term (usually estimated using small minibatches).

Interpretations CCA maximizes the mutual information between the projected views for certain distributions (Borga, 2001), while training an autoencoder to minimize reconstruction error amounts to maximizing a lower bound on the mutual information between inputs and learned features (Vincent et al., 2010). The DCCAE objective offers a trade-off between the information captured in the (input, feature) mapping within each view on the one hand, and the information in the (feature, feature) relationship across views.

2.4 Correlated autoencoders (CorrAE)

In the next approach, we remove the uncorrelatedness constraints from the DCCAE objective, leaving only the sum of scalar correlations between pairs of learned dimensions and the reconstruction error term. This approach is intended to test how important the original CCA constraints are. We call this model correlated autoencoders (CorrAE), also represented by Fig. 1 (c). Its objective canbe written as

$\begin{aligned} \min_{\mathbf{w}_f, \mathbf{w}_g, \mathbf{w}_p, \mathbf{w}_q, \mathbf{U}, \mathbf{V}} \quad & -\frac{1}{N} \text{tr} \left( \mathbf{U}^\top \mathbf{f}(\mathbf{X}) \mathbf{g}(\mathbf{Y})^\top \mathbf{V} \right) \\ & + \frac{\lambda}{N} \sum_{i=1}^N \|\mathbf{x}_i - \mathbf{p}(\mathbf{f}(\mathbf{x}_i))\|^2 + \|\mathbf{y}_i - \mathbf{q}(\mathbf{g}(\mathbf{y}_i))\|^2 \\ \text{s.t.} \quad & \mathbf{u}_i^\top \mathbf{f}(\mathbf{X}) \mathbf{f}(\mathbf{X})^\top \mathbf{u}_i = \mathbf{v}_i^\top \mathbf{g}(\mathbf{Y}) \mathbf{g}(\mathbf{Y})^\top \mathbf{v}_i = N, \quad 1 \leq i \leq L. \end{aligned} \quad (3)$

where $\lambda > 0$ is a trade-off parameter. It is clear that the constraint set in (3) is a relaxed version of that of (2). Later we demonstrate that this difference results in a large performance gap. We apply the same optimization strategy of DCCAE to CorrAE.

CorrAE is similar to the model of Chandar et al. (2014, 2015), who try to learn vectorial word representations using parallel corpora from two languages. They use a DNN in each view (language) to predict a bag-of-words representation of the input sentences, or that of the paired sentences from the other view, while encouraging the learned bottleneck layer representations to be highly correlated.

2.5 Minimum-distance autoencoders (DistAE)

The CCA objective can be seen as minimizing the distance between the learned projections of the two views, while satisfying the whitening constraints for the projections (Hardoon et al., 2004). The constraints complicate the optimization of CCA-based objectives, as pointed out above. This observation motivates us to consider additional objectives that decompose into sums over training examples, while maintaining the intuition of the CCA objective as a reconstruction error between two mappings. Here we consider two variants that we refer to as minimum-distance autoencoders (DistAE).

The first variant DistAE-1 optimizes the following objective:

$\begin{aligned} \min_{\mathbf{w}_f, \mathbf{w}_g, \mathbf{w}_p, \mathbf{w}_q} \quad & \frac{1}{N} \sum_{i=1}^N \frac{\|\mathbf{f}(\mathbf{x}_i) - \mathbf{g}(\mathbf{y}_i)\|^2}{\|\mathbf{f}(\mathbf{x}_i)\|^2 + \|\mathbf{g}(\mathbf{y}_i)\|^2} \\ & + \frac{\lambda}{N} \sum_{i=1}^N \|\mathbf{x}_i - \mathbf{p}(\mathbf{f}(\mathbf{x}_i))\|^2 + \|\mathbf{y}_i - \mathbf{q}(\mathbf{g}(\mathbf{y}_i))\|^2 \end{aligned} \quad (4)$

which is a weighted combination of reconstruction errors of two autoencoders and the average discrepancy between the projected sample pairs. The denominator of the discrepancy term is used to keep the optimization from improving the objective by simply scaling down the projections (although they can never become identically zero due to the reconstruction terms). This objective is unconstrained and is the empirical average of the loss incurred at each training sample, so normal SGD applies using small (or any size) minibatches.

The second variant DistAE-2 optimizes a somewhat different objective:

$\begin{aligned} \min_{\mathbf{w}_f, \mathbf{w}_g, \mathbf{w}_p, \mathbf{w}_q, \mathbf{A}, \mathbf{b}} \quad & \frac{1}{N} \sum_{i=1}^N \|\mathbf{f}(\mathbf{x}_i) - \mathbf{A} \mathbf{g}(\mathbf{y}_i) - \mathbf{b}\|^2 \\ & + \frac{\lambda}{N} \sum_{i=1}^N \|\mathbf{x}_i - \mathbf{p}(\mathbf{f}(\mathbf{x}_i))\|^2 + \|\mathbf{y}_i - \mathbf{q}(\mathbf{g}(\mathbf{y}_i))\|^2 \end{aligned} \quad (5)$ where $\mathbf{A} \in \mathbb{R}^{L \times L}$ and $\mathbf{b} \in \mathbb{R}^L$ . The underlying intuition is that the representation of the primary view can be linearly predicted from the representation of the other view. This relationship is motivated by the fact that when $\mathbf{g}(\mathbf{y})$ and $\mathbf{f}(\mathbf{x})$ are perfectly linearly correlated, then there exists an affine transformation that can map from one to the other. This approach, hence, alleviates the burden on $\mathbf{g}(\mathbf{y})$ being simultaneously predictive of the output and close to $\mathbf{f}(\mathbf{x})$ by itself.

3. Related work

Here we focus on related work on multi-view feature learning using neural networks and the kernel extension of CCA.

3.1 Neural network feature extraction using CCA-like objectives

There have been several approaches to multi-view representation learning using neural networks with an objective similar to that of CCA. Under the assumption that the two views share a common cause (e.g., depth is the common cause for adjacent patches of images), Becker and Hinton (1992) propose to maximize a sample-based estimate of mutual information between outputs of neural networks for the two views. In this work, the 1-dimensional output of each network can be considered to be “internal teaching signals” for the other. It is less straightforward to extend their sample-based estimator of mutual information to higher dimensions, while the CCA objective is always closely related to maximal mutual information between the views (under the joint multivariate Gaussian distributions of the inputs, see Borga, 2001).

Lai and Fyfe (1999) propose to optimize the correlation (rather than canonical correlation) between the outputs of networks for each view, subject to scale constraints on each output dimension. Instead of directly solving this constrained formulation, the authors apply Lagrangian relaxation and solve the resulting unconstrained objective using SGD. Note, however, that their objective is different from that of CCA, as there are no constraints that the learned dimensions within each view be uncorrelated. Hsieh (2000) proposes a neural network-based model involving three modules: one module for extracting a pair of maximally correlated one-dimensional features for the two views; and a second and third module for reconstructing the original inputs of the two views from the learned features. In this model, the feature dimensions can be learned one after another, each learned using as input the reconstruction residual from previous dimensions. This approach is intuitively similar to CorrAE and DCCA, but the three modules are each trained separately, so there is no unified objective.

Kim et al. (2012) propose an algorithm that first uses deep belief networks and the autoencoder objective to extract features for two languages independently, and then applies linear CCA to the learned features (activations at the bottleneck layer of the autoencoders) to learn the final representation. In this two-step approach, the DNN weight parameters are not updated to optimize the CCA objective.

There has also been work on multi-view feature learning using deep Boltzmann machines (Srivastava and Salakhutdinov, 2014; Sohn et al., 2014). The models in this work stack several layers of restricted Boltzmann machines (RBM) to represent each view, with an additional top layer that provides the joint representation. These are probabilistic graphical models, for which the maximum likelihood objective is intractable and the training procedures are more complex. Although probabilistic models have some advantages (e.g., dealing with missing values and generating sam-ples in a natural way), DNN-based models have the advantages of a tractable objective and efficient training.

3.2 Kernel CCA

Formulation (1) encompasses several variants. Obviously, (1) reduces to linear CCA when $\mathbf{f}$ and $\mathbf{g}$ are identity mappings. One popular nonlinear approach is kernel CCA (KCCA, Lai and Fyfe, 2000; Akaho, 2001; Melzer et al., 2001; Bach and Jordan, 2002; Hardoon et al., 2004), corresponding to choosing $\mathbf{f}(\mathbf{x})$ and $\mathbf{g}(\mathbf{y})$ to be feature maps induced by positive definite kernels $k_x(\cdot, \cdot)$ and $k_y(\cdot, \cdot)$ , respectively (e.g., Gaussian RBF kernel $k(\mathbf{a}, \mathbf{b}) = e^{-|\mathbf{a}-\mathbf{b}|^2/2s^2}$ where $s$ is the kernel width). Following the derivation of Bach and Jordan (2002, Section 3.2.1), it suffices to consider projection mappings of the form $\tilde{\mathbf{f}}(\mathbf{x}) = \sum_{i=1}^N \alpha_i k_x(\mathbf{x}, \mathbf{x}i)$ and $\tilde{\mathbf{g}}(\mathbf{y}) = \sum{i=1}^N \beta_i k_y(\mathbf{y}, \mathbf{y}_i)$ , where $\alpha_i, \beta_i \in \mathbb{R}^L, i = 1, \dots, N$ .

Denote by $\mathbf{K}_x$ the $N \times N$ kernel matrix for view 1, i.e., $(\mathbf{K}x){ij} = k_x(\mathbf{x}_i, \mathbf{x}_j)$ , and similarly denote by $\mathbf{K}_y$ the kernel matrix for view 2. Then (1) can be written as a problem in the coefficient matrices $\mathbf{A} = [\alpha_1, \dots, \alpha_L]^\top \in \mathbb{R}^{N \times L}$ and $\mathbf{B} = [\beta_1, \dots, \beta_L]^\top \in \mathbb{R}^{N \times L}$ :

$\begin{aligned} \max_{\mathbf{A}, \mathbf{B}} \quad & \frac{1}{N} \text{tr} \left( \mathbf{A}^\top \mathbf{K}_x \mathbf{K}_y \mathbf{B} \right) \\ \text{s.t.} \quad & \mathbf{A}^\top \left( \frac{1}{N} \mathbf{K}_x^2 + r_x \mathbf{K}_x \right) \mathbf{A} = \mathbf{B}^\top \left( \frac{1}{N} \mathbf{K}_y^2 + r_y \mathbf{K}_y \right) \mathbf{B} = \mathbf{I}, \\ & \alpha_i^\top \mathbf{K}_x \mathbf{K}_y \beta_j = 0, \quad \text{for } i \neq j. \end{aligned} \tag{6}$

Therefore we can conveniently work with the kernel (Gram) matrices instead of possibly infinite dimensional RKHS space and optimize directly over the coefficients. Following a derivation similar to that of CCA, one can show that the optimal solution $(\mathbf{A}^*, \mathbf{B}^*)$ satisfies

$(\mathbf{K}_x + N r_x \mathbf{I})^{-1} \mathbf{K}_y (\mathbf{K}_y + N r_y \mathbf{I})^{-1} \mathbf{K}_x \mathbf{A}^* = \mathbf{A}^* \Sigma^2, \tag{7}$

$(\mathbf{K}_y + N r_y \mathbf{I})^{-1} \mathbf{K}_x (\mathbf{K}_x + N r_x \mathbf{I})^{-1} \mathbf{K}_y \mathbf{B}^* = \mathbf{B}^* \Sigma^2, \tag{8}$

where $\Sigma$ is a diagonal matrix containing the leading correlation coefficients. Thus the optimal projection can be obtained by solving an eigenvalue problem of size $N \times N$ .

We can make a few observations on the KCCA method. First, the non-zero regularization parameters $(r_x, r_y) > 0$ are needed to avoid trivial solutions and correlations. Second, in KCCA, the mappings are not optimized over except that the kernel parameters are usually cross-validated. Third, exact KCCA is computationally challenging for large data sets as it would require performing an eigendecomposition of an $N \times N$ matrix which is expensive both in memory (storing the kernel matrices) and time (solving the $N \times N$ eigenvalue systems naively costs $\mathcal{O}(N^3)$ ).

Various kernel approximation techniques have been proposed to scale up kernel machines. Two widely used approximation techniques are random Fourier features (Lopez-Paz et al., 2014) and the Nyström approximation (Williams and Seeger, 2001). In random Fourier features, we randomly sample $M$ $D_x$ (respectively $D_y$ )-dimensional vectors from a Gaussian distribution and map the original inputs to $\mathbb{R}^M$ by computing the dot products with the random samples followed by an elementwise cosine. That is, $\mathbf{f}(\mathbf{x}) = [\cos(\mathbf{w}_1^\top \mathbf{x} + b_1), \dots, \cos(\mathbf{w}_M^\top \mathbf{x} + b_M)]$ where $\mathbf{w}_i$ is sampled from $\mathcal{N}(\mathbf{0}, \mathbf{I}/s^2)$ ( $s$ is the width of the Gaussian kernel we wish to approximate), and $b_i$ is sampled uniformly from $[0, 2\pi]$ . Inner products between transformed samples then approximatekernel similarities between original inputs. In the Nyström approximation, we randomly select $M$ training samples $\tilde{\mathbf{x}}_1, \dots, \tilde{\mathbf{x}}_M$ and construct the $M \times M$ kernel matrix $\tilde{\mathbf{K}}_x$ based on these samples, i.e. $(\tilde{\mathbf{K}}x){ij} = k_x(\tilde{\mathbf{x}}_i, \tilde{\mathbf{x}}_j)$ . We compute the eigenvalue decomposition $\tilde{\mathbf{K}}_x = \tilde{\mathbf{R}}\tilde{\mathbf{\Lambda}}\tilde{\mathbf{R}}^\top$ , and then the $N \times N$ kernel matrix for the entire training set can be approximated as $\mathbf{K}_x \approx \mathbf{C}\tilde{\mathbf{K}}_x^{-1}\mathbf{C}^\top$ where $\mathbf{C}$ contains the columns of $\mathbf{K}x$ corresponding to the selected subset, i.e., $\mathbf{C}{ij} = k_x(\mathbf{x}_i, \tilde{\mathbf{x}}_j)$ . This means $\mathbf{K}_x \approx \left(\mathbf{C}\tilde{\mathbf{R}}\tilde{\mathbf{\Lambda}}^{-1/2}\right) \left(\mathbf{C}\tilde{\mathbf{R}}\tilde{\mathbf{\Lambda}}^{-1/2}\right)^\top$ , so we can use the $M \times N$ matrix $\mathbf{F} = \left(\mathbf{C}\tilde{\mathbf{R}}\tilde{\mathbf{\Lambda}}^{-1/2}\right)^\top$ as a new feature representation for view 1, where inner products between samples approximate kernel similarities.

Both techniques produce rank- $M$ approximations of the kernel matrices with computational complexity $\mathcal{O}(M^3 + M^2N)$ ; but random Fourier features are data independent and more efficient to generate while Nyström tends to work better. Other approximation techniques such as incomplete Cholesky decomposition (Bach and Jordan, 2002), partial Gram-Schmidt (Hardoon et al., 2004), incremental SVD (Arora and Livescu, 2012) have also been proposed and applied to KCCA. However, for very large training sets, such as the ones in some of our tasks below, it remains difficult and costly to approximate KCCA well. Although recently iterative algorithms have been introduced for very large CCA problems (Lu and Foster, 2014), they are aimed at sparse matrices and do not have a natural out-of-sample extension.

3.3 Other related models

CCA is related to metric learning in a broad sense. In metric learning, the task is to learn a metric in the input space (or equivalently a projection mapping) such that the learned distances (or equivalently Euclidean distances in the projected space) between “similar” samples are small while distances between “dissimilar” samples are large. Metric learning often uses side information in the form of pairs of similar/dissimilar samples, which may be known a priori or derived from class labels (Xing et al., 2003; Shental et al., 2002; Bar-Hillel et al., 2005; Tsang and Kwok, 2003; Schultz and Joachims, 2004; Goldberger et al., 2005; Hoi et al., 2006; Globerson and Roweis, 2006; Davis et al., 2007; Weinberger and Saul, 2009).

Note that we can equivalently write the CCA objective as (by replacing max with min – and adding 1/2 times the left-hand side of the whitening constraints)

$\min_{\mathbf{U}, \mathbf{V}} \frac{1}{N} \left\| \mathbf{U}^\top \mathbf{F} - \mathbf{V}^\top \mathbf{G} \right\|_F^2 + \frac{r_x}{2} \|\mathbf{U}\|_F^2 + \frac{r_y}{2} \|\mathbf{V}\|_F^2 \quad (9)$

s.t. the same constraints in (1).

In view of the above formulation, pairs of co-occurring two-view samples are mapped into similar locations in the projection, which are thus considered “similar” in CCA, and the whitening constraints set the scales of the projections so that the dataset does not collapse into a constant and so that the projection dimensions are uncorrelated. Unlike the typical metric learning setting, the two-view data in CCA may come from different domains/modalities and thus each view has its own projection mapping. Also, CCA uses no information regarding “dissimilar” pairs of two-view data. In this sense the CCA setting is more similar to that of Shental et al. (2002) and Bar-Hillel et al. (2005), which use only side information regarding groups of similar samples (“chunklets”) for single-view data.

For multi-view data, Globerson et al. (2007) propose an algorithm for learning Euclidean embeddings by defining a joint or conditional distribution of the views based on Euclidean distancein the embedding space and maximizing the data likelihood. The objective they minimize is the weighted mean of squared distances between embeddings of co-occurring pairs with a regularization term depending on the partition function of the defined distribution. This model differs from CCA in the global constraints/regularization used.

Recently, there has been increasing interest in learning (multi-view) representations using contrastive losses which aim to enforce that distances between dissimilar pairs are larger than distances between similar pairs by some margin (Hermann and Blunsom, 2014; Huang et al., 2013). In case there is no ground-truth information regarding similar/dissimilar pairs (as in our setting), random sampling is typically used to generate negative pairs. The sampling of negative pairs can be harmful in cases where the probability of mistakenly obtaining similar pairs from the sampling procedure is relatively high. In future work it would be interesting to compare contrastive losses with the CCA objective when only similar pairs are given, and to consider the effects of such sampling in a variety of data distributions.

Finally, CCA has a connection with the information bottleneck method (Tishby et al., 1999). Indeed, in the case of Gaussian variables, the information bottleneck method finds the same subspace as CCA (Chechik et al., 2005).

4. Experiments

We first demonstrate the proposed algorithms and related work on several multi-view feature learning tasks (Sections 4.1–4.3). In our setting, the second view is not available during test time, so we try to learn a feature transformation of the first/primary view that captures useful information from the second view using a paired two-view training set. Then, in Section 4.4, we explore the stochastic optimization procedure for the DCCA objective (1).

We focus on several downstream tasks including noisy digit image classification, speech recognition, and word pair semantic similarity. On these tasks, we compare the following methods in the multi-view learning setting:

• DNN-based models, including SplitAE, CorrAE, DCCA, DCCAE, and DistAE.
• Linear CCA (CCA), corresponding to DCCA with only a linear network with no hidden layers for both views.
• Kernel CCA approximations. Exact KCCA is computationally infeasible for our tasks since they are large; we instead implement two kernel approximation techniques, using Gaussian RBF kernels. The first implementation, denoted FKCCA, uses random Fourier features (Lopez-Paz et al., 2014) and the second implementation, denoted NKCCA, uses the Nyström approximation (Williams and Seeger, 2001). As described in Section 3.2, in both FKCCA/NKCCA, we transform the original inputs to an $M$ -dimensional feature space where the inner products between samples approximate the kernel similarities (Yang et al., 2012). We apply linear CCA to the transformed inputs to obtain the approximate KCCA solution.

4.1 Noisy MNIST digits

In this task, we generate two-view data using the MNIST dataset (LeCun et al., 1998), which consists of $28 \times 28$ grayscale digit images, with $60K/10K$ images for training/testing. We generate a more challenging version of the dataset as follows (see Fig. 2 for examples). We first rescale theFigure 2: Selection of view 1 images (left) and their corresponding view 2 images (right) from our noisy MNIST dataset.

pixel values to $[0, 1]$ (by dividing the original values in $[0, 255]$ by 255). We then randomly rotate the images at angles uniformly sampled from $[-\pi/4, \pi/4]$ and the resulting images are used as view 1 inputs. For each view 1 image, we randomly select an image of the same identity (0-9) from the original dataset, add independent random noise uniformly sampled from $[0, 1]$ to each pixel, and truncate the pixel final values to $[0, 1]$ to obtain the corresponding view 2 sample. The original training set is further split into training/tuning sets of size $50K/10K$ .

Since, given the digit identity, observing a view 2 image does not provide any information about the corresponding view 1 image, a good multi-view learning algorithm should be able to extract features that disregard the noise. We measure the class separation in the learned feature spaces via both clustering and classification performance. First, we cluster the projected view 1 inputs into 10 clusters and evaluate how well the clusters agree with ground-truth labels. We use spectral clustering (Ng et al., 2002) so as to account for possibly non-convex cluster shapes. Specifically, we first build a $k$ -nearest-neighbor graph on the projected view 1 tuning/test samples with a binary weighting scheme (edges connecting neighboring samples have a constant weight of 1), and then embed these samples in $\mathbb{R}^{10}$ using eigenvectors of the normalized graph Laplacian, and finally run $K$ -means in the embedding to obtain a hard partition of the samples. In the last step, $K$ -means is run 20 times with random initialization and the run with the best $K$ -means objective is used. The size of the neighborhood graph $k$ is selected from ${5, 10, 20, 30, 50}$ using the tuning set. We measure clustering performance with two criteria, clustering accuracy (AC) and normalized mutual information (NMI) (Xu et al., 2003; Manning et al., 2008). The AC is defined as

$AC = \frac{\sum_{i=1}^N \delta(s_i, \text{map}(r_i))}{N}, \quad (10)$

where $s_i$ is the ground truth label of sample $\mathbf{x}_i$ , $r_i$ is the cluster label of $\mathbf{x}_i$ , and $\text{map}(r_i)$ is an optimal permutation mapping between cluster labels and ground truth labels obtained by solvinga linear assignment problem using the Hungarian algorithm (Munkres, 1957). The NMI considers the probability distribution over the ground truth label set $C$ and cluster label set $C'$ jointly, and is defined by the following set of equations

$NMI(C, C') = \frac{MI(C, C')}{\max(H(C), H(C'))} \quad (11)$

$\text{where} \quad MI(C, C') = \sum_{c_i \in C} \sum_{c'_j \in C'} p(c_i, c'_j) \log_2 \frac{p(c_i, c'_j)}{p(c_i)p(c'_j)} \quad (12)$

$H(C) = - \sum_{c_i \in C} p(c_i) \log_2 p(c_i), \quad H(C') = - \sum_{c'_j \in C'} p(c'_j) \log_2 p(c'_j), \quad (13)$

where $p(c_i)$ is interpreted as the probability of a sample having label $c_i$ and $p(c_i, c'_j)$ the probability of a sample having label $c_i$ while being assigned to cluster $c'_j$ (all of which can be computed by counting the samples in the joint partition of $C$ and $C'$ ). Larger values of these criteria (with an upper bound of 1) indicate better agreement between the clustering and ground-truth labeling.

Each algorithm has hyperparameters that are selected using the tuning set. The final dimensionality $L$ is selected from ${5, 10, 20, 30, 50}$ . For CCA, the regularization parameters $r_x$ and $r_y$ are selected via grid search. For KCCAs, we fix both $r_x$ and $r_y$ at a small positive value of $10^{-4}$ (as suggested by Lopez-Paz et al. (2014), FKCCA is robust to $r_x, r_y$ ), and do grid search for the Gaussian kernel width over ${2, 3, 4, 5, 6, 8, 10, 15, 20}$ for view 1 and ${2.5, 5, 7.5, 10, 15, 20, 30}$ for view 2 at rank $M = 5,000$ , and then test with $M = 20,000$ . For DNN-based models, the feature mappings $(\mathbf{f}, \mathbf{g})$ are implemented by networks of 3 hidden layers, each of 1024 sigmoid units, and a linear output layer of $L$ units; reconstruction mappings $(\mathbf{p}, \mathbf{q})$ are implemented by networks of 3 hidden layers, each of 1024 sigmoid units, and an output layer of 784 sigmoid units. We fix $r_x = r_y = 10^{-4}$ for DCCA and DCCAE. For SplitAE/CorrAE/DCCAE/DistAE we select the trade-off parameter $\lambda$ via grid search over ${0.001, 0.01, 0.1, 1, 10}$ , allowing a trade-off between the correlation term (with value in $[0, L]$ ) and the reconstruction term (varying roughly in the range $[60, 110]$ as the reconstruction error for the second view is always large). The networks $(\mathbf{f}, \mathbf{p})$ are pre-trained in a layerwise manner using restricted Boltzmann machines (Hinton and Salakhutdinov, 2006) and similarly for $(\mathbf{g}, \mathbf{q})$ with inputs from the corresponding view.

For DNN-based models, we use SGD for optimization with minibatch size, learning rate and momentum tuned on the tuning set (more on this in Section 4.4). A small weight decay parameter of $10^{-4}$ is used for all layers. We monitor the objective on the tuning set for early stopping. For each algorithm, we select the model with the best AC on the tuning set, and report its results on the test set. The AC and NMI results (in percent) for each algorithm are given in Table 1. As a baseline, we also cluster the original 784-dimensional view 1 images.

Next, we also measure the quality of the projections via classification experiments. If the learned features are clustered well into classes, then one might expect that a simple linear classifier can achieve high accuracy on these projections. We train one-versus-one linear SVMs (Chang and Lin, 2011) on the projected training set (now using the ground truth labels), and test on the projected test set, while using the projected tuning set for selecting the SVM hyperparameter (the penalty parameter for hinge loss). Test error rates on the optimal embedding of each algorithm (with highest AC) are provided in Table 1 (last column). These error rates agree with the clustering results. Multi-view feature learning makes classification much easier on this task: Instead of using a heavilyFigure 3: $t$ -SNE embedding of the projected test set of noisy MNIST digits by different algorithms. Each sample is denoted by a marker located at its embedding coordinates and color-coded by its identity, except in (a') where the actual input images are shown for samples of classes '1' and '3'. Neither the feature learning nor $t$ -SNE is aware of the class information. We run the $t$ -SNE implementation of Vladmyrov and Carreira-Perpiñán (2012) on projected test images for 300 iterations with a perplexity of 20.Figure 4: $t$ -SNE embedding of the projected test set of noisy MNIST digits by different algorithms. Each sample is denoted by a marker located at its coordinates of embedding and color coded by its identity. Neither the feature learning algorithms nor $t$ -SNE is aware of the class information.Table 1: Performance of several representation learning methods on the noisy MNIST digits test set. Performance measures are clustering accuracy (AC), normalized mutual information (NMI) of clustering, and classification error rates of a linear SVM on the projections. The selected feature dimensionality $L$ is given in parentheses. Results are averaged over 5 random seeds.

Method	AC (%)	NMI (%)	Error (%)
Baseline	47.0	50.6	13.1
CCA ( $L = 10$ )	72.9	56.0	19.6
SplitAE ( $L = 10$ )	64.0	69.0	11.9
CorrAE ( $L = 10$ )	65.5	67.2	12.9
DistAE-1 ( $L = 20$ )	53.5	60.2	16.0
DistAE-2 ( $L = 10$ )	62.6	65.6	15.2
FKCCA( $L = 10$ )	94.7	87.3	5.1
NKCCA( $L = 10$ )	95.1	88.3	4.5
DCCA ( $L = 10$ )	97.0	92.0	2.9
DCCAE ( $L = 10$ )	97.5	93.4	2.2

nonlinear classifier on the original inputs, a very simple linear classifier that can be trained efficiently on low-dimensional projections already achieves high accuracy.

All of the multi-view feature learning algorithms achieve some improvement over the baseline. The nonlinear CCA algorithms all perform similarly, and significantly better than SplitAE, CorrAE, and DistAE. We also qualitatively investigate the features by embedding the projected features in 2D using $t$ -SNE (van der Maaten and Hinton, 2008); the resulting visualizations are given in Figures 3 and 4. Overall, the visual class separation qualitatively agrees with the relative clustering and classification performance in Table 1.

In the embedding of input images (Figure 3 (a)), samples of each digit form an approximately one-dimensional, stripe-shaped manifold, and the degree of freedom along each manifold corresponds roughly to the variation in rotation angle (see Figure 3 (a')). This degree of freedom does not change the identity of the image, which is common to both views. Projections by SplitAE/CorrAE/DistAE do achieve somewhat better separation for some classes, but the unwanted rotation variation is still prominent in the embeddings. On the other hand, without using any label information and with only paired noisy images, the nonlinear CCA algorithms manage to map digits of the same identity to similar locations while suppressing the rotational variation and separating images of different identities. Linear CCA also approximates the same behavior, but fails to separate the classes, presumably because the input variations are too complex to be captured by only linear mappings. Overall, DCCAE gives the cleanest embedding, with different digits pushed far apart and good separation achieved.

The different behavior of CCA-based methods from SplitAE/CorrAE/DistAE suggests two things. First, when the inputs are noisy, reconstructing the input faithfully may lead to unwanted degrees of freedom in the features (DCCAE tends to select a relatively small trade-off parameter $\lambda = 10^{-3}$ or $10^{-2}$ ), further supporting that it is not necessary to fully minimize reconstruction error. We show the clustering accuracy of DCCAE at $L = 10$ for different $\lambda$ values in Figure 5. Second, the hard CCA constraints, which enforce uncorrelatedness between different feature dimen-Figure 5: Clustering accuracies (%) of DCCAE for different $\lambda$ values at $L = 10$ , with other hyperparameters set to their optimal values. Each point gives the mean accuracy obtained with 5 random seeds. $\lambda = 0.0005$ gives the best performance, but only by a very slight margin.

sions, appear essential to the success of CCA-based methods; these constraints are the difference between DCCAE and CorrAE. However, the constraints without the multi-view objective seem to be insufficient. To see this, we also visualize a 10-dimensional locally linear embedding (LLE, Roweis and Saul, 2000) of the test images in Fig. 3 (b). LLE satisfies the same uncorrelatedness constraints as in CCA-based methods, but without access to the second view, it does not separate the classes as nicely.

4.2 Acoustic-articulatory data for speech recognition

We next experiment with the Wisconsin X-Ray Microbeam (XRMB) corpus (Westbury, 1994) of simultaneously recorded speech and articulatory measurements from 47 American English speakers. Multi-view feature learning via CCA/KCCA/DCCA has previously been shown to improve phonetic recognition performance when tested on audio alone (Arora and Livescu, 2013; Wang et al., 2015a).

We follow the setup of Arora and Livescu (2013) and use the learned features for speaker-independent phonetic recognition. Similarly to Arora and Livescu (2013), the inputs to multi-view feature learning are acoustic features (39D features consisting of mel frequency cepstral coefficients (MFCCs) and their first and second derivatives) and articulatory features (horizontal/vertical displacement of 8 pellets attached to different parts of the vocal tract) concatenated over a 7-frame window around each frame, giving 273D acoustic inputs and 112D articulatory inputs for each view.We split the XRMB speakers into disjoint sets of 35/8/2/2 speakers for feature learning/recognizer training/tuning/testing. The 35 speakers for feature learning are fixed; the remaining 12 are used in a 6-fold experiment (recognizer training on 8 speakers, tuning on 2 speakers, and testing on the remaining 2 speakers). Each speaker has roughly 50K frames, giving 1.43M multi-view training frames; this is a much larger training set than those used in previous work on this data set. We remove the per-speaker mean and variance of the articulatory measurements for each training speaker. All of the learned feature types are used in a tandem approach (Hermansky et al., 2000), i.e., they are appended to the original 39D features and used in a standard hidden Markov model (HMM)-based recognizer with Gaussian mixture model observation distributions. The baseline is the recognition performance using the original MFCC features. The recognizer has one 3-state left-to-right HMM per phone, using the same language model as in Arora and Livescu (2013).

For each fold, we select the best hyperparameters based on recognition accuracy on the tuning speakers, and use the corresponding learned model for the test speakers. As before, models based on neural networks are trained via SGD with the optimization parameters tuned by grid search. Here we do not use pre-training for weight initialization. A small weight decay parameter of $5 \times 10^{-4}$ is used for all layers. For each algorithm, the feature dimensionality $L$ is tuned over ${30, 50, 70}$ . For DNN-based models, we use hidden layers of 1500 ReLUs. For DCCA, we vary the network depths (up to 3 nonlinear hidden layers) of each view. In the best DCCA architecture, $\mathbf{f}$ has 3 ReLU layers of 1500 units followed by a linear output layer while $\mathbf{g}$ has only a linear output layer.

For CorrAE/DistAE/DCCAE, we use the same architecture of DCCA for the encoders, and we set the decoders to have symmetric architectures to the encoders. For this domain, we find that the best choice of architecture for the encoders/decoders for View 2 is linear while for View 1 it is typically three layers deep. For SplitAE, the encoder $\mathbf{f}$ is similarly deep and the View 1 decoder $\mathbf{p}$ has the symmetric architecture, while its View 2 decoder $\mathbf{q}$ was set to linear to match the best choice for the other methods. We fix $(r_x, r_y)$ to small values as before. The trade-off parameter $\lambda$ is tuned for each algorithm by grid search.

For FKCCA, we find it important to use a large number of random features $M$ to get a competitive result, consistent with the findings of Huang et al. (2014) when using random Fourier features for speech data. We tune kernel widths at $M = 5,000$ with FKCCA, and test FKCCA with $M = 30,000$ (the largest $M$ we could afford to obtain an exact SVD solution on a workstation with 32G main memory); We are not able to obtain results for NKCCA with $M = 30,000$ in 48 hours with our implementation, so we report its test performance at $M = 20,000$ with the optimal FKCCA hyperparameters. Notice that FKCCA has about 14.6 million parameters (random Gaussian samples + projection matrices from random Fourier features to the $L$ -dimensional KCCA features, which is more than the number of weight parameters in the largest DCCA model, so it is slower than DCCA for testing (the cost of computing test features is linear in the number of parameters for both KCCA and DNNs).

Phone error rates (PERs) obtained by different feature learning algorithms are given in Table 2. We see the same pattern as on MNIST: Nonlinear CCA-based algorithms outperform SplitAE/CorrAE/DistAE. Since the recognizer now is a nonlinear mapping (HMM), the performance of the linear CCA features is highly competitive. Again, DCCAE tends to select a relatively small $\lambda$ , indicating that the canonical correlation term is more important.Table 2: Mean and standard deviations of PERs over 6 folds obtained by each algorithm on the XRMB test speakers.

Method	Mean (std) PER (%)
Baseline	34.8 (4.5)
CCA	26.7 (5.0)
SplitAE	29.0 (4.7)
CorrAE	30.6 (4.8)
DistAE-1	33.2 (4.7)
DistAE-2	32.7 (4.9)
FKCCA	26.0 (4.4)
NKCCA	26.6 (4.2)
DCCA	24.8 (4.4)
DCCAE	24.5 (3.9)

Table 3: Spearman’s correlation ( $\rho$ ) for word/bigram similarities.

Method	WS-353	WS-SIM	WS-REL	RG-65	MC-30	MTurk-287	AN	VN
Baseline	67.0	73.0	63.1	73.4	78.5	52.0	45.0	39.1
CCA	68.4	71.9	64.7	76.6	83.4	58.7	46.6	43.2
SplitAE	63.5	68.0	58.0	75.9	85.9	62.9	46.6	44.6
CorrAE	63.5	69.0	57.5	73.7	81.5	61.6	43.4	42.0
DistAE-1	64.5	70.7	58.2	75.7	84.2	61.6	45.3	39.4
DistAE-2	64.9	70.4	61.9	75.3	82.5	62.5	42.4	40.0
FKCCA	68.7	70.5	62.5	70.3	80.8	60.8	46.4	42.9
NKCCA	69.9	74.9	65.7	80.7	85.8	57.5	48.3	40.9
DCCA	69.7	75.4	63.4	79.4	84.3	65.0	48.5	42.5
DCCAE	69.7	75.4	64.4	79.4	84.7	65.0	49.1	43.2

4.3 Multilingual data for word embeddings

In this task, we learn a vectorial representation of English words from pairs of English-German word embeddings for improved semantic similarity. We follow the setup of Faruqui and Dyer (2014) and use as inputs 640-dimensional monolingual word vectors trained via latent semantic analysis on the WMT 2011 monolingual news corpora and use the same 36K English-German word pairs for multi-view learning. The learned mappings are applied to the original English word embeddings (180K words) and the projections are used for evaluation.

We evaluate learned features on two groups of tasks. The first group consists of the four word similarity tasks from Faruqui and Dyer (2014): WS-353 and the two splits WS-SIM and WS-REL, RG-65, MC-30, and MTurk-287. The second group of tasks uses the adjective-noun (AN) and verb-object (VN) subsets from the bigram similarity dataset of Mitchell and Lapata (2010), and tuning and test splits (of size 649/1972) for each subset (we exclude the noun-noun subset as we find that the NN human annotations often reflect “topical” rather than “functional” similarity). We simply add the projections of the two words in each bigram to obtain an $L$ -dimensional representationFigure 6: Learning curves (total canonical correlation vs. training time) of DCCA on the XRMB ‘JW11’ tuning set. The maximum correlation is equal to the dimensionality (112). Each marker corresponds to one epoch (one pass over the training data) for STO or NOI, or one iteration for L-BFGS. “STO $n$ ” = stochastic optimization with minibatch size $n$ .

of the bigram, as done in prior work (Blacoe and Lapata, 2012). We compute the cosine similarity between the two vectors of each bigram pair, order the pairs by similarity, and report the Spearman’s correlation ( $\rho$ ) between the model’s ranking and human rankings.

We tune the feature dimensionality $L$ over ${128, 384}$ ; other hyperparameters are tuned as in previous experiments. DNN-based models use ReLU hidden layers of width 1,280. A small weight decay parameter of $10^{-4}$ is used for all layers. We use two ReLU hidden layers for encoders ( $\mathbf{f}$ and $\mathbf{g}$ ), and try both linear and nonlinear networks with two hidden layers for decoders ( $\mathbf{p}$ and $\mathbf{q}$ ). FKCCA/NKCCA are tested with $M = 20,000$ using kernel widths tuned at $M = 4,000$ . We fix $r_x = r_y = 10^{-4}$ for nonlinear CCAs and tune them over ${10^{-6}, 10^{-4}, 10^{-2}, 1, 10^2}$ for CCA.

For word similarity tasks, we simply report the highest Spearman’s correlation obtained by each algorithm. For bigram similarity tasks, we select for each algorithm the model with the highest Spearman’s correlation on the 649 tuning bigram pairs, and we report its performance on the 1972 test pairs. The results are given in Table 3. Unlike MNIST and XRMB, it is important for the features to reconstruct the input monolingual word embeddings well, as can be seen from the superior performance of SplitAE over FKCCA/NKCCA/DCCA. This implies there is useful information in the original inputs that is not correlated across views. However, DCCAE still performs the best on the AN task, in this case using a relatively large $\lambda = 0.1$ .#### 4.4 Empirical analysis of DCCA optimization

We now explore the issue of stochastic optimization for DCCA as discussed in Section 2.2. We use the same XRMB dataset as in the acoustic-articulatory experiment of Section 4.2. In this experiment, we select utterances from a single speaker ‘JW11’ and divide them into training/tuning/test splits of roughly 30K/11K/9K pairs of acoustic and articulatory frames.⁵ Since the sole purpose of this experiment is to study the optimization of nonlinear CCA algorithms rather than the usefulness of the features, we do not carry out any down-stream tasks or careful model selection for that purpose (e.g., a search for feature dimensionality $L$ ).

We now consider the effect of our stochastic optimization procedure for DCCA, denoted STO below, and demonstrate the importance of minibatch size. We use a 3-layer architecture where the acoustic and articulatory networks have two hidden layers of 1800 and 1200 rectified linear units (ReLUUs) respectively, and the output dimensionality (and therefore the maximum possible total canonical correlation over dimensions) is $L = 112$ . We use a small weight decay $\gamma = 10^{-4}$ , and do grid search for several hyperparameters: $r_x, r_y \in {10^{-4}, 10^{-2}, 1, 10^2}$ , constant learning rate in ${10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}}$ , fixed momentum in ${0, 0.5, 0.9, 0.95, 0.99}$ , and minibatch size in ${100, 200, 300, 400, 500, 750, 1000}$ . After learning the projection mappings on the training set, we apply them to the tuning/test set to obtain projections, and measure the canonical correlation between views. Figure 6 shows the learning curves on the tuning set for different minibatch sizes, each using the optimal values for the other hyperparameters. It is clear that for small minibatches (100, 200), the objective quickly plateaus at a low value, whereas for large enough minibatch size, there is always a steep increase at the beginning, which is a known advantage of stochastic first-order algorithms (Bottou and Bousquet, 2008), and a wide range of learning rate/momentum give very similar results. The reason for such behavior is that the stochastic estimate of the DCCA objective becomes more accurate as minibatch size $n$ increases. We provide theoretical analysis of the error between the true objective and its stochastic estimate in Appendix A.

Recently, Wang et al. (2015c) have proposed a nonlinear orthogonal iterations (NOI) algorithm for the DCCA objective which extends the alternating least squares procedure (Golub and Zha, 1995; Lu and Foster, 2014) and its stochastic version (Ma et al., 2015) for CCA. Each iteration of the NOI algorithm adaptively estimates the covariance matrices of the projections of each view, whitens the projections of a minibatch using the estimated covariance matrices, and takes a gradient step over DNN weight parameters of the nonlinear least squares problems of regressing each view’s input against the whitened projection of the other view for the minibatch. The advantage of NOI is that it performs well with smaller minibatch sizes and thus reduces memory consumption. Wang et al. (2015c) have shown that NOI can achieve the same objective value as STO using smaller minibatches. For the problems considered in this paper, however, each epoch of NOI takes a longer time than that of STO, due to the whitening operations at each iteration (involving eigenvalue decomposition of $L \times L$ covariance matrices). The smaller the minibatch size we use in NOI, the more frequently we run such operations. We report the learning curve of NOI with minibatch size $n = 100$ while all other hyper-parameters are tuned similarly to STO. NOI can eventually reach the

Our split of the data is the same as the one used by Andrew et al. (2013). We note that Lopez-Paz et al. (2014) used the same speaker in their experiments, but they randomly shuffled all 50K frames before creating the splits. We suspect that DCCA (as well as their own algorithm) were under-tuned in their experiments. We ran experiments on a randomly shuffled dataset with careful tuning of kernel widths for FKCCA/NKCCA (using rank $M = 6000$ ) and obtained canonical correlations of 99.2/105.6/107.6 for FKCCA/NKCCA/DCCA, which are better than those reported in Lopez-Paz et al. (2014).Figure 7: Total canonical correlation achieved by approximate KCCAs on the XRMB ‘JW11’ tuning set for different rank $M$ .

Table 4: Total canonical correlation achieved by each algorithm on the XRMB ‘JW11’ test set. The maximum achievable value here is 112.

	CCA	FKCCA	NKCCA	DCCA L-BFGS	DCCA Stochastic
Canon. Corr.	23.4	68.2	74.3	83.7	89.6

same objective as STO given more training time. Ultimately, the choice of optimization algorithm will depend on the task-specific data and time/memory constraints.

Finally, we also train the same model using batch training with L-BFGS⁶ with the same random initial weight parameters and tune $(r_x, r_y)$ on the same grid. While L-BFGS does well on the training set, its performance on tuning/test is usually worse than that of stochastic optimization with reasonable hyperparameters.

We also train FKCCA and NKCCA on this dataset. We tune $(r_x, r_y)$ on the same grid as for DCCA, tune the kernel width for each view, and vary the approximation rank $M$ from 1000 to 6000 for each method. We plot the best total canonical correlation achieved by each algorithm on the tuning set as a function of $M$ in Figure 7. Clearly both algorithms require relatively large $M$ to perform well on this task, but the return in further increasing $M$ is diminishing. NKCCA achieves better results than FKCCA, although forming the low-rank approximation also becomes more costly as $M$ increases. We plot the total canonical correlation achieved by each algorithm at different running times with various hyperparameters in Figure 8.

We use the L-BFGS implementation of Schmidt (2012), which includes a good line-search procedure.Figure 8: Total canonical correlation vs. running time (hours) achieved by approximate KCCAs (varying rank $M$ and kernel widths), by DCCA with L-BFGS (varying $(r_x, r_y)$ ), and by DCCA with stochastic optimization (varying $(r_x, r_y)$ , minibatch size, learning rate, and momentum, run for up to 100 epochs) on the XRMB ‘JW11’ tuning set. To avoid clutter, we show FKCCA/NKCCA results for the best 300 hyperparameter combinations, and DCCA results for the best 100 hyperparameter combinations. Since both L-BFGS and stochastic optimization are iterative algorithms, we also show the correlation obtained at different numbers of epochs.

Finally, we select for each algorithm the best model on the tuning set and give the corresponding total canonical correlation on the test set in Table 4.

5. Conclusion

We have explored several approaches in the space of DNN-based multi-view representation learning. We have found that on several tasks, CCA-based models outperform autoencoder-based models (SplitAE) and models based on between-view squared distance (DistAE) or correlation (CorrAE) instead of canonical correlation. The best overall performer is a new DCCA extension, deep canonically correlated autoencoders (DCCAE). We have studied these objectives in the context of DNNs, but we expect that the same trends should apply also to other network architectures such as convolutional (LeCun et al., 1998) and recurrent (Elman, 1990; Hochreiter and Schmidhuber, 1997) networks, and this is one direction for future work.

In light of the empirical results, it is interesting to consider again the main features of each type of objective and corresponding constraints. Autoencoder-based approaches are based on the idea that the learned features should be able to accurately reconstruct the inputs (in the case of multi-view learning, the inputs in both views). The CCA objective, on the other hand, focuses on how welleach view’s representation predicts the other’s, ignoring the ability to reconstruct each view. CCA is expected to perform well for clustering and classification when the two views are uncorrelated given the class label (Chaudhuri et al., 2009). The noisy MNIST dataset used here simulates exactly this scenario, and indeed this is the task where deep CCA outperforms other objectives by the largest margins. Even in the other tasks, however, there usually seems to be only a small advantage to being able to reconstruct the inputs faithfully.

The constraints in the various methods also have an important effect. The performance difference between DCCA and CorrAE demonstrates that uncorrelatedness between learned dimensions is important. On the other hand, the stronger DCCA constraint may still not be sufficiently strong; an even better constraint may be to require the learned dimensions to be independent (or approximately so), and this is an interesting avenue for future work.

We believe the applicability of the DCCA objective goes beyond the unsupervised feature learning setting. For example, it can be used as a data-dependent regularizer in supervised or semi-supervised learning settings where we have some labeled data as well as multi-view observations. The usefulness of CCA in such settings has previously been analyzed theoretically (Kakade and Foster, 2007; McWilliams et al., 2013) and has begun to be explored experimentally (Arora and Livescu, 2014).

Appendix A. Analysis of stochastic optimization for DCCA

Our SGD-like optimization for DCCA works as follows. We randomly pick a minibatch of $n$ training pairs ${(\mathbf{x}_i, \mathbf{y}i)}{i=1}^n$ , feed them forward through the networks $\mathbf{f}$ and $\mathbf{g}$ to obtain outputs ${(\mathbf{f}_i, \mathbf{g}i)}{i=1}^n$ where $\mathbf{f}_i = \mathbf{f}(\mathbf{x}i) \in \mathbb{R}^{d_x}$ and $\mathbf{g}i = \mathbf{g}(\mathbf{y}i) \in \mathbb{R}^{d_y}$ , and estimate the covariances of network outputs based on these samples, namely $\hat{\Sigma}{xx} = \frac{1}{n} \sum{i=1}^n \mathbf{f}i \mathbf{f}i^\top + r_x \mathbf{I}$ , $\hat{\Sigma}{yy} = \frac{1}{n} \sum{i=1}^n \mathbf{g}i \mathbf{g}i^\top + r_y \mathbf{I}$ , and $\hat{\Sigma}{xy} = \frac{1}{n} \sum{i=1}^n \mathbf{f}i \mathbf{g}i^\top$ , and finally compute a “partial” objective which is the sum of the top $L$ singular values of $\hat{\mathbf{T}} = \hat{\Sigma}{xx}^{-1/2} \hat{\Sigma}{xy} \hat{\Sigma}{yy}^{-1/2}$ , together with its gradient with respect to the network outputs and the weights at each layer of each network (see Section 3 of Andrew et al., 2013 for the gradient formulas).

We denote by $\Theta^{(n)}$ the “partial” objective based on the $n$ samples, i.e.

$\Theta^{(n)} = \sum_{k=1}^L \sigma_k(\hat{\mathbf{T}}) = \sum_{k=1}^L \sigma_k(\hat{\Sigma}_{xx}^{-1/2} \hat{\Sigma}_{xy} \hat{\Sigma}_{yy}^{-1/2}).$

And recall that our true objective function is

$\Theta = \sum_{k=1}^L \sigma_k(\mathbf{T}) = \sum_{k=1}^L \sigma_k(\Sigma_{xx}^{-1/2} \Sigma_{xy} \Sigma_{yy}^{-1/2}),$

where $\Sigma_{xx} = \frac{1}{n} \sum_{i=1}^N \mathbf{f}i \mathbf{f}i^\top + r_x \mathbf{I}$ , $\Sigma{yy} = \frac{1}{n} \sum{i=1}^N \mathbf{g}i \mathbf{g}i^\top + r_y \mathbf{I}$ , and $\Sigma{xy} = \frac{1}{n} \sum{i=1}^N \mathbf{f}_i \mathbf{g}_i^\top$ are computed over the entire training set.

An important observation is that

$\mathbb{E} \Theta^{(n)} \neq \Theta,$

where the expectation is taken over random selection of the $n$ training samples (all expectations in this section are taken over the random sampling process). This fact indicates that in expectation,the “partial” objective we compute is not equal to the true objective we want to optimize and thus the naive implementation is not really a stochastic gradient algorithm (which requires the gradient estimate to be unbiased). The reason why we can not in general have $\mathbb{E} \Theta^{(n)} = \Theta$ for the naive implementation is that there exist three nonlinear operations in our objective: the sum of singular values operations, the multiplication of three matrices, and the inverse square root operations for auto-covariance matrices. These operations keep us from moving the expectation inside, even though we do have $\mathbb{E} \hat{\Sigma}{xx} = \Sigma{xx}$ , $\mathbb{E} \hat{\Sigma}{yy} = \Sigma{yy}$ , and $\mathbb{E} \hat{\Sigma}{xy} = \Sigma{xy}$ .

In the following, we try to bound the error between $\mathbb{E} \Theta^{(n)}$ and $\Theta$ for a slightly modified version of the naive implementation. Specifically, we make the two assumptions below which make the analysis easier.

• A1: The samples used for estimating $\Sigma_{xx}$ , $\Sigma_{yy}$ and $\Sigma_{xy}$ are chosen independently from each other, each by sampling examples (for $\Sigma_{xx}$ and $\Sigma_{yy}$ ) or example pairs (for $\Sigma_{xy}$ ) uniformly at random with replacement from the training set. In practice, we could randomly pick $n$ samples of $\mathbf{f}(\mathbf{x})$ for computing $\hat{\Sigma}{xx}$ , another $n$ samples of $\mathbf{g}(\mathbf{y})$ for computing $\hat{\Sigma}{yy}$ , and $n$ pairs of samples of $(\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{y}))$ for computing $\hat{\Sigma}_{xy}$ , and the computational cost of this modified procedure is twice the cost of the original naive implementation. This simple modification allows the expectation of matrix multiplications to factorize.
• A2: Additionally, we assume an upper bound on the magnitude of the neural network outputs, namely, $\max(|\mathbf{f}_i|^2, |\mathbf{g}_i|^2) \leq B, \forall i$ . This holds when nonlinear activations with a bounded range are used; e.g., with logistic sigmoid or hyperbolic tangent activations, the upper bound $B$ can be set to the output dimensionality (because the squared activation is bounded by 1 for each output unit), or when the inputs themselves are bounded and the functions $\mathbf{f}(\mathbf{x})$ and $\mathbf{g}(\mathbf{y})$ are Lipschitz.

Our analysis is based on the following formulation of the linear CCA solution (Borga, 2001). Notice that the optimal $(\mathbf{U}, \mathbf{V})$ for the CCA objective (1) satisfies

$\begin{bmatrix} \mathbf{0} & \mathbf{T} \\ \mathbf{T}^\top & \mathbf{0} \end{bmatrix} \begin{bmatrix} \Sigma_{xx}^{1/2} \mathbf{U} \\ \Sigma_{yy}^{1/2} \mathbf{V} \end{bmatrix} = \begin{bmatrix} \Sigma_{xx}^{1/2} \mathbf{U} \\ \Sigma_{yy}^{1/2} \mathbf{V} \end{bmatrix} \Sigma, \quad (14)$

where $\Sigma$ contains the top $L$ singular values of $\mathbf{T}$ on the diagonal. This is easy to verify as $\Sigma_{xx}^{1/2} \mathbf{U}$ and $\Sigma_{yy}^{1/2} \mathbf{V}$ contain the top left/right singular vectors of $\mathbf{T}$ (see, e.g., Section 2 of Andrew et al., 2013).

Multiplying both sides of (14) by $\begin{bmatrix} \Sigma_{xx}^{-1/2} & \mathbf{0} \ \mathbf{0} & \Sigma_{yy}^{-1/2} \end{bmatrix}$ gives

$\begin{bmatrix} \Sigma_{xx}^{-1} & \mathbf{0} \\ \mathbf{0} & \Sigma_{yy}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{0} & \Sigma_{xy} \\ \Sigma_{xy}^\top & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{U} \\ \mathbf{V} \end{bmatrix} = \begin{bmatrix} \mathbf{U} \\ \mathbf{V} \end{bmatrix} \Sigma,$

which implies $\begin{bmatrix} \mathbf{U} \ \mathbf{V} \end{bmatrix}$ correspond to the top $L$ eigenvectors of the matrix $\mathbf{A}^{-1} \mathbf{B}$ , where

$\mathbf{A} = \begin{bmatrix} \Sigma_{xx} & \mathbf{0} \\ \mathbf{0} & \Sigma_{yy} \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} \mathbf{0} & \Sigma_{xy} \\ \Sigma_{xy}^\top & \mathbf{0} \end{bmatrix}.$ (It is straightforward to check that eigenvalues of $\mathbf{A}^{-1}\mathbf{B}$ are $\pm\sigma_1(\mathbf{T}), \pm\sigma_2(\mathbf{T}), \dots$ )

When using minibatches, the estimate of the objective is of course computed based on the randomly chosen subset of samples, and is the sum of the top eigenvalues of the matrix $\hat{\mathbf{A}}^{-1}\hat{\mathbf{B}}$ , where

$\hat{\mathbf{A}} = \begin{bmatrix} \hat{\Sigma}_{xx} & \mathbf{0} \\ \mathbf{0} & \hat{\Sigma}_{yy} \end{bmatrix}, \quad \hat{\mathbf{B}} = \begin{bmatrix} \mathbf{0} & \hat{\Sigma}_{xy} \\ \hat{\Sigma}_{xy}^\top & \mathbf{0} \end{bmatrix}.$

We now try to bound the expected error between $\hat{\mathbf{A}}^{-1}\hat{\mathbf{B}}$ and $\mathbf{A}^{-1}\mathbf{B}$ measured in spectral norm. An important tool in our analysis is the matrix Bernstein inequality below.

Lemma 1 (Matrix Bernstein, Theorem 1.6.2 of Tropp, 2012) Let $\mathbf{A}_1, \dots, \mathbf{A}_n$ be independent random matrices with common dimension $d_1 \times d_2$ . Assume that each matrix has bounded deviation from its mean:

$\|\mathbf{A}_k - \mathbb{E} \mathbf{A}_k\| \leq R \quad \forall k = 1, \dots, n.$

Form the sum $\mathbf{B} = \sum_{k=1}^n \mathbf{A}_k$ , and introduce a variance parameter

$\sigma^2 = \max \left\{ \left\| \mathbb{E} \left[ (\mathbf{B} - \mathbb{E} \mathbf{B})(\mathbf{B} - \mathbb{E} \mathbf{B})^\top \right] \right\|, \left\| \mathbb{E} \left[ (\mathbf{B} - \mathbb{E} \mathbf{B})^\top (\mathbf{B} - \mathbb{E} \mathbf{B}) \right] \right\| \right\}.$

Then

$\mathbb{E} \|\mathbf{B} - \mathbb{E} \mathbf{B}\| \leq \sqrt{2\sigma^2 \log(d_1 + d_2)} + \frac{1}{3} R \log(d_1 + d_2).$

The following result and its proof are similar to that of Lopez-Paz et al. (2014, Theorem. 4).

Theorem 1 Assume that A1 and A2 hold for our stochastic estimate of the true DCCA objective, and assume that the eigenvalues of autocovariance matrices $\hat{\Sigma}{xx}$ and $\hat{\Sigma}{yy}$ are lower bounded by $\gamma_x > 0$ and $\gamma_y > 0$ respectively, i.e.,

$\hat{\Sigma}_{xx} \succeq \gamma_x \mathbf{I}, \quad \hat{\Sigma}_{yy} \succeq \gamma_y \mathbf{I}.$

Then we have the following bound for the expected error:

$\mathbb{E} \left\| \hat{\mathbf{A}}^{-1}\hat{\mathbf{B}} - \mathbf{A}^{-1}\mathbf{B} \right\| \leq \max \{e_1, e_2\} \quad (15)$

where the expectation is taken over random selection of training samples as described in A1, and

$e_1 = \frac{B}{\gamma_x^2} \left( \sqrt{\frac{2B^2 \log(2d_x)}{n}} + \frac{2B \log(2d_x)}{3n} \right) + \frac{1}{\gamma_x} \left( \sqrt{\frac{2B^2 \log(d_x + d_y)}{n}} + \frac{2B \log(d_x + d_y)}{3n} \right),$

$e_2 = \frac{B}{\gamma_y^2} \left( \sqrt{\frac{2B^2 \log(2d_y)}{n}} + \frac{2B \log(2d_y)}{3n} \right) + \frac{1}{\gamma_y} \left( \sqrt{\frac{2B^2 \log(d_x + d_y)}{n}} + \frac{2B \log(d_x + d_y)}{3n} \right).$ Proof Since $\hat{\mathbf{A}}^{-1}\hat{\mathbf{B}} - \mathbf{A}^{-1}\mathbf{B}$ is block diagonal, its spectral norm is bounded by the maximum of the spectral norms of the individual blocks:

$\mathbb{E} \left\| \hat{\mathbf{A}}^{-1}\hat{\mathbf{B}} - \mathbf{A}^{-1}\mathbf{B} \right\| \leq \max \left\{ \mathbb{E} \left\| \hat{\Sigma}_{xx}^{-1} \hat{\Sigma}_{xy} - \Sigma_{xx}^{-1} \Sigma_{xy} \right\|, \mathbb{E} \left\| \hat{\Sigma}_{yy}^{-1} \hat{\Sigma}_{xy} - \Sigma_{yy}^{-1} \Sigma_{xy} \right\| \right\}.$

We now focus on analyzing the first term in the above maximum as the other term follows analogously. Define the individual error terms

$\mathbf{E}_i = \frac{1}{n} (\hat{\Sigma}_{xx}^{-1} \mathbf{f}_i \mathbf{g}_i^\top - \Sigma_{xx}^{-1} \Sigma_{xy}), \quad \mathbf{E} = \sum_{i=1}^n \mathbf{E}_i.$

Thus our goal is to bound $\left| \hat{\Sigma}{xx}^{-1} \hat{\Sigma}{xy} - \Sigma_{xx}^{-1} \Sigma_{xy} \right| = |\mathbf{E}|$ .

Notice that the matrices $\Sigma_{xx}$ and $\Sigma_{xy}$ are not random. And due to assumption A1, the samples used for estimating $\hat{\Sigma}{xx}$ and $\hat{\Sigma}{xy}$ are selected independently, so the expectation of $\hat{\Sigma}_{xx}^{-1} \mathbf{f}_i \mathbf{g}_i^\top$ factorizes. Therefore we have

$\mathbb{E} \mathbf{E}_i = \frac{1}{n} \left( \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} - \Sigma_{xx}^{-1} \Sigma_{xy} \right),$

and the deviation of the individual error matrices from their expectation is

$\mathbf{Z}_i := \mathbf{E}_i - \mathbb{E} \mathbf{E}_i = \frac{1}{n} \left( \hat{\Sigma}_{xx}^{-1} \mathbf{f}_i \mathbf{g}_i^\top - \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} \right).$

We would like to bound $\left| \sum_{i=1}^n \mathbf{Z}_i \right|$ and $|\mathbb{E} \mathbf{E}_i|$ separately using the matrix Bernstein inequality in Lemma 1, and then use them to bound $e_1$ .

Bounding $\left| \sum_{i=1}^n \mathbf{Z}_i \right|$ . Notice that $\mathbb{E} [\mathbf{Z}_i] = \mathbf{0}$ , and we can bound its norm (denoted $R$ )

$\begin{aligned} R &:= \|\mathbf{Z}_i\| \leq \frac{1}{n} \left( \left\| \hat{\Sigma}_{xx}^{-1} \mathbf{f}_i \mathbf{g}_i^\top \right\| + \left\| \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} \right\| \right) \\ &\leq \frac{1}{n} \left( \left\| \hat{\Sigma}_{xx}^{-1} \right\| \left\| \mathbf{f}_i \mathbf{g}_i^\top \right\| + \mathbb{E} \left[ \left\| \hat{\Sigma}_{xx}^{-1} \right\| \right] \|\Sigma_{xy}\| \right) \\ &\leq \frac{1}{n} \left( \frac{1}{\gamma_x} \left\| \mathbf{f}_i \mathbf{g}_i^\top \right\| + \frac{1}{\gamma_x} \max_i \left\| \mathbf{f}_i \mathbf{g}_i^\top \right\| \right) \\ &\leq \frac{2}{n\gamma_x} \max_i \left\| \mathbf{f}_i \mathbf{g}_i^\top \right\| \\ &\leq \frac{2}{n\gamma_x} \max_i \|\mathbf{f}_i\| \|\mathbf{g}_i\| \\ &\leq \frac{2B}{n\gamma_x}, \end{aligned}$

where we have used the triangle inequality in the first inequality, and Jensen's inequality for the second and third inequality (since norms are convex functions). To apply the Matrix Bernstein inequality, we still need to bound the variance which is defined as

$\begin{aligned} \sigma^2 &:= \max \left\{ \left\| \mathbb{E} \left[ \left( \sum_{i=1}^n \mathbf{Z}_i \right) \cdot \left( \sum_{i=1}^n \mathbf{Z}_i \right)^\top \right] \right\|, \left\| \mathbb{E} \left[ \left( \sum_{i=1}^n \mathbf{Z}_i \right)^\top \cdot \left( \sum_{i=1}^n \mathbf{Z}_i \right) \right] \right\| \right\} \\ &= \max \left\{ \left\| \sum_{i=1}^n \mathbb{E} [\mathbf{Z}_i \mathbf{Z}_i^\top] \right\|, \left\| \sum_{i=1}^n \mathbb{E} [\mathbf{Z}_i^\top \mathbf{Z}_i] \right\| \right\}, \end{aligned}$ where we have used the fact that the $\mathbf{Z}_i$ 's are independent and have mean zero in the second equality.

Let us consider an individual term in the summand of the second term:

$\begin{aligned} \mathbf{Z}_i^\top \mathbf{Z}_i &= \frac{1}{n^2} \left( \mathbf{g}_i \mathbf{f}_i^\top \hat{\Sigma}_{xx}^{-2} \mathbf{f}_i \mathbf{g}_i^\top - \mathbf{g}_i \mathbf{f}_i^\top \hat{\Sigma}_{xx}^{-1} \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} \right. \\ &\quad \left. - \Sigma_{xy}^\top \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \hat{\Sigma}_{xx}^{-1} \mathbf{f}_i \mathbf{g}_i^\top + \Sigma_{xy}^\top \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} \right). \end{aligned}$

Taking expectations we see that

$\mathbb{E} \left[ \mathbf{Z}_i^\top \mathbf{Z}_i \right] = \frac{1}{n^2} \left( \mathbb{E} \left[ \mathbf{g}_i \mathbf{f}_i^\top \hat{\Sigma}_{xx}^{-2} \mathbf{f}_i \mathbf{g}_i^\top \right] - \Sigma_{xy}^\top \left( \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \right)^2 \Sigma_{xy} \right)$

where we have again used the assumption A1. Notice that $\Sigma_{xy}^\top \left( \mathbb{E} \left[ \hat{\Sigma}{xx}^{-1} \right] \right)^2 \Sigma{xy}$ is positive semi-definite, so subtracting it only decreases the spectral norm. We now take norms on both sides of the above equality and obtain

$\begin{aligned} \left\| \mathbb{E} \left[ \mathbf{Z}_i^\top \mathbf{Z}_i \right] \right\| &\leq \frac{1}{n^2} \left\| \mathbb{E} \left[ \mathbf{g}_i \mathbf{f}_i^\top \hat{\Sigma}_{xx}^{-2} \mathbf{f}_i \mathbf{g}_i^\top \right] \right\| \leq \frac{1}{n^2} \mathbb{E} \left[ \left\| \mathbf{g}_i \mathbf{f}_i^\top \hat{\Sigma}_{xx}^{-2} \mathbf{f}_i \mathbf{g}_i^\top \right\| \right] \\ &\leq \frac{1}{n^2} \mathbb{E} \left[ \left\| \mathbf{g}_i \mathbf{f}_i^\top \right\| \left\| \hat{\Sigma}_{xx}^{-2} \right\| \left\| \mathbf{f}_i \mathbf{g}_i^\top \right\| \right] \\ &\leq \frac{B^2}{n^2 \gamma_x^2}, \end{aligned}$

where Jensen's inequality is used in the second inequality. A similar argument shows that

$\left\| \mathbb{E} \left[ \mathbf{Z}_i \mathbf{Z}_i^\top \right] \right\| \leq \frac{1}{n^2} \left\| \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} (\mathbf{f}_i \mathbf{g}_i^\top \mathbf{g}_i \mathbf{f}_i^\top) \hat{\Sigma}_{xx}^{-1} \right] \right\| \leq \frac{B^2}{n^2 \gamma_x^2}.$

An invocation of the triangle inequality on the definition of $\sigma^2$ along with the above two bounds gives

$\sigma^2 \leq \frac{B^2}{n \gamma_x^2}.$

We may now appeal to the matrix Bernstein inequality on the $d_x \times d_y$ matrices ${\mathbf{Z}i}{i=1}^n$ to obtain the bound

$\begin{aligned} \mathbb{E} \left\| \sum_{i=1}^n \mathbf{Z}_i \right\| &\leq \sqrt{2\sigma^2 \log(d_x + d_y)} + \frac{1}{3} R \log(d_x + d_y) \\ &= \frac{1}{\gamma_x} \left( \sqrt{\frac{2B^2 \log(d_x + d_y)}{n}} + \frac{2B \log(d_x + d_y)}{3n} \right). \end{aligned}$

Bounding $\mathbb{E} \mathbf{E}_i$ . Notice that

$\hat{\Sigma}_{xx}^{-1} - \Sigma_{xx}^{-1} = \Sigma_{xx}^{-1} (\Sigma_{xx} - \hat{\Sigma}_{xx}) \hat{\Sigma}_{xx}^{-1},$ so that

$\begin{aligned} \|\mathbb{E} \mathbf{E}_i\| &= \frac{1}{n} \left\| \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} \right] \Sigma_{xy} - \Sigma_{xx}^{-1} \Sigma_{xy} \right\| \\ &\leq \frac{1}{n} \left\| \mathbb{E} \left[ \hat{\Sigma}_{xx}^{-1} - \Sigma_{xx}^{-1} \right] \right\| \|\Sigma_{xy}\| \\ &\leq \frac{1}{n} \mathbb{E} \left[ \left\| \hat{\Sigma}_{xx}^{-1} - \Sigma_{xx}^{-1} \right\| \right] \|\Sigma_{xy}\| \\ &\leq \frac{B}{n\gamma_x^2} \mathbb{E} \left\| \hat{\Sigma}_{xx} - \Sigma_{xx} \right\|. \end{aligned}$

Using the same matrix Bernstein technique (now applied to $\sum_{i=1}^n \mathbf{Z}'_i$ with $\mathbf{Z}'_i = \frac{1}{n} (\mathbf{f}_i \mathbf{f}i^\top - \Sigma{xx})$ ), we have

$\mathbb{E} \left\| \hat{\Sigma}_{xx} - \Sigma_{xx} \right\| \leq \sqrt{\frac{2B^2 \log(2d_x)}{n}} + \frac{2B \log(2d_x)}{3n}.$

We are now ready to bound the quantity of interest as

$\begin{aligned} e_1 &= \mathbb{E} \left\| \hat{\Sigma}_{xx}^{-1} \hat{\Sigma}_{xy} - \Sigma_{xx}^{-1} \Sigma_{xy} \right\| = \mathbb{E} \left\| \sum_{i=1}^n \mathbf{E}_i \right\| \\ &= \mathbb{E} \left\| \sum_{i=1}^n \mathbb{E} \mathbf{E}_i + \sum_{i=1}^n \mathbf{Z}_i \right\| \\ &\leq \left\| \sum_{i=1}^n \mathbb{E} \mathbf{E}_i \right\| + \mathbb{E} \left\| \sum_{i=1}^n \mathbf{Z}_i \right\| \\ &\leq \frac{B}{\gamma_x^2} \left( \sqrt{\frac{2B^2 \log(2d_x)}{n}} + \frac{2B \log(2d_x)}{3n} \right) \\ &\quad + \frac{1}{\gamma_x} \left( \sqrt{\frac{2B^2 \log(d_x + d_y)}{n}} + \frac{2B \log(d_x + d_y)}{3n} \right). \end{aligned}$

The bound for $e_2 = \mathbb{E} \left| \hat{\Sigma}{yy}^{-1} \hat{\Sigma}{xy}^\top - \Sigma_{yy}^{-1} \Sigma_{xy}^\top \right|$ is completely analogous. ■

Assuming that $d_x = d_y = d$ in our algorithm and $\gamma = \min(\gamma_x, \gamma_y)$ , then we have the dependency of the error in spectral norm as

$\mathbb{E} \left\| \hat{\mathbf{A}}^{-1} \hat{\mathbf{B}} - \mathbf{A}^{-1} \mathbf{B} \right\| \leq \left( \frac{B^2}{\gamma^2} + \frac{B}{\gamma} \right) \left( \sqrt{\frac{2 \log(2d)}{n}} + \frac{2 \log(2d)}{3n} \right).$

As expected, the error decreases as we use large minibatch size $n$ . Also, we observe the error decreases as $(\gamma_x, \gamma_y)$ increase. Notice that from the definition of $\hat{\Sigma}{xx}$ and $\hat{\Sigma}{yy}$ we are guaranteed that $\gamma_x \geq r_x$ and $\gamma_y \geq r_y$ . This means we can increase the regularization constants to improve the estimation error, and this is to be expected as larger $(r_x, r_y)$ render the auto-covariance matrices less relevant for estimating $\mathbf{T}$ . In practice, as long as one uses a minibatch size $n > d$ , the auto-covariance matrices are typically non-singular and $(r_x, r_y)$ are underestimate of $(\gamma_x, \gamma_y)$ .## Acknowledgment

The authors would like to thank Louis Goldstein for providing phonetic alignments for the data used in the recognition experiment; Manaal Faruqui, Chris Dyer, Ang Lu, Mohit Bansal, and Kevin Gimpel for sharing resources for the multi-lingual embedding experiments; Nati Srebro for input on the stochastic optimization of DCCA; and Geoff Hinton for helpful discussion on the CCA objective.

References

Shotaro Akaho. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001). Springer-Verlag, 2001.

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Proc. of the 30th Int. Conf. Machine Learning (ICML 2013), pages 1247–1255, 2013.

Raman Arora and Karen Livescu. Kernel CCA for multi-view learning of acoustic features using articulatory measurements. In Symposium on Machine Learning in Speech and Language Processing (MLSLP), 2012.

Raman Arora and Karen Livescu. Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains. In Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP'13), 2013.

Raman Arora and Karen Livescu. Multi-view learning with supervision for transformed bottleneck features. In Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP'14), 2014.

Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.

Francis R. Bach and Michael I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Dept. of Statistics, University of California, Berkeley, 2005.

Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning a Mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6:937–965, 2005.

Suzanna Becker and Geoffrey E. Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355:161–163, 1992.

Steffen Bickel and Tobias Scheffer. Multi-view clustering. In Proc. of the 4th IEEE Int. Conf. Data Mining (ICDM'04), pages 19–26, 2004.

Tijl De Bie and Bart De Moor. On the regularization of canonical correlation analysis. Int. Sympos. ICA and BSS, 2003.

William Blacoe and Mirella Lapata. A comparison of vector-based representations for semantic composition. In Conference on Empirical Methods in Natural Language Processing, pages 546–556, 2012.Mathew B. Blaschko and Christoph H. Lampert. Correlational spectral clustering. In Proc. of the 2008 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR'08), pages 1–8, 2008.

Magnus Borga. Canonical correlation: A tutorial. Available at http://www.imt.liu.se/magnus/cca/tutorial/tutorial.pdf, 2001.

Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), volume 20, pages 161–168, 2008.

Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh M. Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems (NIPS), pages 1853–1861, 2014.

Sarath Chandar, Mitesh M. Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational neural networks. arXiv:1504.07225 [cs.CL], April 27 2015.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems and Technology, 2(3):27, 2011.

Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, and Karthik Sridharan. Multi-view clustering via canonical correlation analysis. In Proc. of the 26th Int. Conf. Machine Learning (ICML'09), pages 129–136, 2009.

Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for Gaussian variables. Journal of Machine Learning Research, 6:165–188, 2005.

Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proc. of the 24th Int. Conf. Machine Learning (ICML'07), pages 209–216, 2007.

Paramveer Dhillon, Dean Foster, and Lyle Ungar. Multi-view learning of word embeddings via CCA. In Advances in Neural Information Processing Systems (NIPS), volume 24, pages 199–207, 2011.

Jeffrey L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual correlation. In Proc. EACL, 2014.

Dean P. Foster, Rie Johnson, Sham M. Kakade, and Tong Zhang. Multi-view dimensionality reduction via canonical correlation analysis. Technical report, 2009.

A. Globerson and S. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems (NIPS), volume 18, pages 451–458. MIT Press, Cambridge, MA, 2006.

Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 8:2265–2295, 2007.

Buckets:

huggingchat
/

papers-content