In our new paper, we tackle the problem of density estimation using an exponential family distribution with rich structure. The authors are me*, Dougal Sutherland*, Heiko Strathmann and Arthur Gretton

## TL;DR

1. We extended the kernel exponential family distributions using a deep kernel (kernel on top of network features);
2. The deep kernel is trained by a cross-valiation like two-stage training procedure;
3. The trained kernel is location-variant and adapts to local geometry of data;
4. We find that DKEF is able to capture the shape of distributions well.

## Introduction

The goal is to fit a (non-)parametric distribution to observed data ${x}$ drawn from unknown distribution $p_0(x)$. We know that an exponentialy family distribution can be written in the following form.

Many well known distributions (Gaussian, Gamma, etc.) can usually be written in this form, but we want to using a much more general parameterization. In our work, we take $\theta \cdot T(x) = f(x) = \langle f, k(x,\cdot) \rangle_{\mathcal{H}}$ where $f$ is a function in the reproducing kernel Hubert space $\mathcal{H}$ with kernel $k$, giving kernel exponential family (Sriperumbudur et al. 2017). In short, we want to fit term inside the $\exp$ using a flexible function.

Learning such a flexible distribution in general using maximum likelihood (minimizing K-L divergence) is challenging as one would need to compute the normalizer $Z$. However, an alternative objective other is the score matching (Hyvärinen 2005) loss, which differs from the Fisher divergence by a constant that depends only on the data distribution (not $\theta$). Thus, by minimizing the score matching loss, we are minimizing the Fisher divergence between $p_\theta$ and $p_0$. The sample-version score matching loss is

$$\label{eq:score} J(\theta) = \sum_i\sum_d\partial_d^2\log p_\theta(x_i) + \frac{1}{2}(\partial_d\log p_\theta(x_i))^2$$ The advantage for using score matching is that we can ignore the normalizer when optimizing the natural parameters.

In order to use this approach to fit complex distributions, the kernel needs to adapt to the shapes at different parts of data location. We show in Figure 1 of our paper that if the kernel has only one length-scale, even a simple mixture of a narrow and a broad Gaussians can be challenging, whereas a location-variant kernel is able to capture the two scales and gives a much better fit.

## Deep kernel exponential family (DKEF)

In our work, we construct a flexible kernel using a deep neural network $\phi_w$ on top a Gaussian kernel

or a mixture of such deep kernels above. We model the energy function as weighted sum of kernels centered at a set of inducing points $z$

The network $\phi(x)$ is fully connected with softplus nonlinearity. Using softplus ensures

• the energy is twice-differentiable and the score-matching loss is well defined;
• when combined with a Gaussian $q_0$, the resulting distribution is always normalizable (Proposition 2 in our paper.

## Two-stage training (Section 3)

So far, the free parameters are the weights in the network $w$, $\alpha$, $\sigma$ and inducing points $z$. Naive gradient descent on the score objective will always overfit to the training data as the score can be made arbitrarily good by moving $z$ towards a data point and making $\sigma$ go to zeros. Stochastic gradient descient might help, but would produce very unstable updates.

To train the model, we employ a two-stage training procedure which helps avoid overfitting and also enhances the quality of the fit, motivated by cross-validation. Consider a simpler scenario where we just use a simple Gaussian kernel with bandwidth $\sigma$ as our $k$ (so no deep networks) to minimize some loss (regression, classification, score maching, etc.). To find the optimal $\sigma$ (and any additional regularizers $\lambda$), one would first split the data into two sets $\mathcal{D}_1$ and $\mathcal{D}_2$, use the closed-form solution of the corresponding loss to find the linear coefficients $\alpha$ using a set of (training) data $\mathcal{D}_1$, and then test the performance of the $\alpha$ on another set of (validation) data $\mathcal{D}_2$. The optimal parameter is then found by choosing the $\sigma$ that yeilds the minimum validation loss.

### Failures on distributions with disjoint supports (Section 3.1)

The normalizing flows shows an interesting type of failure with disjoint supports. On MoG, there is usually a “bridge” connecting the two components, and on MoR, the problems is even more severe. Also note that a single ring seems to be difficult for the normalizing flows we experimented on.

On the other hand, we find that score matching fails to fit the correct mixing proportions if the data are supported on (almost) disjoint regions. If you saw Figure 2 of the our paper for MoG and MoR, you will notice this failure mode. One solution would be to first discover these disjoint subsets of the data by clustering and fit a mixture model. Since the problem only occurs on disjoint supports, clustering algorithms should have little difficulty in finding these different supports. We illustrate the effectiveness of this workaroud in the rightmost column of Figure 2 in our paper.

• Strathmann, H., D. Sejdinovic, S. Livingstone, Z. Szabó, and A. Gretton (2015). Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In: NIPS. arXiv: 1506.02564.
• Sriperumbudur, B., K. Fukumizu, A. Gretton, A. Hyvärinen, and R. Kumar (2017). Density estimation in infinite dimensional exponential families. In: JMLR 18.1, pp. 1830–1888. arXiv: 1312.3516
• Sutherland, D. J., H. Strathmann, M. Arbel, and A. Gretton (2018). Efficient and principled score estimation with Nystr¨om kernel exponential families. In: AISTATS. arXiv: 1705.08360
• Arbel, M. and A. Gretton (2018). Kernel Conditional Exponential Family. In: AISTATS. arXiv: 1711.05363