# Learning deep kernels for exponential family distributions

In our new paper, we tackle the problem of density estimation using an exponential family distribution with rich structure. The authors are me*, Dougal Sutherland*, Heiko Strathmann and Arthur Gretton

## TL;DR

- We extended the kernel exponential family distributions using a deep kernel (kernel on top of network features);
- The deep kernel is trained by a cross-valiation like two-stage training procedure;
- The trained kernel is location-variant and adapts to local geometry of data;
- We find that DKEF is able to capture the
*shape*of distributions well.

## Introduction

The goal is to fit a (non-)parametric distribution to observed data ${x}$ drawn from unknown distribution $p_0(x)$. We know that an exponentialy family distribution can be written in the following form.

Many well known distributions (Gaussian, Gamma, etc.) can usually be written in this form, but we want to using a much more general parameterization. In our work, we take $\theta \cdot T(x) = f(x) = \langle f, k(x,\cdot) \rangle_{\mathcal{H}}$ where $f$ is a function in the reproducing kernel Hubert space $\mathcal{H}$ with kernel $k$, giving kernel exponential family (Sriperumbudur et al. 2017). In short, we want to fit term inside the $\exp$ using a flexible function.

Learning such a flexible distribution in general using maximum likelihood (minimizing K-L divergence) is challenging as one would need to compute the normalizer $Z$. However, an alternative objective other is the score matching (Hyvärinen 2005) loss, which differs from the Fisher divergence by a constant that depends only on the data distribution (not $\theta$). Thus, by minimizing the score matching loss, we are minimizing the Fisher divergence between $p_\theta$ and $p_0$. The sample-version score matching loss is

\begin{equation} \label{eq:score} J(\theta) = \sum_i\sum_d\partial_d^2\log p_\theta(x_i) + \frac{1}{2}(\partial_d\log p_\theta(x_i))^2 \end{equation} The advantage for using score matching is that we can ignore the normalizer when optimizing the natural parameters.

In order to use this approach to fit complex distributions, the kernel needs to adapt to the shapes at different parts of data location. We show in Figure 1 of our paper that if the kernel has only one length-scale, even a simple mixture of a narrow and a broad Gaussians can be challenging, whereas a location-variant kernel is able to capture the two scales and gives a much better fit.

## Deep kernel exponential family (DKEF)

In our work, we construct a flexible kernel using a deep neural network $\phi_w$ on top a Gaussian kernel

or a mixture of such deep kernels above. We model the energy function as weighted sum of kernels centered at a set of inducing points $z$

The network $\phi(x)$ is fully connected with softplus nonlinearity. Using softplus ensures

- the energy is twice-differentiable and the score-matching loss is well defined;
- when combined with a Gaussian $q_0$, the resulting distribution is
*always*normalizable (Proposition 2 in our paper.

## Two-stage training (Section 3)

So far, the free parameters are the weights in the network $w$, $\alpha$, $\sigma$ and inducing points $z$. Naive gradient descent on the score objective will always overfit to the training data as the score can be made arbitrarily good by moving $z$ towards a data point and making $\sigma$ go to zeros. Stochastic gradient descient might help, but would produce very unstable updates.

To train the model, we employ a two-stage training procedure which helps avoid overfitting and also enhances the quality of the fit, motivated by cross-validation. Consider a simpler scenario where we just use a simple Gaussian kernel with bandwidth $\sigma$ as our $k$ (so no deep networks) to minimize some loss (regression, classification, score maching, etc.). To find the optimal $\sigma$ (and any additional regularizers $\lambda$), one would first split the data into two sets $\mathcal{D}_1$ and $\mathcal{D}_2$, use the closed-form solution of the corresponding loss to find the linear coefficients $\alpha$ using a set of (training) data $\mathcal{D}_1$, and then test the performance of the $\alpha$ on another set of (validation) data $\mathcal{D}_2$. The optimal parameter is then found by choosing the $\sigma$ that yeilds the minimum validation loss.

In the above procedure, one can think of $\alpha$ as not being a paramter, but rather a function $\alpha(\sigma, \mathcal{D}_1)$ that is inserted into the model definition to compute the validation loss on $\mathcal{D}_2$. What we really care is then just the validation loss and we’d like to minimize it w.r.t. $\sigma$, which can be done using gradient descent facilitated by automatic differentation through the closed-form solution of $\alpha$. Extending this approach to our model, we optimize not just $\sigma$ but also $w$ and $z$ (and regularizers that we avoid here$) using the split-training procedure.

It is worth noting that this two-stage training only possible given that we have closed-form solution of $\alpha$.

## Experiments and evaluation

We first fit our model on a variety of synthetic distributions and compared with likelihood based normalizing-flows.
Since we are using a different loss (score matching loss) than the others,
different performances may be expected for different criterion, we thus compared different models using
log likelihood and Fisher divergence. The results are in Figure 2 of the paper. In general, our model is able to fit the general *shape* of the distributions much better than likelihood based methods.

On real datasets, we compared models using finite-set Stein discrepency (Jitkrittum 2018), which compares the fit by comparing the gradient of the models evaluated at a set of random points. We find that, while our model performs favorably on the gradient-based FSSD measure, the normalizing-flows are better in likelihoods, especially for larger datasets (Figure 3 in paper).

## Other contributions

### Log-likelihood bias bounds (Section 3.2, Proposition 1)

To evaluate the log-likelihoods on our model, we need to compute the (log) normalizing constant $Z(\theta)$. Due to Jensen inequality, the estiamted normalizer is biased downwards, and thus our estimate of the log likelihood is overestimated. We are able to find a bound on this bias, and we see that on most datasets, this bound is relatively small.

### Failures on distributions with disjoint supports (Section 3.1)

The normalizing flows shows an interesting type of failure with disjoint supports. On MoG, there is usually a “bridge” connecting the two components, and on MoR, the problems is even more severe. Also note that a single ring seems to be difficult for the normalizing flows we experimented on.

On the other hand, we find that score matching fails to fit the correct mixing proportions if
the data are supported on (almost) disjoint regions. If you saw Figure 2 of the our paper for
MoG and MoR, you will notice this failure mode. One solution
would be to first discover these disjoint subsets of the data by clustering and fit a mixture model.
Since the problem only occurs on *disjoint* supports, clustering algorithms should have little difficulty
in finding these different supports. We illustrate the effectiveness of this workaroud in the rightmost
column of Figure 2 in our paper.

## Related work

- Strathmann, H., D. Sejdinovic, S. Livingstone, Z. Szabó, and A. Gretton (2015).
*Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families.*In: NIPS. arXiv: 1506.02564. - Sriperumbudur, B., K. Fukumizu, A. Gretton, A.
Hyvärinen, and R. Kumar (2017).
*Density estimation in infinite dimensional exponential families.*In: JMLR 18.1, pp. 1830–1888. arXiv: 1312.3516 - Sutherland, D. J., H. Strathmann, M. Arbel, and A. Gretton (2018).
*Efficient and principled score estimation with Nystr¨om kernel exponential families.*In: AISTATS. arXiv: 1705.08360 - Arbel, M. and A. Gretton (2018).
*Kernel Conditional Exponential Family.*In: AISTATS. arXiv: 1711.05363