NICE: Non-linear Independent Components Estimation

Accept info: ICLR 2015 Workshop
Authors: Laurent Dinh, David Krueger, Yoshua Bengio
Affiliation: Departement d’informatique et de recherche operationnelle Universite de Montreal
Links: arXiv, GitHub
TLDR: maximize exact log-likelihood via a change of variables, using a carefully designed invertible transformation with a tractable Jacobian.

1. Intuition & Motivation

Using a lower bound criterion might yield a suboptimal solution with respect to the true log-likelihood.
→ Let’s maximize exact log-likelihood.

How can we maximize exact log-likelihood?
There are two approaches exist: autoregressive modeling, change of variable formula.

Autoregressive modeling makes the ancestral sampling procedure computationally expensive and unparallelizable for generative tasks on high-dimensional data.
Thus let’s use change of variable formula.

2. Learning Bijective Transformations of Continuous Probabilities

2.1. Approach Overview

Train encoder \(f\) using NICE critertion.
Use decoder \(f^{-1}\) to generate samples.

2.2. Non-linear Independent Components Estimation (NICE)

Maximum likelihood using the change of variable formula.
Learn an invertible non-linear transformation \(f\), which transforms the data distribution into a simple distribution.
Assume that prior distribution \(p_H(h)\) is factorial.

\(h = f(x)\)
\(p_X(x) \ dx = p_H(h) \ dh\)
\(p_X(x) = p_H(f(x)) \left | \mathrm{det} (\frac{\partial f(x)}{\partial x}) \right |\)
\(\mathrm{log} (p_X(x)) = \mathrm{log}(p_H(f(x))) + \mathrm{log} (\left | \mathrm{det} (\frac{\partial f(x)}{\partial x}) \right |)\)
\(\mathrm{log} (p_X(x)) = \sum_{d=1}^{D} \mathrm{log}(p_{H_d}(f_d(x))) + \mathrm{log} (\left | \mathrm{det} (\frac{\partial f(x)}{\partial x}) \right |)\)

Prior term (\(\mathrm{log}(p_H(f(x)))\)): encourage the transformed data distribution \(f(x)\) to match the prior distribution \(p_H\).
Entropy term (\(\mathrm{log} (\left | \mathrm{det} (\frac{\partial f(x)}{\partial x}) \right |)\)): compensates for the local volume change induced by \(f\), ensuring proper density transformation from \(p_X\) to \(p_H\).

2.3. Architecture

How can we design \(f\) so that computing the determinant of the Jacobian and inverse Jacobian is trivial?
Considering flexible design, use bijective transformation with triangular Jacobian.

Coupling Layer

Keywords: partition \(x\), function \(m\), coupling law \(g\).

\(y_{I_1} = x_{I_1}\)
\(y_{I_2} = g(x_{I_2}; m(x_{I_1}))\)
\(\frac{\partial y}{\partial x} = \begin{bmatrix} I_d & 0 \\ \frac{\partial y_{I_2}}{\partial x_{I_1}} & \frac{\partial y_{I_2}}{\partial x_{I_2}} \\ \end{bmatrix}\)

Jacobian: \(\mathrm{det} \frac{\partial y}{\partial x} = \mathrm{det} \frac{\partial y_{I_2}}{\partial x_{I_2}}\)
Inverse: \(x_{I_1} = y_{I_1}, x_{I_2} = g^{-1}(y_{I_2}; m(y_{I_1}))\)

Additive coupling law: \(g(a;b) = a+b\)
Multiplicative coupling law: \(g(a;b) = a \odot b\)
Affine coupling law: \(g(a;b) = a \odot b_1 + b_2\)

For numerical stability, use additive coupling layer.

Combining Coupling Layer

Compose several coupling layers to obtain a more complex layered transformation.
Since a coupling layer leaves part of its input unchanged, exchange the role of the two subsets in the partition in alternating layers, so that the composition of two coupling layers modifies every dimension.

Rescaling

Since additive coupling lyaer has unit Jacobian determinant, include a diagonal scaling matrix \(S\) as the top layer.
This allows the model to give more weight on some dimensions and less in others.

\[\mathrm{log} (p_X(x)) = \sum_{d=1}^{D} [\mathrm{log}(p_{H_d}(f_d(x))) + \mathrm{log} (|S_{dd}|)]\]

Scaling factors are related with eigenspectrum of PCA.

If \(S_{dd}\) is large = volume of dimension \(d\) has to be expanded to match prior distribution
= variance of dimension \(d\) is small
= dimension \(d\) has less information
= eigenvalue is small
If \(S_{dd}\) is small
= volume of dimension \(d\) has to be contracted to match prior distribution
= variance of dimension \(d\) is large
= dimension \(d\) has more information
= eigenvalue is large

2.4. Inpainting

\(x = (x_{obs}, x_{miss})\)
\(i\): iteration
\(\alpha_i = \frac{10}{100 + i}\): step size
\(\epsilon \sim \mathcal{N}(0, I)\)

Stochastic gradient update the \(x_{miss}\) with \(x_{miss, \ i+1} = x_{miss, \ i+1} + \alpha_i (\frac{\partial \mathrm{log}(p_X((x_{obs}, x_{miss, \ i})))}{\partial x_{miss, \ i}} + \epsilon)\)