Generative Adversarial Networks

Accept info: NIPS 2014
Authors: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
Affiliation: Departement d’informatique et de recherche operationnelle Universite de Montreal
Links: arXiv, OpenReview, GitHub
TLDR: Train generative models through adversarial training, without requiring explicit likelihood estimation.

1. Intuition & Motivation

Problem: Training deep generative models is challenging, because maximum likelihood estimation typically requires approximating intractable probabilistic computations.

Intuition: Deep learning has achieved remarkable success in training discriminative models.
→ Instead of directly maximizing the likelihood, train a discriminative model to determine whether a sample is from the generator’s distribution or the data distribution?

2. Adversarial Nets

2.1. Approach Overview

\[\underset{G}{\mathrm{min}} \ \underset{D}{\mathrm{max}} \ V(D, G) = \mathbb{E}_{x \sim p_{\mathrm{data}}(x)} [\mathrm{log} D(x)] + \mathbb{E}_{z \sim p_{z}(z)} [\mathrm{log}(1 - D(G(z)))]\]

D: discriminator
G: generator

Train D to maximize the probability of assigning the correct label to both training examples and samples from G.
Train G to minimize the \(\mathrm{log}(1 - D(G(z)))\), which encourages G to generate data that D cannot distinguish from real samples.

2.2. Theoretical Result 1: Global Optimality of \(p_g = p_{\mathrm{data}}\)

\(V(G, D) = \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{z \sim p_{z}} [\mathrm{log}(1 - D(G(z)))]\)
\(= \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}(1 - D(x))]\)
\(= \int p_{\mathrm{data}}(x) \ \mathrm{log} (D(x)) \ dx + \int p_{g}(x) \ \mathrm{log} (1 - D(x)) \ dx\)
\(= \int p_{\mathrm{data}}(x) \ \mathrm{log} (D(x)) + p_{g}(x) \ \mathrm{log} (1 - D(x)) \ dx\)

Since \(0 \leq D(x) \leq 1\), \(V(G, D)\) achieves maximum at \(D_{G}^{*}(x) = \frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x) + p_g(x)}\), where \(D_{G}^{*}(x)\) is the optimial discriminator.

\(C(G) = \underset{D}{\mathrm{max}} \ V(G, D)\)
\(= \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D_{G}^{*}(x)] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}(1 - D_{G}^{*}(x))]\)
\(= \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} \frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x) + p_g(x)}] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}\frac{p_{g}(x)}{p_{\mathrm{data}}(x) + p_g(x)}]\)
\(= \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} \frac{p_{\mathrm{data}}(x)}{(p_{\mathrm{data}}(x) + p_g(x))/2}] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}\frac{p_{g}(x)}{(p_{\mathrm{data}}(x) + p_g(x))/2}] - \mathrm{log} 4\)
\(= KL \left ( p_{\mathrm{data}}(x) \parallel \frac{p_{\mathrm{data}}(x) + p_g(x)}{2} \right ) + KL \left ( p_{g}(x) \parallel \frac{p_{\mathrm{data}}(x) + p_g(x)}{2} \right ) - \mathrm{log} 4\)
\(= 2 \cdot JSD \left ( p_{\mathrm{data}}(x) \parallel p_{g}(x) \right ) - \mathrm{log} 4\)

The global minimum of \(C(G)\) exists, and is only achieved if and only if \(p_g = p_{\mathrm{data}}\).

2.3. Adversarial Nets in Practice

algorithm 1

In practice, we must implement the two-player minmax game using an iterative, numerical approach.
Optimizing \(D\) to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting.
Instead, we alternate between \(k\) steps of optimizing \(D\) and one step of optimizing \(G\).
This results in \(D\) being maintained near its optimal solution, so long as \(G\) changes slowly enough.

Early in learning, when \(G\) is poor, \(D\) can reject samples with high confidence because they are clearly different from the training data.
As a result, \(\mathrm{log}(1 - D(G(z)))\) saturates and may not provide sufficient gradient for \(G\) to learn well.
To address this, instead of training \(G\) to minimize \(\mathrm{log}(1 - D(G(z)))\), train \(G\) to maximize \(\mathrm{log} D(G(z))\) instead.
This results in the same fixed point of the dynamics of \(G\) and \(D\), but provides much stronger gradients early in learning.

2.4. Theoretical Result 2: Convergence of Algorithm 1

Proof overview: \(C(G)\) is convex, and we can get gradient at any point.

Proof 1: \(C(G)\) is convex.

\(V(G, D) = \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{z \sim p_{z}} [\mathrm{log}(1 - D(G(z)))]\)
\(= \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}(1 - D(x))]\)

Define \(U(p_g, D) = \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}(1 - D(x))]\), then \(V(G, D) = U(p_g, D)\).

\(U(p_g, D) = \mathbb{E}_{x \sim p_{\mathrm{data}}} [\mathrm{log} D(x)] + \mathbb{E}_{x \sim p_{g}} [\mathrm{log}(1 - D(x))]\)
\(= \int p_{\mathrm{data}}(x) \ \mathrm{log} (D(x)) + p_{g}(x) \ \mathrm{log} (1 - D(x)) \ dx\)

\(U(\lambda p_1 + (1 - \lambda) p_2, D) = \int p_{\mathrm{data}}(x) \ \mathrm{log} (D(x)) + (\lambda p_1(x) + (1 - \lambda) p_2(x)) \ \mathrm{log} (1 - D(x)) \ dx\)
\(= \int p_{\mathrm{data}}(x) \ \mathrm{log} (D(x)) + \lambda p_1(x) \ \mathrm{log} (1 - D(x)) + (1 - \lambda) p_2(x) \ \mathrm{log} (1 - D(x)) \ dx\)
\(= \lambda U(p_1, D) + (1 - \lambda) U(p_2, D)\)

\(U(p_g, D)\) is convex in \(p_g\).
Since \(U(p_g, D)\) is convex in \(p_g\), \(\underset{D}{\mathrm{sup}} \ U(p_g, D)\) is also convex in \(p_g\).
(assume that training \(D\) for \(k\) steps makes \(D\) optimum given \(G\))

\[C(G) = \underset{D}{\mathrm{max}} \ V(G, D) = \underset{D}{\mathrm{sup}} \ U(p_g, D)\]

Thus, \(C(G)\) is also convex.

Proof 2: we can get gradient at any point.

Subderivatives of supremum of convex functions: if \(f\) is convex but possibly non-differentiable, then gradient descent using any subgradient will converge to a global minimum under mild conditions.