Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Accept info: ICML 2015
Authors: Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
Affiliation: Stanford University, University of California, Berkeley
Links: arXiv, GitHub
TLDR: Tractable & flexible probabilistic model by learning the reverse of a forward diffusion process.

1. Intuition & Motivation

How can we make tractable & flexible probabilistic model?

Intuition: quasi-static processes from non-equilibrium statistical physics
Kolmogorov forward and backward equations show that for many forward diffusion processes, the reverse diffusion processes can be described using the same functional form.

Each step in diffusion chain has an analytically evaluable probability.
→ the full chain can also be analytically evaluated
→ tractable

Modeling the reverse process using neural network.
→ flexible

2. Diffusion Probabilistic Models

2.1. Approach Overview

Equations

Markov chain to gradually convert one distribution into another.

Forward process (inference): data distribution → simple distribution
slowly destroy structure in a data distribution through an iterative forward diffusion process
restrict the forward process to a simple functional form
Reverse process (generation): simple distribution → data distribution
learn a reverse diffusion process that restores structure in data
estimate small perturbations to a diffusion process
reverse process have the same functional form as forward process

2.2. Trajectory

Forward Trajectory
Data distribution: \(q(\mathbf{x}^{(0)})\)
Simple distribution (analytically tractable): \(\pi (\mathbf{y})\)
Diffusion rate: \(\beta\)
Markov diffusion kernel: \(T_{\pi} (\mathbf{y} \mid \mathbf{y'}; \beta )\)

\(\pi (\mathbf{y}) = \int \pi (\mathbf{y'}) \ T_{\pi} (\mathbf{y} \mid \mathbf{y'}; \beta ) \ d\mathbf{y'}\)
\(q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)}) = T_{\pi}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)}; \beta_t)\)
\(q(\mathbf{x}^{(0 \cdots T)}) = q(\mathbf{x}^{(0)}) \prod_{t=1}^{T} q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})\)

Reverse Trajectory
\(p(\mathbf{x}^{(T)}) = \pi(\mathbf{x}^{(T)})\)
\(p(\mathbf{x}^{(0 \cdots T)}) = p(\mathbf{x}^{(0)}) \prod_{t=1}^{T} p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})\)

For continuous diffusion (limit of small step size \(\beta\)), the reversal of the diffusion process has the identical functional form as the forward process.
If \(q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})\) is a Gaussian distribution, then \(p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})\) will also be a Gaussian distribution.

2.3. Model Probability

\(p(\mathbf{x}^{(0)}) = \int p(\mathbf{x}^{(0 \cdots T)}) \ d\mathbf{x}^{(1 \cdots T)}\)
\(= \int p(\mathbf{x}^{(0 \cdots T)}) \ \frac{q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)})}{q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)})} \ d\mathbf{x}^{(1 \cdots T)}\)
\(= \int q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)}) \ \frac{p(\mathbf{x}^{(0 \cdots T)})}{q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)})} \ d\mathbf{x}^{(1 \cdots T)}\)
\(= \int q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)}) \ p(\mathbf{x}^{(T)}) \prod_{t=1}^{T} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \ d\mathbf{x}^{(1 \cdots T)}\)

Expected log-likelihood under data distribution
\(L = \mathbb{E}_{\mathbf{x}^{(0)} \sim q(\mathbf{x}^{(0)})} [\mathrm{log} \ p(\mathbf{x}^{(0)})]\)
\(= \int q(\mathbf{x}^{(0)}) \ \mathrm{log} \ p(\mathbf{x}^{(0)}) \ d \mathbf{x}^{(0)}\)
\(= \int q(\mathbf{x}^{(0)}) \ \mathrm{log} \left ( \int q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)}) \ p(\mathbf{x}^{(T)}) \prod_{t=1}^{T} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \ d\mathbf{x}^{(1 \cdots T)} \right ) \ d \mathbf{x}^{(0)}\)
\(\geq \int q(\mathbf{x}^{(0)}) \ q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)}) \ \mathrm{log} \left ( p(\mathbf{x}^{(T)}) \prod_{t=1}^{T} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(1 \cdots T)} \ d \mathbf{x}^{(0)}\)
\(= \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \left ( p(\mathbf{x}^{(T)}) \prod_{t=1}^{T} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \mathrm{log} \ p(\mathbf{x}^{(T)}) + \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= \int q(\mathbf{x}^{(T)}) \ \mathrm{log} \ p(\mathbf{x}^{(T)}) \ d\mathbf{x}^{(T)} + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= \int q(\mathbf{x}^{(T)}) \ \mathrm{log} \ \pi(\mathbf{x}^{T}) \ d\mathbf{x}^{(T)} + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= K\)

Entropy of \(p(\mathbf{X}^{(T)})\)
\(K= -H_q(\pi(\mathbf{X}^{T})) + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H(q(\mathbf{X}^{(T)})) - D_{KL}(q(\mathbf{X}^{(T)}) \parallel \pi(\mathbf{X}^{T})) + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H(q(\mathbf{X}^{(T)})) + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H(p(\mathbf{X}^{(T)})) + \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \sum_{t=1}^{T} \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \sum_{t=1}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)

Remove the edge effect at \(t=0\)
To aviod edge effects, set the final step of the reverse trajectory to be identical to the corresponding forward diffusion step.
\(p(\mathbf{x}^{(0)} \mid \mathbf{x}^{(1)}) = q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)}) \frac{\pi(\mathbf{x}^{(0)})}{\pi(\mathbf{x}^{(1)})} = T_{\pi}(\mathbf{x}^{(0)} \mid \mathbf{x}^{(1)}; \beta_1)\)

\(K = -H_p(\mathbf{X}^{(T)}) + \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(0)} \mid \mathbf{x}^{(1)} )}{q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)})} d\mathbf{x}^{(0 \cdots T)} + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \int q(\mathbf{x}^{(0, 1)}) \ \mathrm{log} \frac{q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)} ) \ \pi(\mathbf{x}^{(0)})}{q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)}) \ \pi(\mathbf{x}^{(1)})} d\mathbf{x}^{(0, 1)} + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \int q(\mathbf{x}^{(0)}) \ \mathrm{log} \pi(\mathbf{x}^{(0)}) \ d\mathbf{x}^{(0)} - \int q(\mathbf{x}^{(1)}) \ \mathrm{log} \pi(\mathbf{x}^{(1)}) \ d\mathbf{x}^{(1)} + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H(q(\mathbf{x}^{(0)}), \pi(\mathbf{x}^{(0)})) + H(q(\mathbf{x}^{(1)}), \pi(\mathbf{x}^{(1)})) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} d\mathbf{x}^{(0 \cdots T)}\)

Rewrite in terms of posterior \(q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(0)})\)
\(K = -H_p(\mathbf{X}^{(T)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)}, \mathbf{x}^{(0)})} d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) \ q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(0)})}{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(0)})} d\mathbf{x}^{(0 \cdots T)}\)

Rewrite in terms of KL divergences and entropies
\(K = -H_p(\mathbf{X}^{(T)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \left ( \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } + \mathrm{log} \frac{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(0)})}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(0)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)})}{q(\mathbf{x}^{(T)} \mid \mathbf{x}^{(0)})} d\mathbf{x}^{(0 \cdots T)} + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) + \int q(\mathbf{x}^{(0, 1)}) \ \mathrm{log} q(\mathbf{x}^{(1)} \mid \mathbf{x}^{(0)}) d\mathbf{x}^{(0, 1)} - \int q(\mathbf{x}^{(0, T)}) \ \mathrm{log} q(\mathbf{x}^{(T)} \mid \mathbf{x}^{(0)}) d\mathbf{x}^{(0, T)} + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H(q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)})) + H(q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)})) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(0, t-1, t)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0, t-1, t)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) + \sum_{t=2}^{T} \int q(\mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(0)} d\mathbf{x}^{(t-1)} d\mathbf{x}^{(t)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) + \sum_{t=2}^{T} \int d\mathbf{x}^{(0)} d\mathbf{x}^{(t)} q(\mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \int q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ \mathrm{log} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) }{q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) } d\mathbf{x}^{(t-1)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) - \sum_{t=2}^{T} \int d\mathbf{x}^{(0)} d\mathbf{x}^{(t)} q(\mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ D_{KL}(q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \parallel p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}))\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) - \sum_{t=2}^{T} \int q(\mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ D_{KL}(q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \parallel p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})) \ d\mathbf{x}^{(0)} d\mathbf{x}^{(t)}\)

2.4. Training

\(L = \mathbb{E}_{\mathbf{x}^{(0)} \sim q(\mathbf{x}^{(0)})} [\mathrm{log} \ p(\mathbf{x}^{(0)})]\)
\(L \geq K\)
\(K = \int q(\mathbf{x}^{(0 \cdots T)}) \ \mathrm{log} \left ( p(\mathbf{x}^{(T)}) \prod_{t=1}^{T} \frac{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)} )}{q(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)})} \right ) d\mathbf{x}^{(0 \cdots T)}\)
\(= -H_p(\mathbf{X}^{(T)}) - H_q(\mathbf{X}^{(1)} \mid \mathbf{X}^{(0)}) + H_q(\mathbf{X}^{(T)} \mid \mathbf{X}^{(0)}) - \sum_{t=2}^{T} \int q(\mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \ D_{KL}(q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \parallel p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})) \ d\mathbf{x}^{(0)} d\mathbf{x}^{(t)}\)

Only train the reverse Markov transitions.
\(\hat{p}(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}) = \underset{p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})}{\mathrm{argmax}} K\)

For Gaussian diffusion, train mean, covariance of a sequence of Gaussians and the forward diffusion schedule \(\beta_{2 \cdots T}\).
(to prevent overfitting, \(\beta_1 \approx 0\), \(\beta_1\) is fixed to a small constant)
To update \(\beta\), samples from \(q(\mathbf{x}^{(1 \cdots T)} \mid \mathbf{x}^{(0)})\) are made by using “frozen noise”, which treats noise as additional auxiliary variable.
(same as reparameterization trick used in VAE)

2.5. Multiplying Distributions & Computing Posteriors

TL;DR: with some assumptions on \(r(\mathbf{x}^{(t)})\), we can say that \(\tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\) has same distribution as \(p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\).

\(p(\mathbf{x}^{(0)})\): model distribution
\(r(\mathbf{x}^{(0)})\): second distribution or bounded positive function
\(\tilde{p}(\mathbf{x}^{(0)}) \propto p(\mathbf{x}^{(0)}) \ r(\mathbf{x}^{(0)})\): new distribution

\(\tilde{p}(\mathbf{x}^{(t)}) = \frac{1}{\tilde{Z}_t} p(\mathbf{x}^{(t)}) \ r(\mathbf{x}^{(t)})\)
\(p(\mathbf{x}^{(t)}) = \int p(\mathbf{x}^{(t+1)}) \ p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) \ d\mathbf{x}^{(t+1)}\)

Assume that the perturbed Markov kernel \(\tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\) obey the equlibrium condition for the perturbed distribution.
\(\tilde{p}(\mathbf{x}^{(t)}) = \int \tilde{p}(\mathbf{x}^{(t+1)}) \ \tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) \ d\mathbf{x}^{(t+1)}\)
\(\frac{1}{\tilde{Z}_t} p(\mathbf{x}^{(t)}) \ r(\mathbf{x}^{(t)}) = \int \frac{1}{\tilde{Z}_{t+1}} p(\mathbf{x}^{(t+1)}) \ r(\mathbf{x}^{(t+1)}) \ \tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) \ d\mathbf{x}^{(t+1)}\)
\(p(\mathbf{x}^{(t)}) = \int \frac{\tilde{Z}_t \ r(\mathbf{x}^{(t+1)})}{\tilde{Z}_{t+1} \ r(\mathbf{x}^{(t)})} p(\mathbf{x}^{(t+1)}) \ \tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) \ d\mathbf{x}^{(t+1)}\)

\(p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) = \frac{\tilde{Z}_t \ r(\mathbf{x}^{(t+1)})}{\tilde{Z}_{t+1} \ r(\mathbf{x}^{(t)})} \ \tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\)
\(\tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) = \frac{\tilde{Z}_{t+1} \ r(\mathbf{x}^{(t)})}{\tilde{Z}_{t} \ r(\mathbf{x}^{(t+1)})} \ p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\)

Assume that \(r(\mathbf{x}^{(t)})\) is sufficiently smooth.
Then, \(p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)})\) has small variance compared to \(r(\mathbf{x}^{(t)})\), \(\frac{r(\mathbf{x}^{(t)})}{r(\mathbf{x}^{(t+1)})}\) can be treated as a small perturbation.
A small perturbation to a Gaussian effects the mean, but not the normalization constant.

\[\tilde{p}(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) = \frac{1}{\tilde{Z}_{t}(\mathbf{x}^{(t+1)})} \ p(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t+1)}) \ r(\mathbf{x}^{(t)})\]

More details

2.6. Entropy of Reverse Process

Since the forward process is known, we can derive upper and lower bounds on the conditional entropy of each step in the reverse trajectory.

\(H_q (\mathrm{X}^{(t-1)}, \mathrm{X}^{(t)}) = H_q (\mathrm{X}^{(t)}, \mathrm{X}^{(t-1)})\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) + H_q (\mathrm{X}^{(t)}) = H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) + H_q (\mathrm{X}^{(t-1)})\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) = H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) + H_q (\mathrm{X}^{(t-1)})- H_q (\mathrm{X}^{(t)})\)

Upper bound
\(H_q (\mathrm{X}^{(t-1)}) \leq H_q (\mathrm{X}^{(t)})\)
\(H_q (\mathrm{X}^{(t-1)}) - H_q (\mathrm{X}^{(t)}) \leq 0\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) \leq 0\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) \leq H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)})\)

Lower bound
\(H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t)}) \geq H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t-1)})\)
\(0 \geq H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t-1)}) - H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t)})\)
\(H_q (\mathrm{X}^{(t-1)})- H_q (\mathrm{X}^{(t)}) \geq H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t-1)}) - H_q (\mathrm{X}^{(0)} \mid \mathrm{X}^{(t)}) + H_q (\mathrm{X}^{(t-1)})- H_q (\mathrm{X}^{(t)})\)
\(H_q (\mathrm{X}^{(t-1)})- H_q (\mathrm{X}^{(t)}) \geq H_q (\mathrm{X}^{(0)}, \mathrm{X}^{(t-1)}) - H_q (\mathrm{X}^{(0)}, \mathrm{X}^{(t)})\)
\(H_q (\mathrm{X}^{(t-1)})- H_q (\mathrm{X}^{(t)}) \geq H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(0)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(0)})\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) \geq H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(0)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(0)})\)
\(H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) \geq H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) + H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(0)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(0)})\)

Upper and lower bounds
\(H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)}) + H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(0)}) - H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(0)}) \leq H_q (\mathrm{X}^{(t-1)} \mid \mathrm{X}^{(t)}) \leq H_q (\mathrm{X}^{(t)} \mid \mathrm{X}^{(t-1)})\)