Adding Conditional Control to Text-to-Image Diffusion Models

Accept info: ICCV 2023 Oral
Authors: Lvmin Zhang, Anyi Rao, Maneesh Agrawala
Affiliation: Stanford University
Links: arXiv, GitHub
Task: image-based conditional image generation
TLDR: Spatial conditioning via fine-tuning a trainable copy of a pre-trained text-to-image diffusion model connected with zero convolution.

1. Intuition & Motivation

Goal: generate images with fine-grained spatial control

Since expressing complex layouts, poses, shapes, and forms precisely through text prompts alone is difficult, text-to-image models struggle with fine-grained spatial control.
Then, how can we enable fine-grained spatial control?
Previous works have attempted to achieve this by incorporating additional images as conditioning inputs.

The most simplest approach is to fine-tune a pre-trained T2I model with condition-image pair dataset.
However, direct fine-tuning can lead several issues, such as overfitting, mode collapse, and catastrophic forgetting.
A common approach to mitigate these issues is to restrict the number or rank of trainable parameters. Nevertheless, training LoRA is insufficient to handle in-the-wild conditioning images with complex shapes and diverse high-level semantics.
Thus, authors aimed to design deeper or more customized neural architectures.

Then, how can we design a deep model and efficiently fine-tune it while avoiding above issues?
To preserve image quality, freezing the pre-trained T2I model parameters is a reasonable choice.
To effectively learn diverse conditional controls, authors use a trainable copy of the T2I encoding layers.
Furthermore, to protect the trainable copy being damaged by harmful noise at the beginning of training, authors use zero convolution layers to connect the trainable copy with the frozen original T2I model.

2. ControlNet

2.1. Approach overview

Goal: conditional image generation in-the-wild via fine-tuning pre-trained T2I model
Core method: freeze pre-trained T2I model + trainable copy of T2I model + zero convolution
Inference techniques: classifier-free guidance resolution weighting, composing multiple ControlNets

2.2. Zero convolution

zeroconv

Gradient calculation for zero convolution
Although zero convolution can cause the gradient on the feature term to become zero, the gradients for the weight and bias are not influenced.
Weight and bias will be optimized into a non-zero matrix in the first gradient descent iteration.

2.3. Classifier-free guidance resolution weighting (CFG-RW)

How can we apply classifier-free guidance (CFG) in ControlNet?
Multiply a weight to each connection between Stable Diffusion and ControlNet according to the resolution of each block.
(\(w_i = 64 / h_i\) where \(h_i\) is the size of i-th block, e.g. \(h_1 = 8, h_2 = 16, \cdots, h_{13} = 64\))

2.4. Composing multiple ControlNets

To apply multiple conditioning images to a single instance of Stable Diffusion, directly add the outputs of the corresponding ControlNets to the Stable Diffusion model.
(no extra weighting or linear interpolation)

3. Experiments

3.1. Main results

Tasks
Canny Edge
Hough Line
HED Boundary
User Scribble
Human Pose
Semantic Segmentation
Depth
Normal Maps
Cartoon Line Drawing
Metrics
Quality: FID, CLIP-aes, user study
Fidelity: semantic segmentation label reconstruction, CLIP-T, user study
Implementation details
Convert input conditioning image from an input size of 512 x 512 into a 64 x 64 feature space vector
(train tiny network to encode an image-space condition into a feature space conditining vector)
Randomly replace 50% text prompts with empty strings to increase ControlNet’s ability to directly recognize semantics in the input conditining images as a replacement for the prompt
Ablations
Method (architecture design, CFG-RW)
Training dataset size
Sudden convergence phenomenon
(model does not gradually learn the control conditions but abruptly succeeds in following the input conditioning image)

3.2. Figures

Sudden convergence phenomenon (Figure 21)

CFG-RW (Figure 5)

Composition of multiple conditions (Figure 6)

Stable Diffusion + ControlNet without prompts (Figure 7)

Architecture ablation (Figure 8)

Influence of training dataset size (Figure 22)

3.3. Limitations

Difficult to remove semantic of input image (Figure 28)
When the semantic of input image is mistakenly recognized, the negative effects seem difficult to be eliminated, even if a strong prompt is provided.