Adding Conditional Control to Text-to-Image Diffusion Models

Accept info: ICCV 2023 Oral
Authors: Lvmin Zhang, Anyi Rao, Maneesh Agrawala
Affiliation: Stanford University
Links: arXiv, GitHub
Task: image-based conditional image generation
TLDR: Spatial conditioning via fine-tuning a trainable copy of a pre-trained text-to-image diffusion model connected with zero convolution.

1. Intuition & Motivation

  • Goal: generate images with fine-grained spatial control

Since expressing complex layouts, poses, shapes, and forms precisely through text prompts alone is difficult, text-to-image models struggle with fine-grained spatial control.
Then, how can we enable fine-grained spatial control?
Previous works have attempted to achieve this by incorporating additional images as conditioning inputs.

The most simplest approach is to fine-tune a pre-trained T2I model with condition-image pair dataset.
However, direct fine-tuning can lead several issues, such as overfitting, mode collapse, and catastrophic forgetting.
A common approach to mitigate these issues is to restrict the number or rank of trainable parameters. Nevertheless, training LoRA is insufficient to handle in-the-wild conditioning images with complex shapes and diverse high-level semantics.
Thus, authors aimed to design deeper or more customized neural architectures.

Then, how can we design a deep model and efficiently fine-tune it while avoiding above issues?
To preserve image quality, freezing the pre-trained T2I model parameters is a reasonable choice.
To effectively learn diverse conditional controls, authors use a trainable copy of the T2I encoding layers.
Furthermore, to protect the trainable copy being damaged by harmful noise at the beginning of training, authors use zero convolution layers to connect the trainable copy with the frozen original T2I model.

2. ControlNet

Figure3

2.1. Approach overview

  • Goal: conditional image generation in-the-wild via fine-tuning pre-trained T2I model
  • Core method: freeze pre-trained T2I model + trainable copy of T2I model + zero convolution
  • Inference techniques: classifier-free guidance resolution weighting, composing multiple ControlNets

2.2. Zero convolution

zeroconv

  • Gradient calculation for zero convolution
    Although zero convolution can cause the gradient on the feature term to become zero, the gradients for the weight and bias are not influenced.
    Weight and bias will be optimized into a non-zero matrix in the first gradient descent iteration.

2.3. Classifier-free guidance resolution weighting (CFG-RW)

How can we apply classifier-free guidance (CFG) in ControlNet?
Multiply a weight to each connection between Stable Diffusion and ControlNet according to the resolution of each block.
(\(w_i = 64 / h_i\) where \(h_i\) is the size of i-th block, e.g. \(h_1 = 8, h_2 = 16, \cdots, h_{13} = 64\))

2.4. Composing multiple ControlNets

To apply multiple conditioning images to a single instance of Stable Diffusion, directly add the outputs of the corresponding ControlNets to the Stable Diffusion model.
(no extra weighting or linear interpolation)

3. Experiments

3.1. Main results

  • Tasks
    Canny Edge
    Hough Line
    HED Boundary
    User Scribble
    Human Pose
    Semantic Segmentation
    Depth
    Normal Maps
    Cartoon Line Drawing
  • Metrics
    Quality: FID, CLIP-aes, user study
    Fidelity: semantic segmentation label reconstruction, CLIP-T, user study
  • Implementation details
    Convert input conditioning image from an input size of 512 x 512 into a 64 x 64 feature space vector
    (train tiny network to encode an image-space condition into a feature space conditining vector)
    Randomly replace 50% text prompts with empty strings to increase ControlNet’s ability to directly recognize semantics in the input conditining images as a replacement for the prompt
  • Ablations
    Method (architecture design, CFG-RW)
    Training dataset size
    Sudden convergence phenomenon
    (model does not gradually learn the control conditions but abruptly succeeds in following the input conditioning image)

3.2. Figures

Sudden convergence phenomenon (Figure 21) Figure21
CFG-RW (Figure 5) Figure5
Composition of multiple conditions (Figure 6) Figure6
Stable Diffusion + ControlNet without prompts (Figure 7) Figure7
Architecture ablation (Figure 8) Figure8
Influence of training dataset size (Figure 22) Figure22

3.3. Limitations

Figure28

  1. Difficult to remove semantic of input image (Figure 28)
    When the semantic of input image is mistakenly recognized, the negative effects seem difficult to be eliminated, even if a strong prompt is provided.