DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Accept info: CVPR 2023 Award Candidate
Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman
Affiliation: Google Research, Boston University
Links: arXiv, project page, GitHub
Task: subject-driven image generation
TLDR: Personalization of T2I diffusion models via fine-tuning with rare tokens and class-specific prior preservation loss.

1. Intuition & Motivation

Goal: generate novel photorealistic images of the subject, given only a few (typically 3-5) casually captured images of a specific subject, without any textual description
(ex. recontextualization, accessorization, property modification)

Figure23_comparison

Recently developed large text-to-image diffusion models have shown unprecedented capabilities, by enabling high-quality and diverse synthesis of images based on a text prompt written in natural language.
While the sysnthesis capabilities of these models are unprecedented, they lack the ability to mimic the appearance of subjects in a given reference set, and synthesize novel renditions of the same subjects in different contexts.
Even the most detailed textual description of an object may yield instances with different appearances.
Thus, it is natural to infer that the expressiveness of their output domain is limited.

To bind new words with specific subjects, the most simple approach is to fine-tune a pre-trained, diffusion-based text-to-image models.

2. DreamBooth

Figure3_method

2.1. Approach overview

Goal: implant a new (unique identifier, subject) pair into the diffusion model’s dictionary
Core method: fine-tune pre-trained T2I diffusion models
Prompt design: a [identifier] [class noun]
identifier: unique identifier linked to the subject
class noun: coarse class descriptor of the subject
Class-specific prior preservation loss
to mitigate language drift and reduced output diversity

2.2. Prompt design

Prompt: a [identifier] [class noun]

To bypass the overhead of writing detailed image descriptions, use simple prompt.
To leverage the model’s prior of the specific class and entangle it with the embedding of the subject’s unique identifier, use coarse class descriptor of the subject [class noun].

How should we design rare-token identifiers?
Existing english words already have their original meaning.
Thus, an identifier should have a weak prior in both the language model and the diffusion model.

Paper’s approach: rare-token lookup in the vocabulary and obtain a sequence of rare token identifiers.
The sequence can be of variable length \(k\), and relatively short sequences of \(k = \left\{ 1, 2, 3\right\}\) work well.

2.3. Class-specific prior preservation loss

From experience, fine-tuning all layers of the model achieves the best subject fidelity.
However, this causes two problems.

Language drift
model slowly forgets how to generate the subject’s class
Reduced output diversity
fine-tuning on a small set of images reduces the amount of variability
(ex. pose, view)

\[\mathbb{E}_{\mathbf{x}, \mathbf{c}, \mathbf{\epsilon}, \mathbf{\epsilon}', t} \left[ w_t \left\| \hat{\mathbf{x}}_{\theta} (\alpha_t \mathbf{x} + \sigma_t \mathbf{\epsilon}, \mathbf{c}) - \mathbf{x} \right\|_2^2 + \lambda w_{t'} \left\| \hat{\mathbf{x}}_{\theta} (\alpha_{t'} \mathbf{x}_{pr} + \sigma_{t'} \mathbf{\epsilon}', \mathbf{c}_{pr}) - \mathbf{x}_{pr} \right\|_2^2 \right]\]

To mitigate two issues, authors propose class-specific prior preservation loss.
(second term of above equation)
The core idea of this loss is to fine-tune the model with images of the subject’s class.
By fine-tuning the model with its own generated samples, the fine-tuned models are able to retain the original prior of the subject’s class, which prevent both lagnauge drift and reduced output diversity.

3. Experiments

3.1. Main results

Dataset
Total 30 subjects, 25 prompts per subject
Objects (21) - recontextualization (20), property modification (5)
Live subjects/pets (9) - recontextualization (10), accessorization (10), property modification (5)
Evaluation
Generate 4 images per subject and per prompt, total 3,000 images
Metrics
Subject fidelity: CLIP-I, DINO, user study
Prompt fidelity: CLIP-T, user study
Image diversity: LPIPS
Implementation details
Train ~ 1,000 iterations
Use relative weight \(\lambda = 1\) for prior preservation loss
Use ViT-S/16 DINO for DINO metric
For Stable Diffusion, train U-Net (and possibly the text encoder)
Generate ~ 1,000 images with text prompt a [class noun] for class-specific prior preservation loss
Ablations
Method (prior preservation loss, class-prior)
Effect of training images

3.2. Applications

Recontextualization

Art renditions

Expression manipulation

Novel view synthesis

Accessorization

Property modification

Comic book generation

3.3. Limitations

Figure9_limit

Incorrect context synthesis (Figure 9 (a))
Possible reasons: weak prior of the context, difficulty in generating both the subject and specified concept together.
Context-appearance entanglement (Figure 9 (b))
Overfitting to real images (Figure 9 (c))
Overfitting observed when the prompt is similar to the original setting.
Dependency of base model
For rare subjects, the model is unable to support as many subject variations.
Variability in the fidelity of the subject.
Hallucinated subject features.

4. Appendix

4.1. Subject fidelity metrics

CLIP is not constructed to distinguish between different subjects that could have highly similar text descriptions.
However, DINO is trained in a self-supervised manner to distinguish different images from each other modulo data augmentations.
Thus, DINO metric is superior than CLIP-I in terms of subject fidelity.

In order to quantitatively test this, authors compute correlations between DINO/CLIP-I scores and normalized human preference scores.
Pearson correlation coefficient: DINO (0.32) > CLIP-I (0.27)