DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Accept info: CVPR 2023 Award Candidate
Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman
Affiliation: Google Research, Boston University
Links: arXiv, project page, GitHub
Task: subject-driven image generation
TLDR: Personalization of T2I diffusion models via fine-tuning with rare tokens and class-specific prior preservation loss.
1. Intuition & Motivation
- Goal: generate novel photorealistic images of the subject, given only a few (typically 3-5) casually captured images of a specific subject, without any textual description
(ex. recontextualization, accessorization, property modification)
Recently developed large text-to-image diffusion models have shown unprecedented capabilities, by enabling high-quality and diverse synthesis of images based on a text prompt written in natural language.
While the sysnthesis capabilities of these models are unprecedented, they lack the ability to mimic the appearance of subjects in a given reference set, and synthesize novel renditions of the same subjects in different contexts.
Even the most detailed textual description of an object may yield instances with different appearances.
Thus, it is natural to infer that the expressiveness of their output domain is limited.
To bind new words with specific subjects, the most simple approach is to fine-tune a pre-trained, diffusion-based text-to-image models.
2. DreamBooth
2.1. Approach overview
- Goal: implant a new (unique identifier, subject) pair into the diffusion model’s dictionary
- Core method: fine-tune pre-trained T2I diffusion models
- Prompt design:
a [identifier] [class noun]
identifier
: unique identifier linked to the subject
class noun
: coarse class descriptor of the subject - Class-specific prior preservation loss
to mitigate language drift and reduced output diversity
2.2. Prompt design
Prompt: a [identifier] [class noun]
To bypass the overhead of writing detailed image descriptions, use simple prompt.
To leverage the model’s prior of the specific class and entangle it with the embedding of the subject’s unique identifier, use coarse class descriptor of the subject [class noun]
.
How should we design rare-token identifiers?
Existing english words already have their original meaning.
Thus, an identifier should have a weak prior in both the language model and the diffusion model.
Paper’s approach: rare-token lookup in the vocabulary and obtain a sequence of rare token identifiers.
The sequence can be of variable length \(k\), and relatively short sequences of \(k = \left\{ 1, 2, 3\right\}\) work well.
2.3. Class-specific prior preservation loss
From experience, fine-tuning all layers of the model achieves the best subject fidelity.
However, this causes two problems.
- Language drift
model slowly forgets how to generate the subject’s class - Reduced output diversity
fine-tuning on a small set of images reduces the amount of variability
(ex. pose, view)
To mitigate two issues, authors propose class-specific prior preservation loss.
(second term of above equation)
The core idea of this loss is to fine-tune the model with images of the subject’s class.
By fine-tuning the model with its own generated samples, the fine-tuned models are able to retain the original prior of the subject’s class, which prevent both lagnauge drift and reduced output diversity.
3. Experiments
3.1. Main results
- Dataset
Total 30 subjects, 25 prompts per subject
Objects (21) - recontextualization (20), property modification (5)
Live subjects/pets (9) - recontextualization (10), accessorization (10), property modification (5) - Evaluation
Generate 4 images per subject and per prompt, total 3,000 images - Metrics
Subject fidelity: CLIP-I, DINO, user study
Prompt fidelity: CLIP-T, user study
Image diversity: LPIPS - Implementation details
Train ~ 1,000 iterations
Use relative weight \(\lambda = 1\) for prior preservation loss
Use ViT-S/16 DINO for DINO metric
For Stable Diffusion, train U-Net (and possibly the text encoder)
Generate ~ 1,000 images with text prompta [class noun]
for class-specific prior preservation loss - Ablations
Method (prior preservation loss, class-prior)
Effect of training images
3.2. Applications
Recontextualization
Art renditions
Expression manipulation
Novel view synthesis
Accessorization
Property modification
Comic book generation
3.3. Limitations
- Incorrect context synthesis (
Figure 9 (a)
)
Possible reasons: weak prior of the context, difficulty in generating both the subject and specified concept together. - Context-appearance entanglement (
Figure 9 (b)
) - Overfitting to real images (
Figure 9 (c)
)
Overfitting observed when the prompt is similar to the original setting. - Dependency of base model
For rare subjects, the model is unable to support as many subject variations.
Variability in the fidelity of the subject.
Hallucinated subject features.
4. Appendix
4.1. Subject fidelity metrics
CLIP is not constructed to distinguish between different subjects that could have highly similar text descriptions.
However, DINO is trained in a self-supervised manner to distinguish different images from each other modulo data augmentations.
Thus, DINO metric is superior than CLIP-I in terms of subject fidelity.
In order to quantitatively test this, authors compute correlations between DINO/CLIP-I scores and normalized human preference scores.
Pearson correlation coefficient: DINO (0.32) > CLIP-I (0.27)