An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Accept info: ICLR 2023 Spotlight
Authors: Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
Affiliation: Tel-Aviv University, NVIDIA
Links: arXiv, OpenReview, project page, GitHub
Task: personalized text-to-image generation
TLDR: Personalized text-to-image generation via optimizing only a single word embedding.

1. Intuition & Motivation

Goal: language-guided generation of new, user-specific concepts
(ex. text-guided personalized generation, style transfer, concept compositions, bias reduction)

Recently, large-scale text-to-image models have demonstrated an unprecedented capability to reason over natural language descriptions.
However, generating a desired target, such as user-specific concept, through text is quite difficult.
(see Figure 3)

To overcome this challenge, it is natural to train the T2I model to learn new concepts.
The three most common approaches are:

Re-training the model with an expanded dataset, which is prohibitively expensive.
Fine-tuning on a few examples, which typically leads to catastrophic forgetting.
Training an adapter, though previous works face difficulties, such as accessing newly learned concepts.

Since training the T2I model has several limitations, authors frame the task as an inversion, inverting the concepts into new pseudo-words within the textual embedding space of a pre-trained text-to-image model.

2. Textual Inversion

2.1. Approach overview

Goal: find pseudo-words that encode new, user-specified concepts
Core method: find pseudo-words through a visual reconstruction objective

2.2. Objective

\[v_* = \textrm{argmin}_{v} \mathbb{E}_{z \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left\| \epsilon - \epsilon_{\theta}(z_t, t, c_{\theta}(y)) \right\|_2^2 \right]\]

Find \(v_*\) through direct optimization, by minimizing the LDM loss.
Re-using the same training scheme as the original LDM model motivates the learned embedding to capture fine visual details unique to the concept.

3. Experiments

3.1. Qualitative results

Results are partially curated.
For each prompt, generate 16 candidates and manually select the best result.

Text-guided synthesis (Figure 4)

Style transfer (Figure 6)

Concept compositions (Figure 7)

Bias reduction (Figure 8)

Downstream applications (Figure 9)

3.2. Quantitative analysis

Metrics
Reconstruction (ability to replicate the target concept): CLIP-I, user study
Editability (ability to modify the concepts using textual prompts): CLIP-T, user study
Evaluation
Generate 64 samples using 50 DDIM steps per prompt
Implementation details
Use LDM
5,000 optimization steps
Word embeddings were initialized with the embeddings of a single-word coarse descriptor of the object
Ablations
Inversion method
Traing dataset (training set size, training image diversity, training prompts)

3.3. Limitations

Typical failure cases: difficult relational prompts
Learning a single concept requires roughly two hours.
Since CLIP is less sensitive to shape-preservation, CLIP-I is not a reliable metric.
Textual Inversion may still struggle with learning precise shapes, instead incorporating the semantic essence of a concept.
In contrast to the baseline LDM model, inverted Stable Diffusion embeddings tend to dominate the prompt and become more difficult to integrate into new, simple prompts.