An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Accept info: ICLR 2023 Spotlight
Authors: Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
Affiliation: Tel-Aviv University, NVIDIA
Links: arXiv, OpenReview, project page, GitHub
Task: personalized text-to-image generation
TLDR: Personalized text-to-image generation via optimizing only a single word embedding.

1. Intuition & Motivation

  • Goal: language-guided generation of new, user-specific concepts
    (ex. text-guided personalized generation, style transfer, concept compositions, bias reduction)

Figure3

Recently, large-scale text-to-image models have demonstrated an unprecedented capability to reason over natural language descriptions.
However, generating a desired target, such as user-specific concept, through text is quite difficult.
(see Figure 3)

To overcome this challenge, it is natural to train the T2I model to learn new concepts.
The three most common approaches are:

  1. Re-training the model with an expanded dataset, which is prohibitively expensive.
  2. Fine-tuning on a few examples, which typically leads to catastrophic forgetting.
  3. Training an adapter, though previous works face difficulties, such as accessing newly learned concepts.

Since training the T2I model has several limitations, authors frame the task as an inversion, inverting the concepts into new pseudo-words within the textual embedding space of a pre-trained text-to-image model.

2. Textual Inversion

Figure2

2.1. Approach overview

  • Goal: find pseudo-words that encode new, user-specified concepts
  • Core method: find pseudo-words through a visual reconstruction objective

2.2. Objective

\[v_* = \textrm{argmin}_{v} \mathbb{E}_{z \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left\| \epsilon - \epsilon_{\theta}(z_t, t, c_{\theta}(y)) \right\|_2^2 \right]\]

Find \(v_*\) through direct optimization, by minimizing the LDM loss.
Re-using the same training scheme as the original LDM model motivates the learned embedding to capture fine visual details unique to the concept.

3. Experiments

3.1. Qualitative results

Results are partially curated.
For each prompt, generate 16 candidates and manually select the best result.

Text-guided synthesis (Figure 4) Figure4
Style transfer (Figure 6) Figure6
Concept compositions (Figure 7) Figure7
Bias reduction (Figure 8) Figure8
Downstream applications (Figure 9) Figure9

3.2. Quantitative analysis

  • Metrics
    Reconstruction (ability to replicate the target concept): CLIP-I, user study
    Editability (ability to modify the concepts using textual prompts): CLIP-T, user study
  • Evaluation
    Generate 64 samples using 50 DDIM steps per prompt
  • Implementation details
    Use LDM
    5,000 optimization steps
    Word embeddings were initialized with the embeddings of a single-word coarse descriptor of the object
  • Ablations
    Inversion method
    Traing dataset (training set size, training image diversity, training prompts)

3.3. Limitations

  1. Typical failure cases: difficult relational prompts
  2. Learning a single concept requires roughly two hours.
  3. Since CLIP is less sensitive to shape-preservation, CLIP-I is not a reliable metric.
  4. Textual Inversion may still struggle with learning precise shapes, instead incorporating the semantic essence of a concept.
  5. In contrast to the baseline LDM model, inverted Stable Diffusion embeddings tend to dominate the prompt and become more difficult to integrate into new, simple prompts.