Prompt-to-Prompt Image Editing with Cross Attention Control

Accept info: ICLR 2023 Spotlight
Authors: Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
Affiliation: Google Research, Tel-Aviv University
Links: arXiv, OpenReview, project page, GitHub
Task: text-driven image editing
TLDR: Text-driven image editing via injecting the cross-attention maps of a pre-trained T2I diffusion model.

1. Intuition & Motivation

Since a small modification of the text prompt often leads to a completely different outcome, text-driven image editing is challenging.
Previous works mitigate this issue by using a spatial mask to restrict where edits occur.
However, these methods ignore the original structure and content within the masked region.
Natural question may follow: how can we edit an image while considering its original structure and content?

To consider original structure and content, it is natural to seek approaches that implicitly determine where to apply edits in an unmasked image.

Key observation: spatial layout and geometry of the generated image depend on the cross-attention maps.
Interestingly, the structure of the image is already determined in the early steps of the diffusion process.
Thus, we can think of an idea to edit images by injecting the cross-attention map from the original prompt into those of the modified prompt.

2. Prompt-to-Prompt

2.1. Approach overview

Goal: editing the input image guided only by the edited text prompt
Core method: use cross-attention maps of original text prompt, not the cross-attention maps of edited text prompt

2.2. Cross-attention in text-conditioned diffusion models

Deep spatial features of the noisy image \(\phi(z_t)\) are projected to a query matrix \(Q = l_{Q}(\phi(z_t))\).
Text embedding \(\psi(P)\) is projected to a key matrix \(K = l_{K}(\psi(P))\) and a value matrix \(V = l_{V}(\psi(P))\).
(\(l_{Q}, l_{K}, l_{V}\) are learned linear projections)

The cross-attention maps are then \(M = \mathrm{Softmax}(\frac{QK^T}{\sqrt{d}})\).

2.3. Controlling the cross-attention

Algorithm 1

Inject the cross-attention maps \(M\) that were obtained from the generation with the original prompt \(P\), into a second generation with the modified prompt \(P^*\).
(apply attention injection only over the common tokens from the both prompts)
This allows the synthesis of an edited image \(I^*\) that is not only manipulated according to the edited prompt, but also preserves the structure of the input image \(I\).

2.4. Edit function per application

Word swap
Attention injection may over constrain the geometry, especially when a large structural modification.
To address this challenge, use softer attention constrain, which limits the number of injection steps.
\(Edit(M_t, M_t^*, t) := \left\{\begin{matrix} M_t^* & t < \tau \\ M_t & t \geq \tau \\ \end{matrix}\right.\)
(\(\tau\): timestamp parameter that determines until which step the injection is applied)
Adding a new phrase
To preserve common details, apply the attention injection only over the common tokens from both prompts.
\((Edit(M_t, M_t^*, t))_{i, j} := \left\{\begin{matrix} (M_t^*)_{i, j} & A(j) = None \\ (M_t)_{i, A(j)} & A(j) \neq None \\ \end{matrix}\right.\)
(alignment function \(A\) receives a token index from target prompt \(P^*\) and outputs the corresponding token index in \(P\) or \(None\) if there isn’t a match)
Attention re-weighting
Strengthen or weakens the extent to which each token is affecting the resulting image.
\((Edit(M_t, M_t^*, t))_{i, j} := \left\{\begin{matrix} c * (M_t)_{i, j} & j = j^* \\ (M_t)_{i, j} & j \neq j^* \\ \end{matrix}\right.\)
(scale the attention map of the assigned token \(j^*\) with parameter \(c \in [-2, 2]\), resulting in a stronger/weaker effect)
Real image editing
Given a real image \(x_0\), use DDIM inversion, which perform the diffusion process in the reverse direction.

3. Experiments

3.1. Main results

No quantitative results, only qualitative.
Ablations Method (attention injection)

3.2. Applications

Text-Only Localized Editing

Global Editing

Fader Control

Real Image Editing

Ablation - Attention Injection

3.3. Limitations

Inability to perform more precise editing due to the resolution of the attention maps

Struggle to edit real images (Figure 11)
DDIM inversion is not sufficiently accurate in many cases.
Inability to move objects across the image
ex. editing an image of seating dog to standing dog
Inability to perform more precise editing
Due to the resolution of the attention maps.