Prompt-to-Prompt Image Editing with Cross Attention Control
Accept info: ICLR 2023 Spotlight
Authors: Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
Affiliation: Google Research, Tel-Aviv University
Links: arXiv, OpenReview, project page, GitHub
Task: text-driven image editing
TLDR: Text-driven image editing via injecting the cross-attention maps of a pre-trained T2I diffusion model.
1. Intuition & Motivation
Since a small modification of the text prompt often leads to a completely different outcome, text-driven image editing is challenging.
Previous works mitigate this issue by using a spatial mask to restrict where edits occur.
However, these methods ignore the original structure and content within the masked region.
Natural question may follow: how can we edit an image while considering its original structure and content?
To consider original structure and content, it is natural to seek approaches that implicitly determine where to apply edits in an unmasked image.
Key observation: spatial layout and geometry of the generated image depend on the cross-attention maps.
Interestingly, the structure of the image is already determined in the early steps of the diffusion process.
Thus, we can think of an idea to edit images by injecting the cross-attention map from the original prompt into those of the modified prompt.
2. Prompt-to-Prompt
2.1. Approach overview
- Goal: editing the input image guided only by the edited text prompt
- Core method: use cross-attention maps of original text prompt, not the cross-attention maps of edited text prompt
2.2. Cross-attention in text-conditioned diffusion models
Deep spatial features of the noisy image \(\phi(z_t)\) are projected to a query matrix \(Q = l_{Q}(\phi(z_t))\).
Text embedding \(\psi(P)\) is projected to a key matrix \(K = l_{K}(\psi(P))\) and a value matrix \(V = l_{V}(\psi(P))\).
(\(l_{Q}, l_{K}, l_{V}\) are learned linear projections)
The cross-attention maps are then \(M = \mathrm{Softmax}(\frac{QK^T}{\sqrt{d}})\).
2.3. Controlling the cross-attention
Inject the cross-attention maps \(M\) that were obtained from the generation with the original prompt \(P\), into a second generation with the modified prompt \(P^*\).
(apply attention injection only over the common tokens from the both prompts)
This allows the synthesis of an edited image \(I^*\) that is not only manipulated according to the edited prompt, but also preserves the structure of the input image \(I\).
2.4. Edit function per application
-
Word swap
Attention injection may over constrain the geometry, especially when a large structural modification.
To address this challenge, use softer attention constrain, which limits the number of injection steps.
\(Edit(M_t, M_t^*, t) := \left\{\begin{matrix} M_t^* & t < \tau \\ M_t & t \geq \tau \\ \end{matrix}\right.\)
(\(\tau\): timestamp parameter that determines until which step the injection is applied) -
Adding a new phrase
To preserve common details, apply the attention injection only over the common tokens from both prompts.
\((Edit(M_t, M_t^*, t))_{i, j} := \left\{\begin{matrix} (M_t^*)_{i, j} & A(j) = None \\ (M_t)_{i, A(j)} & A(j) \neq None \\ \end{matrix}\right.\)
(alignment function \(A\) receives a token index from target prompt \(P^*\) and outputs the corresponding token index in \(P\) or \(None\) if there isn’t a match) -
Attention re-weighting
Strengthen or weakens the extent to which each token is affecting the resulting image.
\((Edit(M_t, M_t^*, t))_{i, j} := \left\{\begin{matrix} c * (M_t)_{i, j} & j = j^* \\ (M_t)_{i, j} & j \neq j^* \\ \end{matrix}\right.\)
(scale the attention map of the assigned token \(j^*\) with parameter \(c \in [-2, 2]\), resulting in a stronger/weaker effect) -
Real image editing
Given a real image \(x_0\), use DDIM inversion, which perform the diffusion process in the reverse direction.
3. Experiments
3.1. Main results
- No quantitative results, only qualitative.
- Ablations Method (attention injection)
3.2. Applications
Text-Only Localized Editing
Global Editing
Fader Control
Real Image Editing
Ablation - Attention Injection
3.3. Limitations
- Inaccurate DDIM inversion (
Figure 11
)
Inversion is not sufficiently accurate in many cases. - Cannot edit complicated structure modifications
Cannot be used to spatially move existing objects across the image.
ex. editing an image ofseating dog
tostanding dog