AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
Accept info: ICLR 2024
Authors: Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen
Affiliation: Zhejiang University, Singapore Management University, Harvard University
Links: arXiv, OpenReview, GitHub
Task: zero-shot anomaly detection (ZSAD)
TLDR: Learning object-agnostic prompt is key to ZSAD.
1. Intuition & Motivation
Zero-shot anomaly detection (ZSAD): detect anomalies without any training sample in a target dataset
For ZSAD, previous works exploit the CLIP’s generalizability by using specialized object-aware text prompts.
For example, WinCLIP use a large number of hand-crafted text prompts for ZSAD.
However, the current prompting approaches fail to capture the abnormality as shown in Figure 1 (c), (d), (e)
.
Why?
Since CLIP is pre-trained to align with the class semantics of the foreground objects, CLIP with object-aware prompt will also focus on foreground semantics rather than abnormality/normality in the images.
Natural question may follow: do we need foreground object semantics in ZSAD?
Even though the foreground object semantics can be completely different, anomaly patterns remain quite similar.
Thus, it is natural to learn object-agnostic text prompts, not object-aware text prompts.
Then how can we make CLIP to not focus on object semantics?
The most simple approach would be excluding the object semantics from text prompt templates.
(ex. A photo of a damaged [class]
→ A photo of a damaged [object]
)
2. AnomalyCLIP: object-agnostic prompt learning
2.1. Approach overview
- Core method: object-agnostic text prompt templates
- Training objective: cross-entropy, focal, dice loss
image-level (global): cross-entropy loss
pixel-level (local): focal loss, dice loss - Other methods: text prompt tuning, DPAM
text prompt tuning: learnable token embeddings in text encoder
DPAM: replace Q-K self-attention with diagonally prominent attention (ex. V-V self-attention) - Only train text prompt templates and token embeddings
2.2. Object-agnostic text prompt design
-
Object-aware text prompt templates (previous works)
text embeddings of normality:[V_1][V_2]...[V_E][cls]
text embeddings of abnormality:[W_1][W_2]...[W_E][damaged][cls]
-
Object-agnostic text prompt templates (our work)
text embeddings of normality (\(g_n\)):[V_1][V_2]...[V_E][object]
text embeddings of abnormality (\(g_a\)):[W_1][W_2]...[W_E][damaged][object]
2.3. Refinement of the textual space
Use text prompt tuning to refine the original textual space of CLIP.
Following the previous works (VPT, MaPLe), add additional learnable token embeddings into its text encoder.
2.4. Refinement of the local visual space
Attention map in the visual encoder focuses on the specific tokens.
(ex. Figure 1 (b)
)
These tokens disrupt the local visual semantics, hindering the effective learning of the fine-grained abnormality.
Authors empirically find that a Diagonally Prominent Attention Map (DPAM) helps reduce the disturbance from other tokens, leading to improved local visual semantics.
Thus, replace the original Q-K attention in the visual encoder with diagonally prominent attention.
(ex. Q-Q, K-K, V-V self-attention)
2.5. Training
Objective: glocal loss (global for cross-entropy loss + local for focal and dice loss)
Integrate intermediate layers to provide more local visual details.
2.6. Inference
- Image-level anomaly score: \(P(g_a, f_i)\)
\(P(\cdot, \cdot)\): similarity score used in CLIP
\(g_a\): learned abnormality text embeddings
\(f_i\): global visual embedding
- Pixel-level prediction
Merge the segmentation of all selected intermediate layers with interpolation and smoothing operation
3. Experiments
3.1. Main results
- Datasets
17 benchmark dataset (7 industrial, 10 medical) - Metrics
Area Under the Receiver Operating Characteristic Curve (AUROC)
Average Precision (AP) for anomaly detection
AUPRO for anomaly segmentation - Implementation details
Use CLIP ViT-L/14@336
Replace Q-K self-attention with V-V self-attention
Fine-tune on MVTec AD test, evaluate on other datasets
(for MVTec AD, fine-tune with VisA test) - Ablations
Method (DPAM, object-agnostic text prompts, learnable tokens in text encoders, multi-layer visual encoder features)
Loss (glocal, local, global)
DPAM strategy (V-V, Q-Q, K-K)