AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
Accept info: ICLR 2024
Authors: Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen
Affiliation: Zhejiang University, Singapore Management University, Harvard University
Links: arXiv, OpenReview, GitHub
Task: zero-shot anomaly detection (ZSAD)
TLDR: learning object-agnostic prompt is key to ZSAD.
1. Intuition & Motivation
Zero-shot anomaly detection (ZSAD): detect anomalies without any training sample in a target dataset
For ZSAD, previous works exploit the CLIP’s generalizability by using specialized object-aware text prompts.
For example, WinCLIP use a large number of hand-crafted text prompts for ZSAD.
However, the current prompting approaches fail to capture the abnormality as shown in Figure 1 (c), (d), (e)
.
Why?
Since CLIP is pre-trained to align with the class semantics of the foreground objects, CLIP with object-aware prompt will also focus on foreground semantics rather than abnormality/normality in the images.
Natural question may follow: do we need foreground object semantics in ZSAD?
Even though the foreground object semantics can be completely different, anomaly patterns remain quite similar.
Thus, it is natural to learn object-agnostic text prompts, not object-aware text prompts.
Then how can we make CLIP to not focus on object semantics?
The most simple approach would be excluding the object semantics from text prompt templates.
(ex. A photo of a damaged [class]
→ A photo of a damaged [object]
)
2. AnomalyCLIP: object-agnostic prompt learning
2.1. Approach overview
- Core method: object-agnostic text prompt templates
- Training objective: cross-entropy, focal, dice loss
image-level (global): cross-entropy loss
pixel-level (local): focal loss, dice loss - Other methods: text prompt tuning, DPAM
text prompt tuning: learnable token embeddings in text encoder
DPAM: replace Q-K self-attention with diagonally prominent attention (ex. V-V self-attention) - Only train text prompt templates and token embeddings
2.2. Object-agnostic text prompt design
-
Object-aware text prompt templates (previous works)
text embeddings of normality:[V_1][V_2]...[V_E][cls]
text embeddings of abnormality:[W_1][W_2]...[W_E][damaged][cls]
-
Object-agnostic text prompt templates (our work)
text embeddings of normality (\(g_n\)):[V_1][V_2]...[V_E][object]
text embeddings of abnormality (\(g_a\)):[W_1][W_2]...[W_E][damaged][object]
2.3. Refinement of the textual space
Use text prompt tuning to refine the original textual space of CLIP.
Following the previous works (VPT, MaPLe), add additional learnable token embeddings into its text encoder.
2.4. Refinement of the local visual space
Attention map in the visual encoder focuses on the specific tokens.
(ex. Figure 1 (b)
)
These tokens disrupt the local visual semantics, hindering the effective learning of the fine-grained abnormality.
Authors empirically find that a Diagonally Prominent Attention Map (DPAM) helps reduce the disturbance from other tokens, leading to improved local visual semantics.
Thus, replace the original Q-K attention in the visual encoder with diagonally prominent attention.
(ex. Q-Q, K-K, V-V self-attention)
2.5. Training and inference
Training
Objective: glocal loss (global + local)
Global loss: cross-entropy loss
Local loss: focal loss, dice loss
Integrate intermediate layers to provide more local visual details.
Inference
- Image-level anomaly score: \(P(g_a, f_i)\)
\(P(\cdot, \cdot)\): similarity score used in CLIP
\(g_a\): learned abnormality text embeddings
\(f_i\): global visual embedding
- Pixel-level prediction: merge the segmentation of all selected intermediate layers with interpolation and smoothing operation
3. Experiments
3.1. Experiment setup
- Datasets
17 benchmark dataset (7 industrial, 10 medical) - Metrics
Area Under the Receiver Operating Characteristic Curve (AUROC)
Average Precision (AP) for anomaly detection
AUPRO for anomaly segmentation - Implementation details
use CLIP ViT-L/14@336
replace Q-K self-attention with V-V self-attention
fine-tune on MVTec AD test, evaluate on other datasets
(for MVTec AD, fine-tune with VisA test)
3.2. Main results
Table 1
: AnomalyCLIP achieves superior ZSAD performance across the datasets.
Table 2
: AnomalyCLIP obtain promising ZSAD performance on various medical image datasets, even though they are tuned using a defect detection dataset.
Table 3
: even if AnomalyCLIP is fine-tuned on medical image data (ColonDB), its performance varies depending on the dataset.
(performance degradation in COVID-19, ISIC, TN3K)
3.3. Ablation study
Module ablation
\(T_1\): DPAM
\(T_2\): object-agnostic text prompts
\(T_3\): learnable tokens in text encoders
\(T_4\): multi-layer visual encoder features
Context optimization
DPAM strategy ablation
Compared to V-V self-attention,
Q-Q self-attention performas similar on pixel-level anomalies, but degrades on image-level anomalies.
K-K self-attention performas similar on image-level anomalies, but degrades on pixel-level anomalies.
Why V-V self-attention is bettern than Q-Q or K-K self-attention?
Since Q-K consists of Q and K, Q-Q and K-K still produce large attention score on specific tokens.
In contrast to Q-Q and K-K, V-V does not participate in computing the Q-K, reducing the unexpected bias of specific tokens.