AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

Accept info: ICLR 2024
Authors: Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen
Affiliation: Zhejiang University, Singapore Management University, Harvard University
Links: arXiv, OpenReview, GitHub
Task: zero-shot anomaly detection (ZSAD)
TLDR: learning object-agnostic prompt is key to ZSAD.

1. Intuition & Motivation

Zero-shot anomaly detection (ZSAD): detect anomalies without any training sample in a target dataset

For ZSAD, previous works exploit the CLIP’s generalizability by using specialized object-aware text prompts.
For example, WinCLIP use a large number of hand-crafted text prompts for ZSAD.

Figure1_comparison

However, the current prompting approaches fail to capture the abnormality as shown in Figure 1 (c), (d), (e).
Why?
Since CLIP is pre-trained to align with the class semantics of the foreground objects, CLIP with object-aware prompt will also focus on foreground semantics rather than abnormality/normality in the images.
Natural question may follow: do we need foreground object semantics in ZSAD?

Even though the foreground object semantics can be completely different, anomaly patterns remain quite similar.
Thus, it is natural to learn object-agnostic text prompts, not object-aware text prompts.

Then how can we make CLIP to not focus on object semantics?
The most simple approach would be excluding the object semantics from text prompt templates.
(ex. A photo of a damaged [class]A photo of a damaged [object])

2. AnomalyCLIP: object-agnostic prompt learning

Figure2_overview

2.1. Approach overview

  • Core method: object-agnostic text prompt templates
  • Training objective: cross-entropy, focal, dice loss
    image-level (global): cross-entropy loss
    pixel-level (local): focal loss, dice loss
  • Other methods: text prompt tuning, DPAM
    text prompt tuning: learnable token embeddings in text encoder
    DPAM: replace Q-K self-attention with diagonally prominent attention (ex. V-V self-attention)
  • Only train text prompt templates and token embeddings

2.2. Object-agnostic text prompt design

  • Object-aware text prompt templates (previous works)
    text embeddings of normality: [V_1][V_2]...[V_E][cls]
    text embeddings of abnormality: [W_1][W_2]...[W_E][damaged][cls]

  • Object-agnostic text prompt templates (our work)
    text embeddings of normality (\(g_n\)): [V_1][V_2]...[V_E][object]
    text embeddings of abnormality (\(g_a\)): [W_1][W_2]...[W_E][damaged][object]

2.3. Refinement of the textual space

Use text prompt tuning to refine the original textual space of CLIP.
Following the previous works (VPT, MaPLe), add additional learnable token embeddings into its text encoder.

2.4. Refinement of the local visual space

Figure3_attention_vis

Attention map in the visual encoder focuses on the specific tokens.
(ex. Figure 1 (b))
These tokens disrupt the local visual semantics, hindering the effective learning of the fine-grained abnormality.

Authors empirically find that a Diagonally Prominent Attention Map (DPAM) helps reduce the disturbance from other tokens, leading to improved local visual semantics.
Thus, replace the original Q-K attention in the visual encoder with diagonally prominent attention.
(ex. Q-Q, K-K, V-V self-attention)

2.5. Training and inference

Training

Equation2_loss

Objective: glocal loss (global + local)
Global loss: cross-entropy loss
Local loss: focal loss, dice loss

Integrate intermediate layers to provide more local visual details.

Inference

  • Image-level anomaly score: \(P(g_a, f_i)\)
    \(P(\cdot, \cdot)\): similarity score used in CLIP
    \(g_a\): learned abnormality text embeddings
    \(f_i\): global visual embedding

Inference_mask

  • Pixel-level prediction: merge the segmentation of all selected intermediate layers with interpolation and smoothing operation

3. Experiments

3.1. Experiment setup

  • Datasets
    17 benchmark dataset (7 industrial, 10 medical)
  • Metrics
    Area Under the Receiver Operating Characteristic Curve (AUROC)
    Average Precision (AP) for anomaly detection
    AUPRO for anomaly segmentation
  • Implementation details
    use CLIP ViT-L/14@336
    replace Q-K self-attention with V-V self-attention
    fine-tune on MVTec AD test, evaluate on other datasets
    (for MVTec AD, fine-tune with VisA test)

3.2. Main results

Table1_result

Table 1: AnomalyCLIP achieves superior ZSAD performance across the datasets.

Table2_result

Table 2: AnomalyCLIP obtain promising ZSAD performance on various medical image datasets, even though they are tuned using a defect detection dataset.

Table3_result

Table 3: even if AnomalyCLIP is fine-tuned on medical image data (ColonDB), its performance varies depending on the dataset.
(performance degradation in COVID-19, ISIC, TN3K)

3.3. Ablation study

Module ablation

Table4_module

\(T_1\): DPAM
\(T_2\): object-agnostic text prompts
\(T_3\): learnable tokens in text encoders
\(T_4\): multi-layer visual encoder features

Context optimization

Table5_context

DPAM strategy ablation

Figure6_dpam

Compared to V-V self-attention,
Q-Q self-attention performas similar on pixel-level anomalies, but degrades on image-level anomalies.
K-K self-attention performas similar on image-level anomalies, but degrades on pixel-level anomalies.

Why V-V self-attention is bettern than Q-Q or K-K self-attention?
Since Q-K consists of Q and K, Q-Q and K-K still produce large attention score on specific tokens.
In contrast to Q-Q and K-K, V-V does not participate in computing the Q-K, reducing the unexpected bias of specific tokens.