YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Description
Visual anomaly detection is critical in industrial manufacturing, but traditional methods often rely on extensive normal datasets and custom models, limiting scalability. Recent advancements in large-scale visual-language models have significantly improved zero/few-shot anomaly detection. However, these approaches may not fully utilize hierarchical features, potentially missing nuanced details. We introduce a window self-attention mechanism based on the CLIP model, combined with learnable prompts to process multi-level features within a Soldier-Offier Window selfAttention (SOWA) framework. Our method has been tested on five benchmark datasets, demonstrating superior performance by leading in 18 out of 20 metrics compared to existing state-of-the-art techniques.
Installation
Pip
# clone project
git clone https://github.com/huzongxiang/sowa
cd sowa
# [OPTIONAL] create conda environment
conda create -n sowa python=3.9
conda activate sowa
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
Conda
# clone project
git clone https://github.com/huzongxiang/sowa
cd sowa
# create conda environment and install dependencies
conda env create -f environment.yaml -n sowa
# activate conda environment
conda activate sowa
How to run
Train model with default configuration
# train on CPU
python src/train.py trainer=cpu data=sowa_visa model=sowa_hfwa
# train on GPU
python src/train.py trainer=gpu data=sowa_visa model=sowa_hfwa
Results
Comparisons with few-shot (K=4) anomaly detection methods on datasets of MVTec-AD, Visa, BTAD, DAGM and DTD Synthetic.
| Metric | Dataset | WinCLIP | April-GAN | Ours |
|---|---|---|---|---|
| AC AUROC | MVTec-AD | 95.2±1.3 | 92.8±0.2 | 96.8±0.3 |
| Visa | 87.3±1.8 | 92.6±0.4 | 92.9±0.2 | |
| BTAD | 87.0±0.2 | 92.1±0.2 | 94.8±0.2 | |
| DAGM | 93.8±0.2 | 96.2±1.1 | 98.9±0.3 | |
| DTD-Synthetic | 98.1±0.2 | 98.5±0.1 | 99.1±0.0 | |
| AC AP | MVTec-AD | 97.3±0.6 | 96.3±0.1 | 98.3±0.3 |
| Visa | 88.8±1.8 | 94.5±0.3 | 94.5±0.2 | |
| BTAD | 86.8±0.0 | 95.2±0.5 | 95.5±0.7 | |
| DAGM | 83.8±1.1 | 86.7±4.5 | 95.2±1.7 | |
| DTD-Synthetic | 99.1±0.1 | 99.4±0.0 | 99.6±0.0 | |
| AS AUROC | MVTec-AD | 96.2±0.3 | 95.9±0.0 | 95.7±0.1 |
| Visa | 97.2±0.2 | 96.2±0.0 | 97.1±0.0 | |
| BTAD | 95.8±0.0 | 94.4±0.1 | 97.1±0.0 | |
| DAGM | 93.8±0.1 | 88.9±0.4 | 96.9±0.0 | |
| DTD-Synthetic | 96.8±0.2 | 96.7±0.0 | 98.7±0.0 | |
| AS AUPRO | MVTec-AD | 89.0±0.8 | 91.8±0.1 | 92.4±0.2 |
| Visa | 87.6±0.9 | 90.2±0.1 | 91.4±0.0 | |
| BTAD | 66.6±0.2 | 78.2±0.1 | 81.2±0.2 | |
| DAGM | 82.4±0.3 | 77.8±0.9 | 94.4±0.1 | |
| DTD-Synthetic | 90.1±0.5 | 92.2±0.0 | 96.6±0.1 |
Performance Comparison on MVTec-AD and Visa Datasets.
| Method | Source | MVTec-AD AC AUROC | MVTec-AD AS AUROC | MVTec-AD AS PRO | Visa AC AUROC | Visa AS AUROC | Visa AS PRO |
|---|---|---|---|---|---|---|---|
| SPADE | arXiv 2020 | 84.8±2.5 | 92.7±0.3 | 87.0±0.5 | 81.7±3.4 | 96.6±0.3 | 87.3±0.8 |
| PaDiM | ICPR 2021 | 80.4±2.4 | 92.6±0.7 | 81.3±1.9 | 72.8±2.9 | 93.2±0.5 | 72.6±1.9 |
| PatchCore | CVPR 2022 | 88.8±2.6 | 94.3±0.5 | 84.3±1.6 | 85.3±2.1 | 96.8±0.3 | 84.9±1.4 |
| WinCLIP | CVPR 2023 | 95.2±1.3 | 96.2±0.3 | 89.0±0.8 | 87.3±1.8 | 97.2±0.2 | 87.6±0.9 |
| April-GAN | CVPR 2023 VAND workshop | 92.8±0.2 | 95.9±0.0 | 91.8±0.1 | 92.6±0.4 | 96.2±0.0 | 90.2±0.1 |
| PromptAD | CVPR 2024 | 96.6±0.9 | 96.5±0.2 | - | 89.1±1.7 | 97.4±0.3 | - |
| InCTRL | CVPR 2024 | 94.5±1.8 | - | - | 87.7±1.9 | - | - |
| SOWA | Ours | 96.8±0.3 | 95.7±0.1 | 92.4±0.2 | 92.9±0.2 | 97.1±0.0 | 91.4±0.0 |
Comparisons with few-shot anomaly detection methods on datasets of MVTec-AD, Visa, BTAD, DAGM and DTD Synthetic.
Visualization
Visualization results under the few-shot setting (K=4).
Mechanism
Hierarchical Results on MVTec-AD Dataset. A set of images showing the real outputs of the model, illustrating how different layers (H1 to H4) process various feature modes. Each row represents a different sample, with columns showing the original image, segmentation mask, heatmap, and feature outputs from H1 to H4, and fusion.

Inference Speed
Inference performance comparison of different methods on a single NVIDIA RTX3070 8GB GPU.
Citation
Please cite the following paper if this work helps your project:
@article{hu2024sowa,
title={SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection},
author={Hu, Zongxiang and Zhang, zhaosheng},
journal={arXiv preprint arXiv:2407.03634},
year={2024}
}
