jiang-cc commited on
Commit
2ae8982
·
verified ·
1 Parent(s): 366408f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +20 -41
README.md CHANGED
@@ -4,8 +4,6 @@ license: apache-2.0
4
  tags:
5
  - anomaly-detection
6
  - vision-language-model
7
- - industrial-inspection
8
- - comparison-aware
9
  - qwen2.5-vl
10
  pipeline_tag: image-text-to-text
11
  language:
@@ -14,25 +12,17 @@ base_model:
14
  - Qwen/Qwen2.5-VL-7B-Instruct
15
  ---
16
 
17
- # AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models
18
 
19
- AD-Copilot extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** that generates
20
- special comparison tokens capturing differences between a reference image and a test image,
21
- achieving **state-of-the-art results** on industrial anomaly detection benchmarks.
22
 
23
- ## Key Innovation
24
-
25
- - **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention mechanism that compares reference and test images
26
- - **100 comparison tokens** per image pair, injected into the language model sequence
27
- - Achieves **78.74% accuracy** on OmniDiff benchmark (vs. 72.19% for base Qwen2.5-VL)
28
 
29
- ## Links
30
 
31
- | Resource | Link |
32
- |----------|------|
33
- | **Paper** | [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1) |
34
- | **Code** | [GitHub](https://github.com/jam-cc/AD-Copilot) |
35
- | **Demo** | [HuggingFace Space](https://huggingface.co/spaces/jiang-cc/AD-Copilot) |
36
 
37
  ## Quick Start
38
 
@@ -43,40 +33,30 @@ from qwen_vl_utils import process_vision_info
43
 
44
  model = AutoModelForVision2Seq.from_pretrained(
45
  "jiang-cc/AD-Copilot",
46
- torch_dtype=torch.bfloat16,
47
- device_map="auto",
48
- trust_remote_code=True,
49
  )
50
  processor = AutoProcessor.from_pretrained(
51
  "jiang-cc/AD-Copilot",
52
- min_pixels=64 * 28 * 28,
53
- max_pixels=1280 * 28 * 28,
54
- trust_remote_code=True,
55
  )
56
 
57
- messages = [
58
- {
59
- "role": "user",
60
- "content": [
61
- {"type": "image", "image": "path/to/good_image.png"},
62
- {"type": "image", "image": "path/to/test_image.png"},
63
- {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no. Please answer the letter only."},
64
- ],
65
- }
66
- ]
67
 
68
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
69
- image_inputs, video_inputs = process_vision_info(messages)
70
  inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
71
 
72
  with torch.inference_mode():
73
- output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
74
-
75
- trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]
76
  print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
77
  ```
78
 
79
- ## Benchmark Results (OmniDiff)
80
 
81
  | Model | Visited IAD | Avg ACC |
82
  |-------|-------------|---------|
@@ -87,10 +67,9 @@ print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
87
 
88
  ## Architecture
89
 
90
- - **Base Model**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden size)
91
- - **Vision Encoder**: Qwen2.5-VL ViT (32 layers, 1280 hidden size)
92
  - **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
93
- - **Parameters**: ~8B total
94
  - **Dtype**: bfloat16
95
 
96
  ## Citation
 
4
  tags:
5
  - anomaly-detection
6
  - vision-language-model
 
 
7
  - qwen2.5-vl
8
  pipeline_tag: image-text-to-text
9
  language:
 
12
  - Qwen/Qwen2.5-VL-7B-Instruct
13
  ---
14
 
15
+ # AD-Copilot
16
 
17
+ Comparison-aware anomaly detection with vision-language models. Extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** achieving **78.74%** on OmniDiff benchmark.
 
 
18
 
19
+ [Paper](https://arxiv.org/abs/2603.13779v1) | [Code](https://github.com/jam-cc/AD-Copilot) | [Demo](https://huggingface.co/spaces/jiang-cc/AD-Copilot)
 
 
 
 
20
 
21
+ ## Key Innovation
22
 
23
+ - **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention comparing reference and test images
24
+ - **100 comparison tokens** per image pair injected into the language model
25
+ - State-of-the-art on industrial anomaly detection benchmarks
 
 
26
 
27
  ## Quick Start
28
 
 
33
 
34
  model = AutoModelForVision2Seq.from_pretrained(
35
  "jiang-cc/AD-Copilot",
36
+ torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
 
 
37
  )
38
  processor = AutoProcessor.from_pretrained(
39
  "jiang-cc/AD-Copilot",
40
+ min_pixels=64*28*28, max_pixels=1280*28*28, trust_remote_code=True,
 
 
41
  )
42
 
43
+ messages = [{"role": "user", "content": [
44
+ {"type": "image", "image": "good.png"},
45
+ {"type": "image", "image": "test.png"},
46
+ {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no."},
47
+ ]}]
 
 
 
 
 
48
 
49
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
50
+ image_inputs, _ = process_vision_info(messages)
51
  inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
52
 
53
  with torch.inference_mode():
54
+ ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
55
+ trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, ids)]
 
56
  print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
57
  ```
58
 
59
+ ## Results (OmniDiff Benchmark)
60
 
61
  | Model | Visited IAD | Avg ACC |
62
  |-------|-------------|---------|
 
67
 
68
  ## Architecture
69
 
70
+ - **Base**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden)
71
+ - **Vision**: Qwen2.5-VL ViT (32 layers, 1280 hidden)
72
  - **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
 
73
  - **Dtype**: bfloat16
74
 
75
  ## Citation