jiang-cc
/

AD-Copilot

@@ -4,8 +4,6 @@ license: apache-2.0
 tags:
   - anomaly-detection
   - vision-language-model
-  - industrial-inspection
-  - comparison-aware
   - qwen2.5-vl
 pipeline_tag: image-text-to-text
 language:
@@ -14,25 +12,17 @@ base_model:
   - Qwen/Qwen2.5-VL-7B-Instruct
 ---
-# AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models
-AD-Copilot extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** that generates
-special comparison tokens capturing differences between a reference image and a test image,
-achieving **state-of-the-art results** on industrial anomaly detection benchmarks.
-## Key Innovation
-- **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention mechanism that compares reference and test images
-- **100 comparison tokens** per image pair, injected into the language model sequence
-- Achieves **78.74% accuracy** on OmniDiff benchmark (vs. 72.19% for base Qwen2.5-VL)
-## Links
-| Resource | Link |
-|----------|------|
-| **Paper** | [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1) |
-| **Code** | [GitHub](https://github.com/jam-cc/AD-Copilot) |
-| **Demo** | [HuggingFace Space](https://huggingface.co/spaces/jiang-cc/AD-Copilot) |
 ## Quick Start
@@ -43,40 +33,30 @@ from qwen_vl_utils import process_vision_info
 model = AutoModelForVision2Seq.from_pretrained(
     "jiang-cc/AD-Copilot",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True,
 )
 processor = AutoProcessor.from_pretrained(
     "jiang-cc/AD-Copilot",
-    min_pixels=64 * 28 * 28,
-    max_pixels=1280 * 28 * 28,
-    trust_remote_code=True,
 )
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "path/to/good_image.png"},
-            {"type": "image", "image": "path/to/test_image.png"},
-            {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no. Please answer the letter only."},
-        ],
-    }
-]
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
 with torch.inference_mode():
-    output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
-trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]
 print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
 ```
-## Benchmark Results (OmniDiff)
 | Model | Visited IAD | Avg ACC |
 |-------|-------------|---------|
@@ -87,10 +67,9 @@ print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
 ## Architecture
-- **Base Model**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden size)
-- **Vision Encoder**: Qwen2.5-VL ViT (32 layers, 1280 hidden size)
 - **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
-- **Parameters**: ~8B total
 - **Dtype**: bfloat16
 ## Citation

 tags:
   - anomaly-detection
   - vision-language-model
   - qwen2.5-vl
 pipeline_tag: image-text-to-text
 language:
   - Qwen/Qwen2.5-VL-7B-Instruct
 ---
+# AD-Copilot
+Comparison-aware anomaly detection with vision-language models. Extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** achieving **78.74%** on OmniDiff benchmark.
+[Paper](https://arxiv.org/abs/2603.13779v1) | [Code](https://github.com/jam-cc/AD-Copilot) | [Demo](https://huggingface.co/spaces/jiang-cc/AD-Copilot)
+## Key Innovation
+- **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention comparing reference and test images
+- **100 comparison tokens** per image pair injected into the language model
+- State-of-the-art on industrial anomaly detection benchmarks
 ## Quick Start
 model = AutoModelForVision2Seq.from_pretrained(
     "jiang-cc/AD-Copilot",
+    torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
 )
 processor = AutoProcessor.from_pretrained(
     "jiang-cc/AD-Copilot",
+    min_pixels=64*28*28, max_pixels=1280*28*28, trust_remote_code=True,
 )
+messages = [{"role": "user", "content": [
+    {"type": "image", "image": "good.png"},
+    {"type": "image", "image": "test.png"},
+    {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no."},
+]}]
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, _ = process_vision_info(messages)
 inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
 with torch.inference_mode():
+    ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
+trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, ids)]
 print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
 ```
+## Results (OmniDiff Benchmark)
 | Model | Visited IAD | Avg ACC |
 |-------|-------------|---------|
 ## Architecture
+- **Base**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden)
+- **Vision**: Qwen2.5-VL ViT (32 layers, 1280 hidden)
 - **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
 - **Dtype**: bfloat16
 ## Citation