DGX Sentinel: Document Verifier (QLoRA Adapter)
Fine-tuned QLoRA adapter for Qwen2.5-VL-3B-Instruct to extract fields from multi-document packets (paystubs, utility bills, ID cards).
This is a LoRA adapter only - it requires the base model Qwen/Qwen2.5-VL-3B-Instruct to function.
π― Model Description
- Base Model: Qwen/Qwen2.5-VL-3B-Instruct
- Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
- Trainable Parameters: 23.5M (0.62% of base model)
- Training Data: 100 synthetic document sets with 4 documents each
- Training Hardware: NVIDIA DGX Spark GB10 GPU (CUDA 13.0)
- Training Time: 35 minutes
- Final Training Loss: 0.0709
π Performance (Hybrid System)
The model is designed to work with a hybrid verification pipeline combining ML extraction + rule-based verification:
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Extraction Accuracy | β₯85% | 99.2% | β |
| Cross-Document F1 | β₯75% | 100% | β |
| Hallucination Rate | β€8% | 0% | β |
Key Insight: The model excels at field extraction (99.2%) but requires rule-based verification for cross-document inconsistency detection. The hybrid approach achieves 100% F1 on test data.
π Usage
Installation
pip install transformers peft pillow qwen-vl-utils
Loading the Model
from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct",
torch_dtype=torch.bfloat16,
load_in_4bit=True,
device_map="auto"
)
# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, "kunikhanna/dgx-sentinel-qwen-3b")
# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
Inference Example
from qwen_vl_utils import process_vision_info
from PIL import Image
# Load your document images
paystub_1 = Image.open("paystub_1.jpg")
paystub_2 = Image.open("paystub_2.jpg")
utility_bill = Image.open("utility_bill.jpg")
id_card = Image.open("id_card.jpg")
# Prepare prompt
prompt = """Extract the following fields from these 4 documents:
1. Full Name
2. Monthly Gross Income
3. Residential Address
4. Issue Date
Output as JSON."""
# Create messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": paystub_1},
{"type": "image", "image": paystub_2},
{"type": "image", "image": utility_bill},
{"type": "image", "image": id_card},
{"type": "text", "text": prompt}
]
}
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512)
output_text = processor.batch_decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(output_text)
# Output: {"full_name": "John Smith", "monthly_income": "5000", ...}
ποΈ Architecture
LoRA Configuration
{
"r": 16,
"lora_alpha": 16,
"lora_dropout": 0.05,
"target_modules": [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"vision_proj"
],
"peft_type": "LORA"
}
Note: Includes vision_proj in target modules for better multi-document visual understanding.
π Training Details
Dataset
- Training Split: 100 synthetic document sets
- Validation Split: 30 synthetic document sets
- Test Split: 20 synthetic document sets
- Document Types: Paystub (2x), Utility Bill (1x), ID Card (1x)
- Inconsistency Rate: 32.7% of sets contain intentional discrepancies
Training Hyperparameters
learning_rate: 2e-4
num_epochs: 2
batch_size: 1
gradient_accumulation_steps: 4
warmup_steps: 10
weight_decay: 0.01
optimizer: adamw_torch
lr_scheduler: cosine
max_seq_length: 8192
Hardware
- GPU: NVIDIA DGX Spark GB10
- CUDA: 13.0.1
- PyTorch: 2.10.0+cu130
- Memory: ~11GB VRAM during training
π Limitations
Comparison Logic: The model extracts fields accurately (99.2%) but does NOT perform cross-document verification autonomously. For production use, combine with rule-based verification.
Synthetic Data: Trained exclusively on programmatically generated documents. Real-world performance may vary with actual scanned documents.
Document Types: Optimized for paystubs, utility bills, and ID cards. Other document types require fine-tuning.
Language: English-only training data.
π¦ Full System
For the complete verification pipeline with rule-based verification:
- Demo: Streamlit UI included in repository
- Rule Engine: 400+ lines of fuzzy matching, normalization, validation
π Citation
@software{dgx_sentinel_2026,
title={DGX Sentinel: Document Verification System},
author={Kunal Khanna},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/kunikhanna/dgx-sentinel-qwen-3b}
}
π License
Apache 2.0 - Following the base model's license.
π Acknowledgments
- Base Model: Qwen Team for Qwen2.5-VL-3B-Instruct
- Training Framework: Hugging Face PEFT and Transformers
- Hardware: NVIDIA DGX Platform
- Downloads last month
- 2
Model tree for kunikhanna/dgx-sentinel-qwen-3b
Base model
Qwen/Qwen2.5-VL-3B-Instruct