DGX Sentinel: Document Verifier (QLoRA Adapter)

Fine-tuned QLoRA adapter for Qwen2.5-VL-3B-Instruct to extract fields from multi-document packets (paystubs, utility bills, ID cards).

This is a LoRA adapter only - it requires the base model Qwen/Qwen2.5-VL-3B-Instruct to function.

🎯 Model Description

Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
Trainable Parameters: 23.5M (0.62% of base model)
Training Data: 100 synthetic document sets with 4 documents each
Training Hardware: NVIDIA DGX Spark GB10 GPU (CUDA 13.0)
Training Time: 35 minutes
Final Training Loss: 0.0709

📊 Performance (Hybrid System)

The model is designed to work with a hybrid verification pipeline combining ML extraction + rule-based verification:

Metric	Target	Achieved	Status
Extraction Accuracy	≥85%	99.2%	✅
Cross-Document F1	≥75%	100%	✅
Hallucination Rate	≤8%	0%	✅

Key Insight: The model excels at field extraction (99.2%) but requires rule-based verification for cross-document inconsistency detection. The hybrid approach achieves 100% F1 on test data.

🚀 Usage

Installation

pip install transformers peft pillow qwen-vl-utils

Loading the Model

from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch

# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
    device_map="auto"
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, "kunikhanna/dgx-sentinel-qwen-3b")

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

Inference Example

from qwen_vl_utils import process_vision_info
from PIL import Image

# Load your document images
paystub_1 = Image.open("paystub_1.jpg")
paystub_2 = Image.open("paystub_2.jpg")
utility_bill = Image.open("utility_bill.jpg")
id_card = Image.open("id_card.jpg")

# Prepare prompt
prompt = """Extract the following fields from these 4 documents:
1. Full Name
2. Monthly Gross Income
3. Residential Address
4. Issue Date

Output as JSON."""

# Create messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": paystub_1},
            {"type": "image", "image": paystub_2},
            {"type": "image", "image": utility_bill},
            {"type": "image", "image": id_card},
            {"type": "text", "text": prompt}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
output_text = processor.batch_decode(
    output_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

print(output_text)
# Output: {"full_name": "John Smith", "monthly_income": "5000", ...}

🏗️ Architecture

LoRA Configuration

{
  "r": 16,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "target_modules": [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
    "vision_proj"
  ],
  "peft_type": "LORA"
}

Note: Includes vision_proj in target modules for better multi-document visual understanding.

📚 Training Details

Dataset

Training Split: 100 synthetic document sets
Validation Split: 30 synthetic document sets
Test Split: 20 synthetic document sets
Document Types: Paystub (2x), Utility Bill (1x), ID Card (1x)
Inconsistency Rate: 32.7% of sets contain intentional discrepancies

Training Hyperparameters

learning_rate: 2e-4
num_epochs: 2
batch_size: 1
gradient_accumulation_steps: 4
warmup_steps: 10
weight_decay: 0.01
optimizer: adamw_torch
lr_scheduler: cosine
max_seq_length: 8192

Hardware

GPU: NVIDIA DGX Spark GB10
CUDA: 13.0.1
PyTorch: 2.10.0+cu130
Memory: ~11GB VRAM during training

🔍 Limitations

Comparison Logic: The model extracts fields accurately (99.2%) but does NOT perform cross-document verification autonomously. For production use, combine with rule-based verification.
Synthetic Data: Trained exclusively on programmatically generated documents. Real-world performance may vary with actual scanned documents.
Document Types: Optimized for paystubs, utility bills, and ID cards. Other document types require fine-tuning.
Language: English-only training data.

📦 Full System

For the complete verification pipeline with rule-based verification:

Demo: Streamlit UI included in repository
Rule Engine: 400+ lines of fuzzy matching, normalization, validation

📄 Citation

@software{dgx_sentinel_2026,
  title={DGX Sentinel: Document Verification System},
  author={Kunal Khanna},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/kunikhanna/dgx-sentinel-qwen-3b}
}

📜 License

Apache 2.0 - Following the base model's license.

🙏 Acknowledgments

Base Model: Qwen Team for Qwen2.5-VL-3B-Instruct
Training Framework: Hugging Face PEFT and Transformers
Hardware: NVIDIA DGX Platform

Downloads last month: 2

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kunikhanna/dgx-sentinel-qwen-3b

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(122)

this model