DGX Sentinel: Document Verifier (QLoRA Adapter)

Fine-tuned QLoRA adapter for Qwen2.5-VL-3B-Instruct to extract fields from multi-document packets (paystubs, utility bills, ID cards).

This is a LoRA adapter only - it requires the base model Qwen/Qwen2.5-VL-3B-Instruct to function.

🎯 Model Description

  • Base Model: Qwen/Qwen2.5-VL-3B-Instruct
  • Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
  • Trainable Parameters: 23.5M (0.62% of base model)
  • Training Data: 100 synthetic document sets with 4 documents each
  • Training Hardware: NVIDIA DGX Spark GB10 GPU (CUDA 13.0)
  • Training Time: 35 minutes
  • Final Training Loss: 0.0709

πŸ“Š Performance (Hybrid System)

The model is designed to work with a hybrid verification pipeline combining ML extraction + rule-based verification:

Metric Target Achieved Status
Extraction Accuracy β‰₯85% 99.2% βœ…
Cross-Document F1 β‰₯75% 100% βœ…
Hallucination Rate ≀8% 0% βœ…

Key Insight: The model excels at field extraction (99.2%) but requires rule-based verification for cross-document inconsistency detection. The hybrid approach achieves 100% F1 on test data.

πŸš€ Usage

Installation

pip install transformers peft pillow qwen-vl-utils

Loading the Model

from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch

# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
    device_map="auto"
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, "kunikhanna/dgx-sentinel-qwen-3b")

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

Inference Example

from qwen_vl_utils import process_vision_info
from PIL import Image

# Load your document images
paystub_1 = Image.open("paystub_1.jpg")
paystub_2 = Image.open("paystub_2.jpg")
utility_bill = Image.open("utility_bill.jpg")
id_card = Image.open("id_card.jpg")

# Prepare prompt
prompt = """Extract the following fields from these 4 documents:
1. Full Name
2. Monthly Gross Income
3. Residential Address
4. Issue Date

Output as JSON."""

# Create messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": paystub_1},
            {"type": "image", "image": paystub_2},
            {"type": "image", "image": utility_bill},
            {"type": "image", "image": id_card},
            {"type": "text", "text": prompt}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
output_text = processor.batch_decode(
    output_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

print(output_text)
# Output: {"full_name": "John Smith", "monthly_income": "5000", ...}

πŸ—οΈ Architecture

LoRA Configuration

{
  "r": 16,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "target_modules": [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
    "vision_proj"
  ],
  "peft_type": "LORA"
}

Note: Includes vision_proj in target modules for better multi-document visual understanding.

πŸ“š Training Details

Dataset

  • Training Split: 100 synthetic document sets
  • Validation Split: 30 synthetic document sets
  • Test Split: 20 synthetic document sets
  • Document Types: Paystub (2x), Utility Bill (1x), ID Card (1x)
  • Inconsistency Rate: 32.7% of sets contain intentional discrepancies

Training Hyperparameters

learning_rate: 2e-4
num_epochs: 2
batch_size: 1
gradient_accumulation_steps: 4
warmup_steps: 10
weight_decay: 0.01
optimizer: adamw_torch
lr_scheduler: cosine
max_seq_length: 8192

Hardware

  • GPU: NVIDIA DGX Spark GB10
  • CUDA: 13.0.1
  • PyTorch: 2.10.0+cu130
  • Memory: ~11GB VRAM during training

πŸ” Limitations

  1. Comparison Logic: The model extracts fields accurately (99.2%) but does NOT perform cross-document verification autonomously. For production use, combine with rule-based verification.

  2. Synthetic Data: Trained exclusively on programmatically generated documents. Real-world performance may vary with actual scanned documents.

  3. Document Types: Optimized for paystubs, utility bills, and ID cards. Other document types require fine-tuning.

  4. Language: English-only training data.

πŸ“¦ Full System

For the complete verification pipeline with rule-based verification:

  • Demo: Streamlit UI included in repository
  • Rule Engine: 400+ lines of fuzzy matching, normalization, validation

πŸ“„ Citation

@software{dgx_sentinel_2026,
  title={DGX Sentinel: Document Verification System},
  author={Kunal Khanna},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/kunikhanna/dgx-sentinel-qwen-3b}
}

πŸ“œ License

Apache 2.0 - Following the base model's license.

πŸ™ Acknowledgments

  • Base Model: Qwen Team for Qwen2.5-VL-3B-Instruct
  • Training Framework: Hugging Face PEFT and Transformers
  • Hardware: NVIDIA DGX Platform
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kunikhanna/dgx-sentinel-qwen-3b

Adapter
(122)
this model