⚡ v4.3: Performance optimizations — ONNX INT8, BGE embedder, batched classification, thread control

#4
by gaurv007 - opened

4 Performance Optimizations for v4.3

1. ⚡ ONNX + INT8 Quantization Support (2-4x faster inference)

  • New ml/export_onnx_v2.py — full pipeline: merge LoRA → ONNX export → dynamic INT8 quantization
  • app.py now tries ONNX quantized model FIRST (via ONNX_MODEL_PATH env var or Hub gaurv007/clauseguard-onnx-int8), falls back to PyTorch PEFT
  • Added optimum[onnxruntime] to requirements.txt
  • Expected: 2-4x faster clause classification, ~75% smaller model file

2. 🎯 Upgraded Embedder: BGE-small-en-v1.5 (+21% retrieval accuracy)

  • Replaced all-MiniLM-L6-v2 (MTEB retrieval ~42.7) with BAAI/bge-small-en-v1.5 (MTEB retrieval ~51.7)
  • Same 384-dim output, same inference latency
  • Added BGE query instruction prefix for asymmetric retrieval in chatbot
  • Updated in both chatbot.py and compare.py

3. 🚀 Batched Clause Classification (2-3x throughput)

  • New classify_cuad_batch() processes all clauses in a single batched forward pass (batch_size=8)
  • Replaces sequential classify_cuad() loop in analyze_contract()
  • Includes cache-aware batching: checks prediction cache first, only processes uncached clauses
  • Graceful fallback to regex on any batch error

4. 🧵 CPU Thread Control

  • torch.set_num_threads(2) + torch.set_num_interop_threads(1) at import time
  • Prevents CPU thrashing when multiple Gradio users hit the Space simultaneously
  • Matched to HF Spaces CPU-basic (2 vCPUs)

Files Changed

File Changes
app.py ONNX model loading (tries ONNX first → fallback PyTorch), batched classify_cuad_batch(), torch.set_num_threads(2), optimum import
chatbot.py Embedder → BAAI/bge-small-en-v1.5, BGE query instruction prefix in retrieve_chunks()
compare.py Embedder → BAAI/bge-small-en-v1.5
requirements.txt Added optimum[onnxruntime]>=1.23.0
README.md v4.3 changelog, updated models table
ml/export_onnx_v2.py NEW — Full ONNX export + INT8 quantization pipeline for CUAD classifier

How to use ONNX acceleration

Option A: Export yourself (recommended)

cd ml/
pip install "optimum[onnxruntime]" peft
python export_onnx_v2.py
# Pushes quantized model to gaurv007/clauseguard-onnx-int8

Option B: Set env var in Space settings

ONNX_MODEL_PATH=./onnx_legalbert_int8
# or
ONNX_HUB_MODEL_ID=gaurv007/clauseguard-onnx-int8

The app will automatically try ONNX first, fall back to PyTorch if unavailable.

Expected performance improvements

Optimization Speedup RAM saved
ONNX INT8 quantization 2-4x per inference ~75% model size
Batched classification (8 clauses/pass) 2-3x throughput -
BGE embedder Same speed, +21% retrieval accuracy -
Thread control Prevents thrashing under concurrent load -
Combined ~4-8x faster on typical 50-clause contract -
gaurv007 changed pull request status to open
gaurv007 changed pull request status to merged

Sign up or log in to comment