Spaces:
Sleeping
Sleeping
⚡ v4.3: Performance optimizations — ONNX INT8, BGE embedder, batched classification, thread control
#4
by gaurv007 - opened
4 Performance Optimizations for v4.3
1. ⚡ ONNX + INT8 Quantization Support (2-4x faster inference)
- New
ml/export_onnx_v2.py— full pipeline: merge LoRA → ONNX export → dynamic INT8 quantization app.pynow tries ONNX quantized model FIRST (viaONNX_MODEL_PATHenv var or Hubgaurv007/clauseguard-onnx-int8), falls back to PyTorch PEFT- Added
optimum[onnxruntime]to requirements.txt - Expected: 2-4x faster clause classification, ~75% smaller model file
2. 🎯 Upgraded Embedder: BGE-small-en-v1.5 (+21% retrieval accuracy)
- Replaced
all-MiniLM-L6-v2(MTEB retrieval ~42.7) withBAAI/bge-small-en-v1.5(MTEB retrieval ~51.7) - Same 384-dim output, same inference latency
- Added BGE query instruction prefix for asymmetric retrieval in chatbot
- Updated in both
chatbot.pyandcompare.py
3. 🚀 Batched Clause Classification (2-3x throughput)
- New
classify_cuad_batch()processes all clauses in a single batched forward pass (batch_size=8) - Replaces sequential
classify_cuad()loop inanalyze_contract() - Includes cache-aware batching: checks prediction cache first, only processes uncached clauses
- Graceful fallback to regex on any batch error
4. 🧵 CPU Thread Control
torch.set_num_threads(2)+torch.set_num_interop_threads(1)at import time- Prevents CPU thrashing when multiple Gradio users hit the Space simultaneously
- Matched to HF Spaces CPU-basic (2 vCPUs)
Files Changed
| File | Changes |
|---|---|
app.py |
ONNX model loading (tries ONNX first → fallback PyTorch), batched classify_cuad_batch(), torch.set_num_threads(2), optimum import |
chatbot.py |
Embedder → BAAI/bge-small-en-v1.5, BGE query instruction prefix in retrieve_chunks() |
compare.py |
Embedder → BAAI/bge-small-en-v1.5 |
requirements.txt |
Added optimum[onnxruntime]>=1.23.0 |
README.md |
v4.3 changelog, updated models table |
ml/export_onnx_v2.py |
NEW — Full ONNX export + INT8 quantization pipeline for CUAD classifier |
How to use ONNX acceleration
Option A: Export yourself (recommended)
cd ml/
pip install "optimum[onnxruntime]" peft
python export_onnx_v2.py
# Pushes quantized model to gaurv007/clauseguard-onnx-int8
Option B: Set env var in Space settings
ONNX_MODEL_PATH=./onnx_legalbert_int8
# or
ONNX_HUB_MODEL_ID=gaurv007/clauseguard-onnx-int8
The app will automatically try ONNX first, fall back to PyTorch if unavailable.
Expected performance improvements
| Optimization | Speedup | RAM saved |
|---|---|---|
| ONNX INT8 quantization | 2-4x per inference | ~75% model size |
| Batched classification (8 clauses/pass) | 2-3x throughput | - |
| BGE embedder | Same speed, +21% retrieval accuracy | - |
| Thread control | Prevents thrashing under concurrent load | - |
| Combined | ~4-8x faster on typical 50-clause contract | - |
gaurv007 changed pull request status to open
gaurv007 changed pull request status to merged