ClauseGuard / README.md
gaurv007's picture
⚑ v4.3: Performance optimizations β€” ONNX INT8, BGE embedder, batched classification, thread control (#4)
f4b6528
|
raw
history blame
10 kB
---
title: ClauseGuard
emoji: πŸ›‘οΈ
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: "5.23.0"
python_version: "3.12"
app_file: app.py
pinned: false
---
# πŸ›‘οΈ ClauseGuard v4.3 β€” World's Best Open-Source Legal Contract Analysis
**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
## πŸ†• What's New in v4.3
| Feature | Description |
|---------|-------------|
| **⚑ ONNX + INT8 Quantization** | CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization β€” **2-4x faster inference on CPU**. New `ml/export_onnx_v2.py` handles the full mergeβ†’exportβ†’quantize pipeline. |
| **🎯 Better Embeddings** | Upgraded from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` β€” **+21% retrieval accuracy** on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. |
| **πŸš€ Batched Classification** | All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one β€” **2-3x throughput improvement**. |
| **🧡 CPU Thread Control** | `torch.set_num_threads(2)` prevents CPU thrashing under concurrent Gradio requests |
### Previous: v4.2
| Feature | Description |
|---------|-------------|
| **πŸ”§ NLI Fix** | Fixed contradiction detection β€” now uses `CrossEncoder.predict()` instead of broken `pipeline("text-classification")` dict input. Contradictions actually work now. |
| **πŸ”’ Thread Safety** | `BoundedCache` now uses `threading.RLock` to prevent race conditions under concurrent Gradio requests |
| **⚑ Pre-compiled Regex** | All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level β€” eliminates thousands of redundant compilations |
| **πŸ”— Extension Fix** | Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. |
| **🏷️ Label Coverage** | Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP |
| **πŸ›‘οΈ Security** | API CORS localhost origins now require explicit opt-in via `CORS_ALLOW_LOCALHOST=true` env var |
### Previous: v4.0
| Feature | Description |
|---------|-------------|
| **πŸ” OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
| **πŸ’¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
| **✏️ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
## ✨ Core Features
### Analysis Engine
| Feature | Description |
|---------|-------------|
| **41 CUAD Clause Categories** | Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more |
| **4-Tier Risk Scoring** | Critical πŸ”΄ / High 🟠 / Medium 🟑 / Low 🟒 with visual risk matrix |
| **Legal NER** | Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles |
| **NLI Contradiction Detection** | Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions |
| **Obligation Tracker** | Categorizes action items: monetary πŸ’°, compliance βš–οΈ, reporting πŸ“Š, delivery πŸ“¦, termination πŸ›‘ |
| **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
| **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
| **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
| **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
| **OCR Support** | Process scanned PDFs with docTR OCR engine |
### Document Support
- **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
- **DOCX/DOC** parsing via `python-docx`
- **TXT / Markdown** direct text input
### UI/UX
- **3-Panel Professional Layout** β€” Upload sidebar + Main analysis + Summary dashboard
- **Document Viewer** β€” Inline entity highlights (colored annotations)
- **Clause Cards** β€” Expandable risk-badged cards with confidence scores
- **Redlining Tab** β€” Side-by-side original vs suggested safer alternatives
- **Q&A Chat Tab** β€” Conversational interface to ask questions about the contract
- **Export Reports** β€” JSON (structured) and CSV (tabular) downloads
- **Color-Coded Risk Badges** β€” Instant visual triage
## 🧠 Models & Architecture
| Component | Technology |
|-----------|------------|
| Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β€” LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
| Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
| NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
| Embeddings | `BAAI/bge-small-en-v1.5` (384-dim, RAG retrieval β€” +21% over MiniLM) |
| LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
| OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
| Obligations | Regex pattern matching across 5 obligation categories |
## πŸ” OCR Architecture (Smart PDF Router)
```
PDF uploaded
↓
[detect_if_scanned] β€” pdfplumber extracts >50 chars/page?
↓ ↓
Native PDF Scanned PDF
↓ ↓
pdfplumber docTR OCR (CPU)
↓ ↓
Contract text β†’ existing analysis pipeline
```
## πŸ’¬ Q&A Chatbot Architecture (RAG)
```
User asks question about their contract
↓
[1] Embed question with all-MiniLM-L6-v2
↓
[2] Retrieve top-5 most relevant chunks from contract
↓
[3] Build prompt:
- System: ClauseGuard analysis results (clauses, entities, risk scores)
- Context: Retrieved contract chunks (≀2.5K tokens)
- User question
↓
[4] Stream response from Qwen2.5-7B via HF Inference API
```
**Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β€” NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
## ✏️ Clause Redlining Architecture (3-Tier)
| Tier | Method | Speed | Hallucination Risk |
|------|--------|-------|--------------------|
| **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
| **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
| **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
Anti-hallucination guardrails:
- **Template anchor:** LLM can only refine, not generate from scratch
- **Legal citation:** Every suggestion includes legal basis and consumer standard
- **Disclaimer:** Clear "Not legal advice" warning
## πŸ“Š Risk Scoring Methodology
Risk scores combine clause detection with weighted severity:
- **CRITICAL**: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
- **HIGH**: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
- **MEDIUM**: 10 pts (Governing Law, Jurisdiction, etc.)
- **LOW**: 3 pts (Document Name, Dates, etc.)
Final score normalized to 0-100 with letter grades:
- A (0-14): Low risk
- B (15-29): Moderate risk
- C (30-49): Elevated risk
- D (50-69): High risk
- F (70+): Critical risk
## πŸš€ Usage
1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
- πŸ’‘ Scanned PDFs are automatically processed with OCR
2. Click **Analyze Contract**
3. View results across tabs:
- **Document**: Full text with inline entity highlights
- **Clauses**: Detected clauses with risk badges
- **Entities**: Extracted parties, dates, money, jurisdictions
- **Contradictions**: Conflicting clauses and missing provisions
- **Obligations**: Action items categorized by type
- **Compliance**: Regulatory framework checks
- **Redlining**: ✏️ Safer clause alternatives with legal citations
4. **Export** JSON/CSV reports
5. Switch to **πŸ’¬ Contract Q&A** tab to ask questions about your contract
## πŸ”€ Compare Contracts
Switch to the **Compare Contracts** tab to:
- Upload or paste two contracts side-by-side
- See clause-level diffs (added, removed, modified)
- Get an alignment score and risk delta
## ⚠️ Disclaimer
*Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.*
## πŸ”— Links
- [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard)
- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- [docTR OCR](https://github.com/mindee/doctr)
- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
---
*Built with ❀️ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.*