Spaces:
Running
title: ClauseGuard
emoji: π‘οΈ
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 5.23.0
python_version: '3.12'
app_file: app.py
pinned: false
π‘οΈ ClauseGuard v4.3 β World's Best Open-Source Legal Contract Analysis
ClauseGuard is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
π What's New in v4.3
| Feature | Description |
|---|---|
| β‘ ONNX + INT8 Quantization | CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization β 2-4x faster inference on CPU. New ml/export_onnx_v2.py handles the full mergeβexportβquantize pipeline. |
| π― Better Embeddings | Upgraded from all-MiniLM-L6-v2 to BAAI/bge-small-en-v1.5 β +21% retrieval accuracy on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. |
| π Batched Classification | All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one β 2-3x throughput improvement. |
| π§΅ CPU Thread Control | torch.set_num_threads(2) prevents CPU thrashing under concurrent Gradio requests |
Previous: v4.2
| Feature | Description |
|---|---|
| π§ NLI Fix | Fixed contradiction detection β now uses CrossEncoder.predict() instead of broken pipeline("text-classification") dict input. Contradictions actually work now. |
| π Thread Safety | BoundedCache now uses threading.RLock to prevent race conditions under concurrent Gradio requests |
| β‘ Pre-compiled Regex | All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level β eliminates thousands of redundant compilations |
| π Extension Fix | Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. |
| π·οΈ Label Coverage | Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP |
| π‘οΈ Security | API CORS localhost origins now require explicit opt-in via CORS_ALLOW_LOCALHOST=true env var |
Previous: v4.0
| Feature | Description |
|---|---|
| π OCR for Scanned PDFs | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
| π¬ Contract Q&A Chatbot | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
| βοΈ Clause Redlining | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
β¨ Core Features
Analysis Engine
| Feature | Description |
|---|---|
| 41 CUAD Clause Categories | Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more |
| 4-Tier Risk Scoring | Critical π΄ / High π / Medium π‘ / Low π’ with visual risk matrix |
| Legal NER | Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles |
| NLI Contradiction Detection | Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions |
| Obligation Tracker | Categorizes action items: monetary π°, compliance βοΈ, reporting π, delivery π¦, termination π |
| Compliance Checker | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
| Contract Comparison | Side-by-side diff between two contracts with alignment scoring |
| Clause Redlining | Suggests safer alternatives for risky clauses with legal citations |
| Q&A Chatbot | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
| OCR Support | Process scanned PDFs with docTR OCR engine |
Document Support
- PDF parsing via
pdfplumber(native) +docTROCR (scanned) - DOCX/DOC parsing via
python-docx - TXT / Markdown direct text input
UI/UX
- 3-Panel Professional Layout β Upload sidebar + Main analysis + Summary dashboard
- Document Viewer β Inline entity highlights (colored annotations)
- Clause Cards β Expandable risk-badged cards with confidence scores
- Redlining Tab β Side-by-side original vs suggested safer alternatives
- Q&A Chat Tab β Conversational interface to ask questions about the contract
- Export Reports β JSON (structured) and CSV (tabular) downloads
- Color-Coded Risk Badges β Instant visual triage
π§ Models & Architecture
| Component | Technology |
|---|---|
| Clause Classification | Mokshith31/legalbert-contract-clause-classification β LoRA adapter on nlpaueb/legal-bert-base-uncased, fine-tuned on CUAD 41-class taxonomy |
| Legal NER | matterstack/legal-bert-ner (ML) with regex fallback for 7 entity types |
| NLI | cross-encoder/nli-deberta-v3-base (semantic contradiction detection) |
| Embeddings | BAAI/bge-small-en-v1.5 (384-dim, RAG retrieval β +21% over MiniLM) |
| LLM | Qwen/Qwen2.5-7B-Instruct via HF Inference API (chatbot + redlining) |
| OCR | docTR (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
| Obligations | Regex pattern matching across 5 obligation categories |
π OCR Architecture (Smart PDF Router)
PDF uploaded
β
[detect_if_scanned] β pdfplumber extracts >50 chars/page?
β β
Native PDF Scanned PDF
β β
pdfplumber docTR OCR (CPU)
β β
Contract text β existing analysis pipeline
π¬ Q&A Chatbot Architecture (RAG)
User asks question about their contract
β
[1] Embed question with all-MiniLM-L6-v2
β
[2] Retrieve top-5 most relevant chunks from contract
β
[3] Build prompt:
- System: ClauseGuard analysis results (clauses, entities, risk scores)
- Context: Retrieved contract chunks (β€2.5K tokens)
- User question
β
[4] Stream response from Qwen2.5-7B via HF Inference API
Key design: Analyzed data (clauses, entities, risk scores) goes in the system prompt β NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
βοΈ Clause Redlining Architecture (3-Tier)
| Tier | Method | Speed | Hallucination Risk |
|---|---|---|---|
| 1. Template Lookup | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
| 2. Keyword Matching | Match clause text to relevant templates via legal keywords | Instant | Zero |
| 3. LLM Refinement | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
Anti-hallucination guardrails:
- Template anchor: LLM can only refine, not generate from scratch
- Legal citation: Every suggestion includes legal basis and consumer standard
- Disclaimer: Clear "Not legal advice" warning
π Risk Scoring Methodology
Risk scores combine clause detection with weighted severity:
- CRITICAL: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
- HIGH: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
- MEDIUM: 10 pts (Governing Law, Jurisdiction, etc.)
- LOW: 3 pts (Document Name, Dates, etc.)
Final score normalized to 0-100 with letter grades:
- A (0-14): Low risk
- B (15-29): Moderate risk
- C (30-49): Elevated risk
- D (50-69): High risk
- F (70+): Critical risk
π Usage
- Upload a contract (PDF, DOCX, or TXT) or paste text directly
- π‘ Scanned PDFs are automatically processed with OCR
- Click Analyze Contract
- View results across tabs:
- Document: Full text with inline entity highlights
- Clauses: Detected clauses with risk badges
- Entities: Extracted parties, dates, money, jurisdictions
- Contradictions: Conflicting clauses and missing provisions
- Obligations: Action items categorized by type
- Compliance: Regulatory framework checks
- Redlining: βοΈ Safer clause alternatives with legal citations
- Export JSON/CSV reports
- Switch to π¬ Contract Q&A tab to ask questions about your contract
π Compare Contracts
Switch to the Compare Contracts tab to:
- Upload or paste two contracts side-by-side
- See clause-level diffs (added, removed, modified)
- Get an alignment score and risk delta
β οΈ Disclaimer
Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.
π Links
- ClauseGuard Space
- Clause Classifier Model
- Legal-BERT Base
- CUAD Dataset
- Qwen2.5-7B (Chatbot LLM)
- docTR OCR
- CUAD Paper (arXiv:2103.06268)
Built with β€οΈ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.