Spaces:
Running
Running
| title: ClauseGuard | |
| emoji: π‘οΈ | |
| colorFrom: gray | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: "5.23.0" | |
| python_version: "3.12" | |
| app_file: app.py | |
| pinned: false | |
| # π‘οΈ ClauseGuard v4.3 β World's Best Open-Source Legal Contract Analysis | |
| **ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs. | |
| ## π What's New in v4.3 | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **β‘ ONNX + INT8 Quantization** | CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization β **2-4x faster inference on CPU**. New `ml/export_onnx_v2.py` handles the full mergeβexportβquantize pipeline. | | |
| | **π― Better Embeddings** | Upgraded from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` β **+21% retrieval accuracy** on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. | | |
| | **π Batched Classification** | All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one β **2-3x throughput improvement**. | | |
| | **π§΅ CPU Thread Control** | `torch.set_num_threads(2)` prevents CPU thrashing under concurrent Gradio requests | | |
| ### Previous: v4.2 | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **π§ NLI Fix** | Fixed contradiction detection β now uses `CrossEncoder.predict()` instead of broken `pipeline("text-classification")` dict input. Contradictions actually work now. | | |
| | **π Thread Safety** | `BoundedCache` now uses `threading.RLock` to prevent race conditions under concurrent Gradio requests | | |
| | **β‘ Pre-compiled Regex** | All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level β eliminates thousands of redundant compilations | | |
| | **π Extension Fix** | Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. | | |
| | **π·οΈ Label Coverage** | Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP | | |
| | **π‘οΈ Security** | API CORS localhost origins now require explicit opt-in via `CORS_ALLOW_LOCALHOST=true` env var | | |
| ### Previous: v4.0 | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **π OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) | | |
| | **π¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation | | |
| | **βοΈ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses | | |
| ## β¨ Core Features | |
| ### Analysis Engine | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **41 CUAD Clause Categories** | Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more | | |
| | **4-Tier Risk Scoring** | Critical π΄ / High π / Medium π‘ / Low π’ with visual risk matrix | | |
| | **Legal NER** | Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles | | |
| | **NLI Contradiction Detection** | Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions | | |
| | **Obligation Tracker** | Categorizes action items: monetary π°, compliance βοΈ, reporting π, delivery π¦, termination π | | |
| | **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements | | |
| | **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring | | |
| | **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations | | |
| | **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) | | |
| | **OCR Support** | Process scanned PDFs with docTR OCR engine | | |
| ### Document Support | |
| - **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned) | |
| - **DOCX/DOC** parsing via `python-docx` | |
| - **TXT / Markdown** direct text input | |
| ### UI/UX | |
| - **3-Panel Professional Layout** β Upload sidebar + Main analysis + Summary dashboard | |
| - **Document Viewer** β Inline entity highlights (colored annotations) | |
| - **Clause Cards** β Expandable risk-badged cards with confidence scores | |
| - **Redlining Tab** β Side-by-side original vs suggested safer alternatives | |
| - **Q&A Chat Tab** β Conversational interface to ask questions about the contract | |
| - **Export Reports** β JSON (structured) and CSV (tabular) downloads | |
| - **Color-Coded Risk Badges** β Instant visual triage | |
| ## π§ Models & Architecture | |
| | Component | Technology | | |
| |-----------|------------| | |
| | Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy | | |
| | Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types | | |
| | NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) | | |
| | Embeddings | `BAAI/bge-small-en-v1.5` (384-dim, RAG retrieval β +21% over MiniLM) | | |
| | LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) | | |
| | OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction | | |
| | Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA | | |
| | Comparison | Semantic similarity with sentence embeddings + string matching fallback | | |
| | Obligations | Regex pattern matching across 5 obligation categories | | |
| ## π OCR Architecture (Smart PDF Router) | |
| ``` | |
| PDF uploaded | |
| β | |
| [detect_if_scanned] β pdfplumber extracts >50 chars/page? | |
| β β | |
| Native PDF Scanned PDF | |
| β β | |
| pdfplumber docTR OCR (CPU) | |
| β β | |
| Contract text β existing analysis pipeline | |
| ``` | |
| ## π¬ Q&A Chatbot Architecture (RAG) | |
| ``` | |
| User asks question about their contract | |
| β | |
| [1] Embed question with all-MiniLM-L6-v2 | |
| β | |
| [2] Retrieve top-5 most relevant chunks from contract | |
| β | |
| [3] Build prompt: | |
| - System: ClauseGuard analysis results (clauses, entities, risk scores) | |
| - Context: Retrieved contract chunks (β€2.5K tokens) | |
| - User question | |
| β | |
| [4] Stream response from Qwen2.5-7B via HF Inference API | |
| ``` | |
| **Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence. | |
| ## βοΈ Clause Redlining Architecture (3-Tier) | |
| | Tier | Method | Speed | Hallucination Risk | | |
| |------|--------|-------|--------------------| | |
| | **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero | | |
| | **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero | | |
| | **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) | | |
| Anti-hallucination guardrails: | |
| - **Template anchor:** LLM can only refine, not generate from scratch | |
| - **Legal citation:** Every suggestion includes legal basis and consumer standard | |
| - **Disclaimer:** Clear "Not legal advice" warning | |
| ## π Risk Scoring Methodology | |
| Risk scores combine clause detection with weighted severity: | |
| - **CRITICAL**: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.) | |
| - **HIGH**: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.) | |
| - **MEDIUM**: 10 pts (Governing Law, Jurisdiction, etc.) | |
| - **LOW**: 3 pts (Document Name, Dates, etc.) | |
| Final score normalized to 0-100 with letter grades: | |
| - A (0-14): Low risk | |
| - B (15-29): Moderate risk | |
| - C (30-49): Elevated risk | |
| - D (50-69): High risk | |
| - F (70+): Critical risk | |
| ## π Usage | |
| 1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly | |
| - π‘ Scanned PDFs are automatically processed with OCR | |
| 2. Click **Analyze Contract** | |
| 3. View results across tabs: | |
| - **Document**: Full text with inline entity highlights | |
| - **Clauses**: Detected clauses with risk badges | |
| - **Entities**: Extracted parties, dates, money, jurisdictions | |
| - **Contradictions**: Conflicting clauses and missing provisions | |
| - **Obligations**: Action items categorized by type | |
| - **Compliance**: Regulatory framework checks | |
| - **Redlining**: βοΈ Safer clause alternatives with legal citations | |
| 4. **Export** JSON/CSV reports | |
| 5. Switch to **π¬ Contract Q&A** tab to ask questions about your contract | |
| ## π Compare Contracts | |
| Switch to the **Compare Contracts** tab to: | |
| - Upload or paste two contracts side-by-side | |
| - See clause-level diffs (added, removed, modified) | |
| - Get an alignment score and risk delta | |
| ## β οΈ Disclaimer | |
| *Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.* | |
| ## π Links | |
| - [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard) | |
| - [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification) | |
| - [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased) | |
| - [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa) | |
| - [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | |
| - [docTR OCR](https://github.com/mindee/doctr) | |
| - [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268) | |
| --- | |
| *Built with β€οΈ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.* | |