Spaces:
Sleeping
Sleeping
File size: 8,188 Bytes
85cf385 94c4c90 85cf385 e8d10a0 28c983c e8d10a0 28c983c e8d10a0 d3099a5 e8d10a0 d3099a5 28c983c e8d10a0 d3099a5 28c983c d3099a5 e8d10a0 d3099a5 28c983c d3099a5 28c983c d3099a5 28c983c d3099a5 28c983c d3099a5 e8d10a0 d3099a5 e8d10a0 d3099a5 28c983c e8d10a0 d3099a5 28c983c d3099a5 28c983c d3099a5 e8d10a0 d3099a5 e8d10a0 d3099a5 28c983c d3099a5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
title: ClauseGuard
emoji: π‘οΈ
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: "5.23.0"
python_version: "3.12"
app_file: app.py
pinned: false
---
# π‘οΈ ClauseGuard v4.0 β World's Best Open-Source Legal Contract Analysis
**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
## π What's New in v4.0
| Feature | Description |
|---------|-------------|
| **π OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
| **π¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
| **βοΈ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
## β¨ Core Features
### Analysis Engine
| Feature | Description |
|---------|-------------|
| **41 CUAD Clause Categories** | Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more |
| **4-Tier Risk Scoring** | Critical π΄ / High π / Medium π‘ / Low π’ with visual risk matrix |
| **Legal NER** | Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles |
| **NLI Contradiction Detection** | Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions |
| **Obligation Tracker** | Categorizes action items: monetary π°, compliance βοΈ, reporting π, delivery π¦, termination π |
| **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
| **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
| **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
| **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
| **OCR Support** | Process scanned PDFs with docTR OCR engine |
### Document Support
- **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
- **DOCX/DOC** parsing via `python-docx`
- **TXT / Markdown** direct text input
### UI/UX
- **3-Panel Professional Layout** β Upload sidebar + Main analysis + Summary dashboard
- **Document Viewer** β Inline entity highlights (colored annotations)
- **Clause Cards** β Expandable risk-badged cards with confidence scores
- **Redlining Tab** β Side-by-side original vs suggested safer alternatives
- **Q&A Chat Tab** β Conversational interface to ask questions about the contract
- **Export Reports** β JSON (structured) and CSV (tabular) downloads
- **Color-Coded Risk Badges** β Instant visual triage
## π§ Models & Architecture
| Component | Technology |
|-----------|------------|
| Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
| Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
| NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
| Embeddings | `sentence-transformers/all-MiniLM-L6-v2` (384-dim, RAG retrieval) |
| LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
| OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
| Obligations | Regex pattern matching across 5 obligation categories |
## π OCR Architecture (Smart PDF Router)
```
PDF uploaded
β
[detect_if_scanned] β pdfplumber extracts >50 chars/page?
β β
Native PDF Scanned PDF
β β
pdfplumber docTR OCR (CPU)
β β
Contract text β existing analysis pipeline
```
## π¬ Q&A Chatbot Architecture (RAG)
```
User asks question about their contract
β
[1] Embed question with all-MiniLM-L6-v2
β
[2] Retrieve top-5 most relevant chunks from contract
β
[3] Build prompt:
- System: ClauseGuard analysis results (clauses, entities, risk scores)
- Context: Retrieved contract chunks (β€2.5K tokens)
- User question
β
[4] Stream response from Qwen2.5-7B via HF Inference API
```
**Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
## βοΈ Clause Redlining Architecture (3-Tier)
| Tier | Method | Speed | Hallucination Risk |
|------|--------|-------|--------------------|
| **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
| **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
| **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
Anti-hallucination guardrails:
- **Template anchor:** LLM can only refine, not generate from scratch
- **Legal citation:** Every suggestion includes legal basis and consumer standard
- **Disclaimer:** Clear "Not legal advice" warning
## π Risk Scoring Methodology
Risk scores combine clause detection with weighted severity:
- **CRITICAL**: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
- **HIGH**: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
- **MEDIUM**: 10 pts (Governing Law, Jurisdiction, etc.)
- **LOW**: 3 pts (Document Name, Dates, etc.)
Final score normalized to 0-100 with letter grades:
- A (0-14): Low risk
- B (15-29): Moderate risk
- C (30-49): Elevated risk
- D (50-69): High risk
- F (70+): Critical risk
## π Usage
1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
- π‘ Scanned PDFs are automatically processed with OCR
2. Click **Analyze Contract**
3. View results across tabs:
- **Document**: Full text with inline entity highlights
- **Clauses**: Detected clauses with risk badges
- **Entities**: Extracted parties, dates, money, jurisdictions
- **Contradictions**: Conflicting clauses and missing provisions
- **Obligations**: Action items categorized by type
- **Compliance**: Regulatory framework checks
- **Redlining**: βοΈ Safer clause alternatives with legal citations
4. **Export** JSON/CSV reports
5. Switch to **π¬ Contract Q&A** tab to ask questions about your contract
## π Compare Contracts
Switch to the **Compare Contracts** tab to:
- Upload or paste two contracts side-by-side
- See clause-level diffs (added, removed, modified)
- Get an alignment score and risk delta
## β οΈ Disclaimer
*Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.*
## π Links
- [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard)
- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- [docTR OCR](https://github.com/mindee/doctr)
- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
---
*Built with β€οΈ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.*
|