Spaces:
Sleeping
Sleeping
v4.0: Update README with OCR, Chatbot, and Redlining docs
Browse files
README.md
CHANGED
|
@@ -10,9 +10,17 @@ app_file: app.py
|
|
| 10 |
pinned: false
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# π‘οΈ ClauseGuard β World's Best Open-Source Legal Contract Analysis
|
| 14 |
|
| 15 |
-
**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## β¨ Core Features
|
| 18 |
|
|
@@ -26,9 +34,12 @@ pinned: false
|
|
| 26 |
| **Obligation Tracker** | Categorizes action items: monetary π°, compliance βοΈ, reporting π, delivery π¦, termination π |
|
| 27 |
| **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
|
| 28 |
| **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Document Support
|
| 31 |
-
- **PDF** parsing via `pdfplumber`
|
| 32 |
- **DOCX/DOC** parsing via `python-docx`
|
| 33 |
- **TXT / Markdown** direct text input
|
| 34 |
|
|
@@ -36,6 +47,8 @@ pinned: false
|
|
| 36 |
- **3-Panel Professional Layout** β Upload sidebar + Main analysis + Summary dashboard
|
| 37 |
- **Document Viewer** β Inline entity highlights (colored annotations)
|
| 38 |
- **Clause Cards** β Expandable risk-badged cards with confidence scores
|
|
|
|
|
|
|
| 39 |
- **Export Reports** β JSON (structured) and CSV (tabular) downloads
|
| 40 |
- **Color-Coded Risk Badges** β Instant visual triage
|
| 41 |
|
|
@@ -44,12 +57,61 @@ pinned: false
|
|
| 44 |
| Component | Technology |
|
| 45 |
|-----------|------------|
|
| 46 |
| Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
|
| 47 |
-
| NER |
|
| 48 |
-
| NLI |
|
|
|
|
|
|
|
|
|
|
| 49 |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
|
| 50 |
-
| Comparison |
|
| 51 |
| Obligations | Regex pattern matching across 5 obligation categories |
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
## π Risk Scoring Methodology
|
| 54 |
|
| 55 |
Risk scores combine clause detection with weighted severity:
|
|
@@ -65,16 +127,10 @@ Final score normalized to 0-100 with letter grades:
|
|
| 65 |
- D (50-69): High risk
|
| 66 |
- F (70+): Critical risk
|
| 67 |
|
| 68 |
-
## π Datasets & Research
|
| 69 |
-
|
| 70 |
-
- [CUAD](https://huggingface.co/datasets/theatticusproject/cuad-qa) β 510 contracts, 13K annotations, 41 clause categories
|
| 71 |
-
- [LegalBench](https://huggingface.co/datasets/nguha/legalbench) β 322 legal reasoning tasks
|
| 72 |
-
- [LexGLUE](https://huggingface.co/datasets/coastalcph/lex_glue) β Unfair Terms of Service classification
|
| 73 |
-
- Paper: [CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review](https://arxiv.org/abs/2103.06268) (Hendrycks et al., 2021)
|
| 74 |
-
|
| 75 |
## π Usage
|
| 76 |
|
| 77 |
1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
|
|
|
|
| 78 |
2. Click **Analyze Contract**
|
| 79 |
3. View results across tabs:
|
| 80 |
- **Document**: Full text with inline entity highlights
|
|
@@ -83,7 +139,9 @@ Final score normalized to 0-100 with letter grades:
|
|
| 83 |
- **Contradictions**: Conflicting clauses and missing provisions
|
| 84 |
- **Obligations**: Action items categorized by type
|
| 85 |
- **Compliance**: Regulatory framework checks
|
|
|
|
| 86 |
4. **Export** JSON/CSV reports
|
|
|
|
| 87 |
|
| 88 |
## π Compare Contracts
|
| 89 |
|
|
@@ -91,7 +149,6 @@ Switch to the **Compare Contracts** tab to:
|
|
| 91 |
- Upload or paste two contracts side-by-side
|
| 92 |
- See clause-level diffs (added, removed, modified)
|
| 93 |
- Get an alignment score and risk delta
|
| 94 |
-
- View raw JSON comparison data
|
| 95 |
|
| 96 |
## β οΈ Disclaimer
|
| 97 |
|
|
@@ -103,6 +160,8 @@ Switch to the **Compare Contracts** tab to:
|
|
| 103 |
- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
|
| 104 |
- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
|
| 105 |
- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
|
|
|
|
|
|
|
| 106 |
- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
|
| 107 |
|
| 108 |
---
|
|
|
|
| 10 |
pinned: false
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# π‘οΈ ClauseGuard v4.0 β World's Best Open-Source Legal Contract Analysis
|
| 14 |
|
| 15 |
+
**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
|
| 16 |
+
|
| 17 |
+
## π What's New in v4.0
|
| 18 |
+
|
| 19 |
+
| Feature | Description |
|
| 20 |
+
|---------|-------------|
|
| 21 |
+
| **π OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
|
| 22 |
+
| **π¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
|
| 23 |
+
| **βοΈ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
|
| 24 |
|
| 25 |
## β¨ Core Features
|
| 26 |
|
|
|
|
| 34 |
| **Obligation Tracker** | Categorizes action items: monetary π°, compliance βοΈ, reporting π, delivery π¦, termination π |
|
| 35 |
| **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
|
| 36 |
| **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
|
| 37 |
+
| **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
|
| 38 |
+
| **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
|
| 39 |
+
| **OCR Support** | Process scanned PDFs with docTR OCR engine |
|
| 40 |
|
| 41 |
### Document Support
|
| 42 |
+
- **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
|
| 43 |
- **DOCX/DOC** parsing via `python-docx`
|
| 44 |
- **TXT / Markdown** direct text input
|
| 45 |
|
|
|
|
| 47 |
- **3-Panel Professional Layout** β Upload sidebar + Main analysis + Summary dashboard
|
| 48 |
- **Document Viewer** β Inline entity highlights (colored annotations)
|
| 49 |
- **Clause Cards** β Expandable risk-badged cards with confidence scores
|
| 50 |
+
- **Redlining Tab** β Side-by-side original vs suggested safer alternatives
|
| 51 |
+
- **Q&A Chat Tab** β Conversational interface to ask questions about the contract
|
| 52 |
- **Export Reports** β JSON (structured) and CSV (tabular) downloads
|
| 53 |
- **Color-Coded Risk Badges** β Instant visual triage
|
| 54 |
|
|
|
|
| 57 |
| Component | Technology |
|
| 58 |
|-----------|------------|
|
| 59 |
| Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
|
| 60 |
+
| Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
|
| 61 |
+
| NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
|
| 62 |
+
| Embeddings | `sentence-transformers/all-MiniLM-L6-v2` (384-dim, RAG retrieval) |
|
| 63 |
+
| LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
|
| 64 |
+
| OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
|
| 65 |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
|
| 66 |
+
| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
|
| 67 |
| Obligations | Regex pattern matching across 5 obligation categories |
|
| 68 |
|
| 69 |
+
## π OCR Architecture (Smart PDF Router)
|
| 70 |
+
|
| 71 |
+
```
|
| 72 |
+
PDF uploaded
|
| 73 |
+
β
|
| 74 |
+
[detect_if_scanned] β pdfplumber extracts >50 chars/page?
|
| 75 |
+
β β
|
| 76 |
+
Native PDF Scanned PDF
|
| 77 |
+
β β
|
| 78 |
+
pdfplumber docTR OCR (CPU)
|
| 79 |
+
β β
|
| 80 |
+
Contract text β existing analysis pipeline
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## π¬ Q&A Chatbot Architecture (RAG)
|
| 84 |
+
|
| 85 |
+
```
|
| 86 |
+
User asks question about their contract
|
| 87 |
+
β
|
| 88 |
+
[1] Embed question with all-MiniLM-L6-v2
|
| 89 |
+
β
|
| 90 |
+
[2] Retrieve top-5 most relevant chunks from contract
|
| 91 |
+
β
|
| 92 |
+
[3] Build prompt:
|
| 93 |
+
- System: ClauseGuard analysis results (clauses, entities, risk scores)
|
| 94 |
+
- Context: Retrieved contract chunks (β€2.5K tokens)
|
| 95 |
+
- User question
|
| 96 |
+
β
|
| 97 |
+
[4] Stream response from Qwen2.5-7B via HF Inference API
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
**Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
|
| 101 |
+
|
| 102 |
+
## βοΈ Clause Redlining Architecture (3-Tier)
|
| 103 |
+
|
| 104 |
+
| Tier | Method | Speed | Hallucination Risk |
|
| 105 |
+
|------|--------|-------|--------------------|
|
| 106 |
+
| **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
|
| 107 |
+
| **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
|
| 108 |
+
| **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
|
| 109 |
+
|
| 110 |
+
Anti-hallucination guardrails:
|
| 111 |
+
- **Template anchor:** LLM can only refine, not generate from scratch
|
| 112 |
+
- **Legal citation:** Every suggestion includes legal basis and consumer standard
|
| 113 |
+
- **Disclaimer:** Clear "Not legal advice" warning
|
| 114 |
+
|
| 115 |
## π Risk Scoring Methodology
|
| 116 |
|
| 117 |
Risk scores combine clause detection with weighted severity:
|
|
|
|
| 127 |
- D (50-69): High risk
|
| 128 |
- F (70+): Critical risk
|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
## π Usage
|
| 131 |
|
| 132 |
1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
|
| 133 |
+
- π‘ Scanned PDFs are automatically processed with OCR
|
| 134 |
2. Click **Analyze Contract**
|
| 135 |
3. View results across tabs:
|
| 136 |
- **Document**: Full text with inline entity highlights
|
|
|
|
| 139 |
- **Contradictions**: Conflicting clauses and missing provisions
|
| 140 |
- **Obligations**: Action items categorized by type
|
| 141 |
- **Compliance**: Regulatory framework checks
|
| 142 |
+
- **Redlining**: βοΈ Safer clause alternatives with legal citations
|
| 143 |
4. **Export** JSON/CSV reports
|
| 144 |
+
5. Switch to **π¬ Contract Q&A** tab to ask questions about your contract
|
| 145 |
|
| 146 |
## π Compare Contracts
|
| 147 |
|
|
|
|
| 149 |
- Upload or paste two contracts side-by-side
|
| 150 |
- See clause-level diffs (added, removed, modified)
|
| 151 |
- Get an alignment score and risk delta
|
|
|
|
| 152 |
|
| 153 |
## β οΈ Disclaimer
|
| 154 |
|
|
|
|
| 160 |
- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
|
| 161 |
- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
|
| 162 |
- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
|
| 163 |
+
- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
|
| 164 |
+
- [docTR OCR](https://github.com/mindee/doctr)
|
| 165 |
- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
|
| 166 |
|
| 167 |
---
|