File size: 10,023 Bytes
85cf385
 
 
94c4c90
 
85cf385
 
 
 
 
 
e8d10a0
f4b6528
e8d10a0
28c983c
 
f4b6528
 
 
 
 
 
 
 
 
 
f4ccb3e
 
 
 
 
 
 
 
 
 
 
28c983c
 
 
 
 
 
e8d10a0
d3099a5
e8d10a0
d3099a5
 
 
 
 
 
 
 
 
 
28c983c
 
 
e8d10a0
d3099a5
28c983c
d3099a5
 
e8d10a0
 
d3099a5
 
 
28c983c
 
d3099a5
 
 
 
 
 
 
 
28c983c
 
f4b6528
28c983c
 
d3099a5
28c983c
d3099a5
 
28c983c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3099a5
 
 
 
 
 
 
e8d10a0
d3099a5
 
 
 
 
 
 
e8d10a0
 
d3099a5
28c983c
e8d10a0
d3099a5
 
 
 
 
 
 
28c983c
d3099a5
28c983c
d3099a5
 
 
 
 
 
 
e8d10a0
 
 
d3099a5
e8d10a0
 
 
d3099a5
 
 
 
28c983c
 
d3099a5
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: ClauseGuard
emoji: πŸ›‘οΈ
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: "5.23.0"
python_version: "3.12"
app_file: app.py
pinned: false
---

# πŸ›‘οΈ ClauseGuard v4.3 β€” World's Best Open-Source Legal Contract Analysis

**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.

## πŸ†• What's New in v4.3

| Feature | Description |
|---------|-------------|
| **⚑ ONNX + INT8 Quantization** | CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization β€” **2-4x faster inference on CPU**. New `ml/export_onnx_v2.py` handles the full mergeβ†’exportβ†’quantize pipeline. |
| **🎯 Better Embeddings** | Upgraded from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` β€” **+21% retrieval accuracy** on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. |
| **πŸš€ Batched Classification** | All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one β€” **2-3x throughput improvement**. |
| **🧡 CPU Thread Control** | `torch.set_num_threads(2)` prevents CPU thrashing under concurrent Gradio requests |

### Previous: v4.2

| Feature | Description |
|---------|-------------|
| **πŸ”§ NLI Fix** | Fixed contradiction detection β€” now uses `CrossEncoder.predict()` instead of broken `pipeline("text-classification")` dict input. Contradictions actually work now. |
| **πŸ”’ Thread Safety** | `BoundedCache` now uses `threading.RLock` to prevent race conditions under concurrent Gradio requests |
| **⚑ Pre-compiled Regex** | All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level β€” eliminates thousands of redundant compilations |
| **πŸ”— Extension Fix** | Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. |
| **🏷️ Label Coverage** | Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP |
| **πŸ›‘οΈ Security** | API CORS localhost origins now require explicit opt-in via `CORS_ALLOW_LOCALHOST=true` env var |

### Previous: v4.0

| Feature | Description |
|---------|-------------|
| **πŸ” OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
| **πŸ’¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
| **✏️ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |

## ✨ Core Features

### Analysis Engine
| Feature | Description |
|---------|-------------|
| **41 CUAD Clause Categories** | Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more |
| **4-Tier Risk Scoring** | Critical πŸ”΄ / High 🟠 / Medium 🟑 / Low 🟒 with visual risk matrix |
| **Legal NER** | Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles |
| **NLI Contradiction Detection** | Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions |
| **Obligation Tracker** | Categorizes action items: monetary πŸ’°, compliance βš–οΈ, reporting πŸ“Š, delivery πŸ“¦, termination πŸ›‘ |
| **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
| **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
| **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
| **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
| **OCR Support** | Process scanned PDFs with docTR OCR engine |

### Document Support
- **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
- **DOCX/DOC** parsing via `python-docx`
- **TXT / Markdown** direct text input

### UI/UX
- **3-Panel Professional Layout** β€” Upload sidebar + Main analysis + Summary dashboard
- **Document Viewer** β€” Inline entity highlights (colored annotations)
- **Clause Cards** β€” Expandable risk-badged cards with confidence scores
- **Redlining Tab** β€” Side-by-side original vs suggested safer alternatives
- **Q&A Chat Tab** β€” Conversational interface to ask questions about the contract
- **Export Reports** β€” JSON (structured) and CSV (tabular) downloads
- **Color-Coded Risk Badges** β€” Instant visual triage

## 🧠 Models & Architecture

| Component | Technology |
|-----------|------------|
| Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β€” LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
| Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
| NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
| Embeddings | `BAAI/bge-small-en-v1.5` (384-dim, RAG retrieval β€” +21% over MiniLM) |
| LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
| OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
| Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
| Obligations | Regex pattern matching across 5 obligation categories |

## πŸ” OCR Architecture (Smart PDF Router)

```
PDF uploaded
    ↓
[detect_if_scanned] β€” pdfplumber extracts >50 chars/page?
    ↓                           ↓
  Native PDF               Scanned PDF
    ↓                           ↓
  pdfplumber              docTR OCR (CPU)
    ↓                           ↓
  Contract text β†’ existing analysis pipeline
```

## πŸ’¬ Q&A Chatbot Architecture (RAG)

```
User asks question about their contract
        ↓
[1] Embed question with all-MiniLM-L6-v2
        ↓
[2] Retrieve top-5 most relevant chunks from contract
        ↓
[3] Build prompt:
    - System: ClauseGuard analysis results (clauses, entities, risk scores)
    - Context: Retrieved contract chunks (≀2.5K tokens)
    - User question
        ↓
[4] Stream response from Qwen2.5-7B via HF Inference API
```

**Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β€” NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.

## ✏️ Clause Redlining Architecture (3-Tier)

| Tier | Method | Speed | Hallucination Risk |
|------|--------|-------|--------------------|
| **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
| **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
| **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |

Anti-hallucination guardrails:
- **Template anchor:** LLM can only refine, not generate from scratch
- **Legal citation:** Every suggestion includes legal basis and consumer standard
- **Disclaimer:** Clear "Not legal advice" warning

## πŸ“Š Risk Scoring Methodology

Risk scores combine clause detection with weighted severity:
- **CRITICAL**: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
- **HIGH**: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
- **MEDIUM**: 10 pts (Governing Law, Jurisdiction, etc.)
- **LOW**: 3 pts (Document Name, Dates, etc.)

Final score normalized to 0-100 with letter grades:
- A (0-14): Low risk
- B (15-29): Moderate risk
- C (30-49): Elevated risk
- D (50-69): High risk
- F (70+): Critical risk

## πŸš€ Usage

1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
   - πŸ’‘ Scanned PDFs are automatically processed with OCR
2. Click **Analyze Contract**
3. View results across tabs:
   - **Document**: Full text with inline entity highlights
   - **Clauses**: Detected clauses with risk badges
   - **Entities**: Extracted parties, dates, money, jurisdictions
   - **Contradictions**: Conflicting clauses and missing provisions
   - **Obligations**: Action items categorized by type
   - **Compliance**: Regulatory framework checks
   - **Redlining**: ✏️ Safer clause alternatives with legal citations
4. **Export** JSON/CSV reports
5. Switch to **πŸ’¬ Contract Q&A** tab to ask questions about your contract

## πŸ”€ Compare Contracts

Switch to the **Compare Contracts** tab to:
- Upload or paste two contracts side-by-side
- See clause-level diffs (added, removed, modified)
- Get an alignment score and risk delta

## ⚠️ Disclaimer

*Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.*

## πŸ”— Links

- [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard)
- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- [docTR OCR](https://github.com/mindee/doctr)
- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)

---

*Built with ❀️ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.*