Spaces:

gaurv007
/

ClauseGuard

Running

App Files Files Community

ClauseGuard / README.md

gaurv007

⚡ v4.3: Performance optimizations — ONNX INT8, BGE embedder, batched classification, thread control (#4)

f4b6528 12 days ago

preview code

raw

history blame

10 kB

	---
	title: ClauseGuard
	emoji: 🛡️
	colorFrom: gray
	colorTo: gray
	sdk: gradio
	sdk_version: "5.23.0"
	python_version: "3.12"
	app_file: app.py
	pinned: false
	---

	# 🛡️ ClauseGuard v4.3 — World's Best Open-Source Legal Contract Analysis

	ClauseGuard is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.

	## 🆕 What's New in v4.3

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| ⚡ ONNX + INT8 Quantization \| CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization — 2-4x faster inference on CPU. New `ml/export_onnx_v2.py` handles the full merge→export→quantize pipeline. \|
	\| 🎯 Better Embeddings \| Upgraded from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` — +21% retrieval accuracy on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. \|
	\| 🚀 Batched Classification \| All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one — 2-3x throughput improvement. \|
	\| 🧵 CPU Thread Control \| `torch.set_num_threads(2)` prevents CPU thrashing under concurrent Gradio requests \|

	### Previous: v4.2

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🔧 NLI Fix \| Fixed contradiction detection — now uses `CrossEncoder.predict()` instead of broken `pipeline("text-classification")` dict input. Contradictions actually work now. \|
	\| 🔒 Thread Safety \| `BoundedCache` now uses `threading.RLock` to prevent race conditions under concurrent Gradio requests \|
	\| ⚡ Pre-compiled Regex \| All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level — eliminates thousands of redundant compilations \|
	\| 🔗 Extension Fix \| Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. \|
	\| 🏷️ Label Coverage \| Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP \|
	\| 🛡️ Security \| API CORS localhost origins now require explicit opt-in via `CORS_ALLOW_LOCALHOST=true` env var \|

	### Previous: v4.0

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🔍 OCR for Scanned PDFs \| Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) \|
	\| 💬 Contract Q&A Chatbot \| RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation \|
	\| ✏️ Clause Redlining \| 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses \|

	## ✨ Core Features

	### Analysis Engine
	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 41 CUAD Clause Categories \| Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more \|
	\| 4-Tier Risk Scoring \| Critical 🔴 / High 🟠 / Medium 🟡 / Low 🟢 with visual risk matrix \|
	\| Legal NER \| Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles \|
	\| NLI Contradiction Detection \| Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions \|
	\| Obligation Tracker \| Categorizes action items: monetary 💰, compliance ⚖️, reporting 📊, delivery 📦, termination 🛑 \|
	\| Compliance Checker \| Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements \|
	\| Contract Comparison \| Side-by-side diff between two contracts with alignment scoring \|
	\| Clause Redlining \| Suggests safer alternatives for risky clauses with legal citations \|
	\| Q&A Chatbot \| Ask questions about your contract using RAG (Retrieval-Augmented Generation) \|
	\| OCR Support \| Process scanned PDFs with docTR OCR engine \|

	### Document Support
	- PDF parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
	- DOCX/DOC parsing via `python-docx`
	- TXT / Markdown direct text input

	### UI/UX
	- 3-Panel Professional Layout — Upload sidebar + Main analysis + Summary dashboard
	- Document Viewer — Inline entity highlights (colored annotations)
	- Clause Cards — Expandable risk-badged cards with confidence scores
	- Redlining Tab — Side-by-side original vs suggested safer alternatives
	- Q&A Chat Tab — Conversational interface to ask questions about the contract
	- Export Reports — JSON (structured) and CSV (tabular) downloads
	- Color-Coded Risk Badges — Instant visual triage

	## 🧠 Models & Architecture

	\| Component \| Technology \|
	\|-----------\|------------\|
	\| Clause Classification \| `Mokshith31/legalbert-contract-clause-classification` — LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy \|
	\| Legal NER \| `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types \|
	\| NLI \| `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) \|
	\| Embeddings \| `BAAI/bge-small-en-v1.5` (384-dim, RAG retrieval — +21% over MiniLM) \|
	\| LLM \| `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) \|
	\| OCR \| `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction \|
	\| Compliance \| Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA \|
	\| Comparison \| Semantic similarity with sentence embeddings + string matching fallback \|
	\| Obligations \| Regex pattern matching across 5 obligation categories \|

	## 🔍 OCR Architecture (Smart PDF Router)

	```
	PDF uploaded
	↓
	[detect_if_scanned] — pdfplumber extracts >50 chars/page?
	↓ ↓
	Native PDF Scanned PDF
	↓ ↓
	pdfplumber docTR OCR (CPU)
	↓ ↓
	Contract text → existing analysis pipeline
	```

	## 💬 Q&A Chatbot Architecture (RAG)

	```
	User asks question about their contract
	↓
	[1] Embed question with all-MiniLM-L6-v2
	↓
	[2] Retrieve top-5 most relevant chunks from contract
	↓
	[3] Build prompt:
	- System: ClauseGuard analysis results (clauses, entities, risk scores)
	- Context: Retrieved contract chunks (≤2.5K tokens)
	- User question
	↓
	[4] Stream response from Qwen2.5-7B via HF Inference API
	```

	Key design: Analyzed data (clauses, entities, risk scores) goes in the system prompt — NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.

	## ✏️ Clause Redlining Architecture (3-Tier)

	\| Tier \| Method \| Speed \| Hallucination Risk \|
	\|------\|--------\|-------\|--------------------\|
	\| 1. Template Lookup \| 18+ pre-written safe alternatives based on FTC/EU/CFPB standards \| Instant \| Zero \|
	\| 2. Keyword Matching \| Match clause text to relevant templates via legal keywords \| Instant \| Zero \|
	\| 3. LLM Refinement \| Qwen2.5-7B adapts template to specific clause context \| ~3-5s \| Low (template-anchored) \|

	Anti-hallucination guardrails:
	- Template anchor: LLM can only refine, not generate from scratch
	- Legal citation: Every suggestion includes legal basis and consumer standard
	- Disclaimer: Clear "Not legal advice" warning

	## 📊 Risk Scoring Methodology

	Risk scores combine clause detection with weighted severity:
	- CRITICAL: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
	- HIGH: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
	- MEDIUM: 10 pts (Governing Law, Jurisdiction, etc.)
	- LOW: 3 pts (Document Name, Dates, etc.)

	Final score normalized to 0-100 with letter grades:
	- A (0-14): Low risk
	- B (15-29): Moderate risk
	- C (30-49): Elevated risk
	- D (50-69): High risk
	- F (70+): Critical risk

	## 🚀 Usage

	1. Upload a contract (PDF, DOCX, or TXT) or paste text directly
	- 💡 Scanned PDFs are automatically processed with OCR
	2. Click Analyze Contract
	3. View results across tabs:
	- Document: Full text with inline entity highlights
	- Clauses: Detected clauses with risk badges
	- Entities: Extracted parties, dates, money, jurisdictions
	- Contradictions: Conflicting clauses and missing provisions
	- Obligations: Action items categorized by type
	- Compliance: Regulatory framework checks
	- Redlining: ✏️ Safer clause alternatives with legal citations
	4. Export JSON/CSV reports
	5. Switch to 💬 Contract Q&A tab to ask questions about your contract

	## 🔀 Compare Contracts

	Switch to the Compare Contracts tab to:
	- Upload or paste two contracts side-by-side
	- See clause-level diffs (added, removed, modified)
	- Get an alignment score and risk delta

	## ⚠️ Disclaimer

	Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.

	## 🔗 Links

	- [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard)
	- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
	- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
	- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
	- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
	- [docTR OCR](https://github.com/mindee/doctr)
	- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)

	---

	Built with ❤️ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.

	---
	title: ClauseGuard
	emoji: 🛡️
	colorFrom: gray
	colorTo: gray
	sdk: gradio
	sdk_version: "5.23.0"
	python_version: "3.12"
	app_file: app.py
	pinned: false
	---

	# 🛡️ ClauseGuard v4.3 — World's Best Open-Source Legal Contract Analysis

	ClauseGuard is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.

	## 🆕 What's New in v4.3

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| ⚡ ONNX + INT8 Quantization \| CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization — 2-4x faster inference on CPU. New `ml/export_onnx_v2.py` handles the full merge→export→quantize pipeline. \|
	\| 🎯 Better Embeddings \| Upgraded from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` — +21% retrieval accuracy on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval. \|
	\| 🚀 Batched Classification \| All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one — 2-3x throughput improvement. \|
	\| 🧵 CPU Thread Control \| `torch.set_num_threads(2)` prevents CPU thrashing under concurrent Gradio requests \|

	### Previous: v4.2

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🔧 NLI Fix \| Fixed contradiction detection — now uses `CrossEncoder.predict()` instead of broken `pipeline("text-classification")` dict input. Contradictions actually work now. \|
	\| 🔒 Thread Safety \| `BoundedCache` now uses `threading.RLock` to prevent race conditions under concurrent Gradio requests \|
	\| ⚡ Pre-compiled Regex \| All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level — eliminates thousands of redundant compilations \|
	\| 🔗 Extension Fix \| Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL. \|
	\| 🏷️ Label Coverage \| Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP \|
	\| 🛡️ Security \| API CORS localhost origins now require explicit opt-in via `CORS_ALLOW_LOCALHOST=true` env var \|

	### Previous: v4.0

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🔍 OCR for Scanned PDFs \| Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) \|
	\| 💬 Contract Q&A Chatbot \| RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation \|
	\| ✏️ Clause Redlining \| 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses \|

	## ✨ Core Features

	### Analysis Engine
	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 41 CUAD Clause Categories \| Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more \|
	\| 4-Tier Risk Scoring \| Critical 🔴 / High 🟠 / Medium 🟡 / Low 🟢 with visual risk matrix \|
	\| Legal NER \| Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles \|
	\| NLI Contradiction Detection \| Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions \|
	\| Obligation Tracker \| Categorizes action items: monetary 💰, compliance ⚖️, reporting 📊, delivery 📦, termination 🛑 \|
	\| Compliance Checker \| Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements \|
	\| Contract Comparison \| Side-by-side diff between two contracts with alignment scoring \|
	\| Clause Redlining \| Suggests safer alternatives for risky clauses with legal citations \|
	\| Q&A Chatbot \| Ask questions about your contract using RAG (Retrieval-Augmented Generation) \|
	\| OCR Support \| Process scanned PDFs with docTR OCR engine \|

	### Document Support
	- PDF parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
	- DOCX/DOC parsing via `python-docx`
	- TXT / Markdown direct text input

	### UI/UX
	- 3-Panel Professional Layout — Upload sidebar + Main analysis + Summary dashboard
	- Document Viewer — Inline entity highlights (colored annotations)
	- Clause Cards — Expandable risk-badged cards with confidence scores
	- Redlining Tab — Side-by-side original vs suggested safer alternatives
	- Q&A Chat Tab — Conversational interface to ask questions about the contract
	- Export Reports — JSON (structured) and CSV (tabular) downloads
	- Color-Coded Risk Badges — Instant visual triage

	## 🧠 Models & Architecture

	\| Component \| Technology \|
	\|-----------\|------------\|
	\| Clause Classification \| `Mokshith31/legalbert-contract-clause-classification` — LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy \|
	\| Legal NER \| `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types \|
	\| NLI \| `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) \|
	\| Embeddings \| `BAAI/bge-small-en-v1.5` (384-dim, RAG retrieval — +21% over MiniLM) \|
	\| LLM \| `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) \|
	\| OCR \| `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction \|
	\| Compliance \| Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA \|
	\| Comparison \| Semantic similarity with sentence embeddings + string matching fallback \|
	\| Obligations \| Regex pattern matching across 5 obligation categories \|

	## 🔍 OCR Architecture (Smart PDF Router)

	```
	PDF uploaded
	↓
	[detect_if_scanned] — pdfplumber extracts >50 chars/page?
	↓ ↓
	Native PDF Scanned PDF
	↓ ↓
	pdfplumber docTR OCR (CPU)
	↓ ↓
	Contract text → existing analysis pipeline
	```

	## 💬 Q&A Chatbot Architecture (RAG)

	```
	User asks question about their contract
	↓
	[1] Embed question with all-MiniLM-L6-v2
	↓
	[2] Retrieve top-5 most relevant chunks from contract
	↓
	[3] Build prompt:
	- System: ClauseGuard analysis results (clauses, entities, risk scores)
	- Context: Retrieved contract chunks (≤2.5K tokens)
	- User question
	↓
	[4] Stream response from Qwen2.5-7B via HF Inference API
	```

	Key design: Analyzed data (clauses, entities, risk scores) goes in the system prompt — NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.

	## ✏️ Clause Redlining Architecture (3-Tier)

	\| Tier \| Method \| Speed \| Hallucination Risk \|
	\|------\|--------\|-------\|--------------------\|
	\| 1. Template Lookup \| 18+ pre-written safe alternatives based on FTC/EU/CFPB standards \| Instant \| Zero \|
	\| 2. Keyword Matching \| Match clause text to relevant templates via legal keywords \| Instant \| Zero \|
	\| 3. LLM Refinement \| Qwen2.5-7B adapts template to specific clause context \| ~3-5s \| Low (template-anchored) \|

	Anti-hallucination guardrails:
	- Template anchor: LLM can only refine, not generate from scratch
	- Legal citation: Every suggestion includes legal basis and consumer standard
	- Disclaimer: Clear "Not legal advice" warning

	## 📊 Risk Scoring Methodology

	Risk scores combine clause detection with weighted severity:
	- CRITICAL: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
	- HIGH: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
	- MEDIUM: 10 pts (Governing Law, Jurisdiction, etc.)
	- LOW: 3 pts (Document Name, Dates, etc.)

	Final score normalized to 0-100 with letter grades:
	- A (0-14): Low risk
	- B (15-29): Moderate risk
	- C (30-49): Elevated risk
	- D (50-69): High risk
	- F (70+): Critical risk

	## 🚀 Usage

	1. Upload a contract (PDF, DOCX, or TXT) or paste text directly
	- 💡 Scanned PDFs are automatically processed with OCR
	2. Click Analyze Contract
	3. View results across tabs:
	- Document: Full text with inline entity highlights
	- Clauses: Detected clauses with risk badges
	- Entities: Extracted parties, dates, money, jurisdictions
	- Contradictions: Conflicting clauses and missing provisions
	- Obligations: Action items categorized by type
	- Compliance: Regulatory framework checks
	- Redlining: ✏️ Safer clause alternatives with legal citations
	4. Export JSON/CSV reports
	5. Switch to 💬 Contract Q&A tab to ask questions about your contract

	## 🔀 Compare Contracts

	Switch to the Compare Contracts tab to:
	- Upload or paste two contracts side-by-side
	- See clause-level diffs (added, removed, modified)
	- Get an alignment score and risk delta

	## ⚠️ Disclaimer

	Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.

	## 🔗 Links

	- [ClauseGuard Space](https://huggingface.co/spaces/gaurv007/ClauseGuard)
	- [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
	- [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
	- [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
	- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
	- [docTR OCR](https://github.com/mindee/doctr)
	- [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)

	---

	Built with ❤️ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.