ClauseGuard / README.md
gaurv007's picture
⚑ v4.3: Performance optimizations β€” ONNX INT8, BGE embedder, batched classification, thread control (#4)
f4b6528
|
raw
history blame
10 kB
metadata
title: ClauseGuard
emoji: πŸ›‘οΈ
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 5.23.0
python_version: '3.12'
app_file: app.py
pinned: false

πŸ›‘οΈ ClauseGuard v4.3 β€” World's Best Open-Source Legal Contract Analysis

ClauseGuard is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.

πŸ†• What's New in v4.3

Feature Description
⚑ ONNX + INT8 Quantization CUAD classifier now supports ONNX Runtime with dynamic INT8 quantization β€” 2-4x faster inference on CPU. New ml/export_onnx_v2.py handles the full mergeβ†’exportβ†’quantize pipeline.
🎯 Better Embeddings Upgraded from all-MiniLM-L6-v2 to BAAI/bge-small-en-v1.5 β€” +21% retrieval accuracy on MTEB benchmarks, same 384-dim, same latency. Includes query instruction prefix for asymmetric retrieval.
πŸš€ Batched Classification All clauses classified in a single batched forward pass (batch_size=8) instead of one-by-one β€” 2-3x throughput improvement.
🧡 CPU Thread Control torch.set_num_threads(2) prevents CPU thrashing under concurrent Gradio requests

Previous: v4.2

Feature Description
πŸ”§ NLI Fix Fixed contradiction detection β€” now uses CrossEncoder.predict() instead of broken pipeline("text-classification") dict input. Contradictions actually work now.
πŸ”’ Thread Safety BoundedCache now uses threading.RLock to prevent race conditions under concurrent Gradio requests
⚑ Pre-compiled Regex All regex patterns (clause classification, obligations, compliance negation) pre-compiled at module level β€” eliminates thousands of redundant compilations
πŸ”— Extension Fix Chrome extension risk formula now matches backend (diminishing returns, not normalized by doc length). Fixed API_BASE URL.
🏷️ Label Coverage Added missing regex-only labels (Indemnification, Confidentiality, Force Majeure, Penalties) to RISK_MAP and DESC_MAP
πŸ›‘οΈ Security API CORS localhost origins now require explicit opt-in via CORS_ALLOW_LOCALHOST=true env var

Previous: v4.0

Feature Description
πŸ” OCR for Scanned PDFs Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models)
πŸ’¬ Contract Q&A Chatbot RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation
✏️ Clause Redlining 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses

✨ Core Features

Analysis Engine

Feature Description
41 CUAD Clause Categories Full taxonomy: Document Name, Parties, Governing Law, Indemnification, Termination, Non-Compete, IP Ownership, Audit Rights, Force Majeure, and more
4-Tier Risk Scoring Critical πŸ”΄ / High 🟠 / Medium 🟑 / Low 🟒 with visual risk matrix
Legal NER Extracts parties, dates, monetary values ($), jurisdictions, defined terms, and party roles
NLI Contradiction Detection Identifies conflicting clauses (e.g., uncapped + capped liability) and missing critical provisions
Obligation Tracker Categorizes action items: monetary πŸ’°, compliance βš–οΈ, reporting πŸ“Š, delivery πŸ“¦, termination πŸ›‘
Compliance Checker Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements
Contract Comparison Side-by-side diff between two contracts with alignment scoring
Clause Redlining Suggests safer alternatives for risky clauses with legal citations
Q&A Chatbot Ask questions about your contract using RAG (Retrieval-Augmented Generation)
OCR Support Process scanned PDFs with docTR OCR engine

Document Support

  • PDF parsing via pdfplumber (native) + docTR OCR (scanned)
  • DOCX/DOC parsing via python-docx
  • TXT / Markdown direct text input

UI/UX

  • 3-Panel Professional Layout β€” Upload sidebar + Main analysis + Summary dashboard
  • Document Viewer β€” Inline entity highlights (colored annotations)
  • Clause Cards β€” Expandable risk-badged cards with confidence scores
  • Redlining Tab β€” Side-by-side original vs suggested safer alternatives
  • Q&A Chat Tab β€” Conversational interface to ask questions about the contract
  • Export Reports β€” JSON (structured) and CSV (tabular) downloads
  • Color-Coded Risk Badges β€” Instant visual triage

🧠 Models & Architecture

Component Technology
Clause Classification Mokshith31/legalbert-contract-clause-classification β€” LoRA adapter on nlpaueb/legal-bert-base-uncased, fine-tuned on CUAD 41-class taxonomy
Legal NER matterstack/legal-bert-ner (ML) with regex fallback for 7 entity types
NLI cross-encoder/nli-deberta-v3-base (semantic contradiction detection)
Embeddings BAAI/bge-small-en-v1.5 (384-dim, RAG retrieval β€” +21% over MiniLM)
LLM Qwen/Qwen2.5-7B-Instruct via HF Inference API (chatbot + redlining)
OCR docTR (fast_base + crnn_vgg16_bn) for scanned PDF text extraction
Compliance Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA
Comparison Semantic similarity with sentence embeddings + string matching fallback
Obligations Regex pattern matching across 5 obligation categories

πŸ” OCR Architecture (Smart PDF Router)

PDF uploaded
    ↓
[detect_if_scanned] β€” pdfplumber extracts >50 chars/page?
    ↓                           ↓
  Native PDF               Scanned PDF
    ↓                           ↓
  pdfplumber              docTR OCR (CPU)
    ↓                           ↓
  Contract text β†’ existing analysis pipeline

πŸ’¬ Q&A Chatbot Architecture (RAG)

User asks question about their contract
        ↓
[1] Embed question with all-MiniLM-L6-v2
        ↓
[2] Retrieve top-5 most relevant chunks from contract
        ↓
[3] Build prompt:
    - System: ClauseGuard analysis results (clauses, entities, risk scores)
    - Context: Retrieved contract chunks (≀2.5K tokens)
    - User question
        ↓
[4] Stream response from Qwen2.5-7B via HF Inference API

Key design: Analyzed data (clauses, entities, risk scores) goes in the system prompt β€” NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.

✏️ Clause Redlining Architecture (3-Tier)

Tier Method Speed Hallucination Risk
1. Template Lookup 18+ pre-written safe alternatives based on FTC/EU/CFPB standards Instant Zero
2. Keyword Matching Match clause text to relevant templates via legal keywords Instant Zero
3. LLM Refinement Qwen2.5-7B adapts template to specific clause context ~3-5s Low (template-anchored)

Anti-hallucination guardrails:

  • Template anchor: LLM can only refine, not generate from scratch
  • Legal citation: Every suggestion includes legal basis and consumer standard
  • Disclaimer: Clear "Not legal advice" warning

πŸ“Š Risk Scoring Methodology

Risk scores combine clause detection with weighted severity:

  • CRITICAL: 40 pts (Uncapped Liability, Arbitration, IP Assignment, etc.)
  • HIGH: 20 pts (Non-Compete, Exclusivity, Unilateral Change, etc.)
  • MEDIUM: 10 pts (Governing Law, Jurisdiction, etc.)
  • LOW: 3 pts (Document Name, Dates, etc.)

Final score normalized to 0-100 with letter grades:

  • A (0-14): Low risk
  • B (15-29): Moderate risk
  • C (30-49): Elevated risk
  • D (50-69): High risk
  • F (70+): Critical risk

πŸš€ Usage

  1. Upload a contract (PDF, DOCX, or TXT) or paste text directly
    • πŸ’‘ Scanned PDFs are automatically processed with OCR
  2. Click Analyze Contract
  3. View results across tabs:
    • Document: Full text with inline entity highlights
    • Clauses: Detected clauses with risk badges
    • Entities: Extracted parties, dates, money, jurisdictions
    • Contradictions: Conflicting clauses and missing provisions
    • Obligations: Action items categorized by type
    • Compliance: Regulatory framework checks
    • Redlining: ✏️ Safer clause alternatives with legal citations
  4. Export JSON/CSV reports
  5. Switch to πŸ’¬ Contract Q&A tab to ask questions about your contract

πŸ”€ Compare Contracts

Switch to the Compare Contracts tab to:

  • Upload or paste two contracts side-by-side
  • See clause-level diffs (added, removed, modified)
  • Get an alignment score and risk delta

⚠️ Disclaimer

Not legal advice. ClauseGuard is an AI-powered analysis tool for informational purposes only. Always consult a qualified attorney for legal decisions. The tool may miss nuances and should be used as a preliminary screening aid, not a substitute for professional legal review.

πŸ”— Links


Built with ❀️ using Gradio, Hugging Face Transformers, and Legal-BERT. Open source and free for all.