Spaces:

gaurv007
/

ClauseGuard

Sleeping

App Files Files Community

anky2002 commited on 14 days ago

Commit

2ca6bab

2 Parent(s): 970b3d5 49d0c4a

Merge branch 'main' of https://huggingface.co/spaces/gaurv007/ClauseGuard

Browse files

Files changed (21) hide show

README.md +73 -14
api/Dockerfile +8 -2
api/main.py +181 -85
api/requirements.txt +7 -4
app.py +211 -48
chatbot.py +406 -0
compare.py +33 -2
ml/ClauseGuard_DeBERTa_Training.ipynb +1041 -0
ml/requirements.txt +2 -2
ml/train_classifier_v4.py +434 -0
obligations.py +12 -8
ocr_engine.py +218 -0
redlining.py +591 -0
requirements.txt +3 -1
web/.env.example +7 -0
web/app/api/analyze/route.ts +12 -2
web/app/api/chat/route.ts +37 -0
web/app/api/redline/route.ts +37 -0
web/app/dashboard-pages/analyze/page.tsx +178 -1
web/app/page.tsx +25 -20
web/components/nav.tsx +2 -3

README.md CHANGED Viewed

@@ -10,9 +10,17 @@ app_file: app.py
 pinned: false
 ---
-# 🛡️ ClauseGuard — World's Best Open-Source Legal Contract Analysis
-**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments.
 ## ✨ Core Features
@@ -26,9 +34,12 @@ pinned: false
 | **Obligation Tracker** | Categorizes action items: monetary 💰, compliance ⚖️, reporting 📊, delivery 📦, termination 🛑 |
 | **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
 | **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
 ### Document Support
-- **PDF** parsing via `pdfplumber`
 - **DOCX/DOC** parsing via `python-docx`
 - **TXT / Markdown** direct text input
@@ -36,6 +47,8 @@ pinned: false
 - **3-Panel Professional Layout** — Upload sidebar + Main analysis + Summary dashboard
 - **Document Viewer** — Inline entity highlights (colored annotations)
 - **Clause Cards** — Expandable risk-badged cards with confidence scores
 - **Export Reports** — JSON (structured) and CSV (tabular) downloads
 - **Color-Coded Risk Badges** — Instant visual triage
@@ -44,12 +57,61 @@ pinned: false
 | Component | Technology |
 |-----------|------------|
 | Clause Classification | `Mokshith31/legalbert-contract-clause-classification` — LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
-| NER | Rule-based with 7 entity types (dates, money, parties, jurisdictions, defined terms) |
-| NLI | Heuristic contradiction detection with 5 conflict patterns + missing-clause detection |
 | Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
-| Comparison | SequenceMatcher-based clause alignment with risk delta analysis |
 | Obligations | Regex pattern matching across 5 obligation categories |
 ## 📊 Risk Scoring Methodology
 Risk scores combine clause detection with weighted severity:
@@ -65,16 +127,10 @@ Final score normalized to 0-100 with letter grades:
 - D (50-69): High risk
 - F (70+): Critical risk
-## 📚 Datasets & Research
-- [CUAD](https://huggingface.co/datasets/theatticusproject/cuad-qa) — 510 contracts, 13K annotations, 41 clause categories
-- [LegalBench](https://huggingface.co/datasets/nguha/legalbench) — 322 legal reasoning tasks
-- [LexGLUE](https://huggingface.co/datasets/coastalcph/lex_glue) — Unfair Terms of Service classification
-- Paper: [CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review](https://arxiv.org/abs/2103.06268) (Hendrycks et al., 2021)
 ## 🚀 Usage
 1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
 2. Click **Analyze Contract**
 3. View results across tabs:
    - **Document**: Full text with inline entity highlights
@@ -83,7 +139,9 @@ Final score normalized to 0-100 with letter grades:
    - **Contradictions**: Conflicting clauses and missing provisions
    - **Obligations**: Action items categorized by type
    - **Compliance**: Regulatory framework checks
 4. **Export** JSON/CSV reports
 ## 🔀 Compare Contracts
@@ -91,7 +149,6 @@ Switch to the **Compare Contracts** tab to:
 - Upload or paste two contracts side-by-side
 - See clause-level diffs (added, removed, modified)
 - Get an alignment score and risk delta
-- View raw JSON comparison data
 ## ⚠️ Disclaimer
@@ -103,6 +160,8 @@ Switch to the **Compare Contracts** tab to:
 - [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
 - [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
 - [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
 - [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
 ---

 pinned: false
 ---
+# 🛡️ ClauseGuard v4.0 — World's Best Open-Source Legal Contract Analysis
+**ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
+## 🆕 What's New in v4.0
+| Feature | Description |
+|---------|-------------|
+| **🔍 OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
+| **💬 Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
+| **✏️ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
 ## ✨ Core Features
 | **Obligation Tracker** | Categorizes action items: monetary 💰, compliance ⚖️, reporting 📊, delivery 📦, termination 🛑 |
 | **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
 | **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
+| **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
+| **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
+| **OCR Support** | Process scanned PDFs with docTR OCR engine |
 ### Document Support
+- **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
 - **DOCX/DOC** parsing via `python-docx`
 - **TXT / Markdown** direct text input
 - **3-Panel Professional Layout** — Upload sidebar + Main analysis + Summary dashboard
 - **Document Viewer** — Inline entity highlights (colored annotations)
 - **Clause Cards** — Expandable risk-badged cards with confidence scores
+- **Redlining Tab** — Side-by-side original vs suggested safer alternatives
+- **Q&A Chat Tab** — Conversational interface to ask questions about the contract
 - **Export Reports** — JSON (structured) and CSV (tabular) downloads
 - **Color-Coded Risk Badges** — Instant visual triage
 | Component | Technology |
 |-----------|------------|
 | Clause Classification | `Mokshith31/legalbert-contract-clause-classification` — LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
+| Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
+| NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
+| Embeddings | `sentence-transformers/all-MiniLM-L6-v2` (384-dim, RAG retrieval) |
+| LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
+| OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
 | Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
+| Comparison | Semantic similarity with sentence embeddings + string matching fallback |
 | Obligations | Regex pattern matching across 5 obligation categories |
+## 🔍 OCR Architecture (Smart PDF Router)
+```
+PDF uploaded
+    ↓
+[detect_if_scanned] — pdfplumber extracts >50 chars/page?
+    ↓                           ↓
+  Native PDF               Scanned PDF
+    ↓                           ↓
+  pdfplumber              docTR OCR (CPU)
+    ↓                           ↓
+  Contract text → existing analysis pipeline
+```
+## 💬 Q&A Chatbot Architecture (RAG)
+```
+User asks question about their contract
+        ↓
+[1] Embed question with all-MiniLM-L6-v2
+        ↓
+[2] Retrieve top-5 most relevant chunks from contract
+        ↓
+[3] Build prompt:
+    - System: ClauseGuard analysis results (clauses, entities, risk scores)
+    - Context: Retrieved contract chunks (≤2.5K tokens)
+    - User question
+        ↓
+[4] Stream response from Qwen2.5-7B via HF Inference API
+```
+**Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt — NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
+## ✏️ Clause Redlining Architecture (3-Tier)
+| Tier | Method | Speed | Hallucination Risk |
+|------|--------|-------|--------------------|
+| **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
+| **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
+| **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
+Anti-hallucination guardrails:
+- **Template anchor:** LLM can only refine, not generate from scratch
+- **Legal citation:** Every suggestion includes legal basis and consumer standard
+- **Disclaimer:** Clear "Not legal advice" warning
 ## 📊 Risk Scoring Methodology
 Risk scores combine clause detection with weighted severity:
 - D (50-69): High risk
 - F (70+): Critical risk
 ## 🚀 Usage
 1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
+   - 💡 Scanned PDFs are automatically processed with OCR
 2. Click **Analyze Contract**
 3. View results across tabs:
    - **Document**: Full text with inline entity highlights
    - **Contradictions**: Conflicting clauses and missing provisions
    - **Obligations**: Action items categorized by type
    - **Compliance**: Regulatory framework checks
+   - **Redlining**: ✏️ Safer clause alternatives with legal citations
 4. **Export** JSON/CSV reports
+5. Switch to **💬 Contract Q&A** tab to ask questions about your contract
 ## 🔀 Compare Contracts
 - Upload or paste two contracts side-by-side
 - See clause-level diffs (added, removed, modified)
 - Get an alignment score and risk delta
 ## ⚠️ Disclaimer
 - [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
 - [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
 - [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
+- [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
+- [docTR OCR](https://github.com/mindee/doctr)
 - [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
 ---

api/Dockerfile CHANGED Viewed

@@ -2,10 +2,16 @@ FROM python:3.12-slim
 WORKDIR /app
-COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-COPY . .
 EXPOSE 8000

 WORKDIR /app
+# Install api dependencies
+COPY api/requirements.txt ./requirements.txt
 RUN pip install --no-cache-dir -r requirements.txt
+# Copy shared modules from root (needed by api/main.py)
+COPY app.py compare.py compliance.py obligations.py ./
+COPY ocr_engine.py chatbot.py redlining.py ./
+# Copy api files
+COPY api/ ./
 EXPOSE 8000

api/main.py CHANGED Viewed

@@ -1,19 +1,19 @@
 """
-ClauseGuard — FastAPI Backend v3.0
 ══════════════════════════════════
-FIXED in v3.0:
-  • Imports shared modules (no code duplication)
-  • Fixed API schema to accept both {text} and {clauses} from extension
-  • Added rate limiting
-  • Added max text length validation
-  • Fixed CORS (removed wildcard)
-  • Added proper error responses
 """
 import os
 import re
 import json
 import time
 from contextlib import asynccontextmanager
 from typing import Optional
 from collections import defaultdict
@@ -21,14 +21,14 @@ from datetime import datetime
 import httpx
 import numpy as np
-from fastapi import FastAPI, HTTPException, Depends, Body, Request
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel, Field
 from auth import get_current_user, require_auth
 # ── Import shared modules ──
-# When deployed, these must be in the same directory or on PYTHONPATH
 import sys
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
@@ -36,29 +36,32 @@ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 try:
     from app import (
         split_clauses, classify_cuad, extract_entities,
-        detect_contradictions, compute_risk_score,
         CUAD_LABELS, RISK_MAP, DESC_MAP, _model_status,
         cuad_model, cuad_tokenizer
     )
     from obligations import extract_obligations
     from compliance import check_compliance
     from compare import compare_contracts
     _SHARED_MODULES = True
-except ImportError:
     _SHARED_MODULES = False
-    print("[API] WARNING: Could not import shared modules, using inline fallbacks")
 # ─── Config ───
 SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
 SUPABASE_SERVICE_KEY = os.environ.get("SUPABASE_SERVICE_ROLE_KEY", "")
 HF_API_TOKEN = os.environ.get("HF_API_TOKEN", "")
 SAULLM_ENDPOINT = os.environ.get("SAULLM_ENDPOINT", "")
-MAX_TEXT_LENGTH = int(os.environ.get("MAX_TEXT_LENGTH", "100000"))  # 100KB default
 # ─── Rate Limiting ───
-_rate_limits = {}  # ip -> (count, window_start)
 RATE_LIMIT_REQUESTS = 30
-RATE_LIMIT_WINDOW = 60  # seconds
 def _check_rate_limit(client_ip: str) -> bool:
     now = time.time()
@@ -113,25 +116,16 @@ async def supabase_query(table: str, params: dict, headers_extra: dict = {}):
     except Exception:
         return []
 # ─── Request/Response Models ───
 class AnalyzeRequest(BaseModel):
     text: Optional[str] = Field(None, min_length=50)
-    clauses: Optional[list] = None  # FIXED: accept clauses array from extension
     source_url: Optional[str] = None
-class AnalyzeResponse(BaseModel):
-    risk_score: int
-    grade: str
-    total_clauses: int
-    flagged_count: int
-    results: list[dict]
-    entities: list[dict]
-    contradictions: list[dict]
-    obligations: list[dict]
-    compliance: dict
-    model: str
-    latency_ms: int
 class CompareRequest(BaseModel):
     text_a: str = Field(..., min_length=50)
     text_b: str = Field(..., min_length=50)
@@ -147,21 +141,28 @@ class ExplainResponse(BaseModel):
     legal_basis: str
     recommendation: str
 # ─── App ───
 @asynccontextmanager
 async def lifespan(app: FastAPI):
-    # Models are loaded when app.py is imported
     yield
-app = FastAPI(title="ClauseGuard API", version="3.0.0", lifespan=lifespan)
-# FIXED: No wildcard CORS
 ALLOWED_ORIGINS = [
     "https://clauseguardweb.netlify.app",
     "http://localhost:3000",
     "http://localhost:3001",
 ]
-# Allow chrome extensions
 app.add_middleware(
     CORSMiddleware,
     allow_origins=ALLOWED_ORIGINS,
@@ -174,36 +175,36 @@ app.add_middleware(
 @app.get("/health")
 async def health():
     model_status = "ml" if _SHARED_MODULES and cuad_model else "regex"
     return {
         "status": "ok",
         "model": model_status,
-        "version": "3.0.0",
         "shared_modules": _SHARED_MODULES,
     }
-@app.post("/api/analyze", response_model=AnalyzeResponse)
 async def analyze(req: AnalyzeRequest, request: Request, user: Optional[dict] = Depends(get_current_user)):
-    # Rate limiting
     client_ip = request.client.host if request.client else "unknown"
     if not _check_rate_limit(client_ip):
-        raise HTTPException(status_code=429, detail="Rate limit exceeded. Try again in 60 seconds.")
-    # FIXED: Accept either text or clauses from extension
     text = req.text
     if not text and req.clauses:
         text = "\n\n".join(req.clauses) if isinstance(req.clauses, list) else str(req.clauses)
     if not text or len(text.strip()) < 50:
         raise HTTPException(status_code=400, detail="Text too short (minimum 50 characters)")
-    # Max length check
     if len(text) > MAX_TEXT_LENGTH:
-        raise HTTPException(status_code=400, detail=f"Text too long (maximum {MAX_TEXT_LENGTH} characters)")
     start = time.time()
     clauses = split_clauses(text)
     if not clauses:
-        raise HTTPException(status_code=400, detail="No clauses detected in document")
     clause_results = []
     for clause in clauses:
@@ -224,6 +225,15 @@ async def analyze(req: AnalyzeRequest, request: Request, user: Optional[dict] =
     risk, grade, sev_counts = compute_risk_score(clause_results, len(clauses))
     obligations = extract_obligations(text)
     compliance = check_compliance(text)
     latency = int((time.time() - start) * 1000)
     results_for_db = []
@@ -238,6 +248,29 @@ async def analyze(req: AnalyzeRequest, request: Request, user: Optional[dict] =
             }],
         })
     if user:
         await supabase_insert("analyses", {
             "user_id": user["id"],
@@ -253,46 +286,120 @@ async def analyze(req: AnalyzeRequest, request: Request, user: Optional[dict] =
             "compliance": compliance,
         })
-    return AnalyzeResponse(
-        risk_score=risk,
-        grade=grade,
-        total_clauses=len(clauses),
-        flagged_count=len(set(cr["text"] for cr in clause_results)),
-        results=results_for_db,
-        entities=entities,
-        contradictions=contradictions,
-        obligations=obligations,
-        compliance=compliance,
-        model="ml" if cuad_model else "regex",
-        latency_ms=latency,
-    )
 @app.post("/api/compare")
 async def compare(req: CompareRequest, request: Request):
     client_ip = request.client.host if request.client else "unknown"
     if not _check_rate_limit(client_ip):
         raise HTTPException(status_code=429, detail="Rate limit exceeded.")
-    result = compare_contracts(req.text_a, req.text_b)
-    return result
 @app.post("/api/explain", response_model=ExplainResponse)
 async def explain(req: ExplainRequest, user: dict = Depends(require_auth)):
     desc = DESC_MAP.get(req.category, "Unknown category.")
     legal = "Consult local consumer protection laws."
-    recommendation = "Review this clause carefully. Consider negotiating or seeking legal advice before agreeing."
     if SAULLM_ENDPOINT and HF_API_TOKEN:
         try:
             prompt = (
-                f"You are a consumer protection legal analyst. Analyze this contract clause "
-                f"and explain why it may be unfair or risky.\n\n"
-                f"Clause: \"{req.clause}\"\n"
-                f"Category: {req.category}\n\n"
-                f"Provide:\n"
-                f"1. A plain-English explanation of what this clause means\n"
-                f"2. The specific legal basis or consumer protection concern\n"
-                f"3. A practical recommendation\n\n"
-                f"Be concise. 3-4 sentences per section."
             )
             async with httpx.AsyncClient(timeout=30.0) as client:
                 resp = await client.post(
@@ -311,27 +418,16 @@ async def explain(req: ExplainRequest, user: dict = Depends(require_auth)):
         except Exception:
             pass
-    return ExplainResponse(
-        clause=req.clause,
-        category=req.category,
-        explanation=desc,
-        legal_basis=legal,
-        recommendation=recommendation,
-    )
 @app.get("/api/history")
 async def history(user: dict = Depends(require_auth), limit: int = 20, offset: int = 0):
     limit = min(limit, 100)
-    data = await supabase_query(
-        "analyses",
-        {
-            "user_id": f"eq.{user['id']}",
-            "select": "*",
-            "order": "created_at.desc",
-            "limit": str(limit),
-            "offset": str(offset),
-        },
-    )
     return {"analyses": data, "limit": limit, "offset": offset}
 if __name__ == "__main__":

 """
+ClauseGuard — FastAPI Backend v4.0
 ══════════════════════════════════
+New in v4.0:
+  • /api/redline — clause redlining suggestions
+  • /api/chat — RAG chatbot (streaming)
+  • /api/ocr — OCR scanned PDF extraction
+  • Updated analysis to include redlining data
 """
 import os
 import re
 import json
 import time
+import uuid
+import tempfile
 from contextlib import asynccontextmanager
 from typing import Optional
 from collections import defaultdict
 import httpx
 import numpy as np
+from fastapi import FastAPI, HTTPException, Depends, Body, Request, UploadFile, File as FastAPIFile
 from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import StreamingResponse
 from pydantic import BaseModel, Field
 from auth import get_current_user, require_auth
 # ── Import shared modules ──
 import sys
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 try:
     from app import (
         split_clauses, classify_cuad, extract_entities,
+        detect_contradictions, compute_risk_score, analyze_contract,
         CUAD_LABELS, RISK_MAP, DESC_MAP, _model_status,
         cuad_model, cuad_tokenizer
     )
     from obligations import extract_obligations
     from compliance import check_compliance
     from compare import compare_contracts
+    from redlining import generate_redlines
+    from chatbot import index_contract, chat_respond
+    from ocr_engine import parse_pdf_smart, get_ocr_status
     _SHARED_MODULES = True
+except ImportError as e:
     _SHARED_MODULES = False
+    print(f"[API] WARNING: Could not import shared modules: {e}")
 # ─── Config ───
 SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
 SUPABASE_SERVICE_KEY = os.environ.get("SUPABASE_SERVICE_ROLE_KEY", "")
 HF_API_TOKEN = os.environ.get("HF_API_TOKEN", "")
 SAULLM_ENDPOINT = os.environ.get("SAULLM_ENDPOINT", "")
+MAX_TEXT_LENGTH = int(os.environ.get("MAX_TEXT_LENGTH", "100000"))
 # ─── Rate Limiting ───
+_rate_limits = {}
 RATE_LIMIT_REQUESTS = 30
+RATE_LIMIT_WINDOW = 60
 def _check_rate_limit(client_ip: str) -> bool:
     now = time.time()
     except Exception:
         return []
+# ─── In-memory RAG session store ───
+_rag_sessions: dict = {}
+_RAG_SESSION_MAX = 100
 # ─── Request/Response Models ───
 class AnalyzeRequest(BaseModel):
     text: Optional[str] = Field(None, min_length=50)
+    clauses: Optional[list] = None
     source_url: Optional[str] = None
 class CompareRequest(BaseModel):
     text_a: str = Field(..., min_length=50)
     text_b: str = Field(..., min_length=50)
     legal_basis: str
     recommendation: str
+class ChatRequest(BaseModel):
+    message: str = Field(..., min_length=1, max_length=2000)
+    session_id: str
+    history: Optional[list[dict]] = None
+class RedlineRequest(BaseModel):
+    session_id: Optional[str] = None
+    text: Optional[str] = None
+    use_llm: bool = True
 # ─── App ───
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     yield
+app = FastAPI(title="ClauseGuard API", version="4.0.0", lifespan=lifespan)
 ALLOWED_ORIGINS = [
     "https://clauseguardweb.netlify.app",
     "http://localhost:3000",
     "http://localhost:3001",
 ]
 app.add_middleware(
     CORSMiddleware,
     allow_origins=ALLOWED_ORIGINS,
 @app.get("/health")
 async def health():
     model_status = "ml" if _SHARED_MODULES and cuad_model else "regex"
+    ocr_status = get_ocr_status() if _SHARED_MODULES else "unavailable"
     return {
         "status": "ok",
         "model": model_status,
+        "version": "4.0.0",
         "shared_modules": _SHARED_MODULES,
+        "ocr": ocr_status,
+        "features": ["analyze", "compare", "redline", "chat", "ocr"],
     }
+@app.post("/api/analyze")
 async def analyze(req: AnalyzeRequest, request: Request, user: Optional[dict] = Depends(get_current_user)):
     client_ip = request.client.host if request.client else "unknown"
     if not _check_rate_limit(client_ip):
+        raise HTTPException(status_code=429, detail="Rate limit exceeded.")
     text = req.text
     if not text and req.clauses:
         text = "\n\n".join(req.clauses) if isinstance(req.clauses, list) else str(req.clauses)
     if not text or len(text.strip()) < 50:
         raise HTTPException(status_code=400, detail="Text too short (minimum 50 characters)")
     if len(text) > MAX_TEXT_LENGTH:
+        raise HTTPException(status_code=400, detail=f"Text too long (max {MAX_TEXT_LENGTH} chars)")
     start = time.time()
     clauses = split_clauses(text)
     if not clauses:
+        raise HTTPException(status_code=400, detail="No clauses detected")
     clause_results = []
     for clause in clauses:
     risk, grade, sev_counts = compute_risk_score(clause_results, len(clauses))
     obligations = extract_obligations(text)
     compliance = check_compliance(text)
+    # v4.0: Redlining
+    analysis_for_redline = {"clauses": clause_results}
+    redlines = []
+    try:
+        redlines = generate_redlines(analysis_for_redline, use_llm=True)
+    except Exception as e:
+        print(f"[API] Redlining error: {e}")
     latency = int((time.time() - start) * 1000)
     results_for_db = []
             }],
         })
+    # v4.0: RAG indexing
+    session_id = None
+    try:
+        chunks, embeddings, _status = index_contract(text)
+        if chunks and embeddings is not None:
+            session_id = uuid.uuid4().hex[:12]
+            if len(_rag_sessions) >= _RAG_SESSION_MAX:
+                oldest = next(iter(_rag_sessions))
+                del _rag_sessions[oldest]
+            _rag_sessions[session_id] = {
+                "chunks": chunks,
+                "embeddings": embeddings,
+                "analysis": {
+                    "risk": {"score": risk, "grade": grade, "breakdown": sev_counts},
+                    "metadata": {"total_clauses": len(clauses), "flagged_clauses": len(clause_results)},
+                    "clauses": clause_results[:30],
+                    "entities": entities[:30],
+                    "contradictions": contradictions,
+                },
+            }
+    except Exception as e:
+        print(f"[API] RAG indexing error: {e}")
     if user:
         await supabase_insert("analyses", {
             "user_id": user["id"],
             "compliance": compliance,
         })
+    return {
+        "risk_score": risk,
+        "grade": grade,
+        "total_clauses": len(clauses),
+        "flagged_count": len(set(cr["text"] for cr in clause_results)),
+        "results": results_for_db,
+        "entities": entities,
+        "contradictions": contradictions,
+        "obligations": obligations,
+        "compliance": compliance,
+        "redlines": redlines,
+        "model": "ml" if cuad_model else "regex",
+        "latency_ms": latency,
+        "session_id": session_id,
+    }
 @app.post("/api/compare")
 async def compare(req: CompareRequest, request: Request):
     client_ip = request.client.host if request.client else "unknown"
     if not _check_rate_limit(client_ip):
         raise HTTPException(status_code=429, detail="Rate limit exceeded.")
+    return compare_contracts(req.text_a, req.text_b)
+@app.post("/api/redline")
+async def redline(req: RedlineRequest, request: Request):
+    client_ip = request.client.host if request.client else "unknown"
+    if not _check_rate_limit(client_ip):
+        raise HTTPException(status_code=429, detail="Rate limit exceeded.")
+    if req.session_id and req.session_id in _rag_sessions:
+        analysis = _rag_sessions[req.session_id]["analysis"]
+    elif req.text:
+        result, error = analyze_contract(req.text)
+        if error:
+            raise HTTPException(status_code=400, detail=error)
+        analysis = result
+    else:
+        raise HTTPException(status_code=400, detail="Provide session_id or text")
+    redlines = generate_redlines(analysis, use_llm=req.use_llm)
+    return {"redlines": redlines, "count": len(redlines)}
+@app.post("/api/chat")
+async def chat(req: ChatRequest, request: Request):
+    client_ip = request.client.host if request.client else "unknown"
+    if not _check_rate_limit(client_ip):
+        raise HTTPException(status_code=429, detail="Rate limit exceeded.")
+    if req.session_id not in _rag_sessions:
+        raise HTTPException(status_code=404, detail="Session not found. Analyze a contract first.")
+    session = _rag_sessions[req.session_id]
+    response_text = ""
+    for partial in chat_respond(req.message, req.history or [],
+                                 session["chunks"], session["embeddings"], session["analysis"]):
+        response_text = partial
+    return {"response": response_text, "session_id": req.session_id}
+@app.post("/api/chat/stream")
+async def chat_stream(req: ChatRequest, request: Request):
+    client_ip = request.client.host if request.client else "unknown"
+    if not _check_rate_limit(client_ip):
+        raise HTTPException(status_code=429, detail="Rate limit exceeded.")
+    if req.session_id not in _rag_sessions:
+        raise HTTPException(status_code=404, detail="Session not found.")
+    session = _rag_sessions[req.session_id]
+    async def generate():
+        last = ""
+        for partial in chat_respond(
+            req.message, req.history or [],
+            session["chunks"], session["embeddings"], session["analysis"]
+        ):
+            delta = partial[len(last):]
+            last = partial
+            if delta:
+                yield f"data: {json.dumps({'delta': delta})}\n\n"
+        yield "data: [DONE]\n\n"
+    return StreamingResponse(generate(), media_type="text/event-stream")
+@app.post("/api/ocr")
+async def ocr_endpoint(file: UploadFile = FastAPIFile(...)):
+    if not file.filename or not file.filename.lower().endswith(".pdf"):
+        raise HTTPException(status_code=400, detail="Only PDF files supported")
+    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
+        content = await file.read()
+        tmp.write(content)
+        tmp_path = tmp.name
+    try:
+        text, error, method = parse_pdf_smart(tmp_path)
+        if error:
+            raise HTTPException(status_code=400, detail=error)
+        return {"text": text, "method": method, "chars": len(text) if text else 0, "filename": file.filename}
+    finally:
+        os.unlink(tmp_path)
 @app.post("/api/explain", response_model=ExplainResponse)
 async def explain(req: ExplainRequest, user: dict = Depends(require_auth)):
     desc = DESC_MAP.get(req.category, "Unknown category.")
     legal = "Consult local consumer protection laws."
+    recommendation = "Review this clause carefully."
     if SAULLM_ENDPOINT and HF_API_TOKEN:
         try:
             prompt = (
+                f"Analyze this contract clause and explain why it may be risky.\n\n"
+                f"Clause: \"{req.clause}\"\nCategory: {req.category}\n\n"
+                f"Provide: 1) Plain-English explanation 2) Legal basis 3) Recommendation"
             )
             async with httpx.AsyncClient(timeout=30.0) as client:
                 resp = await client.post(
         except Exception:
             pass
+    return ExplainResponse(clause=req.clause, category=req.category,
+                           explanation=desc, legal_basis=legal, recommendation=recommendation)
 @app.get("/api/history")
 async def history(user: dict = Depends(require_auth), limit: int = 20, offset: int = 0):
     limit = min(limit, 100)
+    data = await supabase_query("analyses", {
+        "user_id": f"eq.{user['id']}", "select": "*",
+        "order": "created_at.desc", "limit": str(limit), "offset": str(offset),
+    })
     return {"analyses": data, "limit": limit, "offset": offset}
 if __name__ == "__main__":

api/requirements.txt CHANGED Viewed

@@ -1,10 +1,13 @@
-fastapi>=0.136.0
-uvicorn[standard]>=0.46.0
-pydantic>=2.13.3
-transformers>=5.6.1
 numpy>=2.0.0
 python-jose[cryptography]>=3.3.0
 httpx>=0.28.0
 peft>=0.15.0
 torch>=2.5.0
 sentence-transformers>=3.0.0

+fastapi>=0.115.0
+uvicorn[standard]>=0.34.0
+pydantic>=2.10.0
+transformers>=4.45.0
 numpy>=2.0.0
 python-jose[cryptography]>=3.3.0
 httpx>=0.28.0
 peft>=0.15.0
 torch>=2.5.0
 sentence-transformers>=3.0.0
+python-doctr[torch]>=0.9.0
+huggingface_hub>=0.25.0
+python-multipart>=0.0.7

app.py CHANGED Viewed

@@ -1,7 +1,12 @@
 """
-ClauseGuard — World's Best Legal Contract Analysis Tool (v3.0)
 ═══════════════════════════════════════════════════════════════
-Fixes in v3.0:
   • Fixed CUAD label mapping (added missing index 6: "Notice Period to Terminate Renewal")
   • Switched from softmax → sigmoid for proper multi-label classification
   • Per-class optimized thresholds instead of flat 0.15
@@ -21,6 +26,9 @@ Models:
     (LoRA adapter on nlpaueb/legal-bert-base-uncased, 41 CUAD classes)
   • Legal NER: matterstack/legal-bert-ner (token classification)
   • NLI: cross-encoder/nli-deberta-v3-base (contradiction detection)
 """
 import os
@@ -71,6 +79,9 @@ except Exception:
 from compare import compare_contracts, render_comparison_html
 from obligations import extract_obligations, render_obligations_html
 from compliance import check_compliance, render_compliance_html
 # ═══════════════════════════════════════════════════════════════════════
 # 1. CONFIGURATION — FIXED label mapping (41 labels, index 6 restored)
@@ -335,20 +346,15 @@ _load_nli_model()
 # ═══════════════════════════════════════════════════════════════════════
 def parse_pdf(file_path):
-    if not _HAS_PDF:
-        return None, "PDF parsing not available (pdfplumber not installed)"
-    try:
-        text = ""
-        with pdfplumber.open(file_path) as pdf:
-            for page in pdf.pages:
-                page_text = page.extract_text()
-                if page_text:
-                    text += page_text + "\n\n"
-        if not text.strip():
-            return None, "PDF appears to be scanned/image-based. OCR is not yet supported. Please use a digital PDF or paste text directly."
-        return text.strip(), None
-    except Exception as e:
-        return None, f"PDF parse error: {e}"
 def parse_docx(file_path):
     if not _HAS_DOCX:
@@ -378,11 +384,22 @@ def parse_document(file_path):
         return None, f"Unsupported file type: {ext}"
 # ═══════════════════════════════════════════════════════════════════════
-# 4. STRUCTURE-AWARE CLAUSE SPLITTING
 # ═══════════════════════════════════════════════════════════════════════
 def split_clauses(text):
-    """Structure-aware clause splitting that respects section numbering."""
     text = re.sub(r'\n{3,}', '\n\n', text.strip())
     # First try to detect numbered sections (1., 2., 3.1, (a), etc.)
@@ -426,9 +443,13 @@ def split_clauses(text):
             preamble = text[:positions[0]].strip()
             if len(preamble) > 30:
                 clauses.insert(0, preamble)
-        return clauses if clauses else _fallback_split(text)
     else:
-        return _fallback_split(text)
 def _fallback_split(text):
     """Fallback: split on paragraph breaks and sentence boundaries."""
@@ -462,8 +483,40 @@ def _fallback_split(text):
 # ═══════════════════════════════════════════════════════════════════════
 # 5. CLAUSE DETECTION — FIXED: sigmoid + per-class thresholds + caching
 # ═══════════════════════════════════════════════════════════════════════
 def _text_hash(text):
     return hashlib.md5(text.encode()).hexdigest()
@@ -474,14 +527,17 @@ def classify_cuad(clause_text):
     if cuad_model is None or cuad_tokenizer is None:
         return _classify_regex(clause_text)
     # Check cache
-    h = _text_hash(clause_text[:512])
     if h in _prediction_cache:
         return _prediction_cache[h]
     try:
         inputs = cuad_tokenizer(
-            clause_text,
             return_tensors="pt",
             truncation=True,
             max_length=256,
@@ -498,10 +554,15 @@ def classify_cuad(clause_text):
             threshold = _CUAD_THRESHOLDS.get(i, 0.40)
             if float(prob) > threshold and i < len(CUAD_LABELS):
                 label = CUAD_LABELS[i]
                 risk = RISK_MAP.get(label, "LOW")
                 results.append({
                     "label": label,
-                    "confidence": round(float(prob), 3),
                     "risk": risk,
                     "description": DESC_MAP.get(label, label),
                     "source": "ml",
@@ -773,19 +834,33 @@ def detect_contradictions(clause_results, raw_text=""):
                     "source": "heuristic",
                 })
-    # ── 2. Missing critical clauses ──
-    critical_clauses = {
-        "Governing Law": "No governing law clause detected — jurisdiction ambiguity may cause disputes.",
-        "Termination for Convenience": "No termination clause detected — exit terms are unclear.",
-        "Limitation of liability": "No liability limitation detected — exposure may be unlimited.",
     }
-    for cc, explanation in critical_clauses.items():
-        if cc not in labels_found:
             contradictions.append({
                 "type": "MISSING",
-                "explanation": explanation,
                 "severity": "MEDIUM",
-                "clauses": [cc],
                 "source": "structural",
             })
@@ -847,13 +922,21 @@ def analyze_contract(text):
     contradictions = detect_contradictions(clause_results, text)
     risk, grade, sev_counts = compute_risk_score(clause_results, len(clauses))
     obligations = extract_obligations(text)
     compliance = check_compliance(text)
     result = {
         "metadata": {
             "analysis_date": datetime.now().isoformat(),
             "total_clauses": len(clauses),
-            "flagged_clauses": len(set(cr["text"] for cr in clause_results)),
             "model": get_model_status_text(),
         },
         "risk": {
             "score": risk,
@@ -1119,11 +1202,11 @@ def process_upload(file):
 def run_analysis(text):
     if not text or len(text.strip()) < 50:
         err_html = '<p style="color:#dc2626;padding:16px;">Document too short (minimum 50 characters)</p>'
-        return [err_html] * 7 + [None, None, ""]
     result, error = analyze_contract(text)
     if error:
         err_html = f'<p style="color:#dc2626;padding:16px;">{error}</p>'
-        return [err_html] * 7 + [None, None, error]
     # FIXED: per-session temp files
     session_id = uuid.uuid4().hex[:8]
@@ -1136,6 +1219,10 @@ def run_analysis(text):
     with open(csv_path, "w") as f:
         f.write(csv_content)
     return [
         render_summary(result),
         render_clause_cards(result),
@@ -1144,13 +1231,15 @@ def run_analysis(text):
         render_document_viewer(result),
         render_obligations_html(result.get("obligations", [])),
         render_compliance_html(result.get("compliance", {})),
         json_path,
         csv_path,
         "Analysis complete",
     ]
 def do_clear():
-    return [""] * 7 + [None, None, ""]
 # ── Example contracts ──
 SPOTIFY_TOS = """By using the Spotify Service, you agree to be bound by these Terms of Use.
@@ -1234,17 +1323,22 @@ with gr.Blocks(
     """
 ) as demo:
     gr.HTML("""
     <div style="display:flex;align-items:center;justify-content:space-between;padding:12px 0;border-bottom:2px solid #e5e7eb;margin-bottom:16px;">
       <div>
         <h1 style="font-size:24px;font-weight:700;margin:0;color:#1f2937;">🛡️ ClauseGuard</h1>
-        <p style="font-size:13px;color:#6b7280;margin:4px 0 0 0;">AI-Powered Legal Contract Analysis · 41 Clause Categories · Risk Scoring · ML NER · NLI Contradictions · Compliance · Obligations</p>
       </div>
-      <div style="font-size:12px;color:#9ca3af;">v3.0 · Precision Legal AI</div>
     </div>
     """)
-    # ── Main Tabs: Analysis vs Comparison ──
     with gr.Tabs():
         # ═══════ TAB 1: Single Contract Analysis ═══════
@@ -1261,7 +1355,7 @@ with gr.Blocks(
                 with gr.Column(scale=3):
                     text_input = gr.Textbox(
                         label="📄 Contract Text",
-                        placeholder="Paste contract text here, or upload a file above...",
                         lines=14,
                         max_lines=40,
                         show_copy_button=True,
@@ -1304,6 +1398,8 @@ with gr.Blocks(
                             obligations_html = gr.HTML(label="Obligation Tracker")
                         with gr.Tab("⚖️ Compliance"):
                             compliance_html = gr.HTML(label="Compliance Checker")
         # ═══════ TAB 2: Contract Comparison ═══════
         with gr.Tab("🔀 Compare Contracts"):
@@ -1352,6 +1448,53 @@ with gr.Blocks(
                 with gr.Column(scale=2):
                     comp_json = gr.JSON(label="Raw Comparison Data")
     # ── Events ──
     def _load_file(file):
         text, err = parse_document(file) if file else ("", "No file")
@@ -1359,23 +1502,41 @@ with gr.Blocks(
             return "", err
         return text, "Loaded successfully" if not err else err
     load_btn.click(_load_file, inputs=[file_input], outputs=[text_input, load_status])
     comp_load_a.click(_load_file, inputs=[comp_file_a], outputs=[comp_text_a, comp_status_a])
     comp_load_b.click(_load_file, inputs=[comp_file_b], outputs=[comp_text_b, comp_status_b])
     scan_btn.click(
-        run_analysis,
         inputs=[text_input],
-        outputs=[summary_html, clauses_html, entities_html, nli_html,
-                 doc_html, obligations_html, compliance_html,
-                 json_file, csv_file, status_msg]
     )
     clear_btn.click(
-        do_clear,
-        outputs=[summary_html, clauses_html, entities_html, nli_html,
-                 doc_html, obligations_html, compliance_html,
-                 json_file, csv_file, status_msg]
     )
     comp_btn.click(
@@ -1391,6 +1552,8 @@ with gr.Blocks(
         · Model: <a href="https://huggingface.co/Mokshith31/legalbert-contract-clause-classification" style="color:#6b7280;">Legal-BERT + CUAD (41 classes)</a>
         · NER: <a href="https://huggingface.co/matterstack/legal-bert-ner" style="color:#6b7280;">Legal-BERT NER</a>
         · NLI: <a href="https://huggingface.co/cross-encoder/nli-deberta-v3-base" style="color:#6b7280;">DeBERTa-v3 NLI</a>
         · Dataset: <a href="https://huggingface.co/datasets/theatticusproject/cuad-qa" style="color:#6b7280;">CUAD</a>
         · <a href="https://huggingface.co/spaces/gaurv007/ClauseGuard" style="color:#6b7280;">ClauseGuard Space</a>
       </p>

 """
+ClauseGuard — World's Best Legal Contract Analysis Tool (v4.0)
 ═══════════════════════════════════════════════════════════════
+New in v4.0:
+  • OCR support for scanned PDFs (docTR engine with smart native/scanned routing)
+  • Contract Q&A Chatbot (RAG: embedding retrieval + HF Inference API streaming)
+  • Clause Redlining (3-tier: template lookup + RAG + LLM refinement)
+Carried from v3.0:
   • Fixed CUAD label mapping (added missing index 6: "Notice Period to Terminate Renewal")
   • Switched from softmax → sigmoid for proper multi-label classification
   • Per-class optimized thresholds instead of flat 0.15
     (LoRA adapter on nlpaueb/legal-bert-base-uncased, 41 CUAD classes)
   • Legal NER: matterstack/legal-bert-ner (token classification)
   • NLI: cross-encoder/nli-deberta-v3-base (contradiction detection)
+  • Embeddings: sentence-transformers/all-MiniLM-L6-v2 (RAG retrieval)
+  • OCR: docTR fast_base + crnn_vgg16_bn (scanned PDF extraction)
+  • LLM: Qwen/Qwen2.5-7B-Instruct via HF Inference API (chatbot + redlining)
 """
 import os
 from compare import compare_contracts, render_comparison_html
 from obligations import extract_obligations, render_obligations_html
 from compliance import check_compliance, render_compliance_html
+from ocr_engine import parse_pdf_smart, get_ocr_status
+from chatbot import index_contract, chat_respond, get_chatbot_status
+from redlining import generate_redlines, render_redlines_html
 # ═══════════════════════════════════════════════════════════════════════
 # 1. CONFIGURATION — FIXED label mapping (41 labels, index 6 restored)
 # ═══════════════════════════════════════════════════════════════════════
 def parse_pdf(file_path):
+    """Smart PDF parser: native text extraction with OCR fallback for scanned PDFs."""
+    text, error, method = parse_pdf_smart(file_path)
+    if text:
+        if method == "ocr":
+            print(f"[ClauseGuard] PDF extracted via OCR ({len(text)} chars)")
+        return text, None
+    if error:
+        return None, error
+    return None, "Could not extract text from PDF. Try uploading a clearer scan or digital PDF."
 def parse_docx(file_path):
     if not _HAS_DOCX:
         return None, f"Unsupported file type: {ext}"
 # ═══════════════════════════════════════════════════════════════════════
+# 4. DETERMINISTIC CLAUSE SPLITTING (Fix 1 from bug report)
 # ═══════════════════════════════════════════════════════════════════════
+# Document-level chunk cache: same text always produces same chunks
+_chunk_cache = {}
 def split_clauses(text):
+    """Deterministic, structure-aware clause splitting.
+    Fix 1: Same input ALWAYS produces same output. Normalized text is hashed
+    and cached so repeated runs on identical documents are identical."""
+    # Normalize whitespace before hashing for determinism
+    normalized = re.sub(r'\s+', ' ', text.strip())
+    text_hash = hashlib.sha256(normalized.encode()).hexdigest()
+    if text_hash in _chunk_cache:
+        return _chunk_cache[text_hash]
     text = re.sub(r'\n{3,}', '\n\n', text.strip())
     # First try to detect numbered sections (1., 2., 3.1, (a), etc.)
             preamble = text[:positions[0]].strip()
             if len(preamble) > 30:
                 clauses.insert(0, preamble)
+        result = clauses if clauses else _fallback_split(text)
+        _chunk_cache[text_hash] = result
+        return result
     else:
+        result = _fallback_split(text)
+        _chunk_cache[text_hash] = result
+        return result
 def _fallback_split(text):
     """Fallback: split on paragraph breaks and sentence boundaries."""
 # ═══════════════════════════════════════════════════════════════════════
 # 5. CLAUSE DETECTION — FIXED: sigmoid + per-class thresholds + caching
+#    Fix 3: Strip section headings before classification
+#    Fix 6: Label guardrails for high-confidence false positives
 # ═══════════════════════════════════════════════════════════════════════
+# Fix 3: Section heading pattern — strip before classifying
+_HEADING_RE = re.compile(r'^\d+(?:\.\d+)*\s+[A-Z][A-Z\s&,/]+$', re.MULTILINE)
+def _strip_heading(text):
+    """Remove leading section headings that confuse the classifier."""
+    lines = text.split('\n')
+    if lines and _HEADING_RE.match(lines[0].strip()):
+        stripped = '\n'.join(lines[1:]).strip()
+        return stripped if len(stripped) > 20 else text
+    return text
+# Fix 6: Label guardrails — keyword validation for high-confidence labels
+_LABEL_GUARDRAILS = {
+    "Liquidated Damages": re.compile(
+        r'liquidated|pre-?determined.{0,10}damage|agreed.{0,10}sum|penalty clause|stipulated.{0,10}damage',
+        re.IGNORECASE
+    ),
+    "Uncapped Liability": re.compile(
+        r'uncapped|unlimited.{0,10}liabilit|no.{0,10}(limit|cap).{0,10}liabilit',
+        re.IGNORECASE
+    ),
+}
+def _apply_guardrails(label, text, confidence):
+    """Fix 6: If label has a guardrail and text lacks required keywords, demote."""
+    guard = _LABEL_GUARDRAILS.get(label)
+    if guard and not guard.search(text):
+        return "Other", confidence * 0.3  # demote to Other with reduced confidence
+    return label, confidence
 def _text_hash(text):
     return hashlib.md5(text.encode()).hexdigest()
     if cuad_model is None or cuad_tokenizer is None:
         return _classify_regex(clause_text)
+    # Fix 3: Strip section headings before classification
+    clean_text = _strip_heading(clause_text)
     # Check cache
+    h = _text_hash(clean_text[:512])
     if h in _prediction_cache:
         return _prediction_cache[h]
     try:
         inputs = cuad_tokenizer(
+            clean_text,
             return_tensors="pt",
             truncation=True,
             max_length=256,
             threshold = _CUAD_THRESHOLDS.get(i, 0.40)
             if float(prob) > threshold and i < len(CUAD_LABELS):
                 label = CUAD_LABELS[i]
+                conf = float(prob)
+                # Fix 6: Apply guardrails — reject high-confidence false positives
+                label, conf = _apply_guardrails(label, clause_text, conf)
+                if label == "Other" and conf < 0.3:
+                    continue  # Skip demoted labels
                 risk = RISK_MAP.get(label, "LOW")
                 results.append({
                     "label": label,
+                    "confidence": round(conf, 3),
                     "risk": risk,
                     "description": DESC_MAP.get(label, label),
                     "source": "ml",
                     "source": "heuristic",
                 })
+    # ── 2. Missing critical clauses (Fix 4: check raw_text, not labels) ──
+    _REQUIRED_CLAUSE_PATTERNS = {
+        "Governing Law": re.compile(
+            r'govern(?:ed|ing).{0,15}law|applicable.{0,10}law|laws?\s+of\s+the\s+state',
+            re.IGNORECASE
+        ),
+        "Limitation of liability": re.compile(
+            r'limitation.{0,10}liabilit|cap.{0,10}liabilit|liabilit.{0,10}shall\s+not\s+exceed|in\s+no\s+event.{0,20}liable',
+            re.IGNORECASE
+        ),
+        "Arbitration": re.compile(
+            r'arbitrat|AAA|JAMS|binding.{0,10}dispute',
+            re.IGNORECASE
+        ),
+        "Termination": re.compile(
+            r'terminat(?:e|ion|ed)|cancel(?:lation)?',
+            re.IGNORECASE
+        ),
     }
+    for clause_name, pattern in _REQUIRED_CLAUSE_PATTERNS.items():
+        # Check raw_text directly — it's stable and deterministic
+        if not pattern.search(raw_text):
             contradictions.append({
                 "type": "MISSING",
+                "explanation": f"No '{clause_name}' clause detected in the document.",
                 "severity": "MEDIUM",
+                "clauses": [clause_name],
                 "source": "structural",
             })
     contradictions = detect_contradictions(clause_results, text)
     risk, grade, sev_counts = compute_risk_score(clause_results, len(clauses))
     obligations = extract_obligations(text)
+    # Fix 5: Compliance runs against full raw_text (already done in compliance.py)
     compliance = check_compliance(text)
+    # Fix 2: Compute flagged_clauses AFTER all processing is complete
+    flagged_clause_count = len(clause_results)
+    unique_flagged_texts = len(set(cr["text"] for cr in clause_results))
     result = {
         "metadata": {
             "analysis_date": datetime.now().isoformat(),
             "total_clauses": len(clauses),
+            "flagged_clauses": flagged_clause_count,
+            "unique_flagged": unique_flagged_texts,
             "model": get_model_status_text(),
+            "text_hash": hashlib.sha256(re.sub(r'\s+', ' ', text.strip()).encode()).hexdigest()[:16],
         },
         "risk": {
             "score": risk,
 def run_analysis(text):
     if not text or len(text.strip()) < 50:
         err_html = '<p style="color:#dc2626;padding:16px;">Document too short (minimum 50 characters)</p>'
+        return [err_html] * 8 + [None, None, "", None]
     result, error = analyze_contract(text)
     if error:
         err_html = f'<p style="color:#dc2626;padding:16px;">{error}</p>'
+        return [err_html] * 8 + [None, None, error, None]
     # FIXED: per-session temp files
     session_id = uuid.uuid4().hex[:8]
     with open(csv_path, "w") as f:
         f.write(csv_content)
+    # Generate redline suggestions (Tier 1 template + Tier 3 LLM for critical/high)
+    redlines = generate_redlines(result, use_llm=True)
+    redlines_html = render_redlines_html(redlines)
     return [
         render_summary(result),
         render_clause_cards(result),
         render_document_viewer(result),
         render_obligations_html(result.get("obligations", [])),
         render_compliance_html(result.get("compliance", {})),
+        redlines_html,
         json_path,
         csv_path,
         "Analysis complete",
+        result,  # Store analysis result for chatbot
     ]
 def do_clear():
+    return [""] * 8 + [None, None, "", None]
 # ── Example contracts ──
 SPOTIFY_TOS = """By using the Spotify Service, you agree to be bound by these Terms of Use.
     """
 ) as demo:
+    # ── Shared State (for chatbot RAG) ──────────────────────────────
+    analysis_state = gr.State(None)       # Full analysis result dict
+    chunks_state = gr.State([])           # Contract text chunks for RAG
+    embeddings_state = gr.State(None)     # Chunk embeddings (numpy array)
     gr.HTML("""
     <div style="display:flex;align-items:center;justify-content:space-between;padding:12px 0;border-bottom:2px solid #e5e7eb;margin-bottom:16px;">
       <div>
         <h1 style="font-size:24px;font-weight:700;margin:0;color:#1f2937;">🛡️ ClauseGuard</h1>
+        <p style="font-size:13px;color:#6b7280;margin:4px 0 0 0;">AI-Powered Legal Contract Analysis · 41 Clause Categories · Risk Scoring · ML NER · NLI Contradictions · Compliance · Obligations · <strong>Q&A Chatbot</strong> · <strong>Clause Redlining</strong> · <strong>OCR</strong></p>
       </div>
+      <div style="font-size:12px;color:#9ca3af;">v4.0 · Precision Legal AI</div>
     </div>
     """)
+    # ── Main Tabs: Analysis vs Comparison vs Chatbot ──
     with gr.Tabs():
         # ═══════ TAB 1: Single Contract Analysis ═══════
                 with gr.Column(scale=3):
                     text_input = gr.Textbox(
                         label="📄 Contract Text",
+                        placeholder="Paste contract text here, or upload a file above...\n\n💡 Scanned PDFs are automatically processed with OCR.",
                         lines=14,
                         max_lines=40,
                         show_copy_button=True,
                             obligations_html = gr.HTML(label="Obligation Tracker")
                         with gr.Tab("⚖️ Compliance"):
                             compliance_html = gr.HTML(label="Compliance Checker")
+                        with gr.Tab("✏️ Redlining"):
+                            redlining_html = gr.HTML(label="Clause Redlining Suggestions")
         # ═══════ TAB 2: Contract Comparison ═══════
         with gr.Tab("🔀 Compare Contracts"):
                 with gr.Column(scale=2):
                     comp_json = gr.JSON(label="Raw Comparison Data")
+        # ═══════ TAB 3: Contract Q&A Chatbot ═══════
+        with gr.Tab("💬 Contract Q&A"):
+            gr.HTML("""
+            <div style="padding:12px 16px;background:linear-gradient(135deg,#eff6ff,#faf5ff);border-radius:10px;margin-bottom:12px;border:1px solid #e5e7eb;">
+                <div style="display:flex;align-items:center;gap:8px;margin-bottom:6px;">
+                    <span style="font-size:20px;">💬</span>
+                    <h3 style="margin:0;font-size:16px;color:#1f2937;">Contract Q&A Chatbot</h3>
+                </div>
+                <p style="font-size:12px;color:#6b7280;margin:0;line-height:1.5;">
+                    Ask questions about your analyzed contract. The chatbot uses <strong>RAG</strong> (Retrieval-Augmented Generation)
+                    to find relevant clauses and generate accurate answers grounded in your contract text.
+                    <br>
+                    <strong>Step 1:</strong> Analyze a contract in the "📄 Single Contract Analysis" tab.
+                    <strong>Step 2:</strong> Come here and ask questions!
+                </p>
+            </div>
+            """)
+            chatbot_index_status = gr.Textbox(
+                label="📡 Chatbot Index Status",
+                interactive=False,
+                lines=1,
+                value="⏳ No contract indexed yet — analyze a contract first",
+            )
+            def _chatbot_fn(message, history, chunks, embeddings, analysis):
+                """Wrapper for ChatInterface fn signature."""
+                yield from chat_respond(message, history, chunks, embeddings, analysis)
+            gr.ChatInterface(
+                fn=_chatbot_fn,
+                type="messages",
+                additional_inputs=[chunks_state, embeddings_state, analysis_state],
+                examples=[
+                    ["What are the main risks in this contract?"],
+                    ["Who are the parties involved?"],
+                    ["What happens if the contract is terminated?"],
+                    ["Are there any liability limitations?"],
+                    ["What are my obligations under this contract?"],
+                    ["Is there an arbitration clause?"],
+                    ["What is the governing law?"],
+                    ["Summarize the key terms in plain language."],
+                ],
+                title="",
+                description="",
+            )
     # ── Events ──
     def _load_file(file):
         text, err = parse_document(file) if file else ("", "No file")
             return "", err
         return text, "Loaded successfully" if not err else err
+    def _analysis_and_index(text):
+        """Run analysis AND index for chatbot in one call."""
+        # Run the standard analysis
+        analysis_outputs = run_analysis(text)
+        # Index for chatbot (uses the raw text)
+        chunks, embeddings, index_status = index_contract(text)
+        # analysis_outputs has 12 items: 8 HTML + json_path + csv_path + status + result
+        # We need to add: chunks_state, embeddings_state, chatbot_index_status
+        return analysis_outputs + [chunks, embeddings, index_status]
     load_btn.click(_load_file, inputs=[file_input], outputs=[text_input, load_status])
     comp_load_a.click(_load_file, inputs=[comp_file_a], outputs=[comp_text_a, comp_status_a])
     comp_load_b.click(_load_file, inputs=[comp_file_b], outputs=[comp_text_b, comp_status_b])
     scan_btn.click(
+        _analysis_and_index,
         inputs=[text_input],
+        outputs=[
+            summary_html, clauses_html, entities_html, nli_html,
+            doc_html, obligations_html, compliance_html, redlining_html,
+            json_file, csv_file, status_msg, analysis_state,
+            chunks_state, embeddings_state, chatbot_index_status,
+        ]
     )
     clear_btn.click(
+        lambda: [""] * 8 + [None, None, "", None, [], None, "⏳ No contract indexed"],
+        outputs=[
+            summary_html, clauses_html, entities_html, nli_html,
+            doc_html, obligations_html, compliance_html, redlining_html,
+            json_file, csv_file, status_msg, analysis_state,
+            chunks_state, embeddings_state, chatbot_index_status,
+        ]
     )
     comp_btn.click(
         · Model: <a href="https://huggingface.co/Mokshith31/legalbert-contract-clause-classification" style="color:#6b7280;">Legal-BERT + CUAD (41 classes)</a>
         · NER: <a href="https://huggingface.co/matterstack/legal-bert-ner" style="color:#6b7280;">Legal-BERT NER</a>
         · NLI: <a href="https://huggingface.co/cross-encoder/nli-deberta-v3-base" style="color:#6b7280;">DeBERTa-v3 NLI</a>
+        · LLM: <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct" style="color:#6b7280;">Qwen2.5-7B</a>
+        · OCR: <a href="https://github.com/mindee/doctr" style="color:#6b7280;">docTR</a>
         · Dataset: <a href="https://huggingface.co/datasets/theatticusproject/cuad-qa" style="color:#6b7280;">CUAD</a>
         · <a href="https://huggingface.co/spaces/gaurv007/ClauseGuard" style="color:#6b7280;">ClauseGuard Space</a>
       </p>

chatbot.py ADDED Viewed

	@@ -0,0 +1,406 @@

+"""
+ClauseGuard — Contract Q&A Chatbot (RAG) v1.0
+═══════════════════════════════════════════════
+Architecture:
+  User asks question about their contract
+          ↓
+  [1] Embed question with sentence-transformers (all-MiniLM-L6-v2)
+          ↓
+  [2] Retrieve top-5 most relevant chunks from contract
+          ↓
+  [3] Build prompt:
+      - System: ClauseGuard analysis results (clauses, entities, risk scores)
+      - Context: Retrieved contract chunks (≤2.5K tokens)
+      - User question
+          ↓
+  [4] Stream response from LLM via HF Inference API
+Key design:
+  • Analyzed data (clauses, entities, risk scores) → system prompt
+  • Raw contract text → RAG retrieval
+  • This gives the model both structured analysis AND verbatim evidence
+"""
+import os
+import re
+import numpy as np
+# ── Embedding model (soft-fail) ─────────────────────────────────────
+_HAS_EMBEDDER = False
+_embedder = None
+try:
+    from sentence_transformers import SentenceTransformer
+    _HAS_EMBEDDER = True
+except ImportError:
+    pass
+# ── HF Inference Client (soft-fail) ─────────────────────────────────
+_HAS_INFERENCE = False
+_llm_client = None
+try:
+    from huggingface_hub import InferenceClient
+    _HAS_INFERENCE = True
+except ImportError:
+    pass
+# ═══════════════════════════════════════════════════════════════════════
+# MODEL LOADING
+# ═══════════════════════════════════════════════════════════════════════
+_chatbot_status = {"embedder": "not_loaded", "llm": "not_loaded"}
+def _load_embedder():
+    """Load sentence-transformers embedding model (lazy)."""
+    global _embedder, _chatbot_status
+    if _embedder is not None:
+        return _embedder
+    if not _HAS_EMBEDDER:
+        _chatbot_status["embedder"] = "unavailable"
+        return None
+    try:
+        print("[ClauseGuard Chat] Loading embedding model: all-MiniLM-L6-v2...")
+        _embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+        _chatbot_status["embedder"] = "loaded"
+        print("[ClauseGuard Chat] Embedding model loaded")
+        return _embedder
+    except Exception as e:
+        _chatbot_status["embedder"] = f"failed: {e}"
+        print(f"[ClauseGuard Chat] Embedder load failed: {e}")
+        return None
+def _get_llm_client():
+    """Get or create HF Inference Client (lazy)."""
+    global _llm_client, _chatbot_status
+    if _llm_client is not None:
+        return _llm_client
+    if not _HAS_INFERENCE:
+        _chatbot_status["llm"] = "unavailable"
+        return None
+    try:
+        token = os.environ.get("HF_TOKEN", "")
+        _llm_client = InferenceClient(
+            provider="hf-inference",
+            api_key=token if token else None,
+        )
+        _chatbot_status["llm"] = "loaded"
+        print("[ClauseGuard Chat] HF Inference Client initialized")
+        return _llm_client
+    except Exception as e:
+        _chatbot_status["llm"] = f"failed: {e}"
+        print(f"[ClauseGuard Chat] LLM client init failed: {e}")
+        return None
+def get_chatbot_status():
+    """Return human-readable chatbot status."""
+    parts = []
+    for name, status in _chatbot_status.items():
+        icon = "✅" if status == "loaded" else "⚠️" if "failed" in status else "❌"
+        label = {"embedder": "Embeddings", "llm": "LLM API"}[name]
+        parts.append(f"{icon} {label}: {status}")
+    return " · ".join(parts)
+# ═══════════════════════════════════════════════════════════════════════
+# TEXT CHUNKING (sentence-preserving, ~300 tokens, no overlap)
+# ═══════════════════════════════════════════════════════════════════════
+def chunk_contract_text(text, target_chunk_size=300, min_chunk_size=50):
+    """
+    Split contract text into chunks for RAG retrieval.
+    Sentence-preserving, ~300 tokens per chunk, 0% overlap.
+    Research (arxiv 2601.14123): overlap adds cost with zero benefit.
+    """
+    if not text:
+        return []
+    # First split on paragraph boundaries
+    paragraphs = re.split(r'\n\n+', text)
+    chunks = []
+    current_chunk = ""
+    for para in paragraphs:
+        para = para.strip()
+        if not para:
+            continue
+        # Estimate word count (rough token proxy)
+        words_current = len(current_chunk.split())
+        words_para = len(para.split())
+        if words_current + words_para <= target_chunk_size:
+            current_chunk += ("\n\n" + para if current_chunk else para)
+        else:
+            # Current chunk is full enough — save it
+            if words_current >= min_chunk_size:
+                chunks.append(current_chunk.strip())
+                current_chunk = para
+            else:
+                # Current chunk too small — need to split the paragraph into sentences
+                sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', para)
+                for sent in sentences:
+                    words_current = len(current_chunk.split())
+                    words_sent = len(sent.split())
+                    if words_current + words_sent <= target_chunk_size:
+                        current_chunk += (" " + sent if current_chunk else sent)
+                    else:
+                        if words_current >= min_chunk_size:
+                            chunks.append(current_chunk.strip())
+                        current_chunk = sent
+    # Don't forget the last chunk
+    if current_chunk.strip() and len(current_chunk.split()) >= min_chunk_size:
+        chunks.append(current_chunk.strip())
+    return chunks
+# ═══════════════════════════════════════════════════════════════════════
+# EMBEDDING & RETRIEVAL
+# ═══════════════════════════════════════════════════════════════════════
+def build_embeddings(chunks):
+    """
+    Embed chunks using sentence-transformers.
+    Returns numpy array of shape (N, 384) or None if embedder unavailable.
+    """
+    embedder = _load_embedder()
+    if embedder is None or not chunks:
+        return None
+    try:
+        embeddings = embedder.encode(
+            chunks,
+            normalize_embeddings=True,
+            batch_size=32,
+            show_progress_bar=False,
+        )
+        return embeddings  # numpy array (N, 384)
+    except Exception as e:
+        print(f"[ClauseGuard Chat] Embedding error: {e}")
+        return None
+def retrieve_chunks(query, chunks, embeddings, top_k=5):
+    """
+    Retrieve top-k most relevant chunks for a query.
+    Uses cosine similarity (embeddings are L2-normalized → dot product = cosine).
+    Context budget: top-5 chunks, ≤2.5K tokens.
+    """
+    embedder = _load_embedder()
+    if embedder is None or embeddings is None or not chunks:
+        return []
+    try:
+        q_emb = embedder.encode([query], normalize_embeddings=True)
+        scores = (q_emb @ embeddings.T)[0]
+        top_indices = np.argsort(scores)[::-1][:top_k]
+        results = []
+        total_words = 0
+        max_words = 600  # ~2.5K tokens budget
+        for idx in top_indices:
+            chunk = chunks[idx]
+            chunk_words = len(chunk.split())
+            if total_words + chunk_words > max_words and results:
+                break
+            results.append({
+                "text": chunk,
+                "score": float(scores[idx]),
+                "index": int(idx),
+            })
+            total_words += chunk_words
+        return results
+    except Exception as e:
+        print(f"[ClauseGuard Chat] Retrieval error: {e}")
+        return []
+# ═══════════════════════════════════════════════════════════════════════
+# SYSTEM PROMPT BUILDER
+# ═══════════════════════════════════════════════════════════════════════
+def _build_system_prompt(analysis_result, retrieved_chunks):
+    """
+    Build the system prompt with:
+    1. ClauseGuard analysis results (clauses, entities, risk scores) — NOT through RAG
+    2. Retrieved contract chunks — through RAG
+    """
+    parts = []
+    parts.append("""You are ClauseGuard AI, a legal contract analysis assistant. You help users understand their contracts by answering questions based on the contract text and analysis results.
+RULES:
+- Answer ONLY based on the provided contract text and analysis. Never make up information.
+- If the answer isn't in the provided context, say "I don't see that information in the analyzed contract."
+- Cite specific clauses or sections when possible.
+- Be concise but thorough. Use plain language, not legal jargon.
+- Always end with: "⚠️ This is AI analysis, not legal advice. Consult an attorney for legal decisions."
+""")
+    # Add analysis summary if available
+    if analysis_result:
+        risk = analysis_result.get("risk", {})
+        parts.append(f"""
+═��═ CONTRACT ANALYSIS SUMMARY ═══
+Risk Score: {risk.get('score', 'N/A')}/100 (Grade {risk.get('grade', 'N/A')})
+Risk Breakdown: {risk.get('breakdown', {})}
+Total Clauses Analyzed: {analysis_result.get('metadata', {}).get('total_clauses', 'N/A')}
+Flagged Clauses: {analysis_result.get('metadata', {}).get('flagged_clauses', 'N/A')}
+""")
+        # Add detected clauses summary
+        clauses = analysis_result.get("clauses", [])
+        if clauses:
+            clause_summary = []
+            seen = set()
+            for c in clauses:
+                key = c["label"]
+                if key not in seen:
+                    seen.add(key)
+                    risk_level = c.get("risk", "LOW")
+                    clause_summary.append(f"  • [{risk_level}] {key}: {c.get('description', '')}")
+            parts.append("═══ DETECTED CLAUSES ═══\n" + "\n".join(clause_summary[:20]))
+        # Add entities summary
+        entities = analysis_result.get("entities", [])
+        if entities:
+            entity_summary = []
+            seen = set()
+            for e in entities:
+                key = f"{e['type']}: {e['text']}"
+                if key not in seen and len(seen) < 15:
+                    seen.add(key)
+                    entity_summary.append(f"  • {e['type']}: {e['text']}")
+            parts.append("═══ EXTRACTED ENTITIES ═══\n" + "\n".join(entity_summary))
+        # Add contradictions
+        contradictions = analysis_result.get("contradictions", [])
+        if contradictions:
+            contra_summary = []
+            for c in contradictions:
+                contra_summary.append(f"  • [{c['type']}] {c['explanation']}")
+            parts.append("═══ CONTRADICTIONS / ISSUES ═══\n" + "\n".join(contra_summary))
+    # Add retrieved contract text
+    if retrieved_chunks:
+        context_text = "\n---\n".join(c["text"] for c in retrieved_chunks)
+        parts.append(f"""
+═══ RELEVANT CONTRACT TEXT (Retrieved) ═══
+{context_text}
+""")
+    return "\n\n".join(parts)
+# ═══════════════════════════════════════════════════════════════════════
+# CHAT RESPONSE (Streaming)
+# ═══════════════════════════════════════════════════════════════════════
+# LLM model to use
+_LLM_MODEL = "Qwen/Qwen2.5-7B-Instruct"
+def chat_respond(message, history, chunks, embeddings, analysis_result):
+    """
+    RAG chatbot response function for gr.ChatInterface.
+    Args:
+        message: User's question (str)
+        history: Chat history (list of dicts with role/content)
+        chunks: Contract text chunks (list of str)
+        embeddings: Chunk embeddings (numpy array or None)
+        analysis_result: Full analysis result dict (or None)
+    Yields:
+        Partial response string (streaming)
+    """
+    # Validate inputs
+    if not chunks or embeddings is None:
+        yield ("⚠️ No contract loaded yet. Please upload and analyze a contract in the "
+               "**📄 Single Contract Analysis** tab first, then come back here to ask questions.")
+        return
+    if not message or not message.strip():
+        yield "Please ask a question about your contract."
+        return
+    # Step 1: Retrieve relevant chunks
+    retrieved = retrieve_chunks(message, chunks, embeddings, top_k=5)
+    # Step 2: Build system prompt with analysis + retrieved context
+    system_prompt = _build_system_prompt(analysis_result, retrieved)
+    # Step 3: Build message history for LLM
+    messages = [{"role": "system", "content": system_prompt}]
+    # Add recent history (last 6 turns to stay in context window)
+    if history:
+        for h in history[-6:]:
+            messages.append({"role": h["role"], "content": h["content"]})
+    messages.append({"role": "user", "content": message})
+    # Step 4: Stream response from LLM
+    client = _get_llm_client()
+    if client is None:
+        yield ("⚠️ LLM service unavailable. Please ensure `huggingface_hub` is installed "
+               "and `HF_TOKEN` is set.")
+        return
+    try:
+        stream = client.chat_completion(
+            model=_LLM_MODEL,
+            messages=messages,
+            max_tokens=1024,
+            stream=True,
+            temperature=0.3,  # Low temperature for factual responses
+        )
+        partial = ""
+        for chunk in stream:
+            token = chunk.choices[0].delta.content or ""
+            partial += token
+            yield partial
+    except Exception as e:
+        error_msg = str(e)
+        if "rate limit" in error_msg.lower() or "429" in error_msg:
+            yield ("⚠️ Rate limit reached on the free HF Inference API. "
+                   "Please wait a moment and try again.")
+        elif "401" in error_msg or "unauthorized" in error_msg.lower():
+            yield ("⚠️ Authentication error. Please set your HF_TOKEN in the Space settings.")
+        else:
+            yield f"⚠️ Error generating response: {error_msg}\n\nPlease try again."
+# ═══════════════════════════════════════════════════════════════════════
+# INDEXING HELPER (combines chunking + embedding)
+# ═══════════════════════════════════════════════════════════════════════
+def index_contract(text):
+    """
+    Chunk and embed contract text for RAG retrieval.
+    Returns: (chunks, embeddings, status_message)
+        chunks: list of str
+        embeddings: numpy array or None
+        status_message: str
+    """
+    if not text or len(text.strip()) < 50:
+        return [], None, "⚠️ No contract text to index"
+    chunks = chunk_contract_text(text)
+    if not chunks:
+        return [], None, "⚠️ Could not split contract into chunks"
+    embeddings = build_embeddings(chunks)
+    if embeddings is None:
+        return chunks, None, "⚠️ Embedding model unavailable — chatbot will not work"
+    return (
+        chunks,
+        embeddings,
+        f"✅ Indexed {len(chunks)} chunks ({len(text)} chars) — Ready to chat!"
+    )

compare.py CHANGED Viewed

@@ -98,6 +98,28 @@ def compare_contracts(text_a, text_b, clauses_a=None, clauses_b=None):
     if clauses_b is None:
         clauses_b = _split_clauses(text_b)
     # Build clause type maps
     type_map_a = defaultdict(list)
     type_map_b = defaultdict(list)
@@ -111,8 +133,9 @@ def compare_contracts(text_a, text_b, clauses_a=None, clauses_b=None):
     matched_b = set()
     modified = []
-    SIMILARITY_THRESHOLD = 0.70
-    MODIFIED_THRESHOLD = 0.40
     for i, ca in enumerate(clauses_a):
         best_sim = 0
@@ -181,12 +204,20 @@ def compare_contracts(text_a, text_b, clauses_a=None, clauses_b=None):
         risk_delta = "Similar risk profiles"
         risk_winner = "tie"
     comparison_method = "semantic (sentence embeddings)" if _embedder is not None else "lexical (string matching)"
     return {
         "alignment_score": round(alignment, 3),
         "contract_a_clauses": len(clauses_a),
         "contract_b_clauses": len(clauses_b),
         "added_clauses": [{"text": c[:200], "type": _extract_clause_type(c)} for c in added[:50]],
         "removed_clauses": [{"text": c[:200], "type": _extract_clause_type(c)} for c in removed[:50]],
         "modified_clauses": modified[:50],

     if clauses_b is None:
         clauses_b = _split_clauses(text_b)
+    # Fix 9: Detect contract types and flag cross-domain comparisons
+    _CONTRACT_TYPE_KEYWORDS = {
+        "employment": ["employee", "employer", "salary", "compensation", "benefits", "vacation", "severance", "at-will"],
+        "lease": ["landlord", "tenant", "rent", "premises", "lease", "occupancy", "security deposit", "eviction"],
+        "service": ["service provider", "customer", "SLA", "deliverables", "statement of work", "SOW"],
+        "nda": ["confidential", "non-disclosure", "disclosing party", "receiving party"],
+        "saas": ["subscription", "SaaS", "cloud", "uptime", "API", "data processing"],
+        "purchase": ["buyer", "seller", "purchase order", "goods", "shipment", "delivery"],
+    }
+    def _detect_contract_type(text):
+        text_lower = text.lower()
+        scores = {}
+        for ctype, keywords in _CONTRACT_TYPE_KEYWORDS.items():
+            scores[ctype] = sum(1 for kw in keywords if kw.lower() in text_lower)
+        best = max(scores, key=scores.get)
+        return best if scores[best] >= 2 else "general"
+    type_a = _detect_contract_type(text_a)
+    type_b = _detect_contract_type(text_b)
+    is_cross_domain = type_a != type_b and type_a != "general" and type_b != "general"
     # Build clause type maps
     type_map_a = defaultdict(list)
     type_map_b = defaultdict(list)
     matched_b = set()
     modified = []
+    # Fix 10: Raise thresholds to reject false "modified" matches
+    SIMILARITY_THRESHOLD = 0.75   # was 0.70 — too many false matches
+    MODIFIED_THRESHOLD = 0.55     # was 0.40 — "Good Reason" ≠ "Force Majeure"
     for i, ca in enumerate(clauses_a):
         best_sim = 0
         risk_delta = "Similar risk profiles"
         risk_winner = "tie"
+    # Fix 9: Cross-domain warning
+    if is_cross_domain:
+        risk_delta = f"Cross-domain comparison ({type_a} vs {type_b}) — risk delta not meaningful across different contract types"
+        risk_winner = "cross-domain"
     comparison_method = "semantic (sentence embeddings)" if _embedder is not None else "lexical (string matching)"
     return {
         "alignment_score": round(alignment, 3),
         "contract_a_clauses": len(clauses_a),
         "contract_b_clauses": len(clauses_b),
+        "contract_a_type": type_a,
+        "contract_b_type": type_b,
+        "is_cross_domain": is_cross_domain,
         "added_clauses": [{"text": c[:200], "type": _extract_clause_type(c)} for c in added[:50]],
         "removed_clauses": [{"text": c[:200], "type": _extract_clause_type(c)} for c in removed[:50]],
         "modified_clauses": modified[:50],

ml/ClauseGuard_DeBERTa_Training.ipynb ADDED Viewed

	@@ -0,0 +1,1041 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 🛡️ ClauseGuard v4 — DeBERTa-v3-large 2-Stage Training\n",
+        "\n",
+        "**Goal:** Train a production-grade contract clause classifier that replaces the current Legal-BERT-base (50% F1 → target 80-87% F1)\n",
+        "\n",
+        "## Architecture\n",
+        "| Setting | Value | Source |\n",
+        "|---------|-------|--------|\n",
+        "| Base model | `microsoft/deberta-v3-large` (435M params) | LexGLUE: outperforms Legal-BERT by 7-10pp |\n",
+        "| Max length | 512 tokens | MAUD paper: covers 72.4% of clauses without truncation |\n",
+        "| Loss function | Asymmetric Loss (γ-=4, clip=0.05) | ASL paper (2009.14119): +3-8pp on rare classes |\n",
+        "| Training | Full fine-tuning (no LoRA) | Full FT wins for encoder classification |\n",
+        "\n",
+        "## 2-Stage Training Pipeline\n",
+        "1. **Stage 1 — LEDGAR** (60K legal provisions, 100 classes): Teaches \"what types of contract clauses exist\"\n",
+        "2. **Stage 2 — CUAD** (41 CUAD classes): Target task with Asymmetric Loss for class imbalance\n",
+        "\n",
+        "**Runtime:** ~8-12 hours on T4 GPU (or ~4-6 hours on A100)\n",
+        "\n",
+        "**Before running:**\n",
+        "1. `Runtime` → `Change runtime type` → **T4 GPU**\n",
+        "2. `Runtime` → `Run all`\n",
+        "3. Paste your HuggingFace token when prompted"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 1: Install Dependencies"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -q transformers datasets scikit-learn accelerate huggingface_hub torch\n",
+        "!pip install -q trackio  # optional: experiment tracking"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 2: Login to HuggingFace Hub"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from huggingface_hub import login\n",
+        "login()"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 3: Configuration"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "# CONFIGURATION — Edit these values\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "\n",
+        "BASE_MODEL = \"microsoft/deberta-v3-large\"   # 435M params, MIT license\n",
+        "MAX_LENGTH = 512                              # covers 72.4% of clauses\n",
+        "HUB_MODEL_ID = \"gaurv007/clauseguard-deberta-v3-large\"  # ← your model repo\n",
+        "\n",
+        "# Stage 1: LEDGAR config\n",
+        "STAGE1_EPOCHS = 5           # LEDGAR is large, converges fast\n",
+        "STAGE1_LR = 2e-5\n",
+        "STAGE1_BATCH = 2            # T4 fp32: reduced for DeBERTa-v3 compatibility\n",
+        "STAGE1_GRAD_ACCUM = 16      # effective batch = 32 (2 * 16)\n",
+        "\n",
+        "# Stage 2: CUAD config  \n",
+        "STAGE2_EPOCHS = 20\n",
+        "STAGE2_LR = 1e-5            # lower LR for fine-tuning pretrained model\n",
+        "STAGE2_BATCH = 2            # T4 fp32: reduced for DeBERTa-v3 compatibility\n",
+        "STAGE2_GRAD_ACCUM = 16      # effective batch = 32 (2 * 16)\n",
+        "EARLY_STOPPING_PATIENCE = 3\n",
+        "\n",
+        "# ASL hyperparameters (from arxiv 2009.14119)\n",
+        "ASL_GAMMA_POS = 0\n",
+        "ASL_GAMMA_NEG = 4\n",
+        "ASL_CLIP = 0.05\n",
+        "\n",
+        "# Weight decay (DeBERTa default)\n",
+        "WEIGHT_DECAY = 0.06\n",
+        "WARMUP_RATIO = 0.1\n",
+        "\n",
+        "SEED = 42\n",
+        "\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "\n",
+        "# CUAD 41 label names (must match class_id 0-40 in CUAD dataset)\n",
+        "CUAD_LABELS = [\n",
+        "    \"Document Name\",                        # 0\n",
+        "    \"Parties\",                              # 1\n",
+        "    \"Agreement Date\",                       # 2\n",
+        "    \"Effective Date\",                       # 3\n",
+        "    \"Expiration Date\",                      # 4\n",
+        "    \"Renewal Term\",                         # 5\n",
+        "    \"Notice Period to Terminate Renewal\",   # 6\n",
+        "    \"Governing Law\",                        # 7\n",
+        "    \"Most Favored Nation\",                  # 8\n",
+        "    \"Non-Compete\",                          # 9\n",
+        "    \"Exclusivity\",                          # 10\n",
+        "    \"No-Solicit of Customers\",              # 11\n",
+        "    \"No-Solicit of Employees\",              # 12\n",
+        "    \"Non-Disparagement\",                    # 13\n",
+        "    \"Termination for Convenience\",          # 14\n",
+        "    \"ROFR/ROFO/ROFN\",                       # 15\n",
+        "    \"Change of Control\",                    # 16\n",
+        "    \"Anti-Assignment\",                      # 17\n",
+        "    \"Revenue/Profit Sharing\",               # 18\n",
+        "    \"Price Restriction\",                    # 19\n",
+        "    \"Minimum Commitment\",                   # 20\n",
+        "    \"Volume Restriction\",                   # 21\n",
+        "    \"IP Ownership Assignment\",              # 22\n",
+        "    \"Joint IP Ownership\",                   # 23\n",
+        "    \"License Grant\",                        # 24\n",
+        "    \"Non-Transferable License\",             # 25\n",
+        "    \"Affiliate License-Licensor\",           # 26\n",
+        "    \"Affiliate License-Licensee\",           # 27\n",
+        "    \"Unlimited/All-You-Can-Eat License\",    # 28\n",
+        "    \"Irrevocable or Perpetual License\",     # 29\n",
+        "    \"Source Code Escrow\",                   # 30\n",
+        "    \"Post-Termination Services\",            # 31\n",
+        "    \"Audit Rights\",                         # 32\n",
+        "    \"Uncapped Liability\",                   # 33\n",
+        "    \"Cap on Liability\",                     # 34\n",
+        "    \"Liquidated Damages\",                   # 35\n",
+        "    \"Warranty Duration\",                    # 36\n",
+        "    \"Insurance\",                            # 37\n",
+        "    \"Covenant Not to Sue\",                  # 38\n",
+        "    \"Third Party Beneficiary\",              # 39\n",
+        "    \"Other\",                                # 40\n",
+        "]\n",
+        "\n",
+        "NUM_CUAD_LABELS = len(CUAD_LABELS)  # 41\n",
+        "\n",
+        "print(f\"🛡️ ClauseGuard v4 Training Configuration\")\n",
+        "print(f\"   Base model: {BASE_MODEL}\")\n",
+        "print(f\"   Max length: {MAX_LENGTH}\")\n",
+        "print(f\"   Hub model: {HUB_MODEL_ID}\")\n",
+        "print(f\"   GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}\")\n",
+        "print(f\"   VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB\" if torch.cuda.is_available() else \"\")\n",
+        "print(f\"   CUAD classes: {NUM_CUAD_LABELS}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 4: Load Datasets"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from datasets import load_dataset, Dataset\n",
+        "import pandas as pd\n",
+        "from collections import Counter\n",
+        "\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "# Stage 1: LEDGAR (100 classes, single-label)\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "print(\"📚 Loading LEDGAR dataset...\")\n",
+        "ledgar = load_dataset(\"coastalcph/lex_glue\", \"ledgar\")\n",
+        "print(f\"   Train: {len(ledgar['train']):,} | Val: {len(ledgar['validation']):,} | Test: {len(ledgar['test']):,}\")\n",
+        "num_ledgar_labels = ledgar['train'].features['label'].num_classes\n",
+        "print(f\"   Classes: {num_ledgar_labels}\")\n",
+        "\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "# Stage 2: CUAD (41 classes — reformulated for classification)\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "print(\"\\n📚 Loading CUAD classification dataset...\")\n",
+        "cuad_raw = load_dataset(\"dvgodoy/CUAD_v1_Contract_Understanding_clause_classification\", split=\"train\")\n",
+        "print(f\"   Total rows: {len(cuad_raw):,}\")\n",
+        "\n",
+        "# Analyze class distribution\n",
+        "class_counts = Counter(cuad_raw['class_id'])\n",
+        "print(f\"   Unique classes: {len(class_counts)}\")\n",
+        "print(f\"   \\n   Class distribution:\")\n",
+        "for cid in sorted(class_counts.keys()):\n",
+        "    label_name = CUAD_LABELS[cid] if cid < len(CUAD_LABELS) else f\"Unknown-{cid}\"\n",
+        "    count = class_counts[cid]\n",
+        "    bar = '█' * min(50, count // 10)\n",
+        "    print(f\"   {cid:2d} {label_name:40s} {count:5d} {bar}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 5: Prepare CUAD Train/Val/Test Splits"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "# CUAD only has train split — create val/test by splitting by file_name\n",
+        "# (so no data leakage between contracts)\n",
+        "cuad_df = cuad_raw.to_pandas()\n",
+        "\n",
+        "# Get unique file names\n",
+        "unique_files = cuad_df['file_name'].unique()\n",
+        "print(f\"Unique contracts: {len(unique_files)}\")\n",
+        "\n",
+        "# Split files 80/10/10\n",
+        "train_files, test_files = train_test_split(unique_files, test_size=0.2, random_state=SEED)\n",
+        "val_files, test_files = train_test_split(test_files, test_size=0.5, random_state=SEED)\n",
+        "\n",
+        "cuad_train_df = cuad_df[cuad_df['file_name'].isin(train_files)]\n",
+        "cuad_val_df = cuad_df[cuad_df['file_name'].isin(val_files)]\n",
+        "cuad_test_df = cuad_df[cuad_df['file_name'].isin(test_files)]\n",
+        "\n",
+        "print(f\"CUAD splits — Train: {len(cuad_train_df)} | Val: {len(cuad_val_df)} | Test: {len(cuad_test_df)}\")\n",
+        "print(f\"Train contracts: {len(train_files)} | Val contracts: {len(val_files)} | Test contracts: {len(test_files)}\")\n",
+        "\n",
+        "# Convert to HF Dataset\n",
+        "cuad_train = Dataset.from_pandas(cuad_train_df.reset_index(drop=True))\n",
+        "cuad_val = Dataset.from_pandas(cuad_val_df.reset_index(drop=True))\n",
+        "cuad_test = Dataset.from_pandas(cuad_test_df.reset_index(drop=True))\n",
+        "\n",
+        "# Verify class distribution in each split\n",
+        "for name, ds in [(\"Train\", cuad_train), (\"Val\", cuad_val), (\"Test\", cuad_test)]:\n",
+        "    counts = Counter(ds['class_id'])\n",
+        "    empty_classes = [i for i in range(NUM_CUAD_LABELS) if counts.get(i, 0) == 0]\n",
+        "    print(f\"   {name}: {len(ds)} rows, {len(counts)} classes present, {len(empty_classes)} classes missing: {empty_classes[:5]}...\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 6: Tokenizer & Preprocessing"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import AutoTokenizer\n",
+        "\n",
+        "print(f\"Loading tokenizer: {BASE_MODEL}\")\n",
+        "tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)\n",
+        "\n",
+        "# ── LEDGAR preprocessing (single-label) ──\n",
+        "def preprocess_ledgar(examples):\n",
+        "    tokenized = tokenizer(\n",
+        "        examples[\"text\"],\n",
+        "        truncation=True,\n",
+        "        max_length=MAX_LENGTH,\n",
+        "        padding=False,\n",
+        "    )\n",
+        "    tokenized[\"labels\"] = examples[\"label\"]  # int label for CrossEntropy\n",
+        "    return tokenized\n",
+        "\n",
+        "# ── CUAD preprocessing (single-label per clause, 41 classes) ──\n",
+        "def preprocess_cuad(examples):\n",
+        "    tokenized = tokenizer(\n",
+        "        examples[\"clause\"],\n",
+        "        truncation=True,\n",
+        "        max_length=MAX_LENGTH,\n",
+        "        padding=False,\n",
+        "    )\n",
+        "    tokenized[\"labels\"] = examples[\"class_id\"]  # int label for CrossEntropy + ASL\n",
+        "    return tokenized\n",
+        "\n",
+        "print(\"Tokenizing LEDGAR...\")\n",
+        "ledgar_tokenized = ledgar.map(\n",
+        "    preprocess_ledgar, batched=True,\n",
+        "    remove_columns=ledgar[\"train\"].column_names,\n",
+        "    desc=\"Tokenizing LEDGAR\"\n",
+        ")\n",
+        "\n",
+        "print(\"Tokenizing CUAD...\")\n",
+        "cuad_train_tok = cuad_train.map(\n",
+        "    preprocess_cuad, batched=True,\n",
+        "    remove_columns=cuad_train.column_names,\n",
+        "    desc=\"Tokenizing CUAD train\"\n",
+        ")\n",
+        "cuad_val_tok = cuad_val.map(\n",
+        "    preprocess_cuad, batched=True,\n",
+        "    remove_columns=cuad_val.column_names,\n",
+        "    desc=\"Tokenizing CUAD val\"\n",
+        ")\n",
+        "cuad_test_tok = cuad_test.map(\n",
+        "    preprocess_cuad, batched=True,\n",
+        "    remove_columns=cuad_test.column_names,\n",
+        "    desc=\"Tokenizing CUAD test\"\n",
+        ")\n",
+        "\n",
+        "# Check token lengths\n",
+        "train_lengths = [len(x) for x in cuad_train_tok['input_ids']]\n",
+        "print(f\"\\n📊 CUAD token length stats:\")\n",
+        "print(f\"   Mean: {np.mean(train_lengths):.0f} | Median: {np.median(train_lengths):.0f}\")\n",
+        "print(f\"   95th pct: {np.percentile(train_lengths, 95):.0f} | Max: {max(train_lengths)}\")\n",
+        "print(f\"   Truncated (>512): {sum(1 for l in train_lengths if l >= MAX_LENGTH)} ({sum(1 for l in train_lengths if l >= MAX_LENGTH)/len(train_lengths)*100:.1f}%)\")\n",
+        "print(\"✅ Tokenization complete!\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 7: Asymmetric Loss Function\n",
+        "\n",
+        "From [Asymmetric Loss For Multi-Label Classification](https://arxiv.org/abs/2009.14119) (ICCV 2021).\n",
+        "\n",
+        "Key idea: Down-weight easy negatives more aggressively than positives. Critical for CUAD where most labels are negative for any given clause."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import torch\n",
+        "import torch.nn as nn\n",
+        "import torch.nn.functional as F\n",
+        "\n",
+        "\n",
+        "class AsymmetricLoss(nn.Module):\n",
+        "    \"\"\"\n",
+        "    Asymmetric Loss from arxiv:2009.14119.\n",
+        "    \n",
+        "    For multi-class (single-label) classification with class imbalance:\n",
+        "    We use the multi-class variant — apply focal-style re-weighting\n",
+        "    to cross-entropy, with different gamma for correct vs incorrect classes.\n",
+        "    \n",
+        "    For multi-label (multi-hot) classification:\n",
+        "    L+ = (1-p)^γ+ * log(p)\n",
+        "    L- = (pm)^γ- * log(1-pm), pm = max(p - m, 0)\n",
+        "    \"\"\"\n",
+        "    def __init__(self, gamma_pos=0, gamma_neg=4, clip=0.05, eps=1e-8,\n",
+        "                 num_classes=None, class_weights=None, mode=\"multi_class\"):\n",
+        "        super().__init__()\n",
+        "        self.gamma_pos = gamma_pos\n",
+        "        self.gamma_neg = gamma_neg\n",
+        "        self.clip = clip\n",
+        "        self.eps = eps\n",
+        "        self.mode = mode\n",
+        "        \n",
+        "        # Optional class weights for severe imbalance\n",
+        "        if class_weights is not None:\n",
+        "            self.register_buffer('class_weights', torch.tensor(class_weights, dtype=torch.float32))\n",
+        "        else:\n",
+        "            self.class_weights = None\n",
+        "\n",
+        "    def forward(self, logits, targets):\n",
+        "        if self.mode == \"multi_label\":\n",
+        "            return self._multi_label_loss(logits, targets)\n",
+        "        else:\n",
+        "            return self._multi_class_loss(logits, targets)\n",
+        "    \n",
+        "    def _multi_class_loss(self, logits, targets):\n",
+        "        \"\"\"Focal-style cross-entropy with asymmetric gamma for single-label classification.\"\"\"\n",
+        "        # Standard cross-entropy with class weights\n",
+        "        if self.class_weights is not None:\n",
+        "            ce_loss = F.cross_entropy(logits, targets, weight=self.class_weights, reduction='none')\n",
+        "        else:\n",
+        "            ce_loss = F.cross_entropy(logits, targets, reduction='none')\n",
+        "        \n",
+        "        # Apply focal modulation\n",
+        "        probs = F.softmax(logits, dim=-1)\n",
+        "        # Get probability of the correct class\n",
+        "        p_t = probs.gather(1, targets.unsqueeze(1)).squeeze(1)\n",
+        "        \n",
+        "        # Focal weight: (1 - p_t)^gamma\n",
+        "        # Use gamma_neg for hard examples (low p_t), gamma_pos for easy ones\n",
+        "        focal_weight = (1 - p_t) ** self.gamma_neg\n",
+        "        \n",
+        "        loss = focal_weight * ce_loss\n",
+        "        return loss.mean()\n",
+        "\n",
+        "    def _multi_label_loss(self, logits, targets):\n",
+        "        \"\"\"Full ASL for multi-label classification.\"\"\"\n",
+        "        p = torch.sigmoid(logits)\n",
+        "        \n",
+        "        if self.clip is not None and self.clip > 0:\n",
+        "            p_m = torch.clamp(p - self.clip, min=0)\n",
+        "        else:\n",
+        "            p_m = p\n",
+        "        \n",
+        "        loss_pos = targets * (1 - p) ** self.gamma_pos * torch.log(p + self.eps)\n",
+        "        loss_neg = (1 - targets) * p_m ** self.gamma_neg * torch.log(1 - p_m + self.eps)\n",
+        "        \n",
+        "        loss = -(loss_pos + loss_neg)\n",
+        "        return loss.mean()\n",
+        "\n",
+        "\n",
+        "print(\"✅ AsymmetricLoss defined\")\n",
+        "print(f\"   γ+ = {ASL_GAMMA_POS}, γ- = {ASL_GAMMA_NEG}, clip = {ASL_CLIP}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 8: Custom Trainer with ASL"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import Trainer\n",
+        "\n",
+        "\n",
+        "class ASLTrainer(Trainer):\n",
+        "    \"\"\"Custom Trainer that uses Asymmetric Loss instead of standard CrossEntropy.\"\"\"\n",
+        "    \n",
+        "    def __init__(self, *args, asl_loss_fn=None, **kwargs):\n",
+        "        super().__init__(*args, **kwargs)\n",
+        "        self.asl = asl_loss_fn\n",
+        "\n",
+        "    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):\n",
+        "        labels = inputs.pop(\"labels\")\n",
+        "        outputs = model(**inputs)\n",
+        "        logits = outputs.logits\n",
+        "        \n",
+        "        if self.asl is not None:\n",
+        "            loss = self.asl(logits, labels)\n",
+        "        else:\n",
+        "            # Fallback to standard cross-entropy\n",
+        "            loss = F.cross_entropy(logits, labels)\n",
+        "        \n",
+        "        return (loss, outputs) if return_outputs else loss\n",
+        "\n",
+        "\n",
+        "print(\"✅ ASLTrainer defined\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 9: Metrics"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from sklearn.metrics import f1_score, precision_score, recall_score, classification_report\n",
+        "\n",
+        "\n",
+        "def compute_metrics_single_label(eval_pred):\n",
+        "    \"\"\"Metrics for single-label classification (LEDGAR & CUAD).\"\"\"\n",
+        "    logits, labels = eval_pred.predictions, eval_pred.label_ids\n",
+        "    preds = np.argmax(logits, axis=-1)\n",
+        "    \n",
+        "    micro_f1 = f1_score(labels, preds, average=\"micro\", zero_division=0)\n",
+        "    macro_f1 = f1_score(labels, preds, average=\"macro\", zero_division=0)\n",
+        "    weighted_f1 = f1_score(labels, preds, average=\"weighted\", zero_division=0)\n",
+        "    accuracy = (preds == labels).mean()\n",
+        "    \n",
+        "    return {\n",
+        "        \"accuracy\": accuracy,\n",
+        "        \"micro_f1\": micro_f1,\n",
+        "        \"macro_f1\": macro_f1,\n",
+        "        \"weighted_f1\": weighted_f1,\n",
+        "    }\n",
+        "\n",
+        "\n",
+        "def compute_metrics_cuad_detailed(eval_pred):\n",
+        "    \"\"\"Detailed metrics for CUAD — includes per-class F1.\"\"\"\n",
+        "    logits, labels = eval_pred.predictions, eval_pred.label_ids\n",
+        "    preds = np.argmax(logits, axis=-1)\n",
+        "    \n",
+        "    micro_f1 = f1_score(labels, preds, average=\"micro\", zero_division=0)\n",
+        "    macro_f1 = f1_score(labels, preds, average=\"macro\", zero_division=0)\n",
+        "    weighted_f1 = f1_score(labels, preds, average=\"weighted\", zero_division=0)\n",
+        "    accuracy = (preds == labels).mean()\n",
+        "    \n",
+        "    # Per-class F1\n",
+        "    per_class_f1 = f1_score(labels, preds, average=None, zero_division=0)\n",
+        "    class_metrics = {}\n",
+        "    for i, f1_val in enumerate(per_class_f1):\n",
+        "        if i < len(CUAD_LABELS):\n",
+        "            # Truncate label name for cleaner logging\n",
+        "            safe_name = CUAD_LABELS[i][:20].replace(\" \", \"_\").replace(\"/\", \"_\")\n",
+        "            class_metrics[f\"f1_{safe_name}\"] = float(f1_val)\n",
+        "    \n",
+        "    return {\n",
+        "        \"accuracy\": accuracy,\n",
+        "        \"micro_f1\": micro_f1,\n",
+        "        \"macro_f1\": macro_f1,\n",
+        "        \"weighted_f1\": weighted_f1,\n",
+        "        **class_metrics,\n",
+        "    }\n",
+        "\n",
+        "\n",
+        "print(\"✅ Metrics functions defined\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "---\n",
+        "# 🏋️ STAGE 1: Pre-fine-tune on LEDGAR\n",
+        "\n",
+        "**Goal:** Teach DeBERTa-v3-large what types of contract clauses exist (100 classes, ~60K examples).\n",
+        "\n",
+        "This stage uses standard cross-entropy loss since LEDGAR is well-balanced.\n",
+        "\n",
+        "**Expected:** ~85-90% micro-F1 after 3-5 epochs (~3-5 hours on T4, ~1-2 hours on A100)"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import (\n",
+        "    AutoConfig,\n",
+        "    AutoModelForSequenceClassification,\n",
+        "    TrainingArguments,\n",
+        "    DataCollatorWithPadding,\n",
+        "    EarlyStoppingCallback,\n",
+        ")\n",
+        "\n",
+        "print(f\"🏋️ STAGE 1: Pre-fine-tune on LEDGAR ({num_ledgar_labels} classes)\")\n",
+        "print(f\"   Loading {BASE_MODEL}...\")\n",
+        "\n",
+        "# Load model for Stage 1 (100 classes, single-label)\n",
+        "stage1_model = AutoModelForSequenceClassification.from_pretrained(\n",
+        "    BASE_MODEL,\n",
+        "    num_labels=num_ledgar_labels,\n",
+        "    problem_type=\"single_label_classification\",\n",
+        "    ignore_mismatched_sizes=True,\n",
+        ")\n",
+        "\n",
+        "total_params = sum(p.numel() for p in stage1_model.parameters())\n",
+        "trainable_params = sum(p.numel() for p in stage1_model.parameters() if p.requires_grad)\n",
+        "print(f\"   Total parameters: {total_params:,}\")\n",
+        "print(f\"   Trainable parameters: {trainable_params:,}\")\n",
+        "\n",
+        "stage1_args = TrainingArguments(\n",
+        "    output_dir=\"./stage1_ledgar\",\n",
+        "    num_train_epochs=STAGE1_EPOCHS,\n",
+        "    per_device_train_batch_size=STAGE1_BATCH,\n",
+        "    per_device_eval_batch_size=4,\n",
+        "    gradient_accumulation_steps=STAGE1_GRAD_ACCUM,\n",
+        "    learning_rate=STAGE1_LR,\n",
+        "    weight_decay=WEIGHT_DECAY,\n",
+        "    warmup_ratio=WARMUP_RATIO,\n",
+        "    lr_scheduler_type=\"cosine\",\n",
+        "    eval_strategy=\"epoch\",\n",
+        "    save_strategy=\"epoch\",\n",
+        "    save_total_limit=2,\n",
+        "    load_best_model_at_end=True,\n",
+        "    metric_for_best_model=\"macro_f1\",\n",
+        "    greater_is_better=True,\n",
+        "    bf16=False,  # DeBERTa-v3 breaks with fp16 gradient scaler; fp32 is safest on T4\n",
+        "    fp16=False,\n",
+        "    logging_strategy=\"steps\",\n",
+        "    logging_steps=50,\n",
+        "    logging_first_step=True,\n",
+        "    disable_tqdm=False,\n",
+        "    report_to=\"none\",\n",
+        "    dataloader_num_workers=2,\n",
+        "    seed=SEED,\n",
+        "    gradient_checkpointing=True,  # Critical for T4 (16GB VRAM)\n",
+        ")\n",
+        "\n",
+        "stage1_trainer = Trainer(\n",
+        "    model=stage1_model,\n",
+        "    args=stage1_args,\n",
+        "    train_dataset=ledgar_tokenized[\"train\"],\n",
+        "    eval_dataset=ledgar_tokenized[\"validation\"],\n",
+        "    processing_class=tokenizer,\n",
+        "    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),\n",
+        "    compute_metrics=compute_metrics_single_label,\n",
+        "    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],\n",
+        ")\n",
+        "\n",
+        "print(\"\\n🚀 Starting Stage 1 training...\")\n",
+        "stage1_result = stage1_trainer.train()\n",
+        "print(f\"\\n✅ Stage 1 complete! Loss: {stage1_result.training_loss:.4f}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Evaluate Stage 1 on LEDGAR test set\n",
+        "print(\"📊 Stage 1 — LEDGAR Test Evaluation\")\n",
+        "stage1_test = stage1_trainer.evaluate(ledgar_tokenized[\"test\"])\n",
+        "print(f\"   Accuracy:    {stage1_test['eval_accuracy']:.4f}\")\n",
+        "print(f\"   Micro-F1:    {stage1_test['eval_micro_f1']:.4f}\")\n",
+        "print(f\"   Macro-F1:    {stage1_test['eval_macro_f1']:.4f}\")\n",
+        "print(f\"   Weighted-F1: {stage1_test['eval_weighted_f1']:.4f}\")\n",
+        "\n",
+        "# Save Stage 1 checkpoint\n",
+        "STAGE1_CHECKPOINT = \"./stage1_ledgar_best\"\n",
+        "stage1_trainer.save_model(STAGE1_CHECKPOINT)\n",
+        "tokenizer.save_pretrained(STAGE1_CHECKPOINT)\n",
+        "print(f\"\\n💾 Stage 1 checkpoint saved to {STAGE1_CHECKPOINT}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "---\n",
+        "# 🏋️ STAGE 2: Fine-tune on CUAD 41-class with Asymmetric Loss\n",
+        "\n",
+        "**Goal:** Learn the 41 CUAD contract clause types from the Stage 1 backbone.\n",
+        "\n",
+        "Key improvements over current ClauseGuard:\n",
+        "- DeBERTa-v3-large backbone pre-trained on LEDGAR (Stage 1)\n",
+        "- 512 tokens (vs 256) — captures full clause content\n",
+        "- Asymmetric Loss for class imbalance\n",
+        "- Full fine-tuning (no LoRA bottleneck)\n",
+        "\n",
+        "**Expected:** 75-87% macro-F1 after 10-20 epochs (~5-8 hours on T4, ~2-4 hours on A100)"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Free Stage 1 model memory before loading Stage 2\n",
+        "del stage1_model, stage1_trainer\n",
+        "torch.cuda.empty_cache()\n",
+        "import gc; gc.collect()\n",
+        "\n",
+        "print(f\"🏋️ STAGE 2: Fine-tune on CUAD ({NUM_CUAD_LABELS} classes) with ASL\")\n",
+        "\n",
+        "# Load Stage 1 checkpoint with new head (100 → 41 classes)\n",
+        "stage2_model = AutoModelForSequenceClassification.from_pretrained(\n",
+        "    STAGE1_CHECKPOINT,\n",
+        "    num_labels=NUM_CUAD_LABELS,\n",
+        "    ignore_mismatched_sizes=True,  # classifier head: 100 → 41\n",
+        "    problem_type=\"single_label_classification\",\n",
+        ")\n",
+        "\n",
+        "print(f\"   Loaded Stage 1 backbone with new {NUM_CUAD_LABELS}-class head\")\n",
+        "print(f\"   Parameters: {sum(p.numel() for p in stage2_model.parameters()):,}\")\n",
+        "\n",
+        "# Compute class weights from training distribution\n",
+        "train_class_counts = Counter(cuad_train_tok['labels'])\n",
+        "total_samples = sum(train_class_counts.values())\n",
+        "class_weights = []\n",
+        "for i in range(NUM_CUAD_LABELS):\n",
+        "    count = train_class_counts.get(i, 1)  # avoid div by zero\n",
+        "    # Inverse frequency weighting, capped\n",
+        "    weight = min(10.0, total_samples / (NUM_CUAD_LABELS * count))\n",
+        "    class_weights.append(weight)\n",
+        "\n",
+        "print(f\"   Class weight range: [{min(class_weights):.2f}, {max(class_weights):.2f}]\")\n",
+        "\n",
+        "# Create ASL loss\n",
+        "asl_loss = AsymmetricLoss(\n",
+        "    gamma_pos=ASL_GAMMA_POS,\n",
+        "    gamma_neg=ASL_GAMMA_NEG,\n",
+        "    clip=ASL_CLIP,\n",
+        "    num_classes=NUM_CUAD_LABELS,\n",
+        "    class_weights=class_weights,\n",
+        "    mode=\"multi_class\",  # single-label per clause\n",
+        ")\n",
+        "# Move to GPU\n",
+        "if torch.cuda.is_available():\n",
+        "    asl_loss = asl_loss.cuda()\n",
+        "\n",
+        "stage2_args = TrainingArguments(\n",
+        "    output_dir=\"./stage2_cuad\",\n",
+        "    num_train_epochs=STAGE2_EPOCHS,\n",
+        "    per_device_train_batch_size=STAGE2_BATCH,\n",
+        "    per_device_eval_batch_size=4,\n",
+        "    gradient_accumulation_steps=STAGE2_GRAD_ACCUM,\n",
+        "    learning_rate=STAGE2_LR,\n",
+        "    weight_decay=WEIGHT_DECAY,\n",
+        "    warmup_ratio=WARMUP_RATIO,\n",
+        "    lr_scheduler_type=\"cosine\",\n",
+        "    eval_strategy=\"epoch\",\n",
+        "    save_strategy=\"epoch\",\n",
+        "    save_total_limit=3,\n",
+        "    load_best_model_at_end=True,\n",
+        "    metric_for_best_model=\"macro_f1\",\n",
+        "    greater_is_better=True,\n",
+        "    bf16=False,  # DeBERTa-v3 breaks with fp16 gradient scaler; fp32 is safest on T4\n",
+        "    fp16=False,\n",
+        "    logging_strategy=\"steps\",\n",
+        "    logging_steps=25,\n",
+        "    logging_first_step=True,\n",
+        "    disable_tqdm=False,\n",
+        "    report_to=\"none\",\n",
+        "    push_to_hub=True,\n",
+        "    hub_model_id=HUB_MODEL_ID,\n",
+        "    dataloader_num_workers=2,\n",
+        "    seed=SEED,\n",
+        "    gradient_checkpointing=True,\n",
+        ")\n",
+        "\n",
+        "stage2_trainer = ASLTrainer(\n",
+        "    model=stage2_model,\n",
+        "    args=stage2_args,\n",
+        "    asl_loss_fn=asl_loss,\n",
+        "    train_dataset=cuad_train_tok,\n",
+        "    eval_dataset=cuad_val_tok,\n",
+        "    processing_class=tokenizer,\n",
+        "    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),\n",
+        "    compute_metrics=compute_metrics_cuad_detailed,\n",
+        "    callbacks=[EarlyStoppingCallback(early_stopping_patience=EARLY_STOPPING_PATIENCE)],\n",
+        ")\n",
+        "\n",
+        "print(\"\\n🚀 Starting Stage 2 training with Asymmetric Loss...\")\n",
+        "stage2_result = stage2_trainer.train()\n",
+        "print(f\"\\n✅ Stage 2 complete! Loss: {stage2_result.training_loss:.4f}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 10: Evaluate Stage 2 on CUAD Test Set"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(\"📊 Stage 2 — CUAD Test Evaluation\")\n",
+        "test_results = stage2_trainer.evaluate(cuad_test_tok)\n",
+        "\n",
+        "print(f\"\\n{'='*60}\")\n",
+        "print(f\"  CUAD TEST RESULTS (DeBERTa-v3-large + LEDGAR + ASL)\")\n",
+        "print(f\"{'='*60}\")\n",
+        "print(f\"  Accuracy:    {test_results['eval_accuracy']:.4f}\")\n",
+        "print(f\"  Micro-F1:    {test_results['eval_micro_f1']:.4f}\")\n",
+        "print(f\"  Macro-F1:    {test_results['eval_macro_f1']:.4f}\")\n",
+        "print(f\"  Weighted-F1: {test_results['eval_weighted_f1']:.4f}\")\n",
+        "print(f\"{'='*60}\")\n",
+        "\n",
+        "# Per-class F1 report\n",
+        "print(f\"\\n  Per-class F1 scores:\")\n",
+        "print(f\"  {'Class':<42s} {'F1':>6s}\")\n",
+        "print(f\"  {'-'*48}\")\n",
+        "\n",
+        "zero_f1_classes = []\n",
+        "for i, label_name in enumerate(CUAD_LABELS):\n",
+        "    safe_name = label_name[:20].replace(\" \", \"_\").replace(\"/\", \"_\")\n",
+        "    key = f\"eval_f1_{safe_name}\"\n",
+        "    f1_val = test_results.get(key, 0.0)\n",
+        "    bar = '█' * int(f1_val * 30)\n",
+        "    status = \"\" if f1_val > 0 else \" ← ZERO\"\n",
+        "    print(f\"  {i:2d} {label_name:<40s} {f1_val:.4f} {bar}{status}\")\n",
+        "    if f1_val == 0:\n",
+        "        zero_f1_classes.append(label_name)\n",
+        "\n",
+        "print(f\"\\n  Classes with zero F1: {len(zero_f1_classes)}\")\n",
+        "if zero_f1_classes:\n",
+        "    for c in zero_f1_classes:\n",
+        "        print(f\"    ⚠️ {c}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 11: Full Classification Report"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Generate full sklearn classification report\n",
+        "from sklearn.metrics import classification_report\n",
+        "\n",
+        "# Get predictions on test set\n",
+        "preds_output = stage2_trainer.predict(cuad_test_tok)\n",
+        "preds = np.argmax(preds_output.predictions, axis=-1)\n",
+        "labels = preds_output.label_ids\n",
+        "\n",
+        "# Only include labels that appear in test set\n",
+        "present_labels = sorted(set(labels) | set(preds))\n",
+        "target_names = [CUAD_LABELS[i] if i < len(CUAD_LABELS) else f\"Class-{i}\" for i in present_labels]\n",
+        "\n",
+        "report = classification_report(\n",
+        "    labels, preds,\n",
+        "    labels=present_labels,\n",
+        "    target_names=target_names,\n",
+        "    zero_division=0,\n",
+        "    digits=4,\n",
+        ")\n",
+        "print(\"\\n📊 Full Classification Report:\")\n",
+        "print(report)"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 12: Push Final Model to Hub"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Save model with proper label mapping\n",
+        "stage2_model.config.id2label = {str(i): name for i, name in enumerate(CUAD_LABELS)}\n",
+        "stage2_model.config.label2id = {name: i for i, name in enumerate(CUAD_LABELS)}\n",
+        "\n",
+        "# Save locally\n",
+        "FINAL_DIR = \"./clauseguard-deberta-final\"\n",
+        "stage2_trainer.save_model(FINAL_DIR)\n",
+        "tokenizer.save_pretrained(FINAL_DIR)\n",
+        "\n",
+        "# Push to Hub\n",
+        "print(f\"\\n☁️ Pushing model to Hub: {HUB_MODEL_ID}\")\n",
+        "stage2_trainer.push_to_hub(\n",
+        "    commit_message=(\n",
+        "        f\"ClauseGuard v4: DeBERTa-v3-large 2-stage (LEDGAR→CUAD) with ASL\\n\"\n",
+        "        f\"CUAD Test: micro-F1={test_results['eval_micro_f1']:.4f}, \"\n",
+        "        f\"macro-F1={test_results['eval_macro_f1']:.4f}\"\n",
+        "    )\n",
+        ")\n",
+        "\n",
+        "print(f\"\\n✅ Model pushed to: https://huggingface.co/{HUB_MODEL_ID}\")"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 13: Test the Model on Sample Clauses"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import pipeline as hf_pipeline\n",
+        "\n",
+        "# Load the trained model for inference\n",
+        "classifier = hf_pipeline(\n",
+        "    \"text-classification\",\n",
+        "    model=stage2_model,\n",
+        "    tokenizer=tokenizer,\n",
+        "    top_k=5,  # return top 5 predictions\n",
+        "    device=0 if torch.cuda.is_available() else -1,\n",
+        ")\n",
+        "\n",
+        "test_clauses = [\n",
+        "    # High-risk clauses\n",
+        "    \"The Company may terminate this Agreement at any time, with or without cause, upon written notice to the other party.\",\n",
+        "    \"In no event shall the Company be liable for any indirect, incidental, special, or consequential damages arising out of this Agreement.\",\n",
+        "    \"All intellectual property developed during the term of this Agreement shall be owned exclusively by the Company.\",\n",
+        "    \"This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware.\",\n",
+        "    \"Any disputes arising out of this Agreement shall be resolved through binding arbitration in New York.\",\n",
+        "    \"The Employee agrees not to compete with the Company for a period of two (2) years following termination.\",\n",
+        "    # Neutral clauses\n",
+        "    \"This Agreement shall be effective as of January 1, 2024.\",\n",
+        "    \"The initial term of this Agreement shall be three (3) years.\",\n",
+        "    \"Either party may assign this Agreement with the prior written consent of the other party.\",\n",
+        "]\n",
+        "\n",
+        "print(\"🧪 Testing model on sample clauses:\\n\")\n",
+        "for clause in test_clauses:\n",
+        "    results = classifier(clause, truncation=True, max_length=MAX_LENGTH)\n",
+        "    top = results[0] if isinstance(results[0], dict) else results[0][0]\n",
+        "    top3 = results[:3] if isinstance(results[0], dict) else results[0][:3]\n",
+        "    \n",
+        "    print(f\"📄 \\\"{clause[:90]}{'...' if len(clause) > 90 else ''}\\\"\")\n",
+        "    for r in top3:\n",
+        "        score = r['score']\n",
+        "        bar = '█' * int(score * 20)\n",
+        "        print(f\"   → {r['label']:40s} {score:.4f} {bar}\")\n",
+        "    print()"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 14: Generate Updated app.py Integration Code\n",
+        "\n",
+        "Copy-paste this into your ClauseGuard Space's `app.py` to use the new model."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "integration_code = f'''\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "# ClauseGuard v4 — Integration Code\n",
+        "# Replace the model loading section in app.py with this:\n",
+        "# ═══════════════════════════════════════════════════════════════\n",
+        "\n",
+        "# OLD (remove these):\n",
+        "#   base = \"nlpaueb/legal-bert-base-uncased\"\n",
+        "#   adapter = \"Mokshith31/legalbert-contract-clause-classification\"\n",
+        "#   from peft import PeftModel\n",
+        "\n",
+        "# NEW:\n",
+        "CLAUSEGUARD_MODEL = \"{HUB_MODEL_ID}\"\n",
+        "\n",
+        "def _load_cuad_model():\n",
+        "    global cuad_tokenizer, cuad_model, _model_status\n",
+        "    if not _HAS_TORCH:\n",
+        "        _model_status[\"cuad\"] = \"unavailable\"\n",
+        "        return\n",
+        "    try:\n",
+        "        print(f\"[ClauseGuard] Loading classifier: {{CLAUSEGUARD_MODEL}}\")\n",
+        "        cuad_tokenizer = AutoTokenizer.from_pretrained(CLAUSEGUARD_MODEL)\n",
+        "        cuad_model = AutoModelForSequenceClassification.from_pretrained(CLAUSEGUARD_MODEL)\n",
+        "        cuad_model.eval()\n",
+        "        _model_status[\"cuad\"] = \"loaded\"\n",
+        "        print(f\"[ClauseGuard] Model loaded: {{sum(p.numel() for p in cuad_model.parameters()):,}} params\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"[ClauseGuard] Model load failed: {{e}}\")\n",
+        "        _model_status[\"cuad\"] = f\"failed: {{e}}\"\n",
+        "\n",
+        "# In classify_cuad(), change max_length:\n",
+        "#   max_length=256  →  max_length=512\n",
+        "#\n",
+        "# Also: since the new model is single-label (softmax),\n",
+        "# change the prediction logic from sigmoid to:\n",
+        "#\n",
+        "#   probs = torch.softmax(logits, dim=-1)[0]  # instead of sigmoid\n",
+        "#   top_indices = torch.argsort(probs, descending=True)[:5]\n",
+        "#   for i in top_indices:\n",
+        "#       if float(probs[i]) > 0.10:  # confidence threshold\n",
+        "#           label = CUAD_LABELS[i]\n",
+        "#           ...\n",
+        "\n",
+        "# No more PEFT dependency needed!\n",
+        "# No more ignore_mismatched_sizes!\n",
+        "# Just load directly — the model already has the correct head.\n",
+        "'''\n",
+        "\n",
+        "print(integration_code)"
+      ],
+      "metadata": {},
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Step 15: Comparison with Current Model\n",
+        "\n",
+        "| Metric | Current (Legal-BERT + LoRA) | New (DeBERTa-v3-large + ASL) |\n",
+        "|--------|---------------------------|-----------------------------|\n",
+        "| Base model | 110M params | 435M params |\n",
+        "| Training | LoRA (frozen backbone) | Full fine-tune |\n",
+        "| Pre-training | None | LEDGAR (60K, 100 classes) |\n",
+        "| Max tokens | 256 | 512 |\n",
+        "| Loss function | Cross-entropy | Asymmetric Loss |\n",
+        "| Zero-F1 classes | 10 of 41 | TBD (should be much fewer) |\n",
+        "| Macro-F1 | ~50% | Target: 78-87% |\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## ✅ Done!\n",
+        "\n",
+        "Your trained model is at: **https://huggingface.co/gaurv007/clauseguard-deberta-v3-large**\n",
+        "\n",
+        "### Next Steps:\n",
+        "1. Update ClauseGuard Space to use this model (see integration code above)\n",
+        "2. Remove PEFT dependency from requirements.txt\n",
+        "3. Consider training SetFit classifiers for any remaining zero-F1 classes\n",
+        "4. Add OCR support (Feature #2)\n",
+        "5. Add RAG chatbot (Feature #3)"
+      ],
+      "metadata": {}
+    }
+  ]
+}

ml/requirements.txt CHANGED Viewed

@@ -1,6 +1,6 @@
-transformers==5.6.1
 datasets>=3.2.0
 torch>=2.5.0
 scikit-learn>=1.6.0
 accelerate>=1.2.0
-optimum[onnxruntime]>=1.24.0

+transformers>=5.6.0
 datasets>=3.2.0
 torch>=2.5.0
 scikit-learn>=1.6.0
 accelerate>=1.2.0
+huggingface_hub>=0.27.0

ml/train_classifier_v4.py ADDED Viewed

	@@ -0,0 +1,434 @@

+"""
+ClauseGuard v4 — 2-Stage DeBERTa-v3-large Training Script
+═══════════════════════════════════════════════════════════
+Stage 1: Pre-fine-tune on LEDGAR (60K legal provisions, 100 classes)
+Stage 2: Fine-tune on CUAD (41 classes) with Asymmetric Loss
+Usage:
+    python train_classifier_v4.py                           # Full 2-stage pipeline
+    python train_classifier_v4.py --stage 1                 # Stage 1 only
+    python train_classifier_v4.py --stage 2 --checkpoint ./stage1_ledgar_best  # Stage 2 only
+Requirements:
+    pip install transformers datasets scikit-learn accelerate torch
+Hardware: A100 80GB recommended (~4-6 hours total)
+"""
+import os
+import gc
+import argparse
+import json
+from collections import Counter
+from datetime import datetime
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from datasets import load_dataset, Dataset
+from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
+from sklearn.model_selection import train_test_split
+from transformers import (
+    AutoConfig,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    DataCollatorWithPadding,
+    Trainer,
+    TrainingArguments,
+    EarlyStoppingCallback,
+)
+# ═══════════════════════════════════════════════════════════════
+# CONFIGURATION
+# ═══════════════════════════════════════════════════════════════
+BASE_MODEL = os.environ.get("BASE_MODEL", "microsoft/deberta-v3-large")
+MAX_LENGTH = int(os.environ.get("MAX_LENGTH", "512"))
+HUB_MODEL_ID = os.environ.get("HUB_MODEL_ID", "gaurv007/clauseguard-deberta-v3-large")
+PUSH_TO_HUB = os.environ.get("PUSH_TO_HUB", "true").lower() == "true"
+SEED = 42
+CUAD_LABELS = [
+    "Document Name", "Parties", "Agreement Date", "Effective Date",
+    "Expiration Date", "Renewal Term", "Notice Period to Terminate Renewal",
+    "Governing Law", "Most Favored Nation", "Non-Compete", "Exclusivity",
+    "No-Solicit of Customers", "No-Solicit of Employees", "Non-Disparagement",
+    "Termination for Convenience", "ROFR/ROFO/ROFN", "Change of Control",
+    "Anti-Assignment", "Revenue/Profit Sharing", "Price Restriction",
+    "Minimum Commitment", "Volume Restriction", "IP Ownership Assignment",
+    "Joint IP Ownership", "License Grant", "Non-Transferable License",
+    "Affiliate License-Licensor", "Affiliate License-Licensee",
+    "Unlimited/All-You-Can-Eat License", "Irrevocable or Perpetual License",
+    "Source Code Escrow", "Post-Termination Services", "Audit Rights",
+    "Uncapped Liability", "Cap on Liability", "Liquidated Damages",
+    "Warranty Duration", "Insurance", "Covenant Not to Sue",
+    "Third Party Beneficiary", "Other",
+]
+NUM_CUAD_LABELS = len(CUAD_LABELS)
+# ═══════════════════════════════════════════════════════════════
+# ASYMMETRIC LOSS (arxiv:2009.14119)
+# ═══════════════════════════════════════════════════════════════
+class AsymmetricLoss(nn.Module):
+    """Focal-style loss with asymmetric gamma for class imbalance."""
+    def __init__(self, gamma_pos=0, gamma_neg=4, clip=0.05, eps=1e-8,
+                 class_weights=None):
+        super().__init__()
+        self.gamma_pos = gamma_pos
+        self.gamma_neg = gamma_neg
+        self.clip = clip
+        self.eps = eps
+        if class_weights is not None:
+            self.register_buffer('class_weights',
+                                 torch.tensor(class_weights, dtype=torch.float32))
+        else:
+            self.class_weights = None
+    def forward(self, logits, targets):
+        """Multi-class focal cross-entropy with class weights."""
+        if self.class_weights is not None:
+            ce_loss = F.cross_entropy(logits, targets, weight=self.class_weights,
+                                      reduction='none')
+        else:
+            ce_loss = F.cross_entropy(logits, targets, reduction='none')
+        probs = F.softmax(logits, dim=-1)
+        p_t = probs.gather(1, targets.unsqueeze(1)).squeeze(1)
+        focal_weight = (1 - p_t) ** self.gamma_neg
+        loss = focal_weight * ce_loss
+        return loss.mean()
+# ═══════════════════════════════════════════════════════════════
+# CUSTOM TRAINER
+# ═══════════════════════════════════════════════════════════════
+class ASLTrainer(Trainer):
+    def __init__(self, *args, asl_loss_fn=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.asl = asl_loss_fn
+    def compute_loss(self, model, inputs, return_outputs=False,
+                     num_items_in_batch=None):
+        labels = inputs.pop("labels")
+        outputs = model(**inputs)
+        logits = outputs.logits
+        if self.asl is not None:
+            loss = self.asl(logits, labels)
+        else:
+            loss = F.cross_entropy(logits, labels)
+        return (loss, outputs) if return_outputs else loss
+# ═══════════════════════════════════════════════════════════════
+# METRICS
+# ═══════════════════════════════════════════════════════════════
+def compute_metrics(eval_pred):
+    logits, labels = eval_pred.predictions, eval_pred.label_ids
+    preds = np.argmax(logits, axis=-1)
+    return {
+        "accuracy": (preds == labels).mean(),
+        "micro_f1": f1_score(labels, preds, average="micro", zero_division=0),
+        "macro_f1": f1_score(labels, preds, average="macro", zero_division=0),
+        "weighted_f1": f1_score(labels, preds, average="weighted", zero_division=0),
+    }
+# ═══════════════════════════════════════════════════════════════
+# STAGE 1: LEDGAR
+# ═══════════════════════════════════════════════════════════════
+def run_stage1(tokenizer, output_dir="./stage1_ledgar_best"):
+    print("\n" + "=" * 60)
+    print("  STAGE 1: Pre-fine-tune on LEDGAR (100 classes)")
+    print("=" * 60)
+    ledgar = load_dataset("coastalcph/lex_glue", "ledgar")
+    num_labels = ledgar['train'].features['label'].num_classes
+    print(f"  Train: {len(ledgar['train']):,} | Val: {len(ledgar['validation']):,}")
+    print(f"  Classes: {num_labels}")
+    def preprocess(examples):
+        tok = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH,
+                        padding=False)
+        tok["labels"] = examples["label"]
+        return tok
+    tokenized = ledgar.map(preprocess, batched=True,
+                           remove_columns=ledgar["train"].column_names)
+    model = AutoModelForSequenceClassification.from_pretrained(
+        BASE_MODEL, num_labels=num_labels,
+        problem_type="single_label_classification",
+        ignore_mismatched_sizes=True,
+    )
+    print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    args = TrainingArguments(
+        output_dir="./stage1_ledgar",
+        num_train_epochs=5,
+        per_device_train_batch_size=8,
+        per_device_eval_batch_size=16,
+        gradient_accumulation_steps=4,
+        learning_rate=2e-5,
+        weight_decay=0.06,
+        warmup_ratio=0.1,
+        lr_scheduler_type="cosine",
+        eval_strategy="epoch",
+        save_strategy="epoch",
+        save_total_limit=2,
+        load_best_model_at_end=True,
+        metric_for_best_model="macro_f1",
+        greater_is_better=True,
+        bf16=torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8,
+        fp16=torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8,
+        logging_strategy="steps",
+        logging_steps=50,
+        logging_first_step=True,
+        disable_tqdm=True,
+        report_to="none",
+        dataloader_num_workers=2,
+        seed=SEED,
+        gradient_checkpointing=True,
+    )
+    trainer = Trainer(
+        model=model, args=args,
+        train_dataset=tokenized["train"],
+        eval_dataset=tokenized["validation"],
+        processing_class=tokenizer,
+        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
+        compute_metrics=compute_metrics,
+        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
+    )
+    result = trainer.train()
+    print(f"\n  Stage 1 training loss: {result.training_loss:.4f}")
+    test_metrics = trainer.evaluate(tokenized["test"])
+    print(f"  Stage 1 test micro-F1: {test_metrics['eval_micro_f1']:.4f}")
+    print(f"  Stage 1 test macro-F1: {test_metrics['eval_macro_f1']:.4f}")
+    trainer.save_model(output_dir)
+    tokenizer.save_pretrained(output_dir)
+    print(f"  Saved to {output_dir}")
+    del model, trainer
+    torch.cuda.empty_cache()
+    gc.collect()
+    return output_dir
+# ═════════════════════════════════════════════════════════��═════
+# STAGE 2: CUAD
+# ═══════════════════════════════════════════════════════════════
+def run_stage2(tokenizer, checkpoint_path, output_dir="./clauseguard-deberta-final"):
+    print("\n" + "=" * 60)
+    print(f"  STAGE 2: Fine-tune on CUAD ({NUM_CUAD_LABELS} classes) with ASL")
+    print("=" * 60)
+    # Load and split CUAD
+    cuad_raw = load_dataset(
+        "dvgodoy/CUAD_v1_Contract_Understanding_clause_classification",
+        split="train"
+    )
+    cuad_df = cuad_raw.to_pandas()
+    unique_files = cuad_df['file_name'].unique()
+    train_files, test_files = train_test_split(unique_files, test_size=0.2,
+                                                random_state=SEED)
+    val_files, test_files = train_test_split(test_files, test_size=0.5,
+                                              random_state=SEED)
+    splits = {
+        "train": Dataset.from_pandas(
+            cuad_df[cuad_df['file_name'].isin(train_files)].reset_index(drop=True)
+        ),
+        "val": Dataset.from_pandas(
+            cuad_df[cuad_df['file_name'].isin(val_files)].reset_index(drop=True)
+        ),
+        "test": Dataset.from_pandas(
+            cuad_df[cuad_df['file_name'].isin(test_files)].reset_index(drop=True)
+        ),
+    }
+    for name, ds in splits.items():
+        print(f"  {name}: {len(ds)} rows")
+    def preprocess_cuad(examples):
+        tok = tokenizer(examples["clause"], truncation=True, max_length=MAX_LENGTH,
+                        padding=False)
+        tok["labels"] = examples["class_id"]
+        return tok
+    tok_splits = {}
+    for name, ds in splits.items():
+        tok_splits[name] = ds.map(preprocess_cuad, batched=True,
+                                   remove_columns=ds.column_names)
+    # Load model from Stage 1 checkpoint
+    model = AutoModelForSequenceClassification.from_pretrained(
+        checkpoint_path,
+        num_labels=NUM_CUAD_LABELS,
+        ignore_mismatched_sizes=True,
+        problem_type="single_label_classification",
+    )
+    # Update label mapping
+    model.config.id2label = {str(i): name for i, name in enumerate(CUAD_LABELS)}
+    model.config.label2id = {name: i for i, name in enumerate(CUAD_LABELS)}
+    print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    # Compute class weights
+    train_counts = Counter(tok_splits["train"]["labels"])
+    total = sum(train_counts.values())
+    class_weights = []
+    for i in range(NUM_CUAD_LABELS):
+        count = train_counts.get(i, 1)
+        weight = min(10.0, total / (NUM_CUAD_LABELS * count))
+        class_weights.append(weight)
+    asl = AsymmetricLoss(gamma_pos=0, gamma_neg=4, clip=0.05,
+                          class_weights=class_weights)
+    if torch.cuda.is_available():
+        asl = asl.cuda()
+    args = TrainingArguments(
+        output_dir="./stage2_cuad",
+        num_train_epochs=20,
+        per_device_train_batch_size=8,
+        per_device_eval_batch_size=16,
+        gradient_accumulation_steps=4,
+        learning_rate=1e-5,
+        weight_decay=0.06,
+        warmup_ratio=0.1,
+        lr_scheduler_type="cosine",
+        eval_strategy="epoch",
+        save_strategy="epoch",
+        save_total_limit=3,
+        load_best_model_at_end=True,
+        metric_for_best_model="macro_f1",
+        greater_is_better=True,
+        bf16=torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8,
+        fp16=torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8,
+        logging_strategy="steps",
+        logging_steps=25,
+        logging_first_step=True,
+        disable_tqdm=True,
+        report_to="none",
+        push_to_hub=PUSH_TO_HUB,
+        hub_model_id=HUB_MODEL_ID if PUSH_TO_HUB else None,
+        dataloader_num_workers=2,
+        seed=SEED,
+        gradient_checkpointing=True,
+    )
+    trainer = ASLTrainer(
+        model=model, args=args,
+        asl_loss_fn=asl,
+        train_dataset=tok_splits["train"],
+        eval_dataset=tok_splits["val"],
+        processing_class=tokenizer,
+        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
+        compute_metrics=compute_metrics,
+        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
+    )
+    result = trainer.train()
+    print(f"\n  Stage 2 training loss: {result.training_loss:.4f}")
+    # Evaluate
+    test_metrics = trainer.evaluate(tok_splits["test"])
+    print(f"\n{'='*60}")
+    print(f"  CUAD TEST RESULTS")
+    print(f"{'='*60}")
+    print(f"  Accuracy:    {test_metrics['eval_accuracy']:.4f}")
+    print(f"  Micro-F1:    {test_metrics['eval_micro_f1']:.4f}")
+    print(f"  Macro-F1:    {test_metrics['eval_macro_f1']:.4f}")
+    print(f"  Weighted-F1: {test_metrics['eval_weighted_f1']:.4f}")
+    # Full report
+    preds_out = trainer.predict(tok_splits["test"])
+    preds = np.argmax(preds_out.predictions, axis=-1)
+    labels = preds_out.label_ids
+    present = sorted(set(labels) | set(preds))
+    names = [CUAD_LABELS[i] if i < len(CUAD_LABELS) else f"Class-{i}" for i in present]
+    print("\n" + classification_report(labels, preds, labels=present,
+                                       target_names=names, zero_division=0, digits=4))
+    # Save
+    trainer.save_model(output_dir)
+    tokenizer.save_pretrained(output_dir)
+    if PUSH_TO_HUB:
+        trainer.push_to_hub(
+            commit_message=(
+                f"ClauseGuard v4: DeBERTa-v3-large LEDGAR→CUAD + ASL | "
+                f"micro-F1={test_metrics['eval_micro_f1']:.4f} "
+                f"macro-F1={test_metrics['eval_macro_f1']:.4f}"
+            )
+        )
+        print(f"\n  Pushed to https://huggingface.co/{HUB_MODEL_ID}")
+    # Save test results
+    results_path = os.path.join(output_dir, "test_results.json")
+    with open(results_path, "w") as f:
+        json.dump({
+            "model": HUB_MODEL_ID,
+            "base_model": BASE_MODEL,
+            "max_length": MAX_LENGTH,
+            "stage1_dataset": "coastalcph/lex_glue (ledgar)",
+            "stage2_dataset": "dvgodoy/CUAD_v1_Contract_Understanding_clause_classification",
+            "test_results": {k: float(v) for k, v in test_metrics.items()
+                           if isinstance(v, (int, float))},
+            "timestamp": datetime.now().isoformat(),
+        }, f, indent=2)
+    return output_dir
+# ═══════════════════════════════════════════════════════════════
+# MAIN
+# ═══════════════════════════════════════════════════════════════
+def main():
+    parser = argparse.ArgumentParser(description="ClauseGuard v4 Training")
+    parser.add_argument("--stage", type=int, default=0,
+                        help="Run specific stage (1 or 2). Default: both")
+    parser.add_argument("--checkpoint", type=str, default="./stage1_ledgar_best",
+                        help="Stage 1 checkpoint path for Stage 2")
+    args = parser.parse_args()
+    print(f"🛡️ ClauseGuard v4 Training")
+    print(f"   Model: {BASE_MODEL}")
+    print(f"   Max length: {MAX_LENGTH}")
+    print(f"   Hub: {HUB_MODEL_ID}")
+    if torch.cuda.is_available():
+        print(f"   GPU: {torch.cuda.get_device_name(0)}")
+        print(f"   VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
+    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
+    if args.stage in (0, 1):
+        checkpoint = run_stage1(tokenizer)
+    else:
+        checkpoint = args.checkpoint
+    if args.stage in (0, 2):
+        run_stage2(tokenizer, checkpoint)
+    print("\n✅ Training complete!")
+if __name__ == "__main__":
+    main()

obligations.py CHANGED Viewed

@@ -120,18 +120,22 @@ def extract_obligations(text):
         if not found_types:
             continue
-        # Extract party
         party = "Unknown"
-        for pp in PARTY_PATTERNS:
-            m = re.search(pp, sentence)
-            if m:
-                party = m.group(0).strip()
-                break
-        # Try to determine which party has the obligation based on sentence structure
         obligation_direction = _detect_obligation_direction(sentence)
         if obligation_direction:
             party = obligation_direction
         # Extract timeframe
         deadline = "Not specified"

         if not found_types:
             continue
+        # Extract party (Fix 8: scope to sentence only, reject >40 char strings)
         party = "Unknown"
+        # First try structured direction detection
         obligation_direction = _detect_obligation_direction(sentence)
         if obligation_direction:
             party = obligation_direction
+        else:
+            # Fallback to pattern matching within the sentence
+            for pp in PARTY_PATTERNS:
+                m = re.search(pp, sentence)
+                if m:
+                    candidate = m.group(0).strip()
+                    # Fix 8: Reject party strings >40 chars (header bleed-through)
+                    if len(candidate) <= 40:
+                        party = candidate
+                    break
         # Extract timeframe
         deadline = "Not specified"

ocr_engine.py ADDED Viewed

	@@ -0,0 +1,218 @@

+"""
+ClauseGuard — OCR Engine v1.0
+═════════════════════════════
+Smart PDF Router: detects native vs scanned PDFs.
+  • Native PDF → pdfplumber (fast, existing)
+  • Scanned PDF → docTR OCR (CPU-friendly, ~150MB models)
+Architecture:
+  PDF uploaded
+      ↓
+  [detect_if_scanned] — pdfplumber gets <50 chars/page?
+      ↓                           ↓
+    Native PDF               Scanned PDF
+      ↓                           ↓
+    pdfplumber              docTR OCR (CPU)
+      ↓                           ↓
+    Contract text → existing analysis pipeline
+"""
+import os
+import re
+# ── docTR (soft-fail) ───────────────────────────────────────────────
+_HAS_DOCTR = False
+_ocr_predictor = None
+try:
+    from doctr.io import DocumentFile
+    from doctr.models import ocr_predictor as _make_predictor
+    _HAS_DOCTR = True
+except ImportError:
+    pass
+# ── pdfplumber (soft-fail) ──────────────────────────────────────────
+try:
+    import pdfplumber
+    _HAS_PDF = True
+except ImportError:
+    _HAS_PDF = False
+# ═══════════════════════════════════════════════════════════════════════
+# OCR MODEL LOADING
+# ═══════════════════════════════════════════════════════════════════════
+_ocr_status = "not_loaded"
+def _load_ocr_model():
+    """Load docTR OCR predictor (lazy, on first use)."""
+    global _ocr_predictor, _ocr_status
+    if _ocr_predictor is not None:
+        return _ocr_predictor
+    if not _HAS_DOCTR:
+        _ocr_status = "unavailable (python-doctr not installed)"
+        return None
+    try:
+        print("[ClauseGuard OCR] Loading docTR models (fast_base + crnn_vgg16_bn)...")
+        _ocr_predictor = _make_predictor(
+            det_arch="fast_base",
+            reco_arch="crnn_vgg16_bn",
+            pretrained=True,
+            assume_straight_pages=True,
+        )
+        _ocr_status = "loaded"
+        print("[ClauseGuard OCR] docTR models loaded successfully")
+        return _ocr_predictor
+    except Exception as e:
+        _ocr_status = f"failed: {e}"
+        print(f"[ClauseGuard OCR] docTR load failed: {e}")
+        return None
+def get_ocr_status():
+    """Return human-readable OCR engine status."""
+    if _ocr_predictor is not None:
+        return "✅ OCR: docTR loaded"
+    elif _HAS_DOCTR:
+        return "⏳ OCR: docTR available (not yet loaded)"
+    else:
+        return "❌ OCR: unavailable (python-doctr not installed)"
+# ═══════════════════════════════════════════════════════════════════════
+# SMART PDF ROUTER
+# ═══════════════════════════════════════════════════════════════════════
+def _is_scanned_pdf(file_path, min_chars_per_page=50):
+    """
+    Detect if a PDF is scanned (image-based) by checking if pdfplumber
+    extracts fewer than `min_chars_per_page` characters on average.
+    """
+    if not _HAS_PDF:
+        return True  # Can't check with pdfplumber, assume scanned
+    try:
+        with pdfplumber.open(file_path) as pdf:
+            if len(pdf.pages) == 0:
+                return True
+            total_chars = 0
+            pages_checked = min(len(pdf.pages), 5)  # Check first 5 pages
+            for i in range(pages_checked):
+                page_text = pdf.pages[i].extract_text() or ""
+                total_chars += len(page_text.strip())
+            avg_chars = total_chars / pages_checked
+            return avg_chars < min_chars_per_page
+    except Exception:
+        return True  # If pdfplumber fails, try OCR
+def _extract_native_pdf(file_path):
+    """Extract text from a native (digital) PDF using pdfplumber."""
+    if not _HAS_PDF:
+        return None, "pdfplumber not installed"
+    try:
+        text = ""
+        with pdfplumber.open(file_path) as pdf:
+            for page in pdf.pages:
+                page_text = page.extract_text()
+                if page_text:
+                    text += page_text + "\n\n"
+        if not text.strip():
+            return None, "No text extracted from PDF"
+        return text.strip(), None
+    except Exception as e:
+        return None, f"PDF parse error: {e}"
+def _extract_scanned_pdf(file_path):
+    """Extract text from a scanned PDF using docTR OCR."""
+    predictor = _load_ocr_model()
+    if predictor is None:
+        return None, (
+            "OCR is not available. Install python-doctr: "
+            "`pip install python-doctr[torch]`"
+        )
+    try:
+        doc = DocumentFile.from_pdf(file_path)
+        result = predictor(doc)
+        # Extract text page by page
+        full_text = ""
+        for page_idx, page in enumerate(result.pages):
+            page_text = ""
+            for block in page.blocks:
+                for line in block.lines:
+                    line_text = " ".join(word.value for word in line.words)
+                    page_text += line_text + "\n"
+                page_text += "\n"
+            full_text += page_text + "\n\n"
+        if not full_text.strip():
+            return None, "OCR could not extract text from scanned PDF"
+        # Clean up OCR artifacts
+        full_text = _clean_ocr_text(full_text)
+        return full_text.strip(), None
+    except Exception as e:
+        return None, f"OCR error: {e}"
+def _clean_ocr_text(text):
+    """Clean common OCR artifacts."""
+    # Remove excessive whitespace
+    text = re.sub(r'[ \t]{3,}', '  ', text)
+    # Fix common OCR substitutions
+    text = re.sub(r'\bl\b(?=[A-Z])', 'I', text)  # l before capital → I
+    # Normalize line breaks
+    text = re.sub(r'\n{4,}', '\n\n\n', text)
+    # Remove single-char lines (OCR noise)
+    lines = text.split('\n')
+    cleaned_lines = []
+    for line in lines:
+        stripped = line.strip()
+        if len(stripped) <= 1 and stripped not in ('', '.', ',', ';'):
+            continue
+        cleaned_lines.append(line)
+    return '\n'.join(cleaned_lines)
+# ═══════════════════════════════════════════════════════════════════════
+# PUBLIC API
+# ═══════════════════════════════════════════════════════════════════════
+def parse_pdf_smart(file_path):
+    """
+    Smart PDF parser with OCR fallback.
+    Returns: (text, error, method)
+        text: extracted text (or None)
+        error: error message (or None)
+        method: "native" | "ocr" | None
+    """
+    if not os.path.exists(file_path):
+        return None, "File not found", None
+    # Step 1: Check if PDF is scanned
+    is_scanned = _is_scanned_pdf(file_path)
+    if not is_scanned:
+        # Step 2a: Native PDF — use pdfplumber
+        text, error = _extract_native_pdf(file_path)
+        if text:
+            return text, None, "native"
+        # If pdfplumber returns empty, fall through to OCR
+        print("[ClauseGuard OCR] pdfplumber returned empty — falling back to OCR")
+    # Step 2b: Scanned PDF or pdfplumber failed — use OCR
+    print(f"[ClauseGuard OCR] {'Scanned' if is_scanned else 'Empty native'} PDF detected — running docTR OCR...")
+    text, error = _extract_scanned_pdf(file_path)
+    if text:
+        return text, None, "ocr"
+    return None, error, None
+def ocr_extract(file_path):
+    """
+    Force OCR extraction on a PDF (bypass native text check).
+    Useful when user explicitly wants OCR.
+    """
+    return _extract_scanned_pdf(file_path)

redlining.py ADDED Viewed

	@@ -0,0 +1,591 @@

+"""
+ClauseGuard — Clause Redlining Engine v1.0
+═══════════════════════════════════════════
+3-Tier Hybrid Architecture:
+  Tier 1 — Template lookup (instant, zero hallucination risk)
+  Tier 2 — RAG retrieval from clause corpus (find fairer precedents)
+  Tier 3 — LLM refinement (adapt template using retrieved precedents)
+Anti-hallucination guardrails:
+  • Template anchor: LLM can only refine, not generate from scratch
+  • RAG grounding: Retrieved precedents constrain the output space
+  • Disclaimer: "Not legal advice. Consult an attorney before executing."
+  • Legal citation: Prompt requires LLM to cite the consumer protection standard applied
+"""
+import os
+import re
+from collections import defaultdict
+# ── HF Inference Client (soft-fail) ─────────────────────────────────
+_HAS_INFERENCE = False
+try:
+    from huggingface_hub import InferenceClient
+    _HAS_INFERENCE = True
+except ImportError:
+    pass
+# ═══════════════════════════════════════════════════════════════════════
+# TIER 1: TEMPLATE LIBRARY (18+ clause types)
+# ═══════════════════════════════════════════════════════════════════════
+# Based on FTC guidelines, EU Directive 93/13, and CFPB guidance.
+SAFE_ALTERNATIVES = {
+    # ── CRITICAL Risk Clauses ──────────────────────────────────────
+    "Uncapped Liability": {
+        "risky_pattern": "Total liability shall not exceed $1 / unlimited liability exposure",
+        "safe_alternative": (
+            "Provider's aggregate liability under this Agreement shall not exceed the total "
+            "fees paid by the Customer in the twelve (12) months preceding the claim. "
+            "This limitation shall not apply to: (a) gross negligence or willful misconduct, "
+            "(b) breach of confidentiality obligations, (c) intellectual property indemnification "
+            "obligations, or (d) violations of applicable law."
+        ),
+        "legal_basis": "UCC § 2-719; Restatement (Second) of Contracts § 356",
+        "consumer_standard": "FTC guidelines on unconscionable contract terms",
+        "risk_level": "CRITICAL",
+    },
+    "Arbitration": {
+        "risky_pattern": "All disputes via binding arbitration / class action waiver",
+        "safe_alternative": (
+            "Disputes involving claims under [Dollar Amount] shall be resolved in small claims "
+            "court in the consumer's jurisdiction of residence. For other disputes, either party "
+            "may elect binding arbitration under [AAA/JAMS] rules. The consumer may opt out of "
+            "arbitration by providing written notice within thirty (30) days of accepting these "
+            "terms. Each party bears its own arbitration costs; the prevailing party may recover "
+            "reasonable attorney's fees."
+        ),
+        "legal_basis": "Federal Arbitration Act § 2; AT&T Mobility v. Concepcion, 563 U.S. 333 (2011)",
+        "consumer_standard": "CFPB Arbitration Rule guidance; EU Directive 93/13/EEC Art. 3",
+        "risk_level": "CRITICAL",
+    },
+    "IP Ownership Assignment": {
+        "risky_pattern": "All IP rights assigned to company / work-for-hire everything",
+        "safe_alternative": (
+            "Intellectual property created by the Receiving Party specifically in performance of "
+            "this Agreement ('Work Product IP') shall be assigned to the Disclosing Party. "
+            "Pre-existing IP and general knowledge, skills, and experience of the Receiving Party "
+            "remain the Receiving Party's property. The Disclosing Party grants the Receiving Party "
+            "a non-exclusive, perpetual license to use Work Product IP for internal portfolio and "
+            "reference purposes."
+        ),
+        "legal_basis": "17 U.S.C. § 101 (work for hire); Copyright Act § 201(b)",
+        "consumer_standard": "Standard IP assignment with carve-outs for pre-existing IP",
+        "risk_level": "CRITICAL",
+    },
+    "Termination for Convenience": {
+        "risky_pattern": "Terminate at any time without notice",
+        "safe_alternative": (
+            "Either party may terminate this Agreement for convenience upon thirty (30) days' "
+            "prior written notice. Immediate termination is permitted only for material breach "
+            "that remains uncured after a ten (10) day cure period following written notice "
+            "specifying the breach. Upon termination: (a) all outstanding fees become due, "
+            "(b) each party shall return or destroy confidential information within fifteen (15) "
+            "business days, and (c) licenses granted hereunder shall terminate except as "
+            "expressly stated to survive."
+        ),
+        "legal_basis": "Restatement (Second) of Contracts § 237; UCC § 2-309",
+        "consumer_standard": "FTC: adequate notice period required for service termination",
+        "risk_level": "CRITICAL",
+    },
+    "Limitation of liability": {
+        "risky_pattern": "Company not liable for any damages / complete disclaimer",
+        "safe_alternative": (
+            "Neither party shall be liable for indirect, incidental, special, or consequential "
+            "damages, EXCEPT in cases of: (a) gross negligence or willful misconduct, "
+            "(b) breach of confidentiality, (c) data breach involving personal information, or "
+            "(d) intellectual property infringement. Direct damages are limited to fees paid "
+            "in the prior twelve (12) months. Nothing in this Agreement limits liability for "
+            "death or personal injury caused by negligence."
+        ),
+        "legal_basis": "UCC § 2-719(3); EU Directive 93/13/EEC Annex (a)",
+        "consumer_standard": "Cannot exclude liability for death/personal injury (EU/UK law)",
+        "risk_level": "CRITICAL",
+    },
+    "Unilateral termination": {
+        "risky_pattern": "Company can terminate account at any time without reason",
+        "safe_alternative": (
+            "The Provider may suspend or terminate the User's account for: (a) material breach "
+            "of these Terms, (b) non-payment after ten (10) days' notice, (c) illegal activity, "
+            "or (d) extended inactivity exceeding twelve (12) months. The Provider shall provide "
+            "at least thirty (30) days' written notice before termination, except in cases of "
+            "illegal activity. Upon termination, the User shall have thirty (30) days to export "
+            "their data."
+        ),
+        "legal_basis": "EU Directive 2019/770 (Digital Content); CFPB guidance",
+        "consumer_standard": "Right to export data upon termination; adequate notice period",
+        "risk_level": "CRITICAL",
+    },
+    "Liquidated Damages": {
+        "risky_pattern": "Pre-determined damages far exceeding actual harm",
+        "safe_alternative": (
+            "In the event of breach, the non-breaching party shall be entitled to liquidated "
+            "damages in the amount of [specific reasonable amount], which the parties agree "
+            "represents a reasonable estimate of anticipated harm. This liquidated damages "
+            "provision shall not apply if actual damages are readily ascertainable, in which "
+            "case the non-breaching party may recover actual damages proven."
+        ),
+        "legal_basis": "Restatement (Second) of Contracts § 356; UCC § 2-718",
+        "consumer_standard": "Liquidated damages must be reasonable estimate, not penalty",
+        "risk_level": "CRITICAL",
+    },
+    # ── HIGH Risk Clauses ──────────────────────────────────────────
+    "Unilateral change": {
+        "risky_pattern": "We may modify terms at any time without notice",
+        "safe_alternative": (
+            "Material changes to these Terms require thirty (30) days' advance written notice "
+            "to the User via email and in-app notification. The User has the right to terminate "
+            "without penalty within the notice period if they do not accept the changes. "
+            "Non-material changes (e.g., formatting, clarifications) may be made without notice."
+        ),
+        "legal_basis": "EU Directive 93/13/EEC Art. 3; Restatement (Second) § 89",
+        "consumer_standard": "FTC: material changes require notice and right to reject",
+        "risk_level": "HIGH",
+    },
+    "Content removal": {
+        "risky_pattern": "Company can delete content at sole discretion without notice",
+        "safe_alternative": (
+            "Content may be removed only for violation of these Terms of Service, applicable law, "
+            "or valid legal process. The Provider shall provide prior notice specifying the reason "
+            "for removal (except where legally prohibited). The User has the right to appeal "
+            "within fourteen (14) days. Removed content shall be preserved for thirty (30) days "
+            "to allow for appeal resolution."
+        ),
+        "legal_basis": "EU Digital Services Act Art. 17; First Amendment considerations",
+        "consumer_standard": "Due process: notice, reason, and right to appeal",
+        "risk_level": "HIGH",
+    },
+    "Non-Compete": {
+        "risky_pattern": "Broad non-compete with no time/geography limits",
+        "safe_alternative": (
+            "During the term of this Agreement and for a period of [6-12] months thereafter, "
+            "the Receiving Party shall not directly compete with the Disclosing Party in "
+            "[specific market/geography]. This restriction applies only to [specific business "
+            "activities] and does not prevent general employment in the industry. The Disclosing "
+            "Party shall provide [garden leave pay / consideration] during the restricted period."
+        ),
+        "legal_basis": "Restatement (Second) of Contracts § 188; FTC Non-Compete Rule (2024)",
+        "consumer_standard": "Reasonable scope, duration, geography; adequate consideration",
+        "risk_level": "HIGH",
+    },
+    "Exclusivity": {
+        "risky_pattern": "Exclusive dealing with no time limit or exit clause",
+        "safe_alternative": (
+            "The exclusivity arrangement shall apply for an initial term of [12-24] months, "
+            "after which either party may convert to non-exclusive upon sixty (60) days' notice. "
+            "Exclusivity is limited to [specific product/service category] and [specific "
+            "geographic area]. Performance benchmarks shall be reviewed quarterly; failure to "
+            "meet agreed minimums allows termination of exclusivity."
+        ),
+        "legal_basis": "Sherman Act § 1; EU Competition Law Art. 101 TFEU",
+        "consumer_standard": "Time-limited, scope-limited, with performance exit clause",
+        "risk_level": "HIGH",
+    },
+    "Anti-Assignment": {
+        "risky_pattern": "Complete prohibition on assignment without consent",
+        "safe_alternative": (
+            "Neither party may assign this Agreement without the prior written consent of the "
+            "other party, which shall not be unreasonably withheld, conditioned, or delayed. "
+            "Notwithstanding the foregoing, either party may assign this Agreement without "
+            "consent in connection with a merger, acquisition, or sale of substantially all "
+            "of its assets, provided the assignee assumes all obligations hereunder."
+        ),
+        "legal_basis": "UCC § 2-210; Restatement (Second) of Contracts § 317",
+        "consumer_standard": "Consent not to be unreasonably withheld; M&A carve-out",
+        "risk_level": "HIGH",
+    },
+    # ── MEDIUM Risk Clauses ────────────────────────────────────────
+    "Jurisdiction": {
+        "risky_pattern": "Exclusive jurisdiction in distant/foreign state",
+        "safe_alternative": (
+            "The Consumer may bring claims in their jurisdiction of residence or the Provider's "
+            "principal place of business. Small claims actions may be brought in any court of "
+            "competent jurisdiction. For commercial contracts: disputes shall be resolved in "
+            "[mutually agreed location] or the defendant's principal place of business."
+        ),
+        "legal_basis": "EU Regulation 1215/2012 (Brussels I); CJEU C-585/08",
+        "consumer_standard": "Consumer may sue in home jurisdiction (EU Directive 93/13)",
+        "risk_level": "MEDIUM",
+    },
+    "Choice of law": {
+        "risky_pattern": "Governed by laws of a jurisdiction that disadvantages consumer",
+        "safe_alternative": (
+            "This Agreement shall be governed by the laws of [State/Country]. Notwithstanding "
+            "the foregoing, nothing in this choice of law provision shall deprive the Consumer "
+            "of the protection afforded by mandatory provisions of the law of the Consumer's "
+            "habitual residence."
+        ),
+        "legal_basis": "EU Regulation 593/2008 (Rome I) Art. 6; UCC § 1-301",
+        "consumer_standard": "Cannot override mandatory consumer protection of home jurisdiction",
+        "risk_level": "MEDIUM",
+    },
+    "Contract by using": {
+        "risky_pattern": "Bound to contract by merely using the service (browsewrap)",
+        "safe_alternative": (
+            "By creating an account, the User acknowledges they have read, understood, and agree "
+            "to be bound by these Terms. The User must affirmatively accept these Terms via "
+            "checkbox or click-through before account creation. Continued use after material "
+            "changes requires re-acceptance."
+        ),
+        "legal_basis": "Specht v. Netscape, 306 F.3d 17 (2d Cir. 2002)",
+        "consumer_standard": "Clickwrap > browsewrap; affirmative acceptance required",
+        "risk_level": "MEDIUM",
+    },
+    # ── Additional Common Clauses ──────────────────────────────────
+    "Auto-Renewal": {
+        "risky_pattern": "Auto-renews silently without notice",
+        "safe_alternative": (
+            "This Agreement shall automatically renew for successive [term] periods unless "
+            "either party provides written notice of non-renewal at least thirty (30) days "
+            "before the end of the then-current term. The Provider shall send a reminder "
+            "notice thirty (30) to sixty (60) days before renewal. The Consumer may cancel "
+            "within fifteen (15) days of renewal for a pro-rated refund."
+        ),
+        "legal_basis": "California Auto-Renewal Law (ARL) Bus. & Prof. Code § 17600; FTC Negative Option Rule",
+        "consumer_standard": "Reminder notice required; easy cancellation; pro-rated refund",
+        "risk_level": "HIGH",
+    },
+    "Indemnification": {
+        "risky_pattern": "User indemnifies company for all claims without limit",
+        "safe_alternative": (
+            "Each party shall indemnify, defend, and hold harmless the other party from "
+            "third-party claims arising from: (a) the indemnifying party's breach of this "
+            "Agreement, (b) the indemnifying party's negligence or willful misconduct, or "
+            "(c) the indemnifying party's violation of applicable law. The User's indemnification "
+            "obligation is limited to claims arising from the User's own negligence or "
+            "intentional acts. The maximum indemnification obligation shall not exceed [amount]."
+        ),
+        "legal_basis": "Restatement (Second) of Contracts § 345; UCC § 2-607",
+        "consumer_standard": "Mutual indemnification; limited to own acts; capped",
+        "risk_level": "HIGH",
+    },
+    "Confidentiality": {
+        "risky_pattern": "Overly broad confidentiality with no exceptions or time limit",
+        "safe_alternative": (
+            "Each party agrees to maintain the confidentiality of the other's Confidential "
+            "Information for a period of [3-5] years from disclosure. Confidential Information "
+            "excludes: (a) publicly available information, (b) independently developed "
+            "information, (c) information received from a third party without restriction, "
+            "(d) information required to be disclosed by law or court order (with prompt notice "
+            "to the disclosing party)."
+        ),
+        "legal_basis": "Restatement (Third) of Unfair Competition § 39-45",
+        "consumer_standard": "Time-limited; standard exceptions; required disclosure carve-out",
+        "risk_level": "MEDIUM",
+    },
+}
+# Mapping from CUAD/unfair labels to our template keys
+_LABEL_TO_TEMPLATE = {
+    "Uncapped Liability": "Uncapped Liability",
+    "Arbitration": "Arbitration",
+    "IP Ownership Assignment": "IP Ownership Assignment",
+    "Termination for Convenience": "Termination for Convenience",
+    "Limitation of liability": "Limitation of liability",
+    "Unilateral termination": "Unilateral termination",
+    "Liquidated Damages": "Liquidated Damages",
+    "Unilateral change": "Unilateral change",
+    "Content removal": "Content removal",
+    "Non-Compete": "Non-Compete",
+    "Exclusivity": "Exclusivity",
+    "Anti-Assignment": "Anti-Assignment",
+    "Jurisdiction": "Jurisdiction",
+    "Choice of law": "Choice of law",
+    "Contract by using": "Contract by using",
+    "Cap on Liability": "Limitation of liability",  # Similar enough
+    "No-Solicit of Customers": "Non-Compete",  # Use non-compete template
+    "No-Solicit of Employees": "Non-Compete",
+    "Non-Disparagement": "Confidentiality",  # Similar restrictive clause
+}
+# ═══════════════════════════════════════════════════════════════════════
+# TIER 2: RAG RETRIEVAL (find fairer precedent clauses)
+# ═══════════════════════════════════════════════════════════════════════
+def _find_similar_templates(clause_label, clause_text):
+    """
+    Find the most relevant safe alternative template(s) for a given clause.
+    Returns list of matching templates.
+    """
+    matches = []
+    # Direct label match
+    template_key = _LABEL_TO_TEMPLATE.get(clause_label)
+    if template_key and template_key in SAFE_ALTERNATIVES:
+        matches.append((template_key, SAFE_ALTERNATIVES[template_key], 1.0))
+    # Also do keyword matching for clauses that might not have exact label matches
+    clause_lower = clause_text.lower()
+    keyword_map = {
+        "Uncapped Liability": ["unlimited liability", "uncapped", "no limit on liability"],
+        "Arbitration": ["arbitration", "arbitrate", "waive right to court", "class action waiver"],
+        "Termination for Convenience": ["terminate at any time", "terminate without cause", "terminate without notice"],
+        "Limitation of liability": ["not liable", "limitation of liability", "in no event", "disclaim"],
+        "Unilateral change": ["modify at any time", "sole discretion", "change terms", "without notice"],
+        "Content removal": ["remove content", "delete content", "remove at sole discretion"],
+        "Auto-Renewal": ["auto-renew", "automatically renew", "automatic renewal"],
+        "Indemnification": ["indemnif", "hold harmless"],
+    }
+    for key, keywords in keyword_map.items():
+        if key in SAFE_ALTERNATIVES:
+            for kw in keywords:
+                if kw in clause_lower:
+                    # Avoid duplicates
+                    if not any(m[0] == key for m in matches):
+                        matches.append((key, SAFE_ALTERNATIVES[key], 0.7))
+                    break
+    return matches
+# ═══════════════════════════════════════════════════════════════════════
+# TIER 3: LLM REFINEMENT
+# ═══════════════════════════════════════════════════════════════════════
+_LLM_MODEL = "Qwen/Qwen2.5-7B-Instruct"
+def _refine_with_llm(original_clause, template, clause_label):
+    """
+    Use LLM to adapt the template to the specific clause context.
+    The LLM refines — it does NOT generate from scratch (anti-hallucination).
+    """
+    if not _HAS_INFERENCE:
+        return None
+    try:
+        token = os.environ.get("HF_TOKEN", "")
+        client = InferenceClient(
+            provider="hf-inference",
+            api_key=token if token else None,
+        )
+        prompt = f"""You are a legal contract redlining assistant. Your task is to adapt a safe clause template to fit the specific context of an original risky clause.
+RULES:
+1. You MUST use the provided template as your base — do NOT generate clauses from scratch.
+2. Preserve the legal protections in the template.
+3. Adapt specific details (parties, amounts, timeframes) from the original clause.
+4. Keep the same legal standard cited in the template.
+5. Output ONLY the refined clause text, nothing else.
+6. The refined clause should be immediately usable in a contract.
+ORIGINAL RISKY CLAUSE:
+{original_clause[:500]}
+CLAUSE TYPE: {clause_label}
+SAFE TEMPLATE:
+{template['safe_alternative']}
+LEGAL BASIS: {template['legal_basis']}
+Write the refined safer clause (adapt the template to this specific contract's context):"""
+        response = client.chat_completion(
+            model=_LLM_MODEL,
+            messages=[
+                {"role": "system", "content": "You are a legal contract redlining expert. Output ONLY the refined clause text."},
+                {"role": "user", "content": prompt},
+            ],
+            max_tokens=512,
+            temperature=0.2,
+        )
+        refined = response.choices[0].message.content.strip()
+        # Sanity check: refined should be substantial
+        if len(refined) < 50:
+            return None
+        return refined
+    except Exception as e:
+        print(f"[ClauseGuard Redline] LLM refinement error: {e}")
+        return None
+# ═══════════════════════════════════════════════════════════════════════
+# PUBLIC API
+# ═══════════════════════════════════════════════════════════════════════
+def generate_redlines(analysis_result, use_llm=True):
+    """
+    Generate redline suggestions for all flagged clauses in the analysis.
+    Returns list of redline suggestions:
+    [{
+        "original_text": str,
+        "clause_label": str,
+        "risk_level": str,
+        "safe_alternative": str,
+        "legal_basis": str,
+        "consumer_standard": str,
+        "tier": "template" | "llm_refined",
+        "confidence": str,
+    }]
+    """
+    if analysis_result is None:
+        return []
+    clauses = analysis_result.get("clauses", [])
+    if not clauses:
+        return []
+    redlines = []
+    seen_labels = set()  # Deduplicate by label
+    # Sort by risk level: CRITICAL first
+    risk_order = {"CRITICAL": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3}
+    sorted_clauses = sorted(clauses, key=lambda c: risk_order.get(c.get("risk", "LOW"), 3))
+    for clause in sorted_clauses:
+        label = clause.get("label", "")
+        risk = clause.get("risk", "LOW")
+        text = clause.get("text", "")
+        # Skip LOW risk and already-seen labels
+        if risk == "LOW" or label in seen_labels:
+            continue
+        seen_labels.add(label)
+        # Find matching templates (Tier 1 + Tier 2)
+        matches = _find_similar_templates(label, text)
+        if not matches:
+            continue
+        best_key, best_template, score = matches[0]
+        # Tier 3: Try LLM refinement if enabled
+        refined_text = None
+        tier = "template"
+        if use_llm and risk in ("CRITICAL", "HIGH"):
+            refined_text = _refine_with_llm(text, best_template, label)
+            if refined_text:
+                tier = "llm_refined"
+        redlines.append({
+            "original_text": text[:500],
+            "clause_label": label,
+            "risk_level": risk,
+            "safe_alternative": refined_text or best_template["safe_alternative"],
+            "template_alternative": best_template["safe_alternative"],
+            "legal_basis": best_template["legal_basis"],
+            "consumer_standard": best_template["consumer_standard"],
+            "tier": tier,
+        })
+    return redlines
+def render_redlines_html(redlines):
+    """Render redline suggestions as HTML for Gradio."""
+    if not redlines:
+        return '''<div style="padding:24px;text-align:center;color:#6b7280;font-family:system-ui,sans-serif;">
+            <p style="font-size:16px;">📝 No redline suggestions available.</p>
+            <p style="font-size:13px;">Analyze a contract first — redlining suggestions will appear for risky clauses.</p>
+        </div>'''
+    risk_styles = {
+        "CRITICAL": ("#dc2626", "#fef2f2", "⚠️"),
+        "HIGH": ("#ea580c", "#fff7ed", "⚡"),
+        "MEDIUM": ("#ca8a04", "#fefce8", "📋"),
+        "LOW": ("#16a34a", "#f0fdf4", "✓"),
+    }
+    html = '<div style="font-family:system-ui,sans-serif;">'
+    # Summary header
+    crit = sum(1 for r in redlines if r["risk_level"] == "CRITICAL")
+    high = sum(1 for r in redlines if r["risk_level"] == "HIGH")
+    med = sum(1 for r in redlines if r["risk_level"] == "MEDIUM")
+    llm_count = sum(1 for r in redlines if r["tier"] == "llm_refined")
+    html += f'''
+    <div style="padding:16px;background:linear-gradient(135deg,#eff6ff,#f0fdf4);border-radius:12px;margin-bottom:16px;border:1px solid #e5e7eb;">
+        <div style="display:flex;align-items:center;gap:8px;margin-bottom:8px;">
+            <span style="font-size:24px;">✏️</span>
+            <h2 style="margin:0;font-size:18px;color:#1f2937;">Clause Redlining Suggestions</h2>
+        </div>
+        <p style="font-size:13px;color:#6b7280;margin:0;">
+            {len(redlines)} suggestions: {crit} Critical · {high} High · {med} Medium
+            {f" · {llm_count} LLM-refined" if llm_count else ""}
+        </p>
+    </div>
+    '''
+    for i, redline in enumerate(redlines):
+        border_color, bg_color, icon = risk_styles.get(
+            redline["risk_level"], ("#6b7280", "#f9fafb", "•")
+        )
+        tier_badge = (
+            '<span style="font-size:10px;background:#eff6ff;color:#3b82f6;padding:2px 8px;border-radius:4px;">🤖 LLM Refined</span>'
+            if redline["tier"] == "llm_refined"
+            else '<span style="font-size:10px;background:#f0fdf4;color:#16a34a;padding:2px 8px;border-radius:4px;">📋 Template</span>'
+        )
+        original_preview = redline["original_text"][:200].replace("<", "&lt;").replace(">", "&gt;")
+        safe_text = redline["safe_alternative"].replace("<", "&lt;").replace(">", "&gt;")
+        html += f'''
+        <div style="border:1px solid #e5e7eb;border-left:4px solid {border_color};border-radius:8px;margin-bottom:12px;overflow:hidden;">
+            <!-- Header -->
+            <div style="padding:12px 16px;background:{bg_color};border-bottom:1px solid #e5e7eb;">
+                <div style="display:flex;align-items:center;justify-content:space-between;">
+                    <div style="display:flex;align-items:center;gap:8px;">
+                        <span style="font-size:16px;">{icon}</span>
+                        <span style="font-size:14px;font-weight:600;color:{border_color};">{redline["clause_label"]}</span>
+                        <span style="font-size:11px;color:{border_color};text-transform:uppercase;font-weight:600;">{redline["risk_level"]}</span>
+                    </div>
+                    {tier_badge}
+                </div>
+            </div>
+            <!-- Body -->
+            <div style="padding:16px;">
+                <!-- Original (risky) -->
+                <div style="margin-bottom:12px;">
+                    <div style="font-size:11px;font-weight:600;color:#991b1b;text-transform:uppercase;margin-bottom:4px;">❌ Original (Risky)</div>
+                    <div style="background:#fef2f2;border:1px solid #fecaca;border-radius:6px;padding:10px;font-size:12px;color:#991b1b;line-height:1.6;">
+                        <del>{original_preview}{"..." if len(redline["original_text"]) > 200 else ""}</del>
+                    </div>
+                </div>
+                <!-- Suggested (safe) -->
+                <div style="margin-bottom:12px;">
+                    <div style="font-size:11px;font-weight:600;color:#166534;text-transform:uppercase;margin-bottom:4px;">✅ Suggested Alternative</div>
+                    <div style="background:#f0fdf4;border:1px solid #bbf7d0;border-radius:6px;padding:10px;font-size:12px;color:#166534;line-height:1.6;">
+                        {safe_text}
+                    </div>
+                </div>
+                <!-- Legal basis -->
+                <div style="display:flex;gap:12px;flex-wrap:wrap;">
+                    <div style="flex:1;min-width:200px;">
+                        <div style="font-size:10px;font-weight:600;color:#6b7280;text-transform:uppercase;margin-bottom:2px;">📚 Legal Basis</div>
+                        <div style="font-size:11px;color:#4b5563;">{redline["legal_basis"]}</div>
+                    </div>
+                    <div style="flex:1;min-width:200px;">
+                        <div style="font-size:10px;font-weight:600;color:#6b7280;text-transform:uppercase;margin-bottom:2px;">🛡️ Consumer Standard</div>
+                        <div style="font-size:11px;color:#4b5563;">{redline["consumer_standard"]}</div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        '''
+    # Disclaimer
+    html += '''
+    <div style="margin-top:16px;padding:12px;background:#fefce8;border:1px solid #fde68a;border-radius:8px;">
+        <p style="font-size:11px;color:#92400e;margin:0;line-height:1.5;">
+            <strong>⚠️ Disclaimer:</strong> These are AI-generated suggestions based on legal templates and consumer protection standards.
+            They are NOT legal advice. The suggested alternatives are starting points that should be reviewed and customized by a
+            qualified attorney before use in any contract. Legal requirements vary by jurisdiction.
+        </p>
+    </div>
+    '''
+    html += '</div>'
+    return html

requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
 gradio>=5.23.0
-transformers>=5.6.1
 torch>=2.5.0
 numpy>=2.0.0
 pdfplumber>=0.11.0
@@ -7,3 +7,5 @@ python-docx>=1.1.0
 peft>=0.15.0
 accelerate>=1.2.0
 sentence-transformers>=3.0.0

 gradio>=5.23.0
+transformers>=4.45.0
 torch>=2.5.0
 numpy>=2.0.0
 pdfplumber>=0.11.0
 peft>=0.15.0
 accelerate>=1.2.0
 sentence-transformers>=3.0.0
+python-doctr[torch]>=0.9.0
+huggingface_hub>=0.25.0

web/.env.example CHANGED Viewed

@@ -18,3 +18,10 @@ RESEND_API_KEY=re_...
 # App
 NEXT_PUBLIC_SITE_URL=http://localhost:3000
 CLAUSEGUARD_API_URL=https://gaurv007-clauseguard-api.hf.space

 # App
 NEXT_PUBLIC_SITE_URL=http://localhost:3000
 CLAUSEGUARD_API_URL=https://gaurv007-clauseguard-api.hf.space
+# HF Inference API (for chatbot + redlining LLM)
+HF_TOKEN=hf_...
+# Optional: SaulLM for explain endpoint
+SAULLM_ENDPOINT=
+HF_API_TOKEN=

web/app/api/analyze/route.ts CHANGED Viewed

@@ -1,4 +1,5 @@
 import { NextRequest, NextResponse } from "next/server";
 const API_URL = process.env.CLAUSEGUARD_API_URL || "https://gaurv007-clauseguard-api.hf.space";
@@ -14,10 +15,19 @@ export async function POST(req: NextRequest) {
       );
     }
-    // Forward to backend API v2.0 (full text, clauses split server-side)
     const response = await fetch(`${API_URL}/api/analyze`, {
       method: "POST",
-      headers: { "Content-Type": "application/json" },
       body: JSON.stringify({ text, source_url }),
     });

 import { NextRequest, NextResponse } from "next/server";
+import { createClient } from "@/lib/supabase/server";
 const API_URL = process.env.CLAUSEGUARD_API_URL || "https://gaurv007-clauseguard-api.hf.space";
       );
     }
+    // Forward auth token to backend
+    const headers: Record<string, string> = { "Content-Type": "application/json" };
+    try {
+      const supabase = await createClient();
+      const { data: { session } } = await supabase.auth.getSession();
+      if (session?.access_token) {
+        headers["Authorization"] = `Bearer ${session.access_token}`;
+      }
+    } catch {}
     const response = await fetch(`${API_URL}/api/analyze`, {
       method: "POST",
+      headers,
       body: JSON.stringify({ text, source_url }),
     });

web/app/api/chat/route.ts ADDED Viewed

	@@ -0,0 +1,37 @@

+import { NextRequest, NextResponse } from "next/server";
+const API_URL = process.env.CLAUSEGUARD_API_URL || "https://gaurv007-clauseguard-api.hf.space";
+export async function POST(req: NextRequest) {
+  try {
+    const body = await req.json();
+    const { message, session_id, history } = body;
+    if (!message || !session_id) {
+      return NextResponse.json(
+        { error: "message and session_id are required" },
+        { status: 400 }
+      );
+    }
+    const response = await fetch(`${API_URL}/api/chat`, {
+      method: "POST",
+      headers: { "Content-Type": "application/json" },
+      body: JSON.stringify({ message, session_id, history: history || [] }),
+    });
+    if (!response.ok) {
+      const err = await response.text().catch(() => "");
+      throw new Error(err || `Backend error: ${response.status}`);
+    }
+    const result = await response.json();
+    return NextResponse.json(result);
+  } catch (error: any) {
+    console.error("Chat error:", error.message);
+    return NextResponse.json(
+      { error: error.message || "Chat failed. Try again." },
+      { status: 500 }
+    );
+  }
+}

web/app/api/redline/route.ts ADDED Viewed

	@@ -0,0 +1,37 @@

+import { NextRequest, NextResponse } from "next/server";
+const API_URL = process.env.CLAUSEGUARD_API_URL || "https://gaurv007-clauseguard-api.hf.space";
+export async function POST(req: NextRequest) {
+  try {
+    const body = await req.json();
+    const { session_id, text, use_llm } = body;
+    if (!session_id && !text) {
+      return NextResponse.json(
+        { error: "Provide session_id or text" },
+        { status: 400 }
+      );
+    }
+    const response = await fetch(`${API_URL}/api/redline`, {
+      method: "POST",
+      headers: { "Content-Type": "application/json" },
+      body: JSON.stringify({ session_id, text, use_llm: use_llm ?? true }),
+    });
+    if (!response.ok) {
+      const err = await response.text().catch(() => "");
+      throw new Error(err || `Backend error: ${response.status}`);
+    }
+    const result = await response.json();
+    return NextResponse.json(result);
+  } catch (error: any) {
+    console.error("Redline error:", error.message);
+    return NextResponse.json(
+      { error: error.message || "Redlining failed" },
+      { status: 500 }
+    );
+  }
+}

web/app/dashboard-pages/analyze/page.tsx CHANGED Viewed

@@ -9,7 +9,8 @@ import {
   AlertTriangle, Tag, BookOpen, ClipboardList, DollarSign,
   Calendar, Building, MapPin, Hash, Bot, FileSearch, Percent, Clock,
   User, BookMarked, ShieldX, HelpCircle, Cpu, PenTool, Zap,
-  ShieldOff, CircleSlash, MessageSquareWarning, Construction
 } from "lucide-react";
 interface Cat { name: string; severity: string; description?: string; confidence?: number; }
@@ -19,6 +20,17 @@ interface Contradiction { type: string; explanation: string; severity: string; c
 interface Obligation { type: string; party: string; description: string; deadline: string; priority?: number; }
 interface ComplianceCheck { requirement: string; description: string; severity: string; status: string; matched_keywords: string[]; context?: string[]; }
 interface ComplianceReg { description: string; compliance_rate: number; checks: ComplianceCheck[]; overall_status: string; negated_count?: number; ambiguous_count?: number; }
 interface AnalysisResult {
   risk_score: number;
   grade: string;
@@ -29,8 +41,10 @@ interface AnalysisResult {
   contradictions: Contradiction[];
   obligations: Obligation[];
   compliance: Record<string, ComplianceReg>;
   model: string;
   latency_ms: number;
 }
 const SEV_CONFIG: Record<string, { icon: any; label: string; text: string; bg: string; border: string; ring: string }> = {
@@ -169,6 +183,9 @@ export default function AnalyzePage() {
   const [scanLimit, setScanLimit] = useState(10);
   const [canUpload, setCanUpload] = useState(false);
   const [showUpgrade, setShowUpgrade] = useState(false);
   const fileInputRef = useRef<HTMLInputElement>(null);
   // Fetch user profile from DB on mount — no hardcoded emails or plans
@@ -237,6 +254,31 @@ export default function AnalyzePage() {
     setCopied(true); setTimeout(() => setCopied(false), 2000);
   }
   const flagged = results?.results.filter(r => r.categories.length > 0) || [];
   const filtered = filter === "all" ? flagged : flagged.filter(r => r.categories.some(c => c.severity === filter));
   const sevCounts = { CRITICAL: 0, HIGH: 0, MEDIUM: 0, LOW: 0 };
@@ -260,6 +302,8 @@ export default function AnalyzePage() {
     { key: "contradictions", label: "Issues", icon: AlertTriangle, count: results?.contradictions.length || 0 },
     { key: "obligations", label: "Obligations", icon: ClipboardList, count: results?.obligations.length || 0 },
     { key: "compliance", label: "Compliance", icon: ShieldCheck, count: Object.keys(results?.compliance || {}).length },
   ];
   return (
@@ -668,6 +712,139 @@ export default function AnalyzePage() {
                       })}
                     </div>
                   )}
                 </div>
               </div>
             ) : (

   AlertTriangle, Tag, BookOpen, ClipboardList, DollarSign,
   Calendar, Building, MapPin, Hash, Bot, FileSearch, Percent, Clock,
   User, BookMarked, ShieldX, HelpCircle, Cpu, PenTool, Zap,
+  ShieldOff, CircleSlash, MessageSquareWarning, Construction,
+  MessageSquare, Send, Loader2
 } from "lucide-react";
 interface Cat { name: string; severity: string; description?: string; confidence?: number; }
 interface Obligation { type: string; party: string; description: string; deadline: string; priority?: number; }
 interface ComplianceCheck { requirement: string; description: string; severity: string; status: string; matched_keywords: string[]; context?: string[]; }
 interface ComplianceReg { description: string; compliance_rate: number; checks: ComplianceCheck[]; overall_status: string; negated_count?: number; ambiguous_count?: number; }
+interface Redline {
+  original_text: string;
+  clause_label: string;
+  risk_level: string;
+  safe_alternative: string;
+  template_alternative?: string;
+  legal_basis: string;
+  consumer_standard: string;
+  tier: string;
+}
+interface ChatMessage { role: "user" | "assistant"; content: string; }
 interface AnalysisResult {
   risk_score: number;
   grade: string;
   contradictions: Contradiction[];
   obligations: Obligation[];
   compliance: Record<string, ComplianceReg>;
+  redlines: Redline[];
   model: string;
   latency_ms: number;
+  session_id?: string;
 }
 const SEV_CONFIG: Record<string, { icon: any; label: string; text: string; bg: string; border: string; ring: string }> = {
   const [scanLimit, setScanLimit] = useState(10);
   const [canUpload, setCanUpload] = useState(false);
   const [showUpgrade, setShowUpgrade] = useState(false);
+  const [chatMessages, setChatMessages] = useState<ChatMessage[]>([]);
+  const [chatInput, setChatInput] = useState("");
+  const [chatLoading, setChatLoading] = useState(false);
   const fileInputRef = useRef<HTMLInputElement>(null);
   // Fetch user profile from DB on mount — no hardcoded emails or plans
     setCopied(true); setTimeout(() => setCopied(false), 2000);
   }
+  async function handleChat() {
+    if (!chatInput.trim() || !results?.session_id) return;
+    const userMsg: ChatMessage = { role: "user", content: chatInput.trim() };
+    setChatMessages(prev => [...prev, userMsg]);
+    setChatInput("");
+    setChatLoading(true);
+    try {
+      const res = await fetch("/api/chat", {
+        method: "POST",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify({
+          message: userMsg.content,
+          session_id: results.session_id,
+          history: chatMessages.slice(-6),
+        }),
+      });
+      if (!res.ok) throw new Error((await res.json()).error || "Chat failed");
+      const data = await res.json();
+      setChatMessages(prev => [...prev, { role: "assistant", content: data.response }]);
+    } catch (e: any) {
+      setChatMessages(prev => [...prev, { role: "assistant", content: `⚠️ ${e.message}` }]);
+    }
+    setChatLoading(false);
+  }
   const flagged = results?.results.filter(r => r.categories.length > 0) || [];
   const filtered = filter === "all" ? flagged : flagged.filter(r => r.categories.some(c => c.severity === filter));
   const sevCounts = { CRITICAL: 0, HIGH: 0, MEDIUM: 0, LOW: 0 };
     { key: "contradictions", label: "Issues", icon: AlertTriangle, count: results?.contradictions.length || 0 },
     { key: "obligations", label: "Obligations", icon: ClipboardList, count: results?.obligations.length || 0 },
     { key: "compliance", label: "Compliance", icon: ShieldCheck, count: Object.keys(results?.compliance || {}).length },
+    { key: "redlining", label: "Redlining", icon: PenTool, count: results?.redlines?.length || 0 },
+    { key: "chat", label: "Q&A", icon: MessageSquare, count: chatMessages.length },
   ];
   return (
                       })}
                     </div>
                   )}
+                  {/* Redlining */}
+                  {activeTab === "redlining" && (
+                    <div className="space-y-3">
+                      {(!results.redlines || results.redlines.length === 0) ? (
+                        <div className="border border-dashed border-zinc-200 rounded-xl p-8 sm:p-10 text-center bg-white">
+                          <PenTool className="w-8 h-8 text-zinc-300 mx-auto mb-2" />
+                          <p className="text-sm text-zinc-500">No redlining suggestions for this contract.</p>
+                        </div>
+                      ) : (
+                        <>
+                          <div className="bg-gradient-to-r from-blue-50 to-emerald-50 rounded-xl p-4 border border-zinc-200 mb-2">
+                            <div className="flex items-center gap-2 mb-1">
+                              <PenTool className="w-4 h-4 text-zinc-600" />
+                              <span className="text-sm font-semibold text-zinc-800">Clause Redlining Suggestions</span>
+                            </div>
+                            <p className="text-xs text-zinc-500">
+                              {results.redlines.length} suggestions · {results.redlines.filter(r => r.tier === "llm_refined").length} LLM-refined
+                            </p>
+                          </div>
+                          {results.redlines.map((rl, i) => {
+                            const isHigh = rl.risk_level === "CRITICAL" || rl.risk_level === "HIGH";
+                            const conf = SEV_CONFIG[rl.risk_level] || SEV_CONFIG.MEDIUM;
+                            return (
+                              <div key={i} className={`bg-white border rounded-xl overflow-hidden ${conf.border}`}>
+                                <div className={`px-4 py-3 ${conf.bg} border-b ${conf.border} flex items-center justify-between`}>
+                                  <div className="flex items-center gap-2">
+                                    <conf.icon className={`w-4 h-4 ${conf.text}`} />
+                                    <span className={`text-sm font-semibold ${conf.text}`}>{rl.clause_label}</span>
+                                    <span className={`text-[10px] uppercase font-bold ${conf.text}`}>{rl.risk_level}</span>
+                                  </div>
+                                  <span className={`text-[10px] px-2 py-0.5 rounded border ${
+                                    rl.tier === "llm_refined"
+                                      ? "bg-indigo-50 text-indigo-600 border-indigo-200"
+                                      : "bg-emerald-50 text-emerald-600 border-emerald-200"
+                                  }`}>
+                                    {rl.tier === "llm_refined" ? "🤖 LLM Refined" : "📋 Template"}
+                                  </span>
+                                </div>
+                                <div className="p-4 space-y-3">
+                                  <div>
+                                    <p className="text-[10px] font-semibold text-red-600 uppercase mb-1">❌ Original (Risky)</p>
+                                    <div className="bg-red-50 border border-red-100 rounded-lg p-3 text-xs text-red-800 leading-relaxed line-through">
+                                      {rl.original_text.slice(0, 200)}{rl.original_text.length > 200 ? "..." : ""}
+                                    </div>
+                                  </div>
+                                  <div>
+                                    <p className="text-[10px] font-semibold text-emerald-600 uppercase mb-1">✅ Suggested Alternative</p>
+                                    <div className="bg-emerald-50 border border-emerald-100 rounded-lg p-3 text-xs text-emerald-800 leading-relaxed">
+                                      {rl.safe_alternative}
+                                    </div>
+                                  </div>
+                                  <div className="flex gap-3 flex-wrap text-[10px] text-zinc-500">
+                                    <span>📚 {rl.legal_basis}</span>
+                                    <span>🛡️ {rl.consumer_standard}</span>
+                                  </div>
+                                </div>
+                              </div>
+                            );
+                          })}
+                          <div className="bg-amber-50 border border-amber-200 rounded-lg p-3 text-[11px] text-amber-800">
+                            <strong>⚠️ Disclaimer:</strong> These are AI-generated suggestions, NOT legal advice. Consult an attorney before use.
+                          </div>
+                        </>
+                      )}
+                    </div>
+                  )}
+                  {/* Chat */}
+                  {activeTab === "chat" && (
+                    <div className="flex flex-col h-[350px] sm:h-[420px]">
+                      {!results.session_id ? (
+                        <div className="flex-1 flex items-center justify-center">
+                          <div className="text-center">
+                            <MessageSquare className="w-8 h-8 text-zinc-300 mx-auto mb-2" />
+                            <p className="text-sm text-zinc-500">Chat unavailable — session not initialized.</p>
+                            <p className="text-xs text-zinc-400 mt-1">Try analyzing again with the backend running.</p>
+                          </div>
+                        </div>
+                      ) : (
+                        <>
+                          <div className="flex-1 overflow-y-auto space-y-3 pr-1 mb-3">
+                            {chatMessages.length === 0 && (
+                              <div className="text-center py-8">
+                                <MessageSquare className="w-8 h-8 text-zinc-200 mx-auto mb-2" />
+                                <p className="text-sm text-zinc-400">Ask a question about your contract</p>
+                                <div className="mt-3 flex flex-wrap justify-center gap-2">
+                                  {["What are the main risks?", "Who are the parties?", "Is there an arbitration clause?", "Summarize key terms"].map(q => (
+                                    <button key={q} onClick={() => { setChatInput(q); }}
+                                      className="text-xs px-3 py-1.5 rounded-full border border-zinc-200 text-zinc-500 hover:bg-zinc-50 transition-colors">
+                                      {q}
+                                    </button>
+                                  ))}
+                                </div>
+                              </div>
+                            )}
+                            {chatMessages.map((msg, i) => (
+                              <div key={i} className={`flex ${msg.role === "user" ? "justify-end" : "justify-start"}`}>
+                                <div className={`max-w-[85%] rounded-xl px-3.5 py-2.5 text-sm leading-relaxed ${
+                                  msg.role === "user"
+                                    ? "bg-zinc-900 text-white"
+                                    : "bg-zinc-100 text-zinc-700 border border-zinc-200"
+                                }`}>
+                                  {msg.content}
+                                </div>
+                              </div>
+                            ))}
+                            {chatLoading && (
+                              <div className="flex justify-start">
+                                <div className="bg-zinc-100 border border-zinc-200 rounded-xl px-4 py-3">
+                                  <Loader2 className="w-4 h-4 text-zinc-400 animate-spin" />
+                                </div>
+                              </div>
+                            )}
+                          </div>
+                          <div className="flex gap-2 border-t border-zinc-100 pt-3">
+                            <input
+                              value={chatInput}
+                              onChange={(e) => setChatInput(e.target.value)}
+                              onKeyDown={(e) => e.key === "Enter" && !e.shiftKey && handleChat()}
+                              placeholder="Ask about your contract..."
+                              className="flex-1 px-3 py-2 border border-zinc-200 rounded-lg text-sm focus:outline-none focus:ring-2 focus:ring-zinc-900/10"
+                              disabled={chatLoading}
+                            />
+                            <button onClick={handleChat} disabled={chatLoading || !chatInput.trim()}
+                              className="px-3 py-2 bg-zinc-900 text-white rounded-lg hover:bg-zinc-800 disabled:opacity-40 transition-colors">
+                              <Send className="w-4 h-4" />
+                            </button>
+                          </div>
+                        </>
+                      )}
+                    </div>
+                  )}
                 </div>
               </div>
             ) : (

web/app/page.tsx CHANGED Viewed

@@ -3,7 +3,8 @@ import {
   ShieldCheck, ShieldAlert, Scale, Gavel, ScanText, FileCheck,
   TriangleAlert, ArrowRight, Zap, Eye, Download, ChevronRight,
   Sparkles, Lock, Globe, Ban, FileX, Stamp, Layers, Tag, AlertTriangle,
-  ClipboardList, Landmark, Building, BookOpen, CheckCircle, Cpu
 } from "lucide-react";
 const CLAUSES = [
@@ -21,22 +22,26 @@ const CLAUSES = [
   { icon: ClipboardList, name: "Obligations", desc: "Track monetary, compliance, reporting tasks with priority", severity: "medium" },
   { icon: Landmark, name: "Compliance", desc: "GDPR, CCPA, SOX, HIPAA, FINRA with negation detection", severity: "high" },
   { icon: BookOpen, name: "Compare Contracts", desc: "Semantic similarity with sentence embeddings", severity: "low" },
 ];
 const STEPS = [
-  { icon: Download, title: "Upload or paste", desc: "Drop a PDF, DOCX, or paste contract text directly." },
-  { icon: ScanText, title: "3 AI models analyze", desc: "Legal-BERT classifier + Legal NER + DeBERTa NLI scan your contract." },
-  { icon: TriangleAlert, title: "Get precise insights", desc: "Risk score, contradictions, obligations, compliance gaps with source indicators." },
 ];
 const PRICING = [
   {
     name: "Free", price: "0", period: "", highlight: false, cta: "Get started",
-    features: ["10 scans per month", "41 clause categories", "Risk scoring", "ML Legal NER", "NLI contradiction detection", "Compliance with negation detection"],
   },
   {
     name: "Pro", price: "999", period: "/mo", highlight: true, cta: "Start free trial",
-    features: ["Unlimited scans", "Upload PDF/DOCX files", "Contract comparison", "AI clause explanations", "Scan history", "PDF report export", "Obligation tracker with priority", "Priority support"],
   },
   {
     name: "Team", price: "3,999", period: "/mo", highlight: false, cta: "Talk to us",
@@ -59,14 +64,14 @@ export default function Home() {
         <div className="max-w-2xl">
           <div className="inline-flex items-center gap-2 px-3 py-1 rounded-full border border-zinc-200 text-[13px] text-zinc-500 mb-6">
             <Sparkles className="w-3.5 h-3.5 text-zinc-400" />
-            3 ML models · 41 clause categories · negation-aware compliance
           </div>
           <h1 className="text-3xl sm:text-[42px] lg:text-5xl font-semibold tracking-tight leading-[1.1]">
             Know what you are<br className="hidden sm:block" /> agreeing to
           </h1>
           <p className="mt-5 text-base sm:text-[17px] text-zinc-500 leading-relaxed max-w-lg">
-            ClauseGuard scans contracts, terms of service, and leases using 3 specialized AI models.
-            Get precise clause detection, risk scoring, ML entity extraction, NLI contradiction alerts, and negation-aware compliance checks.
           </p>
           <div className="mt-8 flex flex-col sm:flex-row gap-3">
             <Link href="/dashboard-pages/analyze" className="inline-flex items-center justify-center gap-2 bg-zinc-900 text-white px-5 py-2.5 rounded-lg text-sm font-medium hover:bg-zinc-800 transition-colors">
@@ -87,11 +92,11 @@ export default function Home() {
             <ShieldCheck className="w-4 h-4 text-zinc-400" />
             <p className="text-[13px] font-medium text-zinc-400 uppercase tracking-wider">Detection</p>
           </div>
-          <h2 className="text-xl sm:text-2xl font-semibold tracking-tight">14 powerful analysis features</h2>
           <p className="mt-2 text-zinc-500 text-sm sm:text-[15px] max-w-lg">
-            Based on the CUAD taxonomy + CLAUDETTE framework, the same datasets used by EU consumer protection researchers and Stanford NLP.
           </p>
-          <div className="mt-8 sm:mt-10 grid grid-cols-2 sm:grid-cols-2 lg:grid-cols-4 gap-2 sm:gap-3">
             {CLAUSES.map((c) => (
               <div key={c.name} className="group border border-zinc-100 rounded-xl p-3 sm:p-4 hover:border-zinc-200 hover:shadow-sm transition-all cursor-default">
                 <div className={`w-7 h-7 sm:w-8 sm:h-8 rounded-lg flex items-center justify-center border ${sevColor[c.severity]}`}>
@@ -135,15 +140,15 @@ export default function Home() {
             <Cpu className="w-4 h-4 text-zinc-400" />
             <p className="text-[13px] font-medium text-zinc-400 uppercase tracking-wider">Technology</p>
           </div>
-          <h2 className="text-xl sm:text-2xl font-semibold tracking-tight">Built on 3 production ML models</h2>
           <div className="mt-8 grid sm:grid-cols-2 lg:grid-cols-3 gap-3 sm:gap-4">
             {[
-              { name: "Legal-BERT Classifier", icon: Cpu, desc: "LoRA fine-tuned on 41 CUAD categories with sigmoid multi-label classification and per-class thresholds", source: "Mokshith31/legalbert-contract-clause-classification" },
-              { name: "Legal-BERT NER", icon: Tag, desc: "ML-based named entity recognition for parties, dates, money, jurisdictions with regex augmentation", source: "matterstack/legal-bert-ner" },
-              { name: "DeBERTa-v3 NLI", icon: AlertTriangle, desc: "Cross-encoder model for semantic contradiction detection between clause pairs", source: "cross-encoder/nli-deberta-v3-base" },
-              { name: "Compliance Engine", icon: ShieldCheck, desc: "GDPR, CCPA, SOX, HIPAA, FINRA checking with negation detection and context snippets", source: "Negation-aware keyword + semantic" },
-              { name: "Obligation Tracker", icon: ClipboardList, desc: "Extracts monetary, compliance, reporting, delivery obligations with priority scoring", source: "Context-filtered regex" },
-              { name: "Comparison Engine", icon: Layers, desc: "Semantic similarity via sentence-transformers with SequenceMatcher fallback", source: "all-MiniLM-L6-v2" },
             ].map((m) => (
               <div key={m.name} className="border border-zinc-100 rounded-xl p-4 hover:border-zinc-200 hover:shadow-sm transition-all">
                 <div className="flex items-center gap-2 mb-2">
@@ -211,7 +216,7 @@ export default function Home() {
         <div className="max-w-6xl mx-auto px-4 sm:px-6 py-8 flex flex-col sm:flex-row justify-between items-center gap-4">
           <div className="flex items-center gap-2">
             <ShieldCheck className="w-4 h-4 text-zinc-300" />
-            <span className="text-[13px] text-zinc-400">ClauseGuard v3.0 — not legal advice</span>
           </div>
           <div className="flex gap-5 text-[13px] text-zinc-400">
             <Link href="/privacy" className="hover:text-zinc-600">Privacy</Link>

   ShieldCheck, ShieldAlert, Scale, Gavel, ScanText, FileCheck,
   TriangleAlert, ArrowRight, Zap, Eye, Download, ChevronRight,
   Sparkles, Lock, Globe, Ban, FileX, Stamp, Layers, Tag, AlertTriangle,
+  ClipboardList, Landmark, Building, BookOpen, CheckCircle, Cpu,
+  MessageSquare, PenTool, ScanLine
 } from "lucide-react";
 const CLAUSES = [
   { icon: ClipboardList, name: "Obligations", desc: "Track monetary, compliance, reporting tasks with priority", severity: "medium" },
   { icon: Landmark, name: "Compliance", desc: "GDPR, CCPA, SOX, HIPAA, FINRA with negation detection", severity: "high" },
   { icon: BookOpen, name: "Compare Contracts", desc: "Semantic similarity with sentence embeddings", severity: "low" },
+  { icon: PenTool, name: "Clause Redlining", desc: "AI suggests safer alternatives with legal citations", severity: "critical" },
+  { icon: MessageSquare, name: "Q&A Chatbot", desc: "Ask questions about your contract — RAG-powered answers", severity: "medium" },
+  { icon: ScanLine, name: "OCR for Scanned PDFs", desc: "docTR engine auto-detects and OCRs scanned contracts", severity: "low" },
+  { icon: Cpu, name: "6 AI Models", desc: "Legal-BERT, NER, NLI, Embeddings, OCR, Qwen2.5-7B LLM", severity: "low" },
 ];
 const STEPS = [
+  { icon: Download, title: "Upload or paste", desc: "Drop a PDF (even scanned!), DOCX, or paste contract text directly." },
+  { icon: ScanText, title: "6 AI models analyze", desc: "Legal-BERT + NER + NLI + OCR + Embeddings + LLM scan your contract." },
+  { icon: TriangleAlert, title: "Get precise insights", desc: "Risk score, redlining, Q&A chatbot, contradictions, obligations, and compliance." },
 ];
 const PRICING = [
   {
     name: "Free", price: "0", period: "", highlight: false, cta: "Get started",
+    features: ["10 scans per month", "41 clause categories", "Risk scoring", "ML Legal NER", "NLI contradiction detection", "Compliance with negation detection", "Clause redlining suggestions", "OCR for scanned PDFs"],
   },
   {
     name: "Pro", price: "999", period: "/mo", highlight: true, cta: "Start free trial",
+    features: ["Unlimited scans", "Upload PDF/DOCX files", "Contract comparison", "Q&A Chatbot (RAG)", "AI clause explanations", "LLM-refined redlining", "Scan history", "PDF report export", "Obligation tracker with priority", "Priority support"],
   },
   {
     name: "Team", price: "3,999", period: "/mo", highlight: false, cta: "Talk to us",
         <div className="max-w-2xl">
           <div className="inline-flex items-center gap-2 px-3 py-1 rounded-full border border-zinc-200 text-[13px] text-zinc-500 mb-6">
             <Sparkles className="w-3.5 h-3.5 text-zinc-400" />
+            6 AI models · 41 clause categories · RAG chatbot · clause redlining · OCR
           </div>
           <h1 className="text-3xl sm:text-[42px] lg:text-5xl font-semibold tracking-tight leading-[1.1]">
             Know what you are<br className="hidden sm:block" /> agreeing to
           </h1>
           <p className="mt-5 text-base sm:text-[17px] text-zinc-500 leading-relaxed max-w-lg">
+            ClauseGuard scans contracts using 6 AI models. Get clause detection, risk scoring,
+            safer alternatives, Q&A chatbot, OCR for scanned PDFs, and compliance checks.
           </p>
           <div className="mt-8 flex flex-col sm:flex-row gap-3">
             <Link href="/dashboard-pages/analyze" className="inline-flex items-center justify-center gap-2 bg-zinc-900 text-white px-5 py-2.5 rounded-lg text-sm font-medium hover:bg-zinc-800 transition-colors">
             <ShieldCheck className="w-4 h-4 text-zinc-400" />
             <p className="text-[13px] font-medium text-zinc-400 uppercase tracking-wider">Detection</p>
           </div>
+          <h2 className="text-xl sm:text-2xl font-semibold tracking-tight">18 powerful analysis features</h2>
           <p className="mt-2 text-zinc-500 text-sm sm:text-[15px] max-w-lg">
+            Based on the CUAD taxonomy + CLAUDETTE framework. Now with RAG chatbot, clause redlining, and OCR.
           </p>
+          <div className="mt-8 sm:mt-10 grid grid-cols-2 sm:grid-cols-3 lg:grid-cols-4 gap-2 sm:gap-3">
             {CLAUSES.map((c) => (
               <div key={c.name} className="group border border-zinc-100 rounded-xl p-3 sm:p-4 hover:border-zinc-200 hover:shadow-sm transition-all cursor-default">
                 <div className={`w-7 h-7 sm:w-8 sm:h-8 rounded-lg flex items-center justify-center border ${sevColor[c.severity]}`}>
             <Cpu className="w-4 h-4 text-zinc-400" />
             <p className="text-[13px] font-medium text-zinc-400 uppercase tracking-wider">Technology</p>
           </div>
+          <h2 className="text-xl sm:text-2xl font-semibold tracking-tight">Built on 6 production AI models</h2>
           <div className="mt-8 grid sm:grid-cols-2 lg:grid-cols-3 gap-3 sm:gap-4">
             {[
+              { name: "Legal-BERT Classifier", icon: Cpu, desc: "LoRA fine-tuned on 41 CUAD categories with sigmoid multi-label classification", source: "Mokshith31/legalbert-contract-clause-classification" },
+              { name: "Legal-BERT NER", icon: Tag, desc: "Named entity recognition for parties, dates, money, jurisdictions", source: "matterstack/legal-bert-ner" },
+              { name: "DeBERTa-v3 NLI", icon: AlertTriangle, desc: "Semantic contradiction detection between clause pairs", source: "cross-encoder/nli-deberta-v3-base" },
+              { name: "RAG Chatbot", icon: MessageSquare, desc: "Embedding retrieval + Qwen2.5-7B LLM for contract Q&A", source: "all-MiniLM-L6-v2 + Qwen/Qwen2.5-7B-Instruct" },
+              { name: "Clause Redlining", icon: PenTool, desc: "18+ legal templates + LLM refinement for safer clause alternatives", source: "FTC/EU/CFPB standards + Qwen2.5-7B" },
+              { name: "docTR OCR", icon: ScanLine, desc: "Smart PDF router: auto-detects scanned PDFs and extracts text", source: "docTR fast_base + crnn_vgg16_bn" },
             ].map((m) => (
               <div key={m.name} className="border border-zinc-100 rounded-xl p-4 hover:border-zinc-200 hover:shadow-sm transition-all">
                 <div className="flex items-center gap-2 mb-2">
         <div className="max-w-6xl mx-auto px-4 sm:px-6 py-8 flex flex-col sm:flex-row justify-between items-center gap-4">
           <div className="flex items-center gap-2">
             <ShieldCheck className="w-4 h-4 text-zinc-300" />
+            <span className="text-[13px] text-zinc-400">ClauseGuard v4.0 — not legal advice</span>
           </div>
           <div className="flex gap-5 text-[13px] text-zinc-400">
             <Link href="/privacy" className="hover:text-zinc-600">Privacy</Link>

web/components/nav.tsx CHANGED Viewed

@@ -2,7 +2,7 @@
 import Link from "next/link";
 import { usePathname } from "next/navigation";
-import { ShieldCheck, Menu, X, Crown, GitCompare } from "lucide-react";
 import { useState, useEffect } from "react";
 import { createClient } from "@/lib/supabase/client";
@@ -27,7 +27,6 @@ export function Nav() {
       const user = data.user;
       setUserEmail(user?.email || null);
       if (user) {
-        // Fetch role from database — no hardcoded emails
         const { data: profile } = await supabase
           .from("profiles")
           .select("role")
@@ -44,7 +43,7 @@ export function Nav() {
         <Link href="/" className="flex items-center gap-2">
           <ShieldCheck className="w-5 h-5 text-zinc-900" strokeWidth={2.2} />
           <span className="font-semibold text-[15px] tracking-tight text-zinc-900">ClauseGuard</span>
-          <span className="hidden sm:inline text-[10px] font-medium text-zinc-400 ml-1 border border-zinc-200 px-1.5 py-0.5 rounded">v3.0</span>
         </Link>
         <div className="hidden md:flex items-center gap-1">

 import Link from "next/link";
 import { usePathname } from "next/navigation";
+import { ShieldCheck, Menu, X, Crown, GitCompare, MessageSquare } from "lucide-react";
 import { useState, useEffect } from "react";
 import { createClient } from "@/lib/supabase/client";
       const user = data.user;
       setUserEmail(user?.email || null);
       if (user) {
         const { data: profile } = await supabase
           .from("profiles")
           .select("role")
         <Link href="/" className="flex items-center gap-2">
           <ShieldCheck className="w-5 h-5 text-zinc-900" strokeWidth={2.2} />
           <span className="font-semibold text-[15px] tracking-tight text-zinc-900">ClauseGuard</span>
+          <span className="hidden sm:inline text-[10px] font-medium text-zinc-400 ml-1 border border-zinc-200 px-1.5 py-0.5 rounded">v4.0</span>
         </Link>
         <div className="hidden md:flex items-center gap-1">