gaurv007 commited on
Commit
28c983c
Β·
verified Β·
1 Parent(s): 52ee555

v4.0: Update README with OCR, Chatbot, and Redlining docs

Browse files
Files changed (1) hide show
  1. README.md +73 -14
README.md CHANGED
@@ -10,9 +10,17 @@ app_file: app.py
10
  pinned: false
11
  ---
12
 
13
- # πŸ›‘οΈ ClauseGuard β€” World's Best Open-Source Legal Contract Analysis
14
 
15
- **ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments.
 
 
 
 
 
 
 
 
16
 
17
  ## ✨ Core Features
18
 
@@ -26,9 +34,12 @@ pinned: false
26
  | **Obligation Tracker** | Categorizes action items: monetary πŸ’°, compliance βš–οΈ, reporting πŸ“Š, delivery πŸ“¦, termination πŸ›‘ |
27
  | **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
28
  | **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
 
 
 
29
 
30
  ### Document Support
31
- - **PDF** parsing via `pdfplumber`
32
  - **DOCX/DOC** parsing via `python-docx`
33
  - **TXT / Markdown** direct text input
34
 
@@ -36,6 +47,8 @@ pinned: false
36
  - **3-Panel Professional Layout** β€” Upload sidebar + Main analysis + Summary dashboard
37
  - **Document Viewer** β€” Inline entity highlights (colored annotations)
38
  - **Clause Cards** β€” Expandable risk-badged cards with confidence scores
 
 
39
  - **Export Reports** β€” JSON (structured) and CSV (tabular) downloads
40
  - **Color-Coded Risk Badges** β€” Instant visual triage
41
 
@@ -44,12 +57,61 @@ pinned: false
44
  | Component | Technology |
45
  |-----------|------------|
46
  | Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β€” LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
47
- | NER | Rule-based with 7 entity types (dates, money, parties, jurisdictions, defined terms) |
48
- | NLI | Heuristic contradiction detection with 5 conflict patterns + missing-clause detection |
 
 
 
49
  | Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
50
- | Comparison | SequenceMatcher-based clause alignment with risk delta analysis |
51
  | Obligations | Regex pattern matching across 5 obligation categories |
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## πŸ“Š Risk Scoring Methodology
54
 
55
  Risk scores combine clause detection with weighted severity:
@@ -65,16 +127,10 @@ Final score normalized to 0-100 with letter grades:
65
  - D (50-69): High risk
66
  - F (70+): Critical risk
67
 
68
- ## πŸ“š Datasets & Research
69
-
70
- - [CUAD](https://huggingface.co/datasets/theatticusproject/cuad-qa) β€” 510 contracts, 13K annotations, 41 clause categories
71
- - [LegalBench](https://huggingface.co/datasets/nguha/legalbench) β€” 322 legal reasoning tasks
72
- - [LexGLUE](https://huggingface.co/datasets/coastalcph/lex_glue) β€” Unfair Terms of Service classification
73
- - Paper: [CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review](https://arxiv.org/abs/2103.06268) (Hendrycks et al., 2021)
74
-
75
  ## πŸš€ Usage
76
 
77
  1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
 
78
  2. Click **Analyze Contract**
79
  3. View results across tabs:
80
  - **Document**: Full text with inline entity highlights
@@ -83,7 +139,9 @@ Final score normalized to 0-100 with letter grades:
83
  - **Contradictions**: Conflicting clauses and missing provisions
84
  - **Obligations**: Action items categorized by type
85
  - **Compliance**: Regulatory framework checks
 
86
  4. **Export** JSON/CSV reports
 
87
 
88
  ## πŸ”€ Compare Contracts
89
 
@@ -91,7 +149,6 @@ Switch to the **Compare Contracts** tab to:
91
  - Upload or paste two contracts side-by-side
92
  - See clause-level diffs (added, removed, modified)
93
  - Get an alignment score and risk delta
94
- - View raw JSON comparison data
95
 
96
  ## ⚠️ Disclaimer
97
 
@@ -103,6 +160,8 @@ Switch to the **Compare Contracts** tab to:
103
  - [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
104
  - [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
105
  - [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
 
 
106
  - [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
107
 
108
  ---
 
10
  pinned: false
11
  ---
12
 
13
+ # πŸ›‘οΈ ClauseGuard v4.0 β€” World's Best Open-Source Legal Contract Analysis
14
 
15
+ **ClauseGuard** is the most comprehensive open-source AI-powered legal contract analysis tool. It analyzes contracts using state-of-the-art legal NLP models and provides actionable risk assessments, Q&A chatbot, clause redlining, and OCR for scanned PDFs.
16
+
17
+ ## πŸ†• What's New in v4.0
18
+
19
+ | Feature | Description |
20
+ |---------|-------------|
21
+ | **πŸ” OCR for Scanned PDFs** | Smart PDF router: auto-detects native vs scanned PDFs. Scanned PDFs are processed via docTR OCR engine (CPU-friendly, ~150MB models) |
22
+ | **πŸ’¬ Contract Q&A Chatbot** | RAG-powered chatbot that answers questions about your analyzed contract. Uses sentence-transformers for retrieval + Qwen2.5-7B via HF Inference API for generation |
23
+ | **✏️ Clause Redlining** | 3-tier system: (1) Template lookup from 18+ legal templates based on FTC/EU standards, (2) Keyword-based matching, (3) LLM refinement for CRITICAL/HIGH risk clauses |
24
 
25
  ## ✨ Core Features
26
 
 
34
  | **Obligation Tracker** | Categorizes action items: monetary πŸ’°, compliance βš–οΈ, reporting πŸ“Š, delivery πŸ“¦, termination πŸ›‘ |
35
  | **Compliance Checker** | Validates against GDPR, CCPA, SOX, HIPAA, and FINRA requirements |
36
  | **Contract Comparison** | Side-by-side diff between two contracts with alignment scoring |
37
+ | **Clause Redlining** | Suggests safer alternatives for risky clauses with legal citations |
38
+ | **Q&A Chatbot** | Ask questions about your contract using RAG (Retrieval-Augmented Generation) |
39
+ | **OCR Support** | Process scanned PDFs with docTR OCR engine |
40
 
41
  ### Document Support
42
+ - **PDF** parsing via `pdfplumber` (native) + `docTR` OCR (scanned)
43
  - **DOCX/DOC** parsing via `python-docx`
44
  - **TXT / Markdown** direct text input
45
 
 
47
  - **3-Panel Professional Layout** β€” Upload sidebar + Main analysis + Summary dashboard
48
  - **Document Viewer** β€” Inline entity highlights (colored annotations)
49
  - **Clause Cards** β€” Expandable risk-badged cards with confidence scores
50
+ - **Redlining Tab** β€” Side-by-side original vs suggested safer alternatives
51
+ - **Q&A Chat Tab** β€” Conversational interface to ask questions about the contract
52
  - **Export Reports** β€” JSON (structured) and CSV (tabular) downloads
53
  - **Color-Coded Risk Badges** β€” Instant visual triage
54
 
 
57
  | Component | Technology |
58
  |-----------|------------|
59
  | Clause Classification | `Mokshith31/legalbert-contract-clause-classification` β€” LoRA adapter on `nlpaueb/legal-bert-base-uncased`, fine-tuned on CUAD 41-class taxonomy |
60
+ | Legal NER | `matterstack/legal-bert-ner` (ML) with regex fallback for 7 entity types |
61
+ | NLI | `cross-encoder/nli-deberta-v3-base` (semantic contradiction detection) |
62
+ | Embeddings | `sentence-transformers/all-MiniLM-L6-v2` (384-dim, RAG retrieval) |
63
+ | LLM | `Qwen/Qwen2.5-7B-Instruct` via HF Inference API (chatbot + redlining) |
64
+ | OCR | `docTR` (fast_base + crnn_vgg16_bn) for scanned PDF text extraction |
65
  | Compliance | Regulatory keyword matching across GDPR, CCPA, SOX, HIPAA, FINRA |
66
+ | Comparison | Semantic similarity with sentence embeddings + string matching fallback |
67
  | Obligations | Regex pattern matching across 5 obligation categories |
68
 
69
+ ## πŸ” OCR Architecture (Smart PDF Router)
70
+
71
+ ```
72
+ PDF uploaded
73
+ ↓
74
+ [detect_if_scanned] β€” pdfplumber extracts >50 chars/page?
75
+ ↓ ↓
76
+ Native PDF Scanned PDF
77
+ ↓ ↓
78
+ pdfplumber docTR OCR (CPU)
79
+ ↓ ↓
80
+ Contract text β†’ existing analysis pipeline
81
+ ```
82
+
83
+ ## πŸ’¬ Q&A Chatbot Architecture (RAG)
84
+
85
+ ```
86
+ User asks question about their contract
87
+ ↓
88
+ [1] Embed question with all-MiniLM-L6-v2
89
+ ↓
90
+ [2] Retrieve top-5 most relevant chunks from contract
91
+ ↓
92
+ [3] Build prompt:
93
+ - System: ClauseGuard analysis results (clauses, entities, risk scores)
94
+ - Context: Retrieved contract chunks (≀2.5K tokens)
95
+ - User question
96
+ ↓
97
+ [4] Stream response from Qwen2.5-7B via HF Inference API
98
+ ```
99
+
100
+ **Key design:** Analyzed data (clauses, entities, risk scores) goes in the system prompt β€” NOT through RAG retrieval. Only the raw contract text goes through RAG. This gives the model both structured analysis AND verbatim evidence.
101
+
102
+ ## ✏️ Clause Redlining Architecture (3-Tier)
103
+
104
+ | Tier | Method | Speed | Hallucination Risk |
105
+ |------|--------|-------|--------------------|
106
+ | **1. Template Lookup** | 18+ pre-written safe alternatives based on FTC/EU/CFPB standards | Instant | Zero |
107
+ | **2. Keyword Matching** | Match clause text to relevant templates via legal keywords | Instant | Zero |
108
+ | **3. LLM Refinement** | Qwen2.5-7B adapts template to specific clause context | ~3-5s | Low (template-anchored) |
109
+
110
+ Anti-hallucination guardrails:
111
+ - **Template anchor:** LLM can only refine, not generate from scratch
112
+ - **Legal citation:** Every suggestion includes legal basis and consumer standard
113
+ - **Disclaimer:** Clear "Not legal advice" warning
114
+
115
  ## πŸ“Š Risk Scoring Methodology
116
 
117
  Risk scores combine clause detection with weighted severity:
 
127
  - D (50-69): High risk
128
  - F (70+): Critical risk
129
 
 
 
 
 
 
 
 
130
  ## πŸš€ Usage
131
 
132
  1. **Upload** a contract (PDF, DOCX, or TXT) or paste text directly
133
+ - πŸ’‘ Scanned PDFs are automatically processed with OCR
134
  2. Click **Analyze Contract**
135
  3. View results across tabs:
136
  - **Document**: Full text with inline entity highlights
 
139
  - **Contradictions**: Conflicting clauses and missing provisions
140
  - **Obligations**: Action items categorized by type
141
  - **Compliance**: Regulatory framework checks
142
+ - **Redlining**: ✏️ Safer clause alternatives with legal citations
143
  4. **Export** JSON/CSV reports
144
+ 5. Switch to **πŸ’¬ Contract Q&A** tab to ask questions about your contract
145
 
146
  ## πŸ”€ Compare Contracts
147
 
 
149
  - Upload or paste two contracts side-by-side
150
  - See clause-level diffs (added, removed, modified)
151
  - Get an alignment score and risk delta
 
152
 
153
  ## ⚠️ Disclaimer
154
 
 
160
  - [Clause Classifier Model](https://huggingface.co/Mokshith31/legalbert-contract-clause-classification)
161
  - [Legal-BERT Base](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
162
  - [CUAD Dataset](https://huggingface.co/datasets/theatticusproject/cuad-qa)
163
+ - [Qwen2.5-7B (Chatbot LLM)](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
164
+ - [docTR OCR](https://github.com/mindee/doctr)
165
  - [CUAD Paper (arXiv:2103.06268)](https://arxiv.org/abs/2103.06268)
166
 
167
  ---