File size: 10,543 Bytes
38e4b32
cfd62b3
8de7198
 
cfd62b3
38e4b32
cfd62b3
 
 
38e4b32
 
8de7198
38e4b32
8de7198
 
 
e9cf728
 
 
 
 
38e4b32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8de7198
38e4b32
 
 
 
 
8de7198
1714902
8de7198
 
38e4b32
8de7198
 
cfd62b3
38e4b32
 
 
 
 
 
 
 
 
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
 
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
cfd62b3
8de7198
cfd62b3
1714902
 
 
 
 
 
 
 
 
 
 
cfd62b3
 
38e4b32
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
 
 
 
38e4b32
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
 
 
8de7198
38e4b32
 
8de7198
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
8de7198
 
38e4b32
8de7198
38e4b32
8de7198
 
cfd62b3
 
38e4b32
cfd62b3
8de7198
 
 
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
8de7198
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
8de7198
cfd62b3
8de7198
cfd62b3
8de7198
cfd62b3
8de7198
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
8de7198
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
cfd62b3
8de7198
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
cfd62b3
8de7198
cfd62b3
38e4b32
cfd62b3
38e4b32
 
 
8de7198
 
38e4b32
cfd62b3
8de7198
cfd62b3
8de7198
cfd62b3
 
 
38e4b32
cfd62b3
38e4b32
cfd62b3
8de7198
38e4b32
 
cfd62b3
38e4b32
8de7198
 
 
 
38e4b32
 
 
8de7198
cfd62b3
8de7198
 
 
 
 
cfd62b3
 
 
38e4b32
cfd62b3
8de7198
cfd62b3
8de7198
 
 
cfd62b3
8de7198
cfd62b3
8de7198
 
 
 
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
38e4b32
cfd62b3
8de7198
 
cfd62b3
8de7198
cfd62b3
8de7198
 
 
cfd62b3
 
 
38e4b32
 
8de7198
 
cfd62b3
 
 
8de7198
cfd62b3
8de7198
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# CivicSetu - RAG Techniques Reference

**Version:** 2.3 - Mobile Ledger + Quality Hardening  
**Last Updated:** 2026-05-01

This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.

---

## 1. Current Status Snapshot

As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.

- **Phase 9 Complete (Mobile Responsive)**
  - Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
  - Interactive Graph Explorer with section drill-down.
- **Cloud Infrastructure Live**
  - Relational & Vector: **Neon (Postgres + pgvector)**
  - Graph: **Neo4j AuraDB**
  - Frontend: **Vercel**
  - Backend API: **Hugging Face Spaces**
- **Live app routes**
  - `POST /api/v1/query` - buffered response
  - `POST /api/v1/query/stream` - SSE token streaming
  - `POST /api/v1/query/section-context` - section-focused chat
  - `/api/v1/graph/*` - graph explorer and section drill-down
- **Session-aware graph**
  - LangGraph uses `session_id` as thread key.
  - Each turn clears retrieval/generation fields but preserves conversation history.
- **Active retrieval routing**
  - `fact_lookup -> vector_retrieval`
  - `cross_reference|penalty_lookup|temporal -> graph_retrieval`
  - `conflict_detection -> hybrid_retrieval`
- **Streaming is now first-class**
  - streaming path reuses classifier, retrieval, and reranker
  - answer text streams first
  - citations and metadata are extracted in a second fast pass
- **Latest eval artifact (0.90 Faithfulness)**
  - `eval_results.json` dated **2026-04-28**
  - `faithfulness=0.900`
  - `answer_relevancy=0.858`
  - `context_precision=0.696`
  - `pass_rate=0.581`
- **Knowledge Graph Scale (as of 2026-05-01)**
  - Documents: `6`
  - Sections: `2,090`
  - Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
- **Main remaining weakness**
  - multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
  - context precision for broad fact lookups needs further HNSW tuning

---

## 2. System Overview

CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries.

Core problem:

- legal text is structured around sections, rules, sub-clauses, and cross-references
- users ask imprecise natural-language questions
- answers must stay grounded and cite the right legal section

Why plain semantic RAG fails here:

- embeddings blur important legal entities
- user queries often omit exact statute wording
- conflict questions need more than one legal source
- generation models tend to fill gaps unless grounding is strict

---

## 3. Ingestion Pipeline

### 3.1 PDF Parsing

`ingestion/parser.py` uses **PyMuPDF**.

Important guards:

- document-level `max_pages` trims form-heavy tails
- scanned PDF detection avoids unusable OCR-free sources
- metadata stores capped page count, not necessarily total PDF pages

### 3.2 Section Boundary Chunking

`ingestion/chunker.py` applies multiple regex families in priority order to detect section and rule boundaries.

Current purpose:

- preserve citation boundaries
- keep section hierarchy intact
- split oversized sections without destroying legal structure

Fallback mode is paragraph chunking on double newlines, logged as `fallback_paragraph_chunking`.

### 3.3 Deterministic Chunk IDs

`chunk_id` is a UUID5 over stable section identity data.

Effect:

- re-ingestion is idempotent
- `ON CONFLICT DO UPDATE` replaces old chunk content
- same legal section does not duplicate across re-runs

### 3.4 Section Title Prepended to Embeddings

During embedding, section title is prepended to chunk text.

Reason:

- split sub-sections often lose the title phrase that users actually search for
- title prefix restores semantic recall for questions like "obligations of promoter"

Reranker still reads raw chunk text, not the prefixed text.

### 3.5 Embedding Model

Current defaults from `config/settings.py`:

- `embedding_model = nomic-embed-text`
- `embedding_dimension = 768`

Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.

### 3.6 Graph Seeding

`ingestion/graph_seeder.py` populates the Neo4j knowledge graph using data already persisted in PostgreSQL.

Key steps:
- **Idempotent Upsert:** Documents and Sections are merged into Neo4j using UUID5 `chunk_id`.
- **Relationship Extraction:** 
  - `REFERENCES`: `MetadataExtractor` identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links.
  - `DERIVED_FROM`: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level).
- **Execution:** Automatically triggered at the end of `scripts/ingest.py` or manually via `scripts/seed_phase3.py`.

---

## 4. Query Pipeline

### 4.1 Query Classification and Rewriting

`agent/nodes.py::classifier_node` classifies query and rewrites it for retrieval.

Output shape:

```json
{
  "query_type": "fact_lookup | cross_reference | temporal | penalty_lookup | conflict_detection",
  "rewritten_query": "expanded retrieval-friendly query"
}
```

Current route mapping:

| Query Type | Route |
|---|---|
| `fact_lookup` | `vector_retrieval` |
| `cross_reference" | `graph_retrieval` |
| `penalty_lookup` | `graph_retrieval` |
| `temporal` | `graph_retrieval` |
| `conflict_detection" | `hybrid_retrieval` |

Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.

### 4.2 LLM Routing and Fallback Chain

All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.

Current model chain:

```text
THINKING tier (Generator)
1. gemini/gemini-1.5-flash
2. groq/llama-3.3-70b-versatile
3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7

FAST tier (Classifier/Validator)
1. gemini/gemini-1.5-flash
```

Provider notes:

- NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
- `temperature=0.0` for all grounding tasks
- Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.

---

## 5. Hybrid Retrieval

Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families.

### 5.1 Vector Similarity Search

Used to catch semantic matches when wording differs from statute text.

### 5.2 Full-Text Search

Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.

### 5.3 Reciprocal Rank Fusion

Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.

### 5.4 Section-ID-Aware Direct Lookup

If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.

### 5.5 Central Act Supplementation

For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.

---

## 6. Graph-Based Retrieval

Used for section-centric questions and legal relationships.

Current behavior:

- extract section or rule IDs from query
- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
- hydrate matching sections back from Postgres

Graph retrieval is especially important for:

- explicit section lookups
- penalty questions
- central vs state derivation paths

Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.

---

## 7. Reranking

### 7.1 Cross-Encoder

`retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).

Pipeline:

1. deduplicate by `(section_id, doc_name)`
2. split pinned vs rankable chunks
3. rerank rankable chunks with cross-encoder
4. filter by minimum score (0.05)
5. apply score-gap cutoff (0.95)
6. prepend pinned chunks

### 7.2 Context Assembly

Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.

---

## 8. Generation

### 8.1 Buffered Generation

`generator_node()` builds a numbered context block and asks for JSON output.

### 8.2 Streaming Generation

`stream_generator_node()` now drives SSE output.
1. Run classification/retrieval/reranking.
2. Stream answer tokens immediately.
3. Run a second fast metadata extraction prompt
4. Push metadata/citations as the final SSE event.

### 8.3 Tone Hints by Query Type

| Type | Tone Guidance |
|---|---|
| `fact_lookup` | Direct, no metaphors, cite per bullet. |
| `penalty_lookup` | Lead with consequence/penalty. |
| `cross_reference` | Explain primary section, then connections. |
| `conflict_detection` | Flag contradiction ONLY if both sides are in context. |
| `temporal` | Lead with exact numeric deadline/time. |

---

## 9. Validation

### 9.1 Validator Design

`validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
- Returns `hallucination_flag: True` if score is below floor.
- Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged.

### 9.2 Output Guardrails

`guardrails/output_guard.py`:
- Intercepts low-confidence or safe-guard failures.
- Returns `InsufficientInfoResponse` when grounding is weak.
- Appends legal disclaimer.

---

## 10. RAGAS Evaluation Pipeline

### 10.1 Two-Phase Architecture

- **Phase 1:** Graph invocation -> `eval_phase1_results.json`.
- **Phase 2:** RAGAS scoring -> `eval_results.json`.

### 10.2 Dataset & Metrics

- **Rows:** 31 (Central, 4 States, Multi-Jurisdiction).
- **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision.
- **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80.

---

## 11. Known Failure Modes

- **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
- **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator.

---

## 12. Implementation Checklist

- [x] Add `DocumentSpec` to registry.
- [x] Verify PDF text extraction.
- [x] Run `make ingest`.
- [x] Seed Neo4j graph.
- [x] Run `make eval-smoke` to verify precision.