File size: 25,235 Bytes
aa78dc0
 
f8b04c3
254db3e
 
aa78dc0
 
 
 
 
 
 
 
6c8a2d0
 
aa78dc0
6c8a2d0
cd1d62d
aa78dc0
6c8a2d0
636c31f
a0c09ca
cd1d62d
f8b04c3
 
cd1d62d
aa78dc0
6c8a2d0
 
cd1d62d
aa78dc0
6c8a2d0
cd1d62d
6c8a2d0
aa78dc0
6c8a2d0
f8b04c3
 
 
6c8a2d0
 
 
aa78dc0
6c8a2d0
 
 
aa78dc0
cd1d62d
 
aa78dc0
cd1d62d
 
 
 
 
 
 
254db3e
f8b04c3
 
 
 
 
254db3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
cd1d62d
aa78dc0
 
 
cd1d62d
636c31f
 
cd1d62d
a0c09ca
aa78dc0
 
cd1d62d
aa78dc0
 
a0c09ca
cd1d62d
636c31f
 
a0c09ca
 
 
 
 
aa78dc0
 
 
636c31f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
636c31f
 
 
 
 
 
 
a0c09ca
 
636c31f
a0c09ca
 
 
 
636c31f
 
 
 
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
 
 
 
 
 
 
 
 
 
 
 
aa78dc0
 
 
 
 
 
cd1d62d
aa78dc0
 
 
 
 
 
 
cd1d62d
e45c962
f8b04c3
 
 
 
 
aa78dc0
e45c962
aa78dc0
 
 
 
 
 
 
 
636c31f
aa78dc0
 
 
a0c09ca
aa78dc0
a0c09ca
636c31f
a0c09ca
 
 
 
 
 
 
 
 
636c31f
aa78dc0
 
 
 
 
6c8a2d0
aa78dc0
 
 
 
 
636c31f
aa78dc0
 
 
 
636c31f
 
 
 
 
 
 
 
 
 
aa78dc0
 
636c31f
aa78dc0
f8b04c3
aa78dc0
f8b04c3
aa78dc0
 
f8b04c3
 
aa78dc0
 
f8b04c3
 
 
 
 
aa78dc0
 
6c8a2d0
 
 
a0c09ca
 
 
 
 
 
 
 
6c8a2d0
 
a0c09ca
 
 
aa78dc0
 
f8b04c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
cd1d62d
 
 
 
 
 
 
 
 
 
 
a0c09ca
cd1d62d
 
 
 
f8b04c3
aa78dc0
 
 
 
 
 
a0c09ca
aa78dc0
a0c09ca
aa78dc0
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
636c31f
aa78dc0
 
 
 
 
 
 
 
a0c09ca
aa78dc0
cd1d62d
a0c09ca
 
636c31f
 
cd1d62d
254db3e
cd1d62d
a0c09ca
 
cd1d62d
636c31f
cd1d62d
a0c09ca
 
 
 
 
 
 
 
636c31f
a0c09ca
cd1d62d
a0c09ca
cd1d62d
a0c09ca
 
e45c962
a0c09ca
6c8a2d0
a0c09ca
 
 
 
 
 
6c8a2d0
a0c09ca
6c8a2d0
a0c09ca
 
 
 
 
 
 
2d7bf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8b04c3
2d7bf65
 
 
 
 
 
 
 
f8b04c3
 
 
 
 
 
 
 
 
 
2d7bf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8b04c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
# CivicSetu β€” Low Level Design (LLD)

**Version:** 2.0.0 β€” Phase 8 Complete (RAGAS Evaluation + Retrieval Improvements)
**Live:** https://civicsetu-two.vercel.app
**Last Updated:** April 2026

---

## 1. Module Map

```
src/civicsetu/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ settings.py           Pydantic BaseSettings singleton (lru_cache)
β”‚   └── document_registry.py  All document URLs + metadata (single source of truth)
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ enums.py              StrEnum: Jurisdiction, DocType, QueryType, etc.
β”‚   └── schemas.py            Pydantic models: LegalChunk, Citation, RetrievedChunk, CivicSetuResponse
β”œβ”€β”€ ingestion/
β”‚   β”œβ”€β”€ downloader.py         httpx PDF downloader with MD5 cache check
β”‚   β”œβ”€β”€ parser.py             PyMuPDF text extractor β€” max_pages cap, scanned PDF detection
β”‚   β”œβ”€β”€ chunker.py            Section-boundary regex chunker β€” 6 format patterns + fallback
β”‚   β”œβ”€β”€ metadata_extractor.py Date/Section/Rule reference/amendment regex extraction
β”‚   β”œβ”€β”€ embedder.py           nomic-embed-text-v1.5 via sentence-transformers β€” truncate at 4000 chars pre-prefix
β”‚   β”œβ”€β”€ pipeline.py           Orchestrates ingestion; prepends section_title to embeddings
β”‚   └── graph_seeder.py       Post-ingestion REFERENCES + DERIVED_FROM edge seeding
β”œβ”€β”€ stores/
β”‚   β”œβ”€β”€ relational_store.py   Async SQLAlchemy β€” documents + legal_chunks tables
β”‚   β”œβ”€β”€ vector_store.py       pgvector HNSW cosine search
β”‚   └── graph_store.py        Neo4j Cypher interface β€” fresh driver per call
β”œβ”€β”€ retrieval/
β”‚   β”œβ”€β”€ vector_retriever.py   Wraps VectorStore for agent use
β”‚   β”œβ”€β”€ graph_retriever.py    REFERENCES + DERIVED_FROM traversal, Section/Rule ID extraction
β”‚   └── reranker.py           FlashRank cross-encoder wrapper
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ state.py              CivicSetuState TypedDict (frozen contract)
β”‚   β”œβ”€β”€ nodes.py              Pure functions: classifier, _rrf_retrieve (shared hybrid),
β”‚   β”‚                         vector_retrieval, graph_retrieval, hybrid_retrieval,
β”‚   β”‚                         reranker, generator, validator
β”‚   β”œβ”€β”€ edges.py              Conditional routing: route_after_classifier,
β”‚   β”‚                         route_after_validator
β”‚   └── graph.py              StateGraph assembly + get_compiled_graph()
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ classifier.py         Query type classification + rewriting prompt
β”‚   β”œβ”€β”€ generator.py          Cited answer generation prompt
β”‚   └── validator.py          Hallucination + confidence check prompt
β”œβ”€β”€ guardrails/
β”‚   β”œβ”€β”€ input_guard.py        PII detection + off-topic filter
β”‚   └── output_guard.py       Faithfulness check + disclaimer injection
└── api/
    β”œβ”€β”€ main.py               FastAPI app factory + lifespan (graph pre-compiled)
    β”œβ”€β”€ routes/
    β”‚   β”œβ”€β”€ health.py         GET /health β€” DB ping
    β”‚   β”œβ”€β”€ query.py          POST /api/v1/query β€” main RAG endpoint
    β”‚   └── ingest.py         POST /api/v1/ingest β€” admin endpoint
    └── middleware/
        └── logging.py        Request/response structured logging

eval/
β”œβ”€β”€ golden_dataset.jsonl      31-row RAGAS evaluation dataset across 5 jurisdictions
scripts/
β”œβ”€β”€ run_eval.py               Two-phase RAGAS evaluation: Phase 1 (graph invoke) + Phase 2 (RAGAS scoring)

frontend/                     Next.js 15 App Router β€” deployed on Vercel
β”œβ”€β”€ src/app/
β”‚   β”œβ”€β”€ layout.tsx            Root layout: ThemeProvider + dark mode
β”‚   β”œβ”€β”€ page.tsx              Main page: wires all components together
β”‚   └── globals.css           Tailwind directives + gradient utilities
β”œβ”€β”€ src/components/
β”‚   β”œβ”€β”€ Header.tsx            Logo, new chat, theme toggle, GitHub link
β”‚   β”œβ”€β”€ ChatThread.tsx        Scrollable message list + empty state examples
β”‚   β”œβ”€β”€ MessageBubble.tsx     User/assistant/error bubbles with badges + citations
β”‚   β”œβ”€β”€ ConfidenceBadge.tsx   HIGH/MEDIUM/LOW pill
β”‚   β”œβ”€β”€ CitationsPanel.tsx    Collapsible citation cards
β”‚   └── InputBar.tsx          Auto-resize textarea, jurisdiction select, send
β”œβ”€β”€ src/hooks/
β”‚   └── useChat.ts            Chat state, session_id localStorage, sendMessage
└── src/lib/
    β”œβ”€β”€ types.ts              TypeScript interfaces (mirrors backend Pydantic models)
    └── api.ts                queryRera() fetch wrapper β†’ /api/v1/query
```

---

## 2. Database Schema

### PostgreSQL Tables

```sql
documents (
    doc_id          UUID PRIMARY KEY,
    doc_name        TEXT,
    jurisdiction    TEXT,   -- Jurisdiction enum value
    doc_type        TEXT,   -- DocType enum value  (stored uppercase: ACT, RULES, CIRCULAR)
    source_url      TEXT,
    effective_date  DATE,
    gazette_number  TEXT,
    total_chunks    INTEGER,
    ingested_at     TIMESTAMPTZ,
    is_active       BOOLEAN
)

legal_chunks (
    chunk_id            UUID PRIMARY KEY,
    doc_id              UUID β†’ documents.doc_id,
    jurisdiction        TEXT,
    doc_type            TEXT,
    doc_name            TEXT,
    section_id          TEXT,   -- "18", "3(2)", "Para-3"
    section_title       TEXT,
    section_hierarchy   TEXT[], -- ["RERA Act 2016", "18"]
    text                TEXT,
    effective_date      DATE,
    superseded_by       UUID β†’ legal_chunks.chunk_id,
    status              TEXT,   -- ChunkStatus enum value
    source_url          TEXT,
    page_number         INTEGER,
    embedding           vector(768)  -- HNSW indexed
)
```

### pgvector Index

```sql
CREATE INDEX legal_chunks_embedding_idx
    ON legal_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
```

`m=16` β€” 16 connections per node. `ef_construction=64` β€” 64 candidates during index build.
Tuned for recall/speed balance at <10K vectors. Revisit at 100K+.

### Neo4j Graph Schema

```
Nodes:
  (:Document {doc_id, doc_name, jurisdiction, doc_type, effective_date})
  (:Section  {section_id, title, chunk_id, jurisdiction, doc_name, is_active})

Edges:
  (:Document)-[:HAS_SECTION]->(:Section)
  (:Section) -[:REFERENCES]->(:Section)       -- intra + cross-jurisdiction citations
  (:Section) -[:DERIVED_FROM]->(:Section)     -- State Rule N β†’ RERA Act Sec M
  (:Document)-[:DERIVED_FROM]->(:Document)    -- State Rules β†’ RERA Act 2016

Planned (Phase 7+):
  (:Section) -[:SUPERSEDES]->(:Section)
  (:Section) -[:AMENDED_BY]->(:Amendment)
  (:Section) -[:CONFLICTS_WITH]->(:Section)
```

**Live graph stats (Phase 6):**

| Metric       | Count |
|--------------|-------|
| Documents    | 9     |
| Sections     | 2090  |
| HAS_SECTION  | 1297  |
| REFERENCES   | 933   |
| DERIVED_FROM | 91    |

---

## 3. Document Registry

`document_registry.py` β€” single source of truth for all ingested documents.

```python
@dataclass(frozen=True)
class DocumentSpec:
    name: str
    url: str
    jurisdiction: Jurisdiction
    doc_type: DocType
    effective_date: date | None
    filename: str
    dest_subdir: str
    max_pages: int | None = None  # None = all pages; cap excludes forms/schedules appendices
```

### Ingested Documents (Phase 6)

| Key | Document | Jurisdiction | DocType | Chunks | max_pages |
|---|---|---|---|---|---|
| `rera_act_2016` | RERA Act 2016 | CENTRAL | ACT | ~224 | None |
| `mahrera_rules_2017` | MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | None |
| `up_rera_rules_2016` | UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 24 |
| `up_rera_general_regulations_2019` | UP RERA General Regulations 2019 | UTTAR_PRADESH | CIRCULAR | 85 | None |
| `karnataka_rera_rules_2017` | Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 37 |
| `tn_rera_rules_2017` | Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 15 |

**PDF source notes:**
- Karnataka official PDF (`rera.karnataka.gov.in`) is fully scanned (19MB image) β€” NAREDCO mirror used
- TN PDF bundles rules + forms (101 pages); `max_pages=15` excludes Forms A–O
- UP Rules PDF bundles rules + forms (52 pages); `max_pages=24` excludes prescribed forms

---

## 4. LangGraph State Machine

### State Contract (`agent/state.py`)

```python
class CivicSetuState(TypedDict):
    # Input
    query: str
    session_id: Optional[str]
    jurisdiction_filter: Optional[Jurisdiction]
    top_k: int

    # Classification
    query_type: Optional[QueryType]
    rewritten_query: Optional[str]

    # Retrieval β€” Annotated[list, operator.add] enables parallel node merging
    retrieved_chunks: Annotated[list[RetrievedChunk], operator.add]
    reranked_chunks: list[RetrievedChunk]

    # Generation
    raw_response: Optional[str]
    citations: list[Citation]
    confidence_score: float
    conflict_warnings: list[str]
    amendment_notice: Optional[str]

    # Control
    retry_count: int          # max 2 retries
    hallucination_flag: bool
    error: Optional[str]
```

### RetrievedChunk Schema (`models/schemas.py`)

```python
class RetrievedChunk(BaseModel):
    chunk: LegalChunk
    vector_score: float | None = None
    rerank_score: float | None = None
    retrieval_source: str = "vector"   # "vector" | "graph"
    graph_path: Optional[str] = None   # e.g. "source:18@CENTRAL"
    is_pinned: bool = False            # True = exact source section, bypasses reranker sort
```

### Node Responsibilities

| Node | Input Keys | Output Keys | LLM Call |
| :-- | :-- | :-- | :-- |
| classifier | query | query_type, rewritten_query | Yes |
| vector_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| graph_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| reranker | retrieved_chunks, query | reranked_chunks | No |
| generator | reranked_chunks, query | raw_response, citations, confidence_score | Yes |
| validator | raw_response, reranked_chunks | hallucination_flag, confidence_score | Yes |
| retry | retry_count | retry_count+1, cleared retrieval fields | No |

### Routing Logic

| classifier β†’ route_after_classifier   |                                          |
|---------------------------------------|------------------------------------------|
| fact_lookup                           | vector_retrieval (RRF hybrid)            |
| cross_reference                       | graph_retrieval (β†’ RRF fallback)         |
| penalty_lookup                        | graph_retrieval (β†’ RRF fallback)         |
| temporal                              | graph_retrieval (β†’ RRF fallback)         |
| conflict_detection                    | hybrid_retrieval (RRF across jur.)       |

```
validator β†’ route_after_validator:
    confidence >= 0.5 AND not hallucinated β†’ END
    (confidence < 0.5 OR hallucinated) AND retry_count < 2 β†’ retry β†’ classifier
    (confidence < 0.5 OR hallucinated) AND retry_count >= 2 β†’ END (low confidence answer)
```

---

## 5. Chunking Strategy

### Section Boundary Detection

Six regex patterns across `DocType.RULES`, tried in order (first match wins per line):

| # | Pattern | Format | Jurisdiction |
|---|---|---|---|
| 1 | `\n(?P<id>\d{1,2}[A-Z]?)\.\s*\n(?P<title>...)` | Newline-dot-newline | MahaRERA |
| 2 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...)\.?β€”` | Same-line em-dash | MahaRERA |
| 3 | `^Rule\s+(?P<id>\d{1,2}[A-Z]?)\s*[.\-–]\s*(?P<title>...)` | Explicit Rule prefix | Generic |
| 4 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...?)\.–` | ASCII hyphen `.-` | Karnataka, Tamil Nadu |
| 5 | `(?P<id>\d{1,2}[A-Z]?)-\(1\)\s*\n(?P<title>...)` | `N-(1)\nTitle` | UP RERA multi-clause |
| 6 | `(?P<id>\d{1,2}[A-Z]?)-(?!\()\s*\n(?P<title>...)` | `N-\nTitle` | UP RERA single-clause |

`DocType.ACT` uses a separate pattern set. Fallback: paragraph split on double newlines.
Rule IDs capped at `\d{1,2}` (max 2 digits) β€” prevents year strings like `2016` matching as rule IDs.
Logs `no_section_boundaries_found` + `fallback_paragraph_chunking` when falling back.

### Chunk Size Limits

```
MIN_CHARS = 100   β€” discard fragments (headers, page numbers)
MAX_CHARS = 1500  β€” split large sections at subsection markers (1), (2), (a), (b)
```

### Split Priority for Large Sections

```
1. Subsection markers: \n\s*\((?:\d+|[a-z]{1,3})\)\s+
2. Sentence boundary near MAX_CHARS: rfind('. ')
3. Hard cut at MAX_CHARS (last resort)
```

### parser.py β€” max_pages cap

```python
@staticmethod
def parse(source: str | Path, max_pages: int | None = None) -> ParsedDocument:
    all_pages = list(doc)
    if max_pages is not None:
        all_pages = all_pages[:max_pages]   # slice before fulltext build
```

---

## 6. Embedding Strategy

**Model:** `nomic-embed-text-v1.5` (via `sentence-transformers`, local β€” no Ollama required)
**Dimension:** 768
**Asymmetric prefixes** (MTEB/nomic-embed requirement):

```
Ingestion time:  "search_document: {section_title}\n{text}"  β†’ pipeline.py
Query time:      "search_query: {rewritten_query}"            β†’ retrieval/__init__.py
```

**Section title prepend (Phase 8 change):** `pipeline.py` prepends `section_title` to the
embedded text so sub-chunks (e.g. `S.11(2)`) retain their section context.
Without this, sub-chunks embed without "Obligations of promoter" β€” cosine similarity misses them.
The reranker still receives raw `chunk.text` (no title prefix).

Using wrong prefix at query time causes ~10–15% recall degradation.

### Truncation Guard

```python
MAX_EMBED_CHARS = 4000   # ~1000 tokens β€” safe ceiling before prefix added

def embed_document(self, text: str) -> list[float]:
    if len(text) > MAX_EMBED_CHARS:
        log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS)
        text = text[:MAX_EMBED_CHARS]
    prefixed = f"search_document: {text.strip()}"  # prefix AFTER truncation
    return self.embed_one(prefixed)
```

Truncation happens **before** prefix is added β€” prevents Ollama 500 errors on Tamil Nadu
and other gazette PDFs where sub-sections exceed 10K chars.

---

## 7. Hybrid Retrieval β€” `_rrf_retrieve()`

All retrieval nodes share a single async helper `_rrf_retrieve()` in `agent/nodes.py`.

### Reciprocal Rank Fusion (RRF)

```python
RRF_K = 60   # standard constant

rrf_score(chunk) = 1/(K + rank_in_vector) + 1/(K + rank_in_fts)
```

Fetches `top_k Γ— 3` vector results and `top_k Γ— 2` FTS results, deduplicates by `chunk_id`,
merges via RRF, returns top `top_k Γ— 2`.

### Full-Text Search

`VectorStore.full_text_search()` uses `websearch_to_tsquery` in OR mode:

```sql
WHERE to_tsvector('english', text) @@ websearch_to_tsquery('english', :query)
ORDER BY ts_rank(to_tsvector('english', text), websearch_to_tsquery('english', :query)) DESC
```

Changed from `plainto_tsquery` (AND-mode) β€” AND required all query words to match,
excluding relevant sections that matched most but not all words.

### Section Family Expansion

After RRF merge, top-3 results trigger family expansion:

```python
for rc in merged[:3]:
    base_sid = re.sub(r'\([^)]*\)$', '', section_id).strip()  # "5(4)" β†’ "5"
    family = await VectorStore.get_section_family(section_id=base_sid, jurisdiction=jur)
    # returns all chunks where section_id = '5' OR section_id LIKE '5(%'
```

`get_section_family` guard: skips if `section_id` already contains `(` (base_sid computation
strips this before calling). Hard cap: `_MAX_VECTOR_EXPANDED = 40` chunks before reranker.

**Why top-3 not top-1:** If top-1 RRF result is a sub-section (`S.5(4)`), its parent
family is expanded. But if the truly relevant parent section (`S.11`) appears at RRF rank 2,
only expanding top-1 misses it. Expanding top-3 covers more cases at the cost of a slightly
larger pool.

---

## 7b. Reranker Detail

`reranker_score_threshold = 0.1` β€” minimum cross-encoder score to enter candidate pool.
`reranker_score_gap = 0.6` β€” gap filter cliff threshold.

**Gap filter:**

```python
def _apply_score_gap(chunks, gap=0.6):
    for i in range(1, len(chunks)):
        if chunks[i-1].rerank_score - chunks[i].rerank_score >= gap:
            return chunks[:i]
    return chunks
```

**Threshold history:** Originally `threshold=0.3, gap=0.35`. Gap=0.35 was too aggressive β€”
cut chunks with 0.36 score drop, leaving only 1 context for generator. Raised to 0.6 (Phase 8).

Final context: `pinned_chunks + gap_filtered[:max(0, 5 - len(pinned))]` β†’ max 5 chunks.

---

## 8. Graph Retriever

`graph_retriever.py` β€” called on `cross_reference`, `penalty_lookup`, `temporal` query types.

### Section ID Extraction

```python
section_pattern = re.compile(r'\b(?:section|sec\.?|s\.)\s*(\d+[A-Z]?)\b', re.IGNORECASE)
rule_pattern    = re.compile(r'\bRule\s+(\d+[A-Z]?)\b', re.IGNORECASE)
```

### Traversal Strategy (per jurisdiction)

For each jurisdiction (`CENTRAL`, `MAHARASHTRA`, `UTTAR_PRADESH`, `KARNATAKA`, `TAMIL_NADU`):

```
1. Source section chunks    β€” exact section_id match β†’ is_pinned=True
2. REFERENCES outgoing      β€” sections source cites (depth=2)
3. REFERENCES incoming      β€” sections that cite source
4. DERIVED_FROM outgoing    β€” Act sections this Rule derives from
5. DERIVED_FROM incoming    β€” Rule sections implementing this Act section
```

### Pinning Rule

Only the exact `section_id` match gets `is_pinned=True`. Sub-sections are NOT pinned.
Max pinned chunks: 2 (one per jurisdiction). Remaining 3 slots filled by reranker.

---

## 9. Response Contract

```python
CivicSetuResponse:
    answer: str                    # plain English, cites section numbers
    citations: list[Citation]      # min_length=1 β€” NEVER empty
    confidence_score: float        # 0.0–1.0
    confidence_level: str          # "high"/"medium"/"low"
    query_type_resolved: QueryType
    conflict_warnings: list[str]   # empty until Phase 7
    amendment_notice: Optional[str]
    disclaimer: str                # always present

Citation:
    section_id: str
    doc_name: str
    jurisdiction: Jurisdiction
    effective_date: Optional[date]
    source_url: str
    chunk_id: UUID
```

---

## 9. Error Handling

| Scenario | Behaviour |
| :-- | :-- |
| LLM provider rate limited | LiteLLM auto-rotates to next provider |
| All LLM providers fail | `RuntimeError` β†’ FastAPI 500 |
| No chunks retrieved | `InsufficientInfoResponse` returned |
| Hallucination detected | retry (max 2x) β†’ low confidence answer |
| DB unreachable | `/health` returns `degraded`, query returns 500 |
| Scanned PDF detected | Warning logged, fallback URL used (Karnataka) |
| Section patterns not matched | Fallback paragraph chunking, warning logged |
| Neo4j event loop mismatch | Prevented β€” `_get_driver()` creates fresh driver per call |
| Embedding input too long | Truncated at 4000 chars before prefix; warning logged |
| max_pages exceeded | Parser silently caps pages; total_pages reflects capped count |

---

## 10. Neo4j Graph β€” Phase 6 State (Current)

**Nodes:** 9 Documents, 2090 Sections
**Edges:** 1297 HAS_SECTION, 933 REFERENCES, 91 DERIVED_FROM

### Documents in Graph

| Document | Jurisdiction | DocType | Chunks | Sections | DERIVED_FROM edges |
|---|---|---|---|---|---|
| RERA Act 2016 | CENTRAL | ACT | ~224 | ~224 | β€” |
| MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | ~214 | 17 sec + 1 doc |
| UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 33 | 11 sec + 1 doc |
| UP RERA General Regs 2019 | UTTAR_PRADESH | CIRCULAR | 85 | 53 | β€” |
| Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 45 | 15 sec + 1 doc |
| Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 36 | 15 sec + 1 doc |

### Known Open Issues (non-blocking)

| Issue | Affected | Root Cause |
|---|---|---|
| Act Β§13 missing from graph | UP rule 14, KA rule 11, TN rule 11 | RERA Act ingestion β€” Β§13 chunked under different ID |
| Act Β§66 missing from graph | KA rule 19, TN rule 19 | RERA Act ingestion β€” Β§66 not ingested |

### DERIVED_FROM Map Summary

| Jurisdiction | Mapped pairs | Resolved | Unresolved |
|---|---|---|---|
| MAHARASHTRA | 17 | 17 | 0 |
| UTTAR_PRADESH | 15 | 11 | 4 |
| KARNATAKA | 17 | 15 | 2 |
| TAMIL_NADU | 17 | 15 | 2 |

### PDF Source Decisions

| Jurisdiction | Primary URL | Issue | Resolution |
|---|---|---|---|
| CENTRAL | indiacode.nic.in | β€” | β€” |
| MAHARASHTRA | naredco.in | β€” | β€” |
| UTTAR_PRADESH | up-rera.in/pdf/rera.pdf | pages 25–52 are forms | max_pages=24 |
| KARNATAKA | naredco.in (mirror) | Official PDF fully scanned (19MB) | NAREDCO born-digital |
| TAMIL_NADU | cms.tn.gov.in | pages 16–101 are Forms A–O | max_pages=15 |

## 11. Agent Pipeline β€” Bug Fixes (2026-03-22)

Three production bugs fixed after 12-case E2E suite. All verified: 0 retries, 0
hallucinations, avg latency 7.6s.

### Fix 1 β€” `vector_store.py::get_section_family` β€” Pydantic crash on SELECT *

`SELECT *` returned `embedding` as a raw string; Pydantic `list[float]` validation
failed. Fix: explicit column projection, `embedding=None` on all returned chunks.
Matches every other `VectorStore` method.

### Fix 2 β€” `nodes.py::vector_retrieval_node` β€” Reranker blowup on section expansion

Section family expansion ran on all 5 similarity hits β†’ up to 121 chunks β†’ FlashRank
cross-encoder serial scoring β†’ 65s reranker time. Fix (Phase 5): expand top-1 hit only; hard
cap at 25 chunks before reranker.

```python
for rc in results[:1]:
    ...family expansion...
expanded = expanded[:25]  # hard safety cap
```

**Phase 8 update:** Expanded to top-3 after RAGAS eval revealed that when a sub-section
(e.g. `S.5(4)`) ranks #1, its parent `S.5` (with the 30-day rule) was never expanded.
Cap raised to 40 to accommodate larger families.

```python
for rc in merged[:3]:    # top-3 RRF results (was: top-1)
    ...family expansion...
expanded = expanded[:40]  # was: 25
```


### Fix 3 β€” `nodes.py::validator_node` β€” False hallucination flag

Validator built context as raw `chunk.text` joined string. Generator answer cites
`"Section 11(1)"` but raw text has no section number β†’ validator scores 0.2 β†’
`hallucinated=True` β†’ spurious retry loops (7 retries across 12 tests).

Fix: mirror generator's numbered context block `[i] doc β€” section_id: title\ntext`.
Validator can now match cited section numbers to source context.

### E2E Regression Results (post-fix)

| Metric | Pre-fix | Post-fix |
| :-- | :-- | :-- |
| Avg latency | 19.6s | **7.6s** |
| Max latency | 87.1s | **13.3s** |
| Avg confidence | 0.908 | **0.958** |
| Total retries | 7 | **0** |
| Slow (>20s) | 3 | **0** |
| Low conf (<0.7) | 2 | **0** |
| Pass rate | 12/12 | **12/12** |

---

## 12. Agent Pipeline β€” RAGAS Eval Fixes (Phase 8, April 2026)

Five changes from RAGAS evaluation revealing retrieval and faithfulness failures.

### Fix 4 β€” Reranker thresholds too aggressive (`settings.py`)

Old `score_gap=0.35` cut after any 0.36 point drop β†’ only 1 chunk reached generator.
New: `score_threshold=0.1`, `score_gap=0.6`. Keeps secondary relevant chunks while still
filtering genuine noise (0.98 β†’ 0.20 drop would still cut at 0.78 gap).

### Fix 5 β€” Generator analogy instruction caused hallucination (`generator.py`)

"Use an analogy or real-world example" produced analogies ("Think of it like selling a
used car") not present in retrieved context β†’ faithfulness judge scored as hallucination.
Fix: removed analogy instruction; replaced with "using only information from the provided context".

### Fix 6 β€” Generator weak grounding for sparse contexts (`generator.py`)

Generator constructed legal conclusions from reasoning even when context lacked evidence.
Added explicit rules:
- For sparse context: say "Based on the available context: [X]" and note missing elements
- For conflict detection: only assert conflict if BOTH provisions present in context

### Fix 7 β€” CONFLICT_DETECTION tone hint implied precedence reasoning (`nodes.py`)

Tone hint said "state which jurisdiction takes precedence when context supports it" β€”
LLM interpreted "when context supports it" loosely and applied legal reasoning.
Rewritten to: "Never infer precedence from legal reasoning β€” only state precedence if
the context explicitly says so."

### Fix 8 β€” Temporal query rewrite too generic (`classifier.py`)

Query "What is the timeline for project registration?" produced rewrite "registration
timeline period" β€” FTS missed Section 5 which uses "within thirty days" and "deemed registered".
Added rewriting guidance to expand temporal queries with specific legal time-period keywords.

### RAGAS Results (Phase 8 baseline, 5-row smoke, gemma-4-31b-it judge)

| Row | Faith (before) | Faith (after) | Prec (before) | Prec (after) |
|---|---|---|---|---|
| CENTRAL-FACT-001 | 1.00 | 0.50 | 0.00 | 0.00 |
| CENTRAL-FACT-002 | 0.80 | 0.62 | 0.00 | 0.33 |
| CENTRAL-XREF-001 | 0.63 | 0.50 | 1.00 | 1.00 |
| CENTRAL-CONF-001 | 0.00 | 0.62 | 0.00 | 0.00 |
| CENTRAL-TEMP-001 | 0.67 | 1.00 | 1.00 | 0.00 |
| **Overall** | 0.618 | **0.650** | 0.400* | 0.267 |

\* Before baseline had inflated precision from duplicate chunks (non-deterministic doc_id).
After Phase 8: deterministic UUID5 chunk IDs prevent duplicates on re-ingest.