File size: 25,235 Bytes
aa78dc0 f8b04c3 254db3e aa78dc0 6c8a2d0 aa78dc0 6c8a2d0 cd1d62d aa78dc0 6c8a2d0 636c31f a0c09ca cd1d62d f8b04c3 cd1d62d aa78dc0 6c8a2d0 cd1d62d aa78dc0 6c8a2d0 cd1d62d 6c8a2d0 aa78dc0 6c8a2d0 f8b04c3 6c8a2d0 aa78dc0 6c8a2d0 aa78dc0 cd1d62d aa78dc0 cd1d62d 254db3e f8b04c3 254db3e aa78dc0 a0c09ca aa78dc0 cd1d62d aa78dc0 a0c09ca aa78dc0 cd1d62d aa78dc0 cd1d62d 636c31f cd1d62d a0c09ca aa78dc0 cd1d62d aa78dc0 a0c09ca cd1d62d 636c31f a0c09ca aa78dc0 636c31f a0c09ca 636c31f a0c09ca 636c31f a0c09ca 636c31f aa78dc0 cd1d62d aa78dc0 cd1d62d aa78dc0 cd1d62d e45c962 f8b04c3 aa78dc0 e45c962 aa78dc0 636c31f aa78dc0 a0c09ca aa78dc0 a0c09ca 636c31f a0c09ca 636c31f aa78dc0 6c8a2d0 aa78dc0 636c31f aa78dc0 636c31f aa78dc0 636c31f aa78dc0 f8b04c3 aa78dc0 f8b04c3 aa78dc0 f8b04c3 aa78dc0 f8b04c3 aa78dc0 6c8a2d0 a0c09ca 6c8a2d0 a0c09ca aa78dc0 f8b04c3 cd1d62d a0c09ca cd1d62d a0c09ca cd1d62d f8b04c3 aa78dc0 a0c09ca aa78dc0 a0c09ca aa78dc0 a0c09ca aa78dc0 636c31f aa78dc0 a0c09ca aa78dc0 cd1d62d a0c09ca 636c31f cd1d62d 254db3e cd1d62d a0c09ca cd1d62d 636c31f cd1d62d a0c09ca 636c31f a0c09ca cd1d62d a0c09ca cd1d62d a0c09ca e45c962 a0c09ca 6c8a2d0 a0c09ca 6c8a2d0 a0c09ca 6c8a2d0 a0c09ca 2d7bf65 f8b04c3 2d7bf65 f8b04c3 2d7bf65 f8b04c3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 | # CivicSetu β Low Level Design (LLD)
**Version:** 2.0.0 β Phase 8 Complete (RAGAS Evaluation + Retrieval Improvements)
**Live:** https://civicsetu-two.vercel.app
**Last Updated:** April 2026
---
## 1. Module Map
```
src/civicsetu/
βββ config/
β βββ settings.py Pydantic BaseSettings singleton (lru_cache)
β βββ document_registry.py All document URLs + metadata (single source of truth)
βββ models/
β βββ enums.py StrEnum: Jurisdiction, DocType, QueryType, etc.
β βββ schemas.py Pydantic models: LegalChunk, Citation, RetrievedChunk, CivicSetuResponse
βββ ingestion/
β βββ downloader.py httpx PDF downloader with MD5 cache check
β βββ parser.py PyMuPDF text extractor β max_pages cap, scanned PDF detection
β βββ chunker.py Section-boundary regex chunker β 6 format patterns + fallback
β βββ metadata_extractor.py Date/Section/Rule reference/amendment regex extraction
β βββ embedder.py nomic-embed-text-v1.5 via sentence-transformers β truncate at 4000 chars pre-prefix
β βββ pipeline.py Orchestrates ingestion; prepends section_title to embeddings
β βββ graph_seeder.py Post-ingestion REFERENCES + DERIVED_FROM edge seeding
βββ stores/
β βββ relational_store.py Async SQLAlchemy β documents + legal_chunks tables
β βββ vector_store.py pgvector HNSW cosine search
β βββ graph_store.py Neo4j Cypher interface β fresh driver per call
βββ retrieval/
β βββ vector_retriever.py Wraps VectorStore for agent use
β βββ graph_retriever.py REFERENCES + DERIVED_FROM traversal, Section/Rule ID extraction
β βββ reranker.py FlashRank cross-encoder wrapper
βββ agent/
β βββ state.py CivicSetuState TypedDict (frozen contract)
β βββ nodes.py Pure functions: classifier, _rrf_retrieve (shared hybrid),
β β vector_retrieval, graph_retrieval, hybrid_retrieval,
β β reranker, generator, validator
β βββ edges.py Conditional routing: route_after_classifier,
β β route_after_validator
β βββ graph.py StateGraph assembly + get_compiled_graph()
βββ prompts/
β βββ classifier.py Query type classification + rewriting prompt
β βββ generator.py Cited answer generation prompt
β βββ validator.py Hallucination + confidence check prompt
βββ guardrails/
β βββ input_guard.py PII detection + off-topic filter
β βββ output_guard.py Faithfulness check + disclaimer injection
βββ api/
βββ main.py FastAPI app factory + lifespan (graph pre-compiled)
βββ routes/
β βββ health.py GET /health β DB ping
β βββ query.py POST /api/v1/query β main RAG endpoint
β βββ ingest.py POST /api/v1/ingest β admin endpoint
βββ middleware/
βββ logging.py Request/response structured logging
eval/
βββ golden_dataset.jsonl 31-row RAGAS evaluation dataset across 5 jurisdictions
scripts/
βββ run_eval.py Two-phase RAGAS evaluation: Phase 1 (graph invoke) + Phase 2 (RAGAS scoring)
frontend/ Next.js 15 App Router β deployed on Vercel
βββ src/app/
β βββ layout.tsx Root layout: ThemeProvider + dark mode
β βββ page.tsx Main page: wires all components together
β βββ globals.css Tailwind directives + gradient utilities
βββ src/components/
β βββ Header.tsx Logo, new chat, theme toggle, GitHub link
β βββ ChatThread.tsx Scrollable message list + empty state examples
β βββ MessageBubble.tsx User/assistant/error bubbles with badges + citations
β βββ ConfidenceBadge.tsx HIGH/MEDIUM/LOW pill
β βββ CitationsPanel.tsx Collapsible citation cards
β βββ InputBar.tsx Auto-resize textarea, jurisdiction select, send
βββ src/hooks/
β βββ useChat.ts Chat state, session_id localStorage, sendMessage
βββ src/lib/
βββ types.ts TypeScript interfaces (mirrors backend Pydantic models)
βββ api.ts queryRera() fetch wrapper β /api/v1/query
```
---
## 2. Database Schema
### PostgreSQL Tables
```sql
documents (
doc_id UUID PRIMARY KEY,
doc_name TEXT,
jurisdiction TEXT, -- Jurisdiction enum value
doc_type TEXT, -- DocType enum value (stored uppercase: ACT, RULES, CIRCULAR)
source_url TEXT,
effective_date DATE,
gazette_number TEXT,
total_chunks INTEGER,
ingested_at TIMESTAMPTZ,
is_active BOOLEAN
)
legal_chunks (
chunk_id UUID PRIMARY KEY,
doc_id UUID β documents.doc_id,
jurisdiction TEXT,
doc_type TEXT,
doc_name TEXT,
section_id TEXT, -- "18", "3(2)", "Para-3"
section_title TEXT,
section_hierarchy TEXT[], -- ["RERA Act 2016", "18"]
text TEXT,
effective_date DATE,
superseded_by UUID β legal_chunks.chunk_id,
status TEXT, -- ChunkStatus enum value
source_url TEXT,
page_number INTEGER,
embedding vector(768) -- HNSW indexed
)
```
### pgvector Index
```sql
CREATE INDEX legal_chunks_embedding_idx
ON legal_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
`m=16` β 16 connections per node. `ef_construction=64` β 64 candidates during index build.
Tuned for recall/speed balance at <10K vectors. Revisit at 100K+.
### Neo4j Graph Schema
```
Nodes:
(:Document {doc_id, doc_name, jurisdiction, doc_type, effective_date})
(:Section {section_id, title, chunk_id, jurisdiction, doc_name, is_active})
Edges:
(:Document)-[:HAS_SECTION]->(:Section)
(:Section) -[:REFERENCES]->(:Section) -- intra + cross-jurisdiction citations
(:Section) -[:DERIVED_FROM]->(:Section) -- State Rule N β RERA Act Sec M
(:Document)-[:DERIVED_FROM]->(:Document) -- State Rules β RERA Act 2016
Planned (Phase 7+):
(:Section) -[:SUPERSEDES]->(:Section)
(:Section) -[:AMENDED_BY]->(:Amendment)
(:Section) -[:CONFLICTS_WITH]->(:Section)
```
**Live graph stats (Phase 6):**
| Metric | Count |
|--------------|-------|
| Documents | 9 |
| Sections | 2090 |
| HAS_SECTION | 1297 |
| REFERENCES | 933 |
| DERIVED_FROM | 91 |
---
## 3. Document Registry
`document_registry.py` β single source of truth for all ingested documents.
```python
@dataclass(frozen=True)
class DocumentSpec:
name: str
url: str
jurisdiction: Jurisdiction
doc_type: DocType
effective_date: date | None
filename: str
dest_subdir: str
max_pages: int | None = None # None = all pages; cap excludes forms/schedules appendices
```
### Ingested Documents (Phase 6)
| Key | Document | Jurisdiction | DocType | Chunks | max_pages |
|---|---|---|---|---|---|
| `rera_act_2016` | RERA Act 2016 | CENTRAL | ACT | ~224 | None |
| `mahrera_rules_2017` | MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | None |
| `up_rera_rules_2016` | UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 24 |
| `up_rera_general_regulations_2019` | UP RERA General Regulations 2019 | UTTAR_PRADESH | CIRCULAR | 85 | None |
| `karnataka_rera_rules_2017` | Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 37 |
| `tn_rera_rules_2017` | Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 15 |
**PDF source notes:**
- Karnataka official PDF (`rera.karnataka.gov.in`) is fully scanned (19MB image) β NAREDCO mirror used
- TN PDF bundles rules + forms (101 pages); `max_pages=15` excludes Forms AβO
- UP Rules PDF bundles rules + forms (52 pages); `max_pages=24` excludes prescribed forms
---
## 4. LangGraph State Machine
### State Contract (`agent/state.py`)
```python
class CivicSetuState(TypedDict):
# Input
query: str
session_id: Optional[str]
jurisdiction_filter: Optional[Jurisdiction]
top_k: int
# Classification
query_type: Optional[QueryType]
rewritten_query: Optional[str]
# Retrieval β Annotated[list, operator.add] enables parallel node merging
retrieved_chunks: Annotated[list[RetrievedChunk], operator.add]
reranked_chunks: list[RetrievedChunk]
# Generation
raw_response: Optional[str]
citations: list[Citation]
confidence_score: float
conflict_warnings: list[str]
amendment_notice: Optional[str]
# Control
retry_count: int # max 2 retries
hallucination_flag: bool
error: Optional[str]
```
### RetrievedChunk Schema (`models/schemas.py`)
```python
class RetrievedChunk(BaseModel):
chunk: LegalChunk
vector_score: float | None = None
rerank_score: float | None = None
retrieval_source: str = "vector" # "vector" | "graph"
graph_path: Optional[str] = None # e.g. "source:18@CENTRAL"
is_pinned: bool = False # True = exact source section, bypasses reranker sort
```
### Node Responsibilities
| Node | Input Keys | Output Keys | LLM Call |
| :-- | :-- | :-- | :-- |
| classifier | query | query_type, rewritten_query | Yes |
| vector_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| graph_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| reranker | retrieved_chunks, query | reranked_chunks | No |
| generator | reranked_chunks, query | raw_response, citations, confidence_score | Yes |
| validator | raw_response, reranked_chunks | hallucination_flag, confidence_score | Yes |
| retry | retry_count | retry_count+1, cleared retrieval fields | No |
### Routing Logic
| classifier β route_after_classifier | |
|---------------------------------------|------------------------------------------|
| fact_lookup | vector_retrieval (RRF hybrid) |
| cross_reference | graph_retrieval (β RRF fallback) |
| penalty_lookup | graph_retrieval (β RRF fallback) |
| temporal | graph_retrieval (β RRF fallback) |
| conflict_detection | hybrid_retrieval (RRF across jur.) |
```
validator β route_after_validator:
confidence >= 0.5 AND not hallucinated β END
(confidence < 0.5 OR hallucinated) AND retry_count < 2 β retry β classifier
(confidence < 0.5 OR hallucinated) AND retry_count >= 2 β END (low confidence answer)
```
---
## 5. Chunking Strategy
### Section Boundary Detection
Six regex patterns across `DocType.RULES`, tried in order (first match wins per line):
| # | Pattern | Format | Jurisdiction |
|---|---|---|---|
| 1 | `\n(?P<id>\d{1,2}[A-Z]?)\.\s*\n(?P<title>...)` | Newline-dot-newline | MahaRERA |
| 2 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...)\.?β` | Same-line em-dash | MahaRERA |
| 3 | `^Rule\s+(?P<id>\d{1,2}[A-Z]?)\s*[.\-β]\s*(?P<title>...)` | Explicit Rule prefix | Generic |
| 4 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...?)\.β` | ASCII hyphen `.-` | Karnataka, Tamil Nadu |
| 5 | `(?P<id>\d{1,2}[A-Z]?)-\(1\)\s*\n(?P<title>...)` | `N-(1)\nTitle` | UP RERA multi-clause |
| 6 | `(?P<id>\d{1,2}[A-Z]?)-(?!\()\s*\n(?P<title>...)` | `N-\nTitle` | UP RERA single-clause |
`DocType.ACT` uses a separate pattern set. Fallback: paragraph split on double newlines.
Rule IDs capped at `\d{1,2}` (max 2 digits) β prevents year strings like `2016` matching as rule IDs.
Logs `no_section_boundaries_found` + `fallback_paragraph_chunking` when falling back.
### Chunk Size Limits
```
MIN_CHARS = 100 β discard fragments (headers, page numbers)
MAX_CHARS = 1500 β split large sections at subsection markers (1), (2), (a), (b)
```
### Split Priority for Large Sections
```
1. Subsection markers: \n\s*\((?:\d+|[a-z]{1,3})\)\s+
2. Sentence boundary near MAX_CHARS: rfind('. ')
3. Hard cut at MAX_CHARS (last resort)
```
### parser.py β max_pages cap
```python
@staticmethod
def parse(source: str | Path, max_pages: int | None = None) -> ParsedDocument:
all_pages = list(doc)
if max_pages is not None:
all_pages = all_pages[:max_pages] # slice before fulltext build
```
---
## 6. Embedding Strategy
**Model:** `nomic-embed-text-v1.5` (via `sentence-transformers`, local β no Ollama required)
**Dimension:** 768
**Asymmetric prefixes** (MTEB/nomic-embed requirement):
```
Ingestion time: "search_document: {section_title}\n{text}" β pipeline.py
Query time: "search_query: {rewritten_query}" β retrieval/__init__.py
```
**Section title prepend (Phase 8 change):** `pipeline.py` prepends `section_title` to the
embedded text so sub-chunks (e.g. `S.11(2)`) retain their section context.
Without this, sub-chunks embed without "Obligations of promoter" β cosine similarity misses them.
The reranker still receives raw `chunk.text` (no title prefix).
Using wrong prefix at query time causes ~10β15% recall degradation.
### Truncation Guard
```python
MAX_EMBED_CHARS = 4000 # ~1000 tokens β safe ceiling before prefix added
def embed_document(self, text: str) -> list[float]:
if len(text) > MAX_EMBED_CHARS:
log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS)
text = text[:MAX_EMBED_CHARS]
prefixed = f"search_document: {text.strip()}" # prefix AFTER truncation
return self.embed_one(prefixed)
```
Truncation happens **before** prefix is added β prevents Ollama 500 errors on Tamil Nadu
and other gazette PDFs where sub-sections exceed 10K chars.
---
## 7. Hybrid Retrieval β `_rrf_retrieve()`
All retrieval nodes share a single async helper `_rrf_retrieve()` in `agent/nodes.py`.
### Reciprocal Rank Fusion (RRF)
```python
RRF_K = 60 # standard constant
rrf_score(chunk) = 1/(K + rank_in_vector) + 1/(K + rank_in_fts)
```
Fetches `top_k Γ 3` vector results and `top_k Γ 2` FTS results, deduplicates by `chunk_id`,
merges via RRF, returns top `top_k Γ 2`.
### Full-Text Search
`VectorStore.full_text_search()` uses `websearch_to_tsquery` in OR mode:
```sql
WHERE to_tsvector('english', text) @@ websearch_to_tsquery('english', :query)
ORDER BY ts_rank(to_tsvector('english', text), websearch_to_tsquery('english', :query)) DESC
```
Changed from `plainto_tsquery` (AND-mode) β AND required all query words to match,
excluding relevant sections that matched most but not all words.
### Section Family Expansion
After RRF merge, top-3 results trigger family expansion:
```python
for rc in merged[:3]:
base_sid = re.sub(r'\([^)]*\)$', '', section_id).strip() # "5(4)" β "5"
family = await VectorStore.get_section_family(section_id=base_sid, jurisdiction=jur)
# returns all chunks where section_id = '5' OR section_id LIKE '5(%'
```
`get_section_family` guard: skips if `section_id` already contains `(` (base_sid computation
strips this before calling). Hard cap: `_MAX_VECTOR_EXPANDED = 40` chunks before reranker.
**Why top-3 not top-1:** If top-1 RRF result is a sub-section (`S.5(4)`), its parent
family is expanded. But if the truly relevant parent section (`S.11`) appears at RRF rank 2,
only expanding top-1 misses it. Expanding top-3 covers more cases at the cost of a slightly
larger pool.
---
## 7b. Reranker Detail
`reranker_score_threshold = 0.1` β minimum cross-encoder score to enter candidate pool.
`reranker_score_gap = 0.6` β gap filter cliff threshold.
**Gap filter:**
```python
def _apply_score_gap(chunks, gap=0.6):
for i in range(1, len(chunks)):
if chunks[i-1].rerank_score - chunks[i].rerank_score >= gap:
return chunks[:i]
return chunks
```
**Threshold history:** Originally `threshold=0.3, gap=0.35`. Gap=0.35 was too aggressive β
cut chunks with 0.36 score drop, leaving only 1 context for generator. Raised to 0.6 (Phase 8).
Final context: `pinned_chunks + gap_filtered[:max(0, 5 - len(pinned))]` β max 5 chunks.
---
## 8. Graph Retriever
`graph_retriever.py` β called on `cross_reference`, `penalty_lookup`, `temporal` query types.
### Section ID Extraction
```python
section_pattern = re.compile(r'\b(?:section|sec\.?|s\.)\s*(\d+[A-Z]?)\b', re.IGNORECASE)
rule_pattern = re.compile(r'\bRule\s+(\d+[A-Z]?)\b', re.IGNORECASE)
```
### Traversal Strategy (per jurisdiction)
For each jurisdiction (`CENTRAL`, `MAHARASHTRA`, `UTTAR_PRADESH`, `KARNATAKA`, `TAMIL_NADU`):
```
1. Source section chunks β exact section_id match β is_pinned=True
2. REFERENCES outgoing β sections source cites (depth=2)
3. REFERENCES incoming β sections that cite source
4. DERIVED_FROM outgoing β Act sections this Rule derives from
5. DERIVED_FROM incoming β Rule sections implementing this Act section
```
### Pinning Rule
Only the exact `section_id` match gets `is_pinned=True`. Sub-sections are NOT pinned.
Max pinned chunks: 2 (one per jurisdiction). Remaining 3 slots filled by reranker.
---
## 9. Response Contract
```python
CivicSetuResponse:
answer: str # plain English, cites section numbers
citations: list[Citation] # min_length=1 β NEVER empty
confidence_score: float # 0.0β1.0
confidence_level: str # "high"/"medium"/"low"
query_type_resolved: QueryType
conflict_warnings: list[str] # empty until Phase 7
amendment_notice: Optional[str]
disclaimer: str # always present
Citation:
section_id: str
doc_name: str
jurisdiction: Jurisdiction
effective_date: Optional[date]
source_url: str
chunk_id: UUID
```
---
## 9. Error Handling
| Scenario | Behaviour |
| :-- | :-- |
| LLM provider rate limited | LiteLLM auto-rotates to next provider |
| All LLM providers fail | `RuntimeError` β FastAPI 500 |
| No chunks retrieved | `InsufficientInfoResponse` returned |
| Hallucination detected | retry (max 2x) β low confidence answer |
| DB unreachable | `/health` returns `degraded`, query returns 500 |
| Scanned PDF detected | Warning logged, fallback URL used (Karnataka) |
| Section patterns not matched | Fallback paragraph chunking, warning logged |
| Neo4j event loop mismatch | Prevented β `_get_driver()` creates fresh driver per call |
| Embedding input too long | Truncated at 4000 chars before prefix; warning logged |
| max_pages exceeded | Parser silently caps pages; total_pages reflects capped count |
---
## 10. Neo4j Graph β Phase 6 State (Current)
**Nodes:** 9 Documents, 2090 Sections
**Edges:** 1297 HAS_SECTION, 933 REFERENCES, 91 DERIVED_FROM
### Documents in Graph
| Document | Jurisdiction | DocType | Chunks | Sections | DERIVED_FROM edges |
|---|---|---|---|---|---|
| RERA Act 2016 | CENTRAL | ACT | ~224 | ~224 | β |
| MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | ~214 | 17 sec + 1 doc |
| UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 33 | 11 sec + 1 doc |
| UP RERA General Regs 2019 | UTTAR_PRADESH | CIRCULAR | 85 | 53 | β |
| Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 45 | 15 sec + 1 doc |
| Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 36 | 15 sec + 1 doc |
### Known Open Issues (non-blocking)
| Issue | Affected | Root Cause |
|---|---|---|
| Act Β§13 missing from graph | UP rule 14, KA rule 11, TN rule 11 | RERA Act ingestion β Β§13 chunked under different ID |
| Act Β§66 missing from graph | KA rule 19, TN rule 19 | RERA Act ingestion β Β§66 not ingested |
### DERIVED_FROM Map Summary
| Jurisdiction | Mapped pairs | Resolved | Unresolved |
|---|---|---|---|
| MAHARASHTRA | 17 | 17 | 0 |
| UTTAR_PRADESH | 15 | 11 | 4 |
| KARNATAKA | 17 | 15 | 2 |
| TAMIL_NADU | 17 | 15 | 2 |
### PDF Source Decisions
| Jurisdiction | Primary URL | Issue | Resolution |
|---|---|---|---|
| CENTRAL | indiacode.nic.in | β | β |
| MAHARASHTRA | naredco.in | β | β |
| UTTAR_PRADESH | up-rera.in/pdf/rera.pdf | pages 25β52 are forms | max_pages=24 |
| KARNATAKA | naredco.in (mirror) | Official PDF fully scanned (19MB) | NAREDCO born-digital |
| TAMIL_NADU | cms.tn.gov.in | pages 16β101 are Forms AβO | max_pages=15 |
## 11. Agent Pipeline β Bug Fixes (2026-03-22)
Three production bugs fixed after 12-case E2E suite. All verified: 0 retries, 0
hallucinations, avg latency 7.6s.
### Fix 1 β `vector_store.py::get_section_family` β Pydantic crash on SELECT *
`SELECT *` returned `embedding` as a raw string; Pydantic `list[float]` validation
failed. Fix: explicit column projection, `embedding=None` on all returned chunks.
Matches every other `VectorStore` method.
### Fix 2 β `nodes.py::vector_retrieval_node` β Reranker blowup on section expansion
Section family expansion ran on all 5 similarity hits β up to 121 chunks β FlashRank
cross-encoder serial scoring β 65s reranker time. Fix (Phase 5): expand top-1 hit only; hard
cap at 25 chunks before reranker.
```python
for rc in results[:1]:
...family expansion...
expanded = expanded[:25] # hard safety cap
```
**Phase 8 update:** Expanded to top-3 after RAGAS eval revealed that when a sub-section
(e.g. `S.5(4)`) ranks #1, its parent `S.5` (with the 30-day rule) was never expanded.
Cap raised to 40 to accommodate larger families.
```python
for rc in merged[:3]: # top-3 RRF results (was: top-1)
...family expansion...
expanded = expanded[:40] # was: 25
```
### Fix 3 β `nodes.py::validator_node` β False hallucination flag
Validator built context as raw `chunk.text` joined string. Generator answer cites
`"Section 11(1)"` but raw text has no section number β validator scores 0.2 β
`hallucinated=True` β spurious retry loops (7 retries across 12 tests).
Fix: mirror generator's numbered context block `[i] doc β section_id: title\ntext`.
Validator can now match cited section numbers to source context.
### E2E Regression Results (post-fix)
| Metric | Pre-fix | Post-fix |
| :-- | :-- | :-- |
| Avg latency | 19.6s | **7.6s** |
| Max latency | 87.1s | **13.3s** |
| Avg confidence | 0.908 | **0.958** |
| Total retries | 7 | **0** |
| Slow (>20s) | 3 | **0** |
| Low conf (<0.7) | 2 | **0** |
| Pass rate | 12/12 | **12/12** |
---
## 12. Agent Pipeline β RAGAS Eval Fixes (Phase 8, April 2026)
Five changes from RAGAS evaluation revealing retrieval and faithfulness failures.
### Fix 4 β Reranker thresholds too aggressive (`settings.py`)
Old `score_gap=0.35` cut after any 0.36 point drop β only 1 chunk reached generator.
New: `score_threshold=0.1`, `score_gap=0.6`. Keeps secondary relevant chunks while still
filtering genuine noise (0.98 β 0.20 drop would still cut at 0.78 gap).
### Fix 5 β Generator analogy instruction caused hallucination (`generator.py`)
"Use an analogy or real-world example" produced analogies ("Think of it like selling a
used car") not present in retrieved context β faithfulness judge scored as hallucination.
Fix: removed analogy instruction; replaced with "using only information from the provided context".
### Fix 6 β Generator weak grounding for sparse contexts (`generator.py`)
Generator constructed legal conclusions from reasoning even when context lacked evidence.
Added explicit rules:
- For sparse context: say "Based on the available context: [X]" and note missing elements
- For conflict detection: only assert conflict if BOTH provisions present in context
### Fix 7 β CONFLICT_DETECTION tone hint implied precedence reasoning (`nodes.py`)
Tone hint said "state which jurisdiction takes precedence when context supports it" β
LLM interpreted "when context supports it" loosely and applied legal reasoning.
Rewritten to: "Never infer precedence from legal reasoning β only state precedence if
the context explicitly says so."
### Fix 8 β Temporal query rewrite too generic (`classifier.py`)
Query "What is the timeline for project registration?" produced rewrite "registration
timeline period" β FTS missed Section 5 which uses "within thirty days" and "deemed registered".
Added rewriting guidance to expand temporal queries with specific legal time-period keywords.
### RAGAS Results (Phase 8 baseline, 5-row smoke, gemma-4-31b-it judge)
| Row | Faith (before) | Faith (after) | Prec (before) | Prec (after) |
|---|---|---|---|---|
| CENTRAL-FACT-001 | 1.00 | 0.50 | 0.00 | 0.00 |
| CENTRAL-FACT-002 | 0.80 | 0.62 | 0.00 | 0.33 |
| CENTRAL-XREF-001 | 0.63 | 0.50 | 1.00 | 1.00 |
| CENTRAL-CONF-001 | 0.00 | 0.62 | 0.00 | 0.00 |
| CENTRAL-TEMP-001 | 0.67 | 1.00 | 1.00 | 0.00 |
| **Overall** | 0.618 | **0.650** | 0.400* | 0.267 |
\* Before baseline had inflated precision from duplicate chunks (non-deterministic doc_id).
After Phase 8: deterministic UUID5 chunk IDs prevent duplicates on re-ingest. |