| # CivicSetu β Low Level Design (LLD) |
|
|
| **Version:** 2.0.0 β Phase 8 Complete (RAGAS Evaluation + Retrieval Improvements) |
| **Live:** https://civicsetu-two.vercel.app |
| **Last Updated:** April 2026 |
|
|
| --- |
|
|
| ## 1. Module Map |
|
|
| ``` |
| src/civicsetu/ |
| βββ config/ |
| β βββ settings.py Pydantic BaseSettings singleton (lru_cache) |
| β βββ document_registry.py All document URLs + metadata (single source of truth) |
| βββ models/ |
| β βββ enums.py StrEnum: Jurisdiction, DocType, QueryType, etc. |
| β βββ schemas.py Pydantic models: LegalChunk, Citation, RetrievedChunk, CivicSetuResponse |
| βββ ingestion/ |
| β βββ downloader.py httpx PDF downloader with MD5 cache check |
| β βββ parser.py PyMuPDF text extractor β max_pages cap, scanned PDF detection |
| β βββ chunker.py Section-boundary regex chunker β 6 format patterns + fallback |
| β βββ metadata_extractor.py Date/Section/Rule reference/amendment regex extraction |
| β βββ embedder.py nomic-embed-text-v1.5 via sentence-transformers β truncate at 4000 chars pre-prefix |
| β βββ pipeline.py Orchestrates ingestion; prepends section_title to embeddings |
| β βββ graph_seeder.py Post-ingestion REFERENCES + DERIVED_FROM edge seeding |
| βββ stores/ |
| β βββ relational_store.py Async SQLAlchemy β documents + legal_chunks tables |
| β βββ vector_store.py pgvector HNSW cosine search |
| β βββ graph_store.py Neo4j Cypher interface β fresh driver per call |
| βββ retrieval/ |
| β βββ vector_retriever.py Wraps VectorStore for agent use |
| β βββ graph_retriever.py REFERENCES + DERIVED_FROM traversal, Section/Rule ID extraction |
| β βββ reranker.py FlashRank cross-encoder wrapper |
| βββ agent/ |
| β βββ state.py CivicSetuState TypedDict (frozen contract) |
| β βββ nodes.py Pure functions: classifier, _rrf_retrieve (shared hybrid), |
| β β vector_retrieval, graph_retrieval, hybrid_retrieval, |
| β β reranker, generator, validator |
| β βββ edges.py Conditional routing: route_after_classifier, |
| β β route_after_validator |
| β βββ graph.py StateGraph assembly + get_compiled_graph() |
| βββ prompts/ |
| β βββ classifier.py Query type classification + rewriting prompt |
| β βββ generator.py Cited answer generation prompt |
| β βββ validator.py Hallucination + confidence check prompt |
| βββ guardrails/ |
| β βββ input_guard.py PII detection + off-topic filter |
| β βββ output_guard.py Faithfulness check + disclaimer injection |
| βββ api/ |
| βββ main.py FastAPI app factory + lifespan (graph pre-compiled) |
| βββ routes/ |
| β βββ health.py GET /health β DB ping |
| β βββ query.py POST /api/v1/query β main RAG endpoint |
| β βββ ingest.py POST /api/v1/ingest β admin endpoint |
| βββ middleware/ |
| βββ logging.py Request/response structured logging |
| |
| eval/ |
| βββ golden_dataset.jsonl 31-row RAGAS evaluation dataset across 5 jurisdictions |
| scripts/ |
| βββ run_eval.py Two-phase RAGAS evaluation: Phase 1 (graph invoke) + Phase 2 (RAGAS scoring) |
| |
| frontend/ Next.js 15 App Router β deployed on Vercel |
| βββ src/app/ |
| β βββ layout.tsx Root layout: ThemeProvider + dark mode |
| β βββ page.tsx Main page: wires all components together |
| β βββ globals.css Tailwind directives + gradient utilities |
| βββ src/components/ |
| β βββ Header.tsx Logo, new chat, theme toggle, GitHub link |
| β βββ ChatThread.tsx Scrollable message list + empty state examples |
| β βββ MessageBubble.tsx User/assistant/error bubbles with badges + citations |
| β βββ ConfidenceBadge.tsx HIGH/MEDIUM/LOW pill |
| β βββ CitationsPanel.tsx Collapsible citation cards |
| β βββ InputBar.tsx Auto-resize textarea, jurisdiction select, send |
| βββ src/hooks/ |
| β βββ useChat.ts Chat state, session_id localStorage, sendMessage |
| βββ src/lib/ |
| βββ types.ts TypeScript interfaces (mirrors backend Pydantic models) |
| βββ api.ts queryRera() fetch wrapper β /api/v1/query |
| ``` |
|
|
| --- |
|
|
| ## 2. Database Schema |
|
|
| ### PostgreSQL Tables |
|
|
| ```sql |
| documents ( |
| doc_id UUID PRIMARY KEY, |
| doc_name TEXT, |
| jurisdiction TEXT, -- Jurisdiction enum value |
| doc_type TEXT, -- DocType enum value (stored uppercase: ACT, RULES, CIRCULAR) |
| source_url TEXT, |
| effective_date DATE, |
| gazette_number TEXT, |
| total_chunks INTEGER, |
| ingested_at TIMESTAMPTZ, |
| is_active BOOLEAN |
| ) |
| |
| legal_chunks ( |
| chunk_id UUID PRIMARY KEY, |
| doc_id UUID β documents.doc_id, |
| jurisdiction TEXT, |
| doc_type TEXT, |
| doc_name TEXT, |
| section_id TEXT, -- "18", "3(2)", "Para-3" |
| section_title TEXT, |
| section_hierarchy TEXT[], -- ["RERA Act 2016", "18"] |
| text TEXT, |
| effective_date DATE, |
| superseded_by UUID β legal_chunks.chunk_id, |
| status TEXT, -- ChunkStatus enum value |
| source_url TEXT, |
| page_number INTEGER, |
| embedding vector(768) -- HNSW indexed |
| ) |
| ``` |
|
|
| ### pgvector Index |
|
|
| ```sql |
| CREATE INDEX legal_chunks_embedding_idx |
| ON legal_chunks |
| USING hnsw (embedding vector_cosine_ops) |
| WITH (m = 16, ef_construction = 64); |
| ``` |
|
|
| `m=16` β 16 connections per node. `ef_construction=64` β 64 candidates during index build. |
| Tuned for recall/speed balance at <10K vectors. Revisit at 100K+. |
|
|
| ### Neo4j Graph Schema |
|
|
| ``` |
| Nodes: |
| (:Document {doc_id, doc_name, jurisdiction, doc_type, effective_date}) |
| (:Section {section_id, title, chunk_id, jurisdiction, doc_name, is_active}) |
| |
| Edges: |
| (:Document)-[:HAS_SECTION]->(:Section) |
| (:Section) -[:REFERENCES]->(:Section) -- intra + cross-jurisdiction citations |
| (:Section) -[:DERIVED_FROM]->(:Section) -- State Rule N β RERA Act Sec M |
| (:Document)-[:DERIVED_FROM]->(:Document) -- State Rules β RERA Act 2016 |
| |
| Planned (Phase 7+): |
| (:Section) -[:SUPERSEDES]->(:Section) |
| (:Section) -[:AMENDED_BY]->(:Amendment) |
| (:Section) -[:CONFLICTS_WITH]->(:Section) |
| ``` |
|
|
| **Live graph stats (Phase 6):** |
|
|
| | Metric | Count | |
| |--------------|-------| |
| | Documents | 9 | |
| | Sections | 2090 | |
| | HAS_SECTION | 1297 | |
| | REFERENCES | 933 | |
| | DERIVED_FROM | 91 | |
|
|
| --- |
|
|
| ## 3. Document Registry |
|
|
| `document_registry.py` β single source of truth for all ingested documents. |
|
|
| ```python |
| @dataclass(frozen=True) |
| class DocumentSpec: |
| name: str |
| url: str |
| jurisdiction: Jurisdiction |
| doc_type: DocType |
| effective_date: date | None |
| filename: str |
| dest_subdir: str |
| max_pages: int | None = None # None = all pages; cap excludes forms/schedules appendices |
| ``` |
|
|
| ### Ingested Documents (Phase 6) |
|
|
| | Key | Document | Jurisdiction | DocType | Chunks | max_pages | |
| |---|---|---|---|---|---| |
| | `rera_act_2016` | RERA Act 2016 | CENTRAL | ACT | ~224 | None | |
| | `mahrera_rules_2017` | MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | None | |
| | `up_rera_rules_2016` | UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 24 | |
| | `up_rera_general_regulations_2019` | UP RERA General Regulations 2019 | UTTAR_PRADESH | CIRCULAR | 85 | None | |
| | `karnataka_rera_rules_2017` | Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 37 | |
| | `tn_rera_rules_2017` | Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 15 | |
| |
| **PDF source notes:** |
| - Karnataka official PDF (`rera.karnataka.gov.in`) is fully scanned (19MB image) β NAREDCO mirror used |
| - TN PDF bundles rules + forms (101 pages); `max_pages=15` excludes Forms AβO |
| - UP Rules PDF bundles rules + forms (52 pages); `max_pages=24` excludes prescribed forms |
|
|
| --- |
|
|
| ## 4. LangGraph State Machine |
|
|
| ### State Contract (`agent/state.py`) |
|
|
| ```python |
| class CivicSetuState(TypedDict): |
| # Input |
| query: str |
| session_id: Optional[str] |
| jurisdiction_filter: Optional[Jurisdiction] |
| top_k: int |
| |
| # Classification |
| query_type: Optional[QueryType] |
| rewritten_query: Optional[str] |
| |
| # Retrieval β Annotated[list, operator.add] enables parallel node merging |
| retrieved_chunks: Annotated[list[RetrievedChunk], operator.add] |
| reranked_chunks: list[RetrievedChunk] |
| |
| # Generation |
| raw_response: Optional[str] |
| citations: list[Citation] |
| confidence_score: float |
| conflict_warnings: list[str] |
| amendment_notice: Optional[str] |
| |
| # Control |
| retry_count: int # max 2 retries |
| hallucination_flag: bool |
| error: Optional[str] |
| ``` |
|
|
| ### RetrievedChunk Schema (`models/schemas.py`) |
|
|
| ```python |
| class RetrievedChunk(BaseModel): |
| chunk: LegalChunk |
| vector_score: float | None = None |
| rerank_score: float | None = None |
| retrieval_source: str = "vector" # "vector" | "graph" |
| graph_path: Optional[str] = None # e.g. "source:18@CENTRAL" |
| is_pinned: bool = False # True = exact source section, bypasses reranker sort |
| ``` |
|
|
| ### Node Responsibilities |
|
|
| | Node | Input Keys | Output Keys | LLM Call | |
| | :-- | :-- | :-- | :-- | |
| | classifier | query | query_type, rewritten_query | Yes | |
| | vector_retrieval | rewritten_query, top_k | retrieved_chunks | No | |
| | graph_retrieval | rewritten_query, top_k | retrieved_chunks | No | |
| | reranker | retrieved_chunks, query | reranked_chunks | No | |
| | generator | reranked_chunks, query | raw_response, citations, confidence_score | Yes | |
| | validator | raw_response, reranked_chunks | hallucination_flag, confidence_score | Yes | |
| | retry | retry_count | retry_count+1, cleared retrieval fields | No | |
| |
| ### Routing Logic |
| |
| | classifier β route_after_classifier | | |
| |---------------------------------------|------------------------------------------| |
| | fact_lookup | vector_retrieval (RRF hybrid) | |
| | cross_reference | graph_retrieval (β RRF fallback) | |
| | penalty_lookup | graph_retrieval (β RRF fallback) | |
| | temporal | graph_retrieval (β RRF fallback) | |
| | conflict_detection | hybrid_retrieval (RRF across jur.) | |
|
|
| ``` |
| validator β route_after_validator: |
| confidence >= 0.5 AND not hallucinated β END |
| (confidence < 0.5 OR hallucinated) AND retry_count < 2 β retry β classifier |
| (confidence < 0.5 OR hallucinated) AND retry_count >= 2 β END (low confidence answer) |
| ``` |
|
|
| --- |
|
|
| ## 5. Chunking Strategy |
|
|
| ### Section Boundary Detection |
|
|
| Six regex patterns across `DocType.RULES`, tried in order (first match wins per line): |
|
|
| | # | Pattern | Format | Jurisdiction | |
| |---|---|---|---| |
| | 1 | `\n(?P<id>\d{1,2}[A-Z]?)\.\s*\n(?P<title>...)` | Newline-dot-newline | MahaRERA | |
| | 2 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...)\.?β` | Same-line em-dash | MahaRERA | |
| | 3 | `^Rule\s+(?P<id>\d{1,2}[A-Z]?)\s*[.\-β]\s*(?P<title>...)` | Explicit Rule prefix | Generic | |
| | 4 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...?)\.β` | ASCII hyphen `.-` | Karnataka, Tamil Nadu | |
| | 5 | `(?P<id>\d{1,2}[A-Z]?)-\(1\)\s*\n(?P<title>...)` | `N-(1)\nTitle` | UP RERA multi-clause | |
| | 6 | `(?P<id>\d{1,2}[A-Z]?)-(?!\()\s*\n(?P<title>...)` | `N-\nTitle` | UP RERA single-clause | |
|
|
| `DocType.ACT` uses a separate pattern set. Fallback: paragraph split on double newlines. |
| Rule IDs capped at `\d{1,2}` (max 2 digits) β prevents year strings like `2016` matching as rule IDs. |
| Logs `no_section_boundaries_found` + `fallback_paragraph_chunking` when falling back. |
|
|
| ### Chunk Size Limits |
|
|
| ``` |
| MIN_CHARS = 100 β discard fragments (headers, page numbers) |
| MAX_CHARS = 1500 β split large sections at subsection markers (1), (2), (a), (b) |
| ``` |
|
|
| ### Split Priority for Large Sections |
|
|
| ``` |
| 1. Subsection markers: \n\s*\((?:\d+|[a-z]{1,3})\)\s+ |
| 2. Sentence boundary near MAX_CHARS: rfind('. ') |
| 3. Hard cut at MAX_CHARS (last resort) |
| ``` |
|
|
| ### parser.py β max_pages cap |
| |
| ```python |
| @staticmethod |
| def parse(source: str | Path, max_pages: int | None = None) -> ParsedDocument: |
| all_pages = list(doc) |
| if max_pages is not None: |
| all_pages = all_pages[:max_pages] # slice before fulltext build |
| ``` |
| |
| --- |
|
|
| ## 6. Embedding Strategy |
|
|
| **Model:** `nomic-embed-text-v1.5` (via `sentence-transformers`, local β no Ollama required) |
| **Dimension:** 768 |
| **Asymmetric prefixes** (MTEB/nomic-embed requirement): |
|
|
| ``` |
| Ingestion time: "search_document: {section_title}\n{text}" β pipeline.py |
| Query time: "search_query: {rewritten_query}" β retrieval/__init__.py |
| ``` |
|
|
| **Section title prepend (Phase 8 change):** `pipeline.py` prepends `section_title` to the |
| embedded text so sub-chunks (e.g. `S.11(2)`) retain their section context. |
| Without this, sub-chunks embed without "Obligations of promoter" β cosine similarity misses them. |
| The reranker still receives raw `chunk.text` (no title prefix). |
|
|
| Using wrong prefix at query time causes ~10β15% recall degradation. |
|
|
| ### Truncation Guard |
|
|
| ```python |
| MAX_EMBED_CHARS = 4000 # ~1000 tokens β safe ceiling before prefix added |
| |
| def embed_document(self, text: str) -> list[float]: |
| if len(text) > MAX_EMBED_CHARS: |
| log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS) |
| text = text[:MAX_EMBED_CHARS] |
| prefixed = f"search_document: {text.strip()}" # prefix AFTER truncation |
| return self.embed_one(prefixed) |
| ``` |
|
|
| Truncation happens **before** prefix is added β prevents Ollama 500 errors on Tamil Nadu |
| and other gazette PDFs where sub-sections exceed 10K chars. |
|
|
| --- |
|
|
| ## 7. Hybrid Retrieval β `_rrf_retrieve()` |
|
|
| All retrieval nodes share a single async helper `_rrf_retrieve()` in `agent/nodes.py`. |
|
|
| ### Reciprocal Rank Fusion (RRF) |
|
|
| ```python |
| RRF_K = 60 # standard constant |
| |
| rrf_score(chunk) = 1/(K + rank_in_vector) + 1/(K + rank_in_fts) |
| ``` |
|
|
| Fetches `top_k Γ 3` vector results and `top_k Γ 2` FTS results, deduplicates by `chunk_id`, |
| merges via RRF, returns top `top_k Γ 2`. |
|
|
| ### Full-Text Search |
|
|
| `VectorStore.full_text_search()` uses `websearch_to_tsquery` in OR mode: |
|
|
| ```sql |
| WHERE to_tsvector('english', text) @@ websearch_to_tsquery('english', :query) |
| ORDER BY ts_rank(to_tsvector('english', text), websearch_to_tsquery('english', :query)) DESC |
| ``` |
|
|
| Changed from `plainto_tsquery` (AND-mode) β AND required all query words to match, |
| excluding relevant sections that matched most but not all words. |
|
|
| ### Section Family Expansion |
|
|
| After RRF merge, top-3 results trigger family expansion: |
|
|
| ```python |
| for rc in merged[:3]: |
| base_sid = re.sub(r'\([^)]*\)$', '', section_id).strip() # "5(4)" β "5" |
| family = await VectorStore.get_section_family(section_id=base_sid, jurisdiction=jur) |
| # returns all chunks where section_id = '5' OR section_id LIKE '5(%' |
| ``` |
|
|
| `get_section_family` guard: skips if `section_id` already contains `(` (base_sid computation |
| strips this before calling). Hard cap: `_MAX_VECTOR_EXPANDED = 40` chunks before reranker. |
|
|
| **Why top-3 not top-1:** If top-1 RRF result is a sub-section (`S.5(4)`), its parent |
| family is expanded. But if the truly relevant parent section (`S.11`) appears at RRF rank 2, |
| only expanding top-1 misses it. Expanding top-3 covers more cases at the cost of a slightly |
| larger pool. |
|
|
| --- |
|
|
| ## 7b. Reranker Detail |
|
|
| `reranker_score_threshold = 0.1` β minimum cross-encoder score to enter candidate pool. |
| `reranker_score_gap = 0.6` β gap filter cliff threshold. |
|
|
| **Gap filter:** |
|
|
| ```python |
| def _apply_score_gap(chunks, gap=0.6): |
| for i in range(1, len(chunks)): |
| if chunks[i-1].rerank_score - chunks[i].rerank_score >= gap: |
| return chunks[:i] |
| return chunks |
| ``` |
|
|
| **Threshold history:** Originally `threshold=0.3, gap=0.35`. Gap=0.35 was too aggressive β |
| cut chunks with 0.36 score drop, leaving only 1 context for generator. Raised to 0.6 (Phase 8). |
|
|
| Final context: `pinned_chunks + gap_filtered[:max(0, 5 - len(pinned))]` β max 5 chunks. |
|
|
| --- |
|
|
| ## 8. Graph Retriever |
|
|
| `graph_retriever.py` β called on `cross_reference`, `penalty_lookup`, `temporal` query types. |
|
|
| ### Section ID Extraction |
|
|
| ```python |
| section_pattern = re.compile(r'\b(?:section|sec\.?|s\.)\s*(\d+[A-Z]?)\b', re.IGNORECASE) |
| rule_pattern = re.compile(r'\bRule\s+(\d+[A-Z]?)\b', re.IGNORECASE) |
| ``` |
|
|
| ### Traversal Strategy (per jurisdiction) |
|
|
| For each jurisdiction (`CENTRAL`, `MAHARASHTRA`, `UTTAR_PRADESH`, `KARNATAKA`, `TAMIL_NADU`): |
|
|
| ``` |
| 1. Source section chunks β exact section_id match β is_pinned=True |
| 2. REFERENCES outgoing β sections source cites (depth=2) |
| 3. REFERENCES incoming β sections that cite source |
| 4. DERIVED_FROM outgoing β Act sections this Rule derives from |
| 5. DERIVED_FROM incoming β Rule sections implementing this Act section |
| ``` |
|
|
| ### Pinning Rule |
|
|
| Only the exact `section_id` match gets `is_pinned=True`. Sub-sections are NOT pinned. |
| Max pinned chunks: 2 (one per jurisdiction). Remaining 3 slots filled by reranker. |
|
|
| --- |
|
|
| ## 9. Response Contract |
|
|
| ```python |
| CivicSetuResponse: |
| answer: str # plain English, cites section numbers |
| citations: list[Citation] # min_length=1 β NEVER empty |
| confidence_score: float # 0.0β1.0 |
| confidence_level: str # "high"/"medium"/"low" |
| query_type_resolved: QueryType |
| conflict_warnings: list[str] # empty until Phase 7 |
| amendment_notice: Optional[str] |
| disclaimer: str # always present |
| |
| Citation: |
| section_id: str |
| doc_name: str |
| jurisdiction: Jurisdiction |
| effective_date: Optional[date] |
| source_url: str |
| chunk_id: UUID |
| ``` |
|
|
| --- |
|
|
| ## 9. Error Handling |
|
|
| | Scenario | Behaviour | |
| | :-- | :-- | |
| | LLM provider rate limited | LiteLLM auto-rotates to next provider | |
| | All LLM providers fail | `RuntimeError` β FastAPI 500 | |
| | No chunks retrieved | `InsufficientInfoResponse` returned | |
| | Hallucination detected | retry (max 2x) β low confidence answer | |
| | DB unreachable | `/health` returns `degraded`, query returns 500 | |
| | Scanned PDF detected | Warning logged, fallback URL used (Karnataka) | |
| | Section patterns not matched | Fallback paragraph chunking, warning logged | |
| | Neo4j event loop mismatch | Prevented β `_get_driver()` creates fresh driver per call | |
| | Embedding input too long | Truncated at 4000 chars before prefix; warning logged | |
| | max_pages exceeded | Parser silently caps pages; total_pages reflects capped count | |
|
|
| --- |
|
|
| ## 10. Neo4j Graph β Phase 6 State (Current) |
|
|
| **Nodes:** 9 Documents, 2090 Sections |
| **Edges:** 1297 HAS_SECTION, 933 REFERENCES, 91 DERIVED_FROM |
|
|
| ### Documents in Graph |
|
|
| | Document | Jurisdiction | DocType | Chunks | Sections | DERIVED_FROM edges | |
| |---|---|---|---|---|---| |
| | RERA Act 2016 | CENTRAL | ACT | ~224 | ~224 | β | |
| | MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | ~214 | 17 sec + 1 doc | |
| | UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 33 | 11 sec + 1 doc | |
| | UP RERA General Regs 2019 | UTTAR_PRADESH | CIRCULAR | 85 | 53 | β | |
| | Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 45 | 15 sec + 1 doc | |
| | Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 36 | 15 sec + 1 doc | |
|
|
| ### Known Open Issues (non-blocking) |
|
|
| | Issue | Affected | Root Cause | |
| |---|---|---| |
| | Act Β§13 missing from graph | UP rule 14, KA rule 11, TN rule 11 | RERA Act ingestion β Β§13 chunked under different ID | |
| | Act Β§66 missing from graph | KA rule 19, TN rule 19 | RERA Act ingestion β Β§66 not ingested | |
|
|
| ### DERIVED_FROM Map Summary |
| |
| | Jurisdiction | Mapped pairs | Resolved | Unresolved | |
| |---|---|---|---| |
| | MAHARASHTRA | 17 | 17 | 0 | |
| | UTTAR_PRADESH | 15 | 11 | 4 | |
| | KARNATAKA | 17 | 15 | 2 | |
| | TAMIL_NADU | 17 | 15 | 2 | |
| |
| ### PDF Source Decisions |
| |
| | Jurisdiction | Primary URL | Issue | Resolution | |
| |---|---|---|---| |
| | CENTRAL | indiacode.nic.in | β | β | |
| | MAHARASHTRA | naredco.in | β | β | |
| | UTTAR_PRADESH | up-rera.in/pdf/rera.pdf | pages 25β52 are forms | max_pages=24 | |
| | KARNATAKA | naredco.in (mirror) | Official PDF fully scanned (19MB) | NAREDCO born-digital | |
| | TAMIL_NADU | cms.tn.gov.in | pages 16β101 are Forms AβO | max_pages=15 | |
| |
| ## 11. Agent Pipeline β Bug Fixes (2026-03-22) |
| |
| Three production bugs fixed after 12-case E2E suite. All verified: 0 retries, 0 |
| hallucinations, avg latency 7.6s. |
| |
| ### Fix 1 β `vector_store.py::get_section_family` β Pydantic crash on SELECT * |
|
|
| `SELECT *` returned `embedding` as a raw string; Pydantic `list[float]` validation |
| failed. Fix: explicit column projection, `embedding=None` on all returned chunks. |
| Matches every other `VectorStore` method. |
|
|
| ### Fix 2 β `nodes.py::vector_retrieval_node` β Reranker blowup on section expansion |
|
|
| Section family expansion ran on all 5 similarity hits β up to 121 chunks β FlashRank |
| cross-encoder serial scoring β 65s reranker time. Fix (Phase 5): expand top-1 hit only; hard |
| cap at 25 chunks before reranker. |
|
|
| ```python |
| for rc in results[:1]: |
| ...family expansion... |
| expanded = expanded[:25] # hard safety cap |
| ``` |
|
|
| **Phase 8 update:** Expanded to top-3 after RAGAS eval revealed that when a sub-section |
| (e.g. `S.5(4)`) ranks #1, its parent `S.5` (with the 30-day rule) was never expanded. |
| Cap raised to 40 to accommodate larger families. |
|
|
| ```python |
| for rc in merged[:3]: # top-3 RRF results (was: top-1) |
| ...family expansion... |
| expanded = expanded[:40] # was: 25 |
| ``` |
|
|
|
|
| ### Fix 3 β `nodes.py::validator_node` β False hallucination flag |
| |
| Validator built context as raw `chunk.text` joined string. Generator answer cites |
| `"Section 11(1)"` but raw text has no section number β validator scores 0.2 β |
| `hallucinated=True` β spurious retry loops (7 retries across 12 tests). |
| |
| Fix: mirror generator's numbered context block `[i] doc β section_id: title\ntext`. |
| Validator can now match cited section numbers to source context. |
|
|
| ### E2E Regression Results (post-fix) |
|
|
| | Metric | Pre-fix | Post-fix | |
| | :-- | :-- | :-- | |
| | Avg latency | 19.6s | **7.6s** | |
| | Max latency | 87.1s | **13.3s** | |
| | Avg confidence | 0.908 | **0.958** | |
| | Total retries | 7 | **0** | |
| | Slow (>20s) | 3 | **0** | |
| | Low conf (<0.7) | 2 | **0** | |
| | Pass rate | 12/12 | **12/12** | |
|
|
| --- |
|
|
| ## 12. Agent Pipeline β RAGAS Eval Fixes (Phase 8, April 2026) |
|
|
| Five changes from RAGAS evaluation revealing retrieval and faithfulness failures. |
|
|
| ### Fix 4 β Reranker thresholds too aggressive (`settings.py`) |
|
|
| Old `score_gap=0.35` cut after any 0.36 point drop β only 1 chunk reached generator. |
| New: `score_threshold=0.1`, `score_gap=0.6`. Keeps secondary relevant chunks while still |
| filtering genuine noise (0.98 β 0.20 drop would still cut at 0.78 gap). |
|
|
| ### Fix 5 β Generator analogy instruction caused hallucination (`generator.py`) |
|
|
| "Use an analogy or real-world example" produced analogies ("Think of it like selling a |
| used car") not present in retrieved context β faithfulness judge scored as hallucination. |
| Fix: removed analogy instruction; replaced with "using only information from the provided context". |
|
|
| ### Fix 6 β Generator weak grounding for sparse contexts (`generator.py`) |
|
|
| Generator constructed legal conclusions from reasoning even when context lacked evidence. |
| Added explicit rules: |
| - For sparse context: say "Based on the available context: [X]" and note missing elements |
| - For conflict detection: only assert conflict if BOTH provisions present in context |
|
|
| ### Fix 7 β CONFLICT_DETECTION tone hint implied precedence reasoning (`nodes.py`) |
| |
| Tone hint said "state which jurisdiction takes precedence when context supports it" β |
| LLM interpreted "when context supports it" loosely and applied legal reasoning. |
| Rewritten to: "Never infer precedence from legal reasoning β only state precedence if |
| the context explicitly says so." |
| |
| ### Fix 8 β Temporal query rewrite too generic (`classifier.py`) |
| |
| Query "What is the timeline for project registration?" produced rewrite "registration |
| timeline period" β FTS missed Section 5 which uses "within thirty days" and "deemed registered". |
| Added rewriting guidance to expand temporal queries with specific legal time-period keywords. |
| |
| ### RAGAS Results (Phase 8 baseline, 5-row smoke, gemma-4-31b-it judge) |
| |
| | Row | Faith (before) | Faith (after) | Prec (before) | Prec (after) | |
| |---|---|---|---|---| |
| | CENTRAL-FACT-001 | 1.00 | 0.50 | 0.00 | 0.00 | |
| | CENTRAL-FACT-002 | 0.80 | 0.62 | 0.00 | 0.33 | |
| | CENTRAL-XREF-001 | 0.63 | 0.50 | 1.00 | 1.00 | |
| | CENTRAL-CONF-001 | 0.00 | 0.62 | 0.00 | 0.00 | |
| | CENTRAL-TEMP-001 | 0.67 | 1.00 | 1.00 | 0.00 | |
| | **Overall** | 0.618 | **0.650** | 0.400* | 0.267 | |
| |
| \* Before baseline had inflated precision from duplicate chunks (non-deterministic doc_id). |
| After Phase 8: deterministic UUID5 chunk IDs prevent duplicates on re-ingest. |