civicsetu / docs /HLD.md
adeshboudh16
feat(eval): RAGAS evaluation framework + RAG pipeline improvements
f8b04c3
# CivicSetu β€” High Level Design (HLD)
**Version:** 1.0.0 β€” Phase 8 Complete
**Status:** Phase 8 Complete β€” RAGAS evaluation pipeline live; retrieval improvements shipped
**Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules.
---
## 1. System Overview
CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers
plain-English questions about Indian civic and legal documents with accurate citations,
amendment tracking, and conflict detection between laws.
**Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
labor law, GST compliance, and other civic frameworks.
**Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions).
---
## 2. Architecture Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CLIENT LAYER β”‚
β”‚ HTTP REST (FastAPI) β€” /api/v1/query β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LANGGRAPH AGENT β”‚
β”‚ β”‚
β”‚ [Classifier] β†’ [Vector Retrieval] β†’ [Reranker] β”‚
β”‚ ↑ [Graph Retrieval] β†— β”‚
β”‚ [Retry] ← [Validator] ← [Generator] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ pgvector β”‚ β”‚ Neo4j β”‚ β”‚ PostgreSQL β”‚
β”‚ (vectors) β”‚ β”‚ (graph) β”‚ β”‚ (metadata) β”‚
β”‚ Phase 0 β”‚ β”‚ Phase 1 β”‚ β”‚ Phase 0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ INGESTION PIPELINE β”‚
β”‚ Download β†’ Parse β†’ Chunk β†’ Enrich β†’ Embed β†’ Store β”‚
β”‚ document_registry.py β€” single source of truth for all doc URLs β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## 3. Two Pipelines
### 3.1 Ingestion Pipeline (Offline)
Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`.
```
PDF URL (from document_registry.py)
β†’ Downloader (httpx, cached locally with MD5 check)
β†’ PDFParser (PyMuPDF, text extraction, scanned page detection)
β†’ LegalChunker (multi-format regex: Act + Rule boundary detection)
β†’ MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals)
β†’ Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
β†’ RelationalStore (PostgreSQL β€” documents + legal_chunks tables)
β†’ VectorStore (pgvector β€” HNSW index, cosine similarity)
β†’ GraphStore (Neo4j β€” Document + Section nodes + edges)
β†’ GraphSeeder (REFERENCES + DERIVED_FROM edges seeded post-ingestion)
```
### 3.2 Query Pipeline (Online, per-request)
Triggered on every `POST /api/v1/query`.
```
User Query
β†’ Input Guardrails (PII + off-topic filter)
β†’ Classifier Node (LLM β€” query_type + rewritten_query)
β†’ Vector Retrieval (RRF: pgvector cosine + PostgreSQL FTS merged) ← fact_lookup
β†’ Graph Retrieval (Neo4j β€” REFERENCES + DERIVED_FROM traversal) ← cross_reference / penalty / temporal
β”œβ”€β”€ Outgoing REFERENCES (sections this section cites)
β”œβ”€β”€ Incoming REFERENCES (sections that cite this section)
β”œβ”€β”€ DERIVED_FROM outgoing (Act sections this Rule derives from)
└── DERIVED_FROM incoming (Rule sections that implement this Act section)
Fallback: RRF hybrid retrieval when no section ID in query
β†’ Hybrid Retrieval (dual-query RRF across jurisdictions) ← conflict_detection
β†’ Source section pinning (exact match chunks bypass reranker)
β†’ Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
β†’ Generator Node (LLM β€” structured JSON answer with citations)
β†’ Validator Node (LLM β€” hallucination + confidence check)
β†’ Output Guardrails (faithfulness check + disclaimer injection)
β†’ CivicSetuResponse (answer + citations + confidence + disclaimer)
```
---
## 4. Component Responsibilities
| Component | Responsibility | Technology |
| ----------------- | ------------------------------------------------------ | ------------------------------ |
| DocumentRegistry | Centralised doc URL + metadata management | Python dataclass |
| PDFParser | Text extraction from PDFs | PyMuPDF |
| LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) |
| MetadataExtractor | Date, Section X + Rule X reference, amendment extract | Regex |
| Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) |
| VectorStore | Semantic similarity search | pgvector + HNSW |
| GraphStore | Section relationship traversal β€” fresh driver per call | Neo4j Community |
| GraphSeeder | Post-ingestion REFERENCES + DERIVED_FROM edge seeding | Neo4j Cypher |
| RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy |
| LangGraph Agent | Query orchestration state machine | LangGraph |
| LiteLLM Gateway | LLM provider fallback routing | LiteLLM |
| FastAPI | HTTP API layer | FastAPI + Uvicorn |
| FlashRank | Cross-encoder reranking (pinned source chunks exempt) | ONNX local model |
| Next.js Frontend | Chat UI, multi-turn sessions, citations panel | Next.js 15 App Router + Vercel |
---
## 5. LLM Fallback Chain
```
Primary β†’ gemini/gemini-2.5-flash-lite (Gemini API)
Backup 1 β†’ openrouter/meta-llama/llama-3.3-70b-instruct:free (OpenRouter)
Backup 2 β†’ groq/llama-3.3-70b-versatile (Groq API)
Backup 3 β†’ openrouter/qwen/qwen3.6-plus:free (OpenRouter)
Local β†’ ollama/mistral (offline)
```
All routing handled by LiteLLM. Model swap = config change only.
---
## 6. Data Flow: Query to Response
```
Input: {"query": "What are builder obligations under Section 18?"}
Step 1 Classify β†’ query_type=cross_reference (explicit section number detected)
Step 2 Graph β†’ traverse Section 18 node
REFERENCES outgoing + incoming (depth=2)
DERIVED_FROM outgoing β†’ Act sections this Rule derives from
DERIVED_FROM incoming β†’ Rule sections implementing this Act section
Step 2b Fallback β†’ vector retrieval if graph returns 0 results
Step 3 Pin β†’ exact source section chunks marked is_pinned=True, skip reranker
Step 4 Rerank β†’ cross-encoder scores remaining chunks, top 5 total
Step 5 Generate β†’ LLM produces JSON with answer + citations
Step 6 Validate β†’ hallucination check, confidence score
Step 7 Respond β†’ CivicSetuResponse with citations + disclaimer
Output: {
"answer": "Under Section 18(1)...",
"citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...},
{"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}],
"confidence_score": 0.95,
"confidence_level": "high",
"disclaimer": "This is AI-generated information..."
}
```
---
## 7. Phase Roadmap
| Phase | Scope | Status |
| ----- | ------------------------------------------------ | ----------- |
| 0 | RERA Act 2016, vector RAG, FastAPI | βœ… Complete |
| 1 | Neo4j graph, cross-reference queries | βœ… Complete |
| 2 | MahaRERA Rules 2017, multi-jurisdiction | βœ… Complete |
| 3 | DERIVED_FROM edges, cross-jurisdiction graph | βœ… Complete |
| 4 | Multi-state expansion (UP, TN, Karnataka RERA) | βœ… Complete |
| 5 | Agent pipeline hardening, E2E test suite | βœ… Complete |
| 6 | Next.js frontend, Vercel deployment, public URL | βœ… Complete |
| 7 | Graph explorer, section content drawer, D3 vis | βœ… Complete |
| 8 | RAGAS eval pipeline, hybrid RRF, retrieval fixes | βœ… Complete |
---
## 8. Non-Functional Requirements
| Requirement | Target | Current Status |
| ------------------------- | ------------------------------------ | -------------------------------------------------------------------------------- |
| Response latency | < 10s per query | 7.6s avg β€” 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app |
| Citation accuracy | 100% β€” never answer without citation | Enforced by schema |
| Hallucination rate | < 5% | Validator node + confidence gate |
| Cost | $0 for dev/staging | All free tier |
| Portability | Runs on any machine with Docker | Docker Compose |
| Faithfulness (RAGAS) | β‰₯ 0.75 overall | 0.650 (Phase 8 baseline, 5-row smoke) |
| Context precision (RAGAS) | β‰₯ 0.65 overall | 0.267 (Phase 8 baseline, 5-row smoke) |