CivicSetu β High Level Design (HLD)
Version: 1.0.0 β Phase 8 Complete Status: Phase 8 Complete β RAGAS evaluation pipeline live; retrieval improvements shipped Current Scope: RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules.
1. System Overview
CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers plain-English questions about Indian civic and legal documents with accurate citations, amendment tracking, and conflict detection between laws.
Target Users: Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI, labor law, GST compliance, and other civic frameworks.
Current Scope: RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions).
2. Architecture Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β HTTP REST (FastAPI) β /api/v1/query β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β LANGGRAPH AGENT β
β β
β [Classifier] β [Vector Retrieval] β [Reranker] β
β β [Graph Retrieval] β β
β [Retry] β [Validator] β [Generator] β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
βββββββββΌβββββββ βββββββββΌββββββββββ βββββββββΌβββββββββ
β pgvector β β Neo4j β β PostgreSQL β
β (vectors) β β (graph) β β (metadata) β
β Phase 0 β β Phase 1 β β Phase 0 β
βββββββββ¬βββββββ βββββββββ¬ββββββββββ βββββββββ¬βββββββββ
β β β
ββββββββββββββββββββ΄βββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β INGESTION PIPELINE β
β Download β Parse β Chunk β Enrich β Embed β Store β
β document_registry.py β single source of truth for all doc URLs β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Two Pipelines
3.1 Ingestion Pipeline (Offline)
Runs once per document. Triggered via make ingest or POST /api/v1/ingest.
PDF URL (from document_registry.py)
β Downloader (httpx, cached locally with MD5 check)
β PDFParser (PyMuPDF, text extraction, scanned page detection)
β LegalChunker (multi-format regex: Act + Rule boundary detection)
β MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals)
β Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
β RelationalStore (PostgreSQL β documents + legal_chunks tables)
β VectorStore (pgvector β HNSW index, cosine similarity)
β GraphStore (Neo4j β Document + Section nodes + edges)
β GraphSeeder (REFERENCES + DERIVED_FROM edges seeded post-ingestion)
3.2 Query Pipeline (Online, per-request)
Triggered on every POST /api/v1/query.
User Query
β Input Guardrails (PII + off-topic filter)
β Classifier Node (LLM β query_type + rewritten_query)
β Vector Retrieval (RRF: pgvector cosine + PostgreSQL FTS merged) β fact_lookup
β Graph Retrieval (Neo4j β REFERENCES + DERIVED_FROM traversal) β cross_reference / penalty / temporal
βββ Outgoing REFERENCES (sections this section cites)
βββ Incoming REFERENCES (sections that cite this section)
βββ DERIVED_FROM outgoing (Act sections this Rule derives from)
βββ DERIVED_FROM incoming (Rule sections that implement this Act section)
Fallback: RRF hybrid retrieval when no section ID in query
β Hybrid Retrieval (dual-query RRF across jurisdictions) β conflict_detection
β Source section pinning (exact match chunks bypass reranker)
β Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
β Generator Node (LLM β structured JSON answer with citations)
β Validator Node (LLM β hallucination + confidence check)
β Output Guardrails (faithfulness check + disclaimer injection)
β CivicSetuResponse (answer + citations + confidence + disclaimer)
4. Component Responsibilities
| Component | Responsibility | Technology |
|---|---|---|
| DocumentRegistry | Centralised doc URL + metadata management | Python dataclass |
| PDFParser | Text extraction from PDFs | PyMuPDF |
| LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) |
| MetadataExtractor | Date, Section X + Rule X reference, amendment extract | Regex |
| Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) |
| VectorStore | Semantic similarity search | pgvector + HNSW |
| GraphStore | Section relationship traversal β fresh driver per call | Neo4j Community |
| GraphSeeder | Post-ingestion REFERENCES + DERIVED_FROM edge seeding | Neo4j Cypher |
| RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy |
| LangGraph Agent | Query orchestration state machine | LangGraph |
| LiteLLM Gateway | LLM provider fallback routing | LiteLLM |
| FastAPI | HTTP API layer | FastAPI + Uvicorn |
| FlashRank | Cross-encoder reranking (pinned source chunks exempt) | ONNX local model |
| Next.js Frontend | Chat UI, multi-turn sessions, citations panel | Next.js 15 App Router + Vercel |
5. LLM Fallback Chain
Primary β gemini/gemini-2.5-flash-lite (Gemini API)
Backup 1 β openrouter/meta-llama/llama-3.3-70b-instruct:free (OpenRouter)
Backup 2 β groq/llama-3.3-70b-versatile (Groq API)
Backup 3 β openrouter/qwen/qwen3.6-plus:free (OpenRouter)
Local β ollama/mistral (offline)
All routing handled by LiteLLM. Model swap = config change only.
6. Data Flow: Query to Response
Input: {"query": "What are builder obligations under Section 18?"}
Step 1 Classify β query_type=cross_reference (explicit section number detected)
Step 2 Graph β traverse Section 18 node
REFERENCES outgoing + incoming (depth=2)
DERIVED_FROM outgoing β Act sections this Rule derives from
DERIVED_FROM incoming β Rule sections implementing this Act section
Step 2b Fallback β vector retrieval if graph returns 0 results
Step 3 Pin β exact source section chunks marked is_pinned=True, skip reranker
Step 4 Rerank β cross-encoder scores remaining chunks, top 5 total
Step 5 Generate β LLM produces JSON with answer + citations
Step 6 Validate β hallucination check, confidence score
Step 7 Respond β CivicSetuResponse with citations + disclaimer
Output: {
"answer": "Under Section 18(1)...",
"citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...},
{"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}],
"confidence_score": 0.95,
"confidence_level": "high",
"disclaimer": "This is AI-generated information..."
}
7. Phase Roadmap
| Phase | Scope | Status |
|---|---|---|
| 0 | RERA Act 2016, vector RAG, FastAPI | β Complete |
| 1 | Neo4j graph, cross-reference queries | β Complete |
| 2 | MahaRERA Rules 2017, multi-jurisdiction | β Complete |
| 3 | DERIVED_FROM edges, cross-jurisdiction graph | β Complete |
| 4 | Multi-state expansion (UP, TN, Karnataka RERA) | β Complete |
| 5 | Agent pipeline hardening, E2E test suite | β Complete |
| 6 | Next.js frontend, Vercel deployment, public URL | β Complete |
| 7 | Graph explorer, section content drawer, D3 vis | β Complete |
| 8 | RAGAS eval pipeline, hybrid RRF, retrieval fixes | β Complete |
8. Non-Functional Requirements
| Requirement | Target | Current Status |
|---|---|---|
| Response latency | < 10s per query | 7.6s avg β 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app |
| Citation accuracy | 100% β never answer without citation | Enforced by schema |
| Hallucination rate | < 5% | Validator node + confidence gate |
| Cost | $0 for dev/staging | All free tier |
| Portability | Runs on any machine with Docker | Docker Compose |
| Faithfulness (RAGAS) | β₯ 0.75 overall | 0.650 (Phase 8 baseline, 5-row smoke) |
| Context precision (RAGAS) | β₯ 0.65 overall | 0.267 (Phase 8 baseline, 5-row smoke) |