| # CivicSetu β High Level Design (HLD) |
|
|
| **Version:** 1.0.0 β Phase 8 Complete |
| **Status:** Phase 8 Complete β RAGAS evaluation pipeline live; retrieval improvements shipped |
| **Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules. |
|
|
| --- |
|
|
| ## 1. System Overview |
|
|
| CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers |
| plain-English questions about Indian civic and legal documents with accurate citations, |
| amendment tracking, and conflict detection between laws. |
|
|
| **Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI, |
| labor law, GST compliance, and other civic frameworks. |
|
|
| **Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions). |
|
|
| --- |
|
|
| ## 2. Architecture Overview |
|
|
| ``` |
| |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β CLIENT LAYER β |
| β HTTP REST (FastAPI) β /api/v1/query β |
| ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ |
| β |
| ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ |
| β LANGGRAPH AGENT β |
| β β |
| β [Classifier] β [Vector Retrieval] β [Reranker] β |
| β β [Graph Retrieval] β β |
| β [Retry] β [Validator] β [Generator] β |
| ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ |
| β |
| ββββββββββββββββββββΌβββββββββββββββββββββββ |
| β β β |
| βββββββββΌβββββββ βββββββββΌββββββββββ βββββββββΌβββββββββ |
| β pgvector β β Neo4j β β PostgreSQL β |
| β (vectors) β β (graph) β β (metadata) β |
| β Phase 0 β β Phase 1 β β Phase 0 β |
| βββββββββ¬βββββββ βββββββββ¬ββββββββββ βββββββββ¬βββββββββ |
| β β β |
| ββββββββββββββββββββ΄βββββββββββββββββββββββ |
| β |
| ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ |
| β INGESTION PIPELINE β |
| β Download β Parse β Chunk β Enrich β Embed β Store β |
| β document_registry.py β single source of truth for all doc URLs β |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| ``` |
|
|
| --- |
|
|
| ## 3. Two Pipelines |
|
|
| ### 3.1 Ingestion Pipeline (Offline) |
|
|
| Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`. |
|
|
| ``` |
| |
| PDF URL (from document_registry.py) |
| β Downloader (httpx, cached locally with MD5 check) |
| β PDFParser (PyMuPDF, text extraction, scanned page detection) |
| β LegalChunker (multi-format regex: Act + Rule boundary detection) |
| β MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals) |
| β Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard) |
| β RelationalStore (PostgreSQL β documents + legal_chunks tables) |
| β VectorStore (pgvector β HNSW index, cosine similarity) |
| β GraphStore (Neo4j β Document + Section nodes + edges) |
| β GraphSeeder (REFERENCES + DERIVED_FROM edges seeded post-ingestion) |
| |
| ``` |
|
|
| ### 3.2 Query Pipeline (Online, per-request) |
|
|
| Triggered on every `POST /api/v1/query`. |
|
|
| ``` |
| |
| User Query |
| β Input Guardrails (PII + off-topic filter) |
| β Classifier Node (LLM β query_type + rewritten_query) |
| β Vector Retrieval (RRF: pgvector cosine + PostgreSQL FTS merged) β fact_lookup |
| β Graph Retrieval (Neo4j β REFERENCES + DERIVED_FROM traversal) β cross_reference / penalty / temporal |
| βββ Outgoing REFERENCES (sections this section cites) |
| βββ Incoming REFERENCES (sections that cite this section) |
| βββ DERIVED_FROM outgoing (Act sections this Rule derives from) |
| βββ DERIVED_FROM incoming (Rule sections that implement this Act section) |
| Fallback: RRF hybrid retrieval when no section ID in query |
| β Hybrid Retrieval (dual-query RRF across jurisdictions) β conflict_detection |
| β Source section pinning (exact match chunks bypass reranker) |
| β Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder) |
| β Generator Node (LLM β structured JSON answer with citations) |
| β Validator Node (LLM β hallucination + confidence check) |
| β Output Guardrails (faithfulness check + disclaimer injection) |
| β CivicSetuResponse (answer + citations + confidence + disclaimer) |
| |
| ``` |
|
|
| --- |
|
|
| ## 4. Component Responsibilities |
|
|
| | Component | Responsibility | Technology | |
| | ----------------- | ------------------------------------------------------ | ------------------------------ | |
| | DocumentRegistry | Centralised doc URL + metadata management | Python dataclass | |
| | PDFParser | Text extraction from PDFs | PyMuPDF | |
| | LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) | |
| | MetadataExtractor | Date, Section X + Rule X reference, amendment extract | Regex | |
| | Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) | |
| | VectorStore | Semantic similarity search | pgvector + HNSW | |
| | GraphStore | Section relationship traversal β fresh driver per call | Neo4j Community | |
| | GraphSeeder | Post-ingestion REFERENCES + DERIVED_FROM edge seeding | Neo4j Cypher | |
| | RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy | |
| | LangGraph Agent | Query orchestration state machine | LangGraph | |
| | LiteLLM Gateway | LLM provider fallback routing | LiteLLM | |
| | FastAPI | HTTP API layer | FastAPI + Uvicorn | |
| | FlashRank | Cross-encoder reranking (pinned source chunks exempt) | ONNX local model | |
| | Next.js Frontend | Chat UI, multi-turn sessions, citations panel | Next.js 15 App Router + Vercel | |
| |
| --- |
| |
| ## 5. LLM Fallback Chain |
| |
| ``` |
| |
| Primary β gemini/gemini-2.5-flash-lite (Gemini API) |
| Backup 1 β openrouter/meta-llama/llama-3.3-70b-instruct:free (OpenRouter) |
| Backup 2 β groq/llama-3.3-70b-versatile (Groq API) |
| Backup 3 β openrouter/qwen/qwen3.6-plus:free (OpenRouter) |
| Local β ollama/mistral (offline) |
| |
| ``` |
| |
| All routing handled by LiteLLM. Model swap = config change only. |
| |
| --- |
| |
| ## 6. Data Flow: Query to Response |
| |
| ``` |
| |
| Input: {"query": "What are builder obligations under Section 18?"} |
| |
| Step 1 Classify β query_type=cross_reference (explicit section number detected) |
| Step 2 Graph β traverse Section 18 node |
| REFERENCES outgoing + incoming (depth=2) |
| DERIVED_FROM outgoing β Act sections this Rule derives from |
| DERIVED_FROM incoming β Rule sections implementing this Act section |
| Step 2b Fallback β vector retrieval if graph returns 0 results |
| Step 3 Pin β exact source section chunks marked is_pinned=True, skip reranker |
| Step 4 Rerank β cross-encoder scores remaining chunks, top 5 total |
| Step 5 Generate β LLM produces JSON with answer + citations |
| Step 6 Validate β hallucination check, confidence score |
| Step 7 Respond β CivicSetuResponse with citations + disclaimer |
| |
| Output: { |
| "answer": "Under Section 18(1)...", |
| "citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...}, |
| {"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}], |
| "confidence_score": 0.95, |
| "confidence_level": "high", |
| "disclaimer": "This is AI-generated information..." |
| } |
| |
| ``` |
| |
| --- |
| |
| ## 7. Phase Roadmap |
| |
| | Phase | Scope | Status | |
| | ----- | ------------------------------------------------ | ----------- | |
| | 0 | RERA Act 2016, vector RAG, FastAPI | β
Complete | |
| | 1 | Neo4j graph, cross-reference queries | β
Complete | |
| | 2 | MahaRERA Rules 2017, multi-jurisdiction | β
Complete | |
| | 3 | DERIVED_FROM edges, cross-jurisdiction graph | β
Complete | |
| | 4 | Multi-state expansion (UP, TN, Karnataka RERA) | β
Complete | |
| | 5 | Agent pipeline hardening, E2E test suite | β
Complete | |
| | 6 | Next.js frontend, Vercel deployment, public URL | β
Complete | |
| | 7 | Graph explorer, section content drawer, D3 vis | β
Complete | |
| | 8 | RAGAS eval pipeline, hybrid RRF, retrieval fixes | β
Complete | |
| |
| --- |
| |
| ## 8. Non-Functional Requirements |
| |
| | Requirement | Target | Current Status | |
| | ------------------------- | ------------------------------------ | -------------------------------------------------------------------------------- | |
| | Response latency | < 10s per query | 7.6s avg β 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app | |
| | Citation accuracy | 100% β never answer without citation | Enforced by schema | |
| | Hallucination rate | < 5% | Validator node + confidence gate | |
| | Cost | $0 for dev/staging | All free tier | |
| | Portability | Runs on any machine with Docker | Docker Compose | |
| | Faithfulness (RAGAS) | β₯ 0.75 overall | 0.650 (Phase 8 baseline, 5-row smoke) | |
| | Context precision (RAGAS) | β₯ 0.65 overall | 0.267 (Phase 8 baseline, 5-row smoke) | |
| |