# CivicSetu — High Level Design (HLD) **Version:** 1.0.0 — Phase 8 Complete **Status:** Phase 8 Complete — RAGAS evaluation pipeline live; retrieval improvements shipped **Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules. --- ## 1. System Overview CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers plain-English questions about Indian civic and legal documents with accurate citations, amendment tracking, and conflict detection between laws. **Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI, labor law, GST compliance, and other civic frameworks. **Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions). --- ## 2. Architecture Overview ``` ┌──────────────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ │ HTTP REST (FastAPI) — /api/v1/query │ └────────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────────▼─────────────────────────────────────┐ │ LANGGRAPH AGENT │ │ │ │ [Classifier] → [Vector Retrieval] → [Reranker] │ │ ↑ [Graph Retrieval] ↗ │ │ [Retry] ← [Validator] ← [Generator] │ └────────────────────────────┬─────────────────────────────────────┘ │ ┌──────────────────┼──────────────────────┐ │ │ │ ┌───────▼──────┐ ┌───────▼─────────┐ ┌───────▼────────┐ │ pgvector │ │ Neo4j │ │ PostgreSQL │ │ (vectors) │ │ (graph) │ │ (metadata) │ │ Phase 0 │ │ Phase 1 │ │ Phase 0 │ └───────┬──────┘ └───────┬─────────┘ └───────┬────────┘ │ │ │ └──────────────────┴──────────────────────┘ │ ┌────────────────────────────▼─────────────────────────────────────┐ │ INGESTION PIPELINE │ │ Download → Parse → Chunk → Enrich → Embed → Store │ │ document_registry.py — single source of truth for all doc URLs │ └──────────────────────────────────────────────────────────────────┘ ``` --- ## 3. Two Pipelines ### 3.1 Ingestion Pipeline (Offline) Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`. ``` PDF URL (from document_registry.py) → Downloader (httpx, cached locally with MD5 check) → PDFParser (PyMuPDF, text extraction, scanned page detection) → LegalChunker (multi-format regex: Act + Rule boundary detection) → MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals) → Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard) → RelationalStore (PostgreSQL — documents + legal_chunks tables) → VectorStore (pgvector — HNSW index, cosine similarity) → GraphStore (Neo4j — Document + Section nodes + edges) → GraphSeeder (REFERENCES + DERIVED_FROM edges seeded post-ingestion) ``` ### 3.2 Query Pipeline (Online, per-request) Triggered on every `POST /api/v1/query`. ``` User Query → Input Guardrails (PII + off-topic filter) → Classifier Node (LLM — query_type + rewritten_query) → Vector Retrieval (RRF: pgvector cosine + PostgreSQL FTS merged) ← fact_lookup → Graph Retrieval (Neo4j — REFERENCES + DERIVED_FROM traversal) ← cross_reference / penalty / temporal ├── Outgoing REFERENCES (sections this section cites) ├── Incoming REFERENCES (sections that cite this section) ├── DERIVED_FROM outgoing (Act sections this Rule derives from) └── DERIVED_FROM incoming (Rule sections that implement this Act section) Fallback: RRF hybrid retrieval when no section ID in query → Hybrid Retrieval (dual-query RRF across jurisdictions) ← conflict_detection → Source section pinning (exact match chunks bypass reranker) → Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder) → Generator Node (LLM — structured JSON answer with citations) → Validator Node (LLM — hallucination + confidence check) → Output Guardrails (faithfulness check + disclaimer injection) → CivicSetuResponse (answer + citations + confidence + disclaimer) ``` --- ## 4. Component Responsibilities | Component | Responsibility | Technology | | ----------------- | ------------------------------------------------------ | ------------------------------ | | DocumentRegistry | Centralised doc URL + metadata management | Python dataclass | | PDFParser | Text extraction from PDFs | PyMuPDF | | LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) | | MetadataExtractor | Date, Section X + Rule X reference, amendment extract | Regex | | Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) | | VectorStore | Semantic similarity search | pgvector + HNSW | | GraphStore | Section relationship traversal — fresh driver per call | Neo4j Community | | GraphSeeder | Post-ingestion REFERENCES + DERIVED_FROM edge seeding | Neo4j Cypher | | RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy | | LangGraph Agent | Query orchestration state machine | LangGraph | | LiteLLM Gateway | LLM provider fallback routing | LiteLLM | | FastAPI | HTTP API layer | FastAPI + Uvicorn | | FlashRank | Cross-encoder reranking (pinned source chunks exempt) | ONNX local model | | Next.js Frontend | Chat UI, multi-turn sessions, citations panel | Next.js 15 App Router + Vercel | --- ## 5. LLM Fallback Chain ``` Primary → gemini/gemini-2.5-flash-lite (Gemini API) Backup 1 → openrouter/meta-llama/llama-3.3-70b-instruct:free (OpenRouter) Backup 2 → groq/llama-3.3-70b-versatile (Groq API) Backup 3 → openrouter/qwen/qwen3.6-plus:free (OpenRouter) Local → ollama/mistral (offline) ``` All routing handled by LiteLLM. Model swap = config change only. --- ## 6. Data Flow: Query to Response ``` Input: {"query": "What are builder obligations under Section 18?"} Step 1 Classify → query_type=cross_reference (explicit section number detected) Step 2 Graph → traverse Section 18 node REFERENCES outgoing + incoming (depth=2) DERIVED_FROM outgoing → Act sections this Rule derives from DERIVED_FROM incoming → Rule sections implementing this Act section Step 2b Fallback → vector retrieval if graph returns 0 results Step 3 Pin → exact source section chunks marked is_pinned=True, skip reranker Step 4 Rerank → cross-encoder scores remaining chunks, top 5 total Step 5 Generate → LLM produces JSON with answer + citations Step 6 Validate → hallucination check, confidence score Step 7 Respond → CivicSetuResponse with citations + disclaimer Output: { "answer": "Under Section 18(1)...", "citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...}, {"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}], "confidence_score": 0.95, "confidence_level": "high", "disclaimer": "This is AI-generated information..." } ``` --- ## 7. Phase Roadmap | Phase | Scope | Status | | ----- | ------------------------------------------------ | ----------- | | 0 | RERA Act 2016, vector RAG, FastAPI | ✅ Complete | | 1 | Neo4j graph, cross-reference queries | ✅ Complete | | 2 | MahaRERA Rules 2017, multi-jurisdiction | ✅ Complete | | 3 | DERIVED_FROM edges, cross-jurisdiction graph | ✅ Complete | | 4 | Multi-state expansion (UP, TN, Karnataka RERA) | ✅ Complete | | 5 | Agent pipeline hardening, E2E test suite | ✅ Complete | | 6 | Next.js frontend, Vercel deployment, public URL | ✅ Complete | | 7 | Graph explorer, section content drawer, D3 vis | ✅ Complete | | 8 | RAGAS eval pipeline, hybrid RRF, retrieval fixes | ✅ Complete | --- ## 8. Non-Functional Requirements | Requirement | Target | Current Status | | ------------------------- | ------------------------------------ | -------------------------------------------------------------------------------- | | Response latency | < 10s per query | 7.6s avg — 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app | | Citation accuracy | 100% — never answer without citation | Enforced by schema | | Hallucination rate | < 5% | Validator node + confidence gate | | Cost | $0 for dev/staging | All free tier | | Portability | Runs on any machine with Docker | Docker Compose | | Faithfulness (RAGAS) | ≥ 0.75 overall | 0.650 (Phase 8 baseline, 5-row smoke) | | Context precision (RAGAS) | ≥ 0.65 overall | 0.267 (Phase 8 baseline, 5-row smoke) |