civicsetu / docs /HLD.md
adeshboudh16
feat(eval): RAGAS evaluation framework + RAG pipeline improvements
f8b04c3

CivicSetu β€” High Level Design (HLD)

Version: 1.0.0 β€” Phase 8 Complete Status: Phase 8 Complete β€” RAGAS evaluation pipeline live; retrieval improvements shipped Current Scope: RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules.


1. System Overview

CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers plain-English questions about Indian civic and legal documents with accurate citations, amendment tracking, and conflict detection between laws.

Target Users: Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI, labor law, GST compliance, and other civic frameworks.

Current Scope: RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions).


2. Architecture Overview


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CLIENT LAYER                              β”‚
β”‚              HTTP REST (FastAPI) β€” /api/v1/query                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     LANGGRAPH AGENT                              β”‚
β”‚                                                                  β”‚
β”‚  [Classifier] β†’ [Vector Retrieval] β†’ [Reranker]                  β”‚
β”‚       ↑         [Graph Retrieval]  β†—                             β”‚
β”‚  [Retry]  ←  [Validator] ← [Generator]                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚                  β”‚                      β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  pgvector    β”‚   β”‚   Neo4j         β”‚    β”‚   PostgreSQL   β”‚
  β”‚  (vectors)   β”‚   β”‚   (graph)       β”‚    β”‚   (metadata)   β”‚
  β”‚  Phase 0     β”‚   β”‚   Phase 1       β”‚    β”‚   Phase 0      β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚                      β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INGESTION PIPELINE                            β”‚
β”‚  Download β†’ Parse β†’ Chunk β†’ Enrich β†’ Embed β†’ Store               β”‚
β”‚  document_registry.py β€” single source of truth for all doc URLs  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Two Pipelines

3.1 Ingestion Pipeline (Offline)

Runs once per document. Triggered via make ingest or POST /api/v1/ingest.


PDF URL (from document_registry.py)
β†’ Downloader        (httpx, cached locally with MD5 check)
β†’ PDFParser         (PyMuPDF, text extraction, scanned page detection)
β†’ LegalChunker      (multi-format regex: Act + Rule boundary detection)
β†’ MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals)
β†’ Embedder          (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
β†’ RelationalStore   (PostgreSQL β€” documents + legal_chunks tables)
β†’ VectorStore       (pgvector β€” HNSW index, cosine similarity)
β†’ GraphStore        (Neo4j β€” Document + Section nodes + edges)
β†’ GraphSeeder       (REFERENCES + DERIVED_FROM edges seeded post-ingestion)

3.2 Query Pipeline (Online, per-request)

Triggered on every POST /api/v1/query.


User Query
β†’ Input Guardrails  (PII + off-topic filter)
β†’ Classifier Node   (LLM β€” query_type + rewritten_query)
β†’ Vector Retrieval  (RRF: pgvector cosine + PostgreSQL FTS merged)    ← fact_lookup
β†’ Graph Retrieval   (Neo4j β€” REFERENCES + DERIVED_FROM traversal)    ← cross_reference / penalty / temporal
  β”œβ”€β”€ Outgoing REFERENCES  (sections this section cites)
  β”œβ”€β”€ Incoming REFERENCES  (sections that cite this section)
  β”œβ”€β”€ DERIVED_FROM outgoing (Act sections this Rule derives from)
  └── DERIVED_FROM incoming (Rule sections that implement this Act section)
  Fallback: RRF hybrid retrieval when no section ID in query
β†’ Hybrid Retrieval  (dual-query RRF across jurisdictions)             ← conflict_detection
β†’ Source section pinning (exact match chunks bypass reranker)
β†’ Reranker          (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
β†’ Generator Node    (LLM β€” structured JSON answer with citations)
β†’ Validator Node    (LLM β€” hallucination + confidence check)
β†’ Output Guardrails (faithfulness check + disclaimer injection)
β†’ CivicSetuResponse (answer + citations + confidence + disclaimer)

4. Component Responsibilities

Component Responsibility Technology
DocumentRegistry Centralised doc URL + metadata management Python dataclass
PDFParser Text extraction from PDFs PyMuPDF
LegalChunker Multi-format section-boundary splitting Regex (Act + Rule patterns)
MetadataExtractor Date, Section X + Rule X reference, amendment extract Regex
Embedder Dense vector generation + truncation guard nomic-embed-text (Ollama)
VectorStore Semantic similarity search pgvector + HNSW
GraphStore Section relationship traversal β€” fresh driver per call Neo4j Community
GraphSeeder Post-ingestion REFERENCES + DERIVED_FROM edge seeding Neo4j Cypher
RelationalStore Metadata persistence + chunk storage PostgreSQL + SQLAlchemy
LangGraph Agent Query orchestration state machine LangGraph
LiteLLM Gateway LLM provider fallback routing LiteLLM
FastAPI HTTP API layer FastAPI + Uvicorn
FlashRank Cross-encoder reranking (pinned source chunks exempt) ONNX local model
Next.js Frontend Chat UI, multi-turn sessions, citations panel Next.js 15 App Router + Vercel

5. LLM Fallback Chain


Primary  β†’ gemini/gemini-2.5-flash-lite                          (Gemini API)
Backup 1 β†’ openrouter/meta-llama/llama-3.3-70b-instruct:free     (OpenRouter)
Backup 2 β†’ groq/llama-3.3-70b-versatile                          (Groq API)
Backup 3 β†’ openrouter/qwen/qwen3.6-plus:free                     (OpenRouter)
Local    β†’ ollama/mistral                                         (offline)

All routing handled by LiteLLM. Model swap = config change only.


6. Data Flow: Query to Response


Input:  {"query": "What are builder obligations under Section 18?"}

Step 1  Classify    β†’ query_type=cross_reference (explicit section number detected)
Step 2  Graph       β†’ traverse Section 18 node
                      REFERENCES outgoing + incoming (depth=2)
                      DERIVED_FROM outgoing β†’ Act sections this Rule derives from
                      DERIVED_FROM incoming β†’ Rule sections implementing this Act section
Step 2b Fallback    β†’ vector retrieval if graph returns 0 results
Step 3  Pin         β†’ exact source section chunks marked is_pinned=True, skip reranker
Step 4  Rerank      β†’ cross-encoder scores remaining chunks, top 5 total
Step 5  Generate    β†’ LLM produces JSON with answer + citations
Step 6  Validate    β†’ hallucination check, confidence score
Step 7  Respond     β†’ CivicSetuResponse with citations + disclaimer

Output: {
"answer": "Under Section 18(1)...",
"citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...},
              {"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}],
"confidence_score": 0.95,
"confidence_level": "high",
"disclaimer": "This is AI-generated information..."
}

7. Phase Roadmap

Phase Scope Status
0 RERA Act 2016, vector RAG, FastAPI βœ… Complete
1 Neo4j graph, cross-reference queries βœ… Complete
2 MahaRERA Rules 2017, multi-jurisdiction βœ… Complete
3 DERIVED_FROM edges, cross-jurisdiction graph βœ… Complete
4 Multi-state expansion (UP, TN, Karnataka RERA) βœ… Complete
5 Agent pipeline hardening, E2E test suite βœ… Complete
6 Next.js frontend, Vercel deployment, public URL βœ… Complete
7 Graph explorer, section content drawer, D3 vis βœ… Complete
8 RAGAS eval pipeline, hybrid RRF, retrieval fixes βœ… Complete

8. Non-Functional Requirements

Requirement Target Current Status
Response latency < 10s per query 7.6s avg β€” 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app
Citation accuracy 100% β€” never answer without citation Enforced by schema
Hallucination rate < 5% Validator node + confidence gate
Cost $0 for dev/staging All free tier
Portability Runs on any machine with Docker Docker Compose
Faithfulness (RAGAS) β‰₯ 0.75 overall 0.650 (Phase 8 baseline, 5-row smoke)
Context precision (RAGAS) β‰₯ 0.65 overall 0.267 (Phase 8 baseline, 5-row smoke)