# CivicSetu — High Level Design (HLD)

**Version:** 1.0.0 — Phase 8 Complete
**Status:** Phase 8 Complete — RAGAS evaluation pipeline live; retrieval improvements shipped
**Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules.

---

## 1. System Overview

CivicSetu is an open-source RAG (Retrieval-Augmented Generation) system that answers
plain-English questions about Indian civic and legal documents with accurate citations,
amendment tracking, and conflict detection between laws.

**Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
labor law, GST compliance, and other civic frameworks.

**Current Scope:** RERA Act 2016 (Central) + Maharashtra, Uttar Pradesh, Karnataka, Tamil Nadu Rules (5 jurisdictions).

---

## 2. Architecture Overview

```

┌──────────────────────────────────────────────────────────────────┐
│                        CLIENT LAYER                              │
│              HTTP REST (FastAPI) — /api/v1/query                 │
└────────────────────────────┬─────────────────────────────────────┘
							 │
┌────────────────────────────▼─────────────────────────────────────┐
│                     LANGGRAPH AGENT                              │
│                                                                  │
│  [Classifier] → [Vector Retrieval] → [Reranker]                  │
│       ↑         [Graph Retrieval]  ↗                             │
│  [Retry]  ←  [Validator] ← [Generator]                           │
└────────────────────────────┬─────────────────────────────────────┘
                             │
          ┌──────────────────┼──────────────────────┐
	  │                  │                      │
  ┌───────▼──────┐   ┌───────▼─────────┐    ┌───────▼────────┐
  │  pgvector    │   │   Neo4j         │    │   PostgreSQL   │
  │  (vectors)   │   │   (graph)       │    │   (metadata)   │
  │  Phase 0     │   │   Phase 1       │    │   Phase 0      │
  └───────┬──────┘   └───────┬─────────┘    └───────┬────────┘
          │                  │                      │
          └──────────────────┴──────────────────────┘
			     │
┌────────────────────────────▼─────────────────────────────────────┐
│                    INGESTION PIPELINE                            │
│  Download → Parse → Chunk → Enrich → Embed → Store               │
│  document_registry.py — single source of truth for all doc URLs  │
└──────────────────────────────────────────────────────────────────┘

```

---

## 3. Two Pipelines

### 3.1 Ingestion Pipeline (Offline)

Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`.

```

PDF URL (from document_registry.py)
→ Downloader        (httpx, cached locally with MD5 check)
→ PDFParser         (PyMuPDF, text extraction, scanned page detection)
→ LegalChunker      (multi-format regex: Act + Rule boundary detection)
→ MetadataExtractor (dates, Section X + Rule X cross-references, amendment signals)
→ Embedder          (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
→ RelationalStore   (PostgreSQL — documents + legal_chunks tables)
→ VectorStore       (pgvector — HNSW index, cosine similarity)
→ GraphStore        (Neo4j — Document + Section nodes + edges)
→ GraphSeeder       (REFERENCES + DERIVED_FROM edges seeded post-ingestion)

```

### 3.2 Query Pipeline (Online, per-request)

Triggered on every `POST /api/v1/query`.

```

User Query
→ Input Guardrails  (PII + off-topic filter)
→ Classifier Node   (LLM — query_type + rewritten_query)
→ Vector Retrieval  (RRF: pgvector cosine + PostgreSQL FTS merged)    ← fact_lookup
→ Graph Retrieval   (Neo4j — REFERENCES + DERIVED_FROM traversal)    ← cross_reference / penalty / temporal
  ├── Outgoing REFERENCES  (sections this section cites)
  ├── Incoming REFERENCES  (sections that cite this section)
  ├── DERIVED_FROM outgoing (Act sections this Rule derives from)
  └── DERIVED_FROM incoming (Rule sections that implement this Act section)
  Fallback: RRF hybrid retrieval when no section ID in query
→ Hybrid Retrieval  (dual-query RRF across jurisdictions)             ← conflict_detection
→ Source section pinning (exact match chunks bypass reranker)
→ Reranker          (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
→ Generator Node    (LLM — structured JSON answer with citations)
→ Validator Node    (LLM — hallucination + confidence check)
→ Output Guardrails (faithfulness check + disclaimer injection)
→ CivicSetuResponse (answer + citations + confidence + disclaimer)

```

---

## 4. Component Responsibilities

| Component         | Responsibility                                         | Technology                     |
| ----------------- | ------------------------------------------------------ | ------------------------------ |
| DocumentRegistry  | Centralised doc URL + metadata management              | Python dataclass               |
| PDFParser         | Text extraction from PDFs                              | PyMuPDF                        |
| LegalChunker      | Multi-format section-boundary splitting                | Regex (Act + Rule patterns)    |
| MetadataExtractor | Date, Section X + Rule X reference, amendment extract  | Regex                          |
| Embedder          | Dense vector generation + truncation guard             | nomic-embed-text (Ollama)      |
| VectorStore       | Semantic similarity search                             | pgvector + HNSW                |
| GraphStore        | Section relationship traversal — fresh driver per call | Neo4j Community                |
| GraphSeeder       | Post-ingestion REFERENCES + DERIVED_FROM edge seeding  | Neo4j Cypher                   |
| RelationalStore   | Metadata persistence + chunk storage                   | PostgreSQL + SQLAlchemy        |
| LangGraph Agent   | Query orchestration state machine                      | LangGraph                      |
| LiteLLM Gateway   | LLM provider fallback routing                          | LiteLLM                        |
| FastAPI           | HTTP API layer                                         | FastAPI + Uvicorn              |
| FlashRank         | Cross-encoder reranking (pinned source chunks exempt)  | ONNX local model               |
| Next.js Frontend  | Chat UI, multi-turn sessions, citations panel          | Next.js 15 App Router + Vercel |

---

## 5. LLM Fallback Chain

```

Primary  → gemini/gemini-2.5-flash-lite                          (Gemini API)
Backup 1 → openrouter/meta-llama/llama-3.3-70b-instruct:free     (OpenRouter)
Backup 2 → groq/llama-3.3-70b-versatile                          (Groq API)
Backup 3 → openrouter/qwen/qwen3.6-plus:free                     (OpenRouter)
Local    → ollama/mistral                                         (offline)

```

All routing handled by LiteLLM. Model swap = config change only.

---

## 6. Data Flow: Query to Response

```

Input:  {"query": "What are builder obligations under Section 18?"}

Step 1  Classify    → query_type=cross_reference (explicit section number detected)
Step 2  Graph       → traverse Section 18 node
                      REFERENCES outgoing + incoming (depth=2)
                      DERIVED_FROM outgoing → Act sections this Rule derives from
                      DERIVED_FROM incoming → Rule sections implementing this Act section
Step 2b Fallback    → vector retrieval if graph returns 0 results
Step 3  Pin         → exact source section chunks marked is_pinned=True, skip reranker
Step 4  Rerank      → cross-encoder scores remaining chunks, top 5 total
Step 5  Generate    → LLM produces JSON with answer + citations
Step 6  Validate    → hallucination check, confidence score
Step 7  Respond     → CivicSetuResponse with citations + disclaimer

Output: {
"answer": "Under Section 18(1)...",
"citations": [{"section_id": "18", "doc_name": "RERA Act 2016", ...},
              {"section_id": "18", "doc_name": "Maharashtra Real Estate...", ...}],
"confidence_score": 0.95,
"confidence_level": "high",
"disclaimer": "This is AI-generated information..."
}

```

---

## 7. Phase Roadmap

| Phase | Scope                                            | Status      |
| ----- | ------------------------------------------------ | ----------- |
| 0     | RERA Act 2016, vector RAG, FastAPI               | ✅ Complete |
| 1     | Neo4j graph, cross-reference queries             | ✅ Complete |
| 2     | MahaRERA Rules 2017, multi-jurisdiction          | ✅ Complete |
| 3     | DERIVED_FROM edges, cross-jurisdiction graph     | ✅ Complete |
| 4     | Multi-state expansion (UP, TN, Karnataka RERA)   | ✅ Complete |
| 5     | Agent pipeline hardening, E2E test suite         | ✅ Complete |
| 6     | Next.js frontend, Vercel deployment, public URL  | ✅ Complete |
| 7     | Graph explorer, section content drawer, D3 vis   | ✅ Complete |
| 8     | RAGAS eval pipeline, hybrid RRF, retrieval fixes | ✅ Complete |

---

## 8. Non-Functional Requirements

| Requirement               | Target                               | Current Status                                                                   |
| ------------------------- | ------------------------------------ | -------------------------------------------------------------------------------- |
| Response latency          | < 10s per query                      | 7.6s avg — 12/12 E2E PASS (2026-03-22). Live at https://civicsetu-two.vercel.app |
| Citation accuracy         | 100% — never answer without citation | Enforced by schema                                                               |
| Hallucination rate        | < 5%                                 | Validator node + confidence gate                                                 |
| Cost                      | $0 for dev/staging                   | All free tier                                                                    |
| Portability               | Runs on any machine with Docker      | Docker Compose                                                                   |
| Faithfulness (RAGAS)      | ≥ 0.75 overall                       | 0.650 (Phase 8 baseline, 5-row smoke)                                            |
| Context precision (RAGAS) | ≥ 0.65 overall                       | 0.267 (Phase 8 baseline, 5-row smoke)                                            |