# CivicSetu - RAG Techniques Reference

**Version:** 2.3 - Mobile Ledger + Quality Hardening  
**Last Updated:** 2026-05-01

This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.

---

## 1. Current Status Snapshot

As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.

- **Phase 9 Complete (Mobile Responsive)**
  - Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
  - Interactive Graph Explorer with section drill-down.
- **Cloud Infrastructure Live**
  - Relational & Vector: **Neon (Postgres + pgvector)**
  - Graph: **Neo4j AuraDB**
  - Frontend: **Vercel**
  - Backend API: **Hugging Face Spaces**
- **Live app routes**
  - `POST /api/v1/query` - buffered response
  - `POST /api/v1/query/stream` - SSE token streaming
  - `POST /api/v1/query/section-context` - section-focused chat
  - `/api/v1/graph/*` - graph explorer and section drill-down
- **Session-aware graph**
  - LangGraph uses `session_id` as thread key.
  - Each turn clears retrieval/generation fields but preserves conversation history.
- **Active retrieval routing**
  - `fact_lookup -> vector_retrieval`
  - `cross_reference|penalty_lookup|temporal -> graph_retrieval`
  - `conflict_detection -> hybrid_retrieval`
- **Streaming is now first-class**
  - streaming path reuses classifier, retrieval, and reranker
  - answer text streams first
  - citations and metadata are extracted in a second fast pass
- **Latest eval artifact (0.90 Faithfulness)**
  - `eval_results.json` dated **2026-04-28**
  - `faithfulness=0.900`
  - `answer_relevancy=0.858`
  - `context_precision=0.696`
  - `pass_rate=0.581`
- **Knowledge Graph Scale (as of 2026-05-01)**
  - Documents: `6`
  - Sections: `2,090`
  - Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
- **Main remaining weakness**
  - multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
  - context precision for broad fact lookups needs further HNSW tuning

---

## 2. System Overview

CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries.

Core problem:

- legal text is structured around sections, rules, sub-clauses, and cross-references
- users ask imprecise natural-language questions
- answers must stay grounded and cite the right legal section

Why plain semantic RAG fails here:

- embeddings blur important legal entities
- user queries often omit exact statute wording
- conflict questions need more than one legal source
- generation models tend to fill gaps unless grounding is strict

---

## 3. Ingestion Pipeline

### 3.1 PDF Parsing

`ingestion/parser.py` uses **PyMuPDF**.

Important guards:

- document-level `max_pages` trims form-heavy tails
- scanned PDF detection avoids unusable OCR-free sources
- metadata stores capped page count, not necessarily total PDF pages

### 3.2 Section Boundary Chunking

`ingestion/chunker.py` applies multiple regex families in priority order to detect section and rule boundaries.

Current purpose:

- preserve citation boundaries
- keep section hierarchy intact
- split oversized sections without destroying legal structure

Fallback mode is paragraph chunking on double newlines, logged as `fallback_paragraph_chunking`.

### 3.3 Deterministic Chunk IDs

`chunk_id` is a UUID5 over stable section identity data.

Effect:

- re-ingestion is idempotent
- `ON CONFLICT DO UPDATE` replaces old chunk content
- same legal section does not duplicate across re-runs

### 3.4 Section Title Prepended to Embeddings

During embedding, section title is prepended to chunk text.

Reason:

- split sub-sections often lose the title phrase that users actually search for
- title prefix restores semantic recall for questions like "obligations of promoter"

Reranker still reads raw chunk text, not the prefixed text.

### 3.5 Embedding Model

Current defaults from `config/settings.py`:

- `embedding_model = nomic-embed-text`
- `embedding_dimension = 768`

Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.

### 3.6 Graph Seeding

`ingestion/graph_seeder.py` populates the Neo4j knowledge graph using data already persisted in PostgreSQL.

Key steps:
- **Idempotent Upsert:** Documents and Sections are merged into Neo4j using UUID5 `chunk_id`.
- **Relationship Extraction:** 
  - `REFERENCES`: `MetadataExtractor` identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links.
  - `DERIVED_FROM`: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level).
- **Execution:** Automatically triggered at the end of `scripts/ingest.py` or manually via `scripts/seed_phase3.py`.

---

## 4. Query Pipeline

### 4.1 Query Classification and Rewriting

`agent/nodes.py::classifier_node` classifies query and rewrites it for retrieval.

Output shape:

```json
{
  "query_type": "fact_lookup | cross_reference | temporal | penalty_lookup | conflict_detection",
  "rewritten_query": "expanded retrieval-friendly query"
}
```

Current route mapping:

| Query Type | Route |
|---|---|
| `fact_lookup` | `vector_retrieval` |
| `cross_reference" | `graph_retrieval` |
| `penalty_lookup` | `graph_retrieval` |
| `temporal` | `graph_retrieval` |
| `conflict_detection" | `hybrid_retrieval` |

Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.

### 4.2 LLM Routing and Fallback Chain

All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.

Current model chain:

```text
THINKING tier (Generator)
1. gemini/gemini-1.5-flash
2. groq/llama-3.3-70b-versatile
3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7

FAST tier (Classifier/Validator)
1. gemini/gemini-1.5-flash
```

Provider notes:

- NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
- `temperature=0.0` for all grounding tasks
- Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.

---

## 5. Hybrid Retrieval

Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families.

### 5.1 Vector Similarity Search

Used to catch semantic matches when wording differs from statute text.

### 5.2 Full-Text Search

Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.

### 5.3 Reciprocal Rank Fusion

Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.

### 5.4 Section-ID-Aware Direct Lookup

If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.

### 5.5 Central Act Supplementation

For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.

---

## 6. Graph-Based Retrieval

Used for section-centric questions and legal relationships.

Current behavior:

- extract section or rule IDs from query
- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
- hydrate matching sections back from Postgres

Graph retrieval is especially important for:

- explicit section lookups
- penalty questions
- central vs state derivation paths

Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.

---

## 7. Reranking

### 7.1 Cross-Encoder

`retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).

Pipeline:

1. deduplicate by `(section_id, doc_name)`
2. split pinned vs rankable chunks
3. rerank rankable chunks with cross-encoder
4. filter by minimum score (0.05)
5. apply score-gap cutoff (0.95)
6. prepend pinned chunks

### 7.2 Context Assembly

Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.

---

## 8. Generation

### 8.1 Buffered Generation

`generator_node()` builds a numbered context block and asks for JSON output.

### 8.2 Streaming Generation

`stream_generator_node()` now drives SSE output.
1. Run classification/retrieval/reranking.
2. Stream answer tokens immediately.
3. Run a second fast metadata extraction prompt
4. Push metadata/citations as the final SSE event.

### 8.3 Tone Hints by Query Type

| Type | Tone Guidance |
|---|---|
| `fact_lookup` | Direct, no metaphors, cite per bullet. |
| `penalty_lookup` | Lead with consequence/penalty. |
| `cross_reference` | Explain primary section, then connections. |
| `conflict_detection` | Flag contradiction ONLY if both sides are in context. |
| `temporal` | Lead with exact numeric deadline/time. |

---

## 9. Validation

### 9.1 Validator Design

`validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
- Returns `hallucination_flag: True` if score is below floor.
- Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged.

### 9.2 Output Guardrails

`guardrails/output_guard.py`:
- Intercepts low-confidence or safe-guard failures.
- Returns `InsufficientInfoResponse` when grounding is weak.
- Appends legal disclaimer.

---

## 10. RAGAS Evaluation Pipeline

### 10.1 Two-Phase Architecture

- **Phase 1:** Graph invocation -> `eval_phase1_results.json`.
- **Phase 2:** RAGAS scoring -> `eval_results.json`.

### 10.2 Dataset & Metrics

- **Rows:** 31 (Central, 4 States, Multi-Jurisdiction).
- **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision.
- **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80.

---

## 11. Known Failure Modes

- **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
- **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator.

---

## 12. Implementation Checklist

- [x] Add `DocumentSpec` to registry.
- [x] Verify PDF text extraction.
- [x] Run `make ingest`.
- [x] Seed Neo4j graph.
- [x] Run `make eval-smoke` to verify precision.