# CivicSetu - RAG Techniques Reference **Version:** 2.3 - Mobile Ledger + Quality Hardening **Last Updated:** 2026-05-01 This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are. --- ## 1. Current Status Snapshot As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live. - **Phase 9 Complete (Mobile Responsive)** - Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile. - Interactive Graph Explorer with section drill-down. - **Cloud Infrastructure Live** - Relational & Vector: **Neon (Postgres + pgvector)** - Graph: **Neo4j AuraDB** - Frontend: **Vercel** - Backend API: **Hugging Face Spaces** - **Live app routes** - `POST /api/v1/query` - buffered response - `POST /api/v1/query/stream` - SSE token streaming - `POST /api/v1/query/section-context` - section-focused chat - `/api/v1/graph/*` - graph explorer and section drill-down - **Session-aware graph** - LangGraph uses `session_id` as thread key. - Each turn clears retrieval/generation fields but preserves conversation history. - **Active retrieval routing** - `fact_lookup -> vector_retrieval` - `cross_reference|penalty_lookup|temporal -> graph_retrieval` - `conflict_detection -> hybrid_retrieval` - **Streaming is now first-class** - streaming path reuses classifier, retrieval, and reranker - answer text streams first - citations and metadata are extracted in a second fast pass - **Latest eval artifact (0.90 Faithfulness)** - `eval_results.json` dated **2026-04-28** - `faithfulness=0.900` - `answer_relevancy=0.858` - `context_precision=0.696` - `pass_rate=0.581` - **Knowledge Graph Scale (as of 2026-05-01)** - Documents: `6` - Sections: `2,090` - Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION) - **Main remaining weakness** - multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`) - context precision for broad fact lookups needs further HNSW tuning --- ## 2. System Overview CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries. Core problem: - legal text is structured around sections, rules, sub-clauses, and cross-references - users ask imprecise natural-language questions - answers must stay grounded and cite the right legal section Why plain semantic RAG fails here: - embeddings blur important legal entities - user queries often omit exact statute wording - conflict questions need more than one legal source - generation models tend to fill gaps unless grounding is strict --- ## 3. Ingestion Pipeline ### 3.1 PDF Parsing `ingestion/parser.py` uses **PyMuPDF**. Important guards: - document-level `max_pages` trims form-heavy tails - scanned PDF detection avoids unusable OCR-free sources - metadata stores capped page count, not necessarily total PDF pages ### 3.2 Section Boundary Chunking `ingestion/chunker.py` applies multiple regex families in priority order to detect section and rule boundaries. Current purpose: - preserve citation boundaries - keep section hierarchy intact - split oversized sections without destroying legal structure Fallback mode is paragraph chunking on double newlines, logged as `fallback_paragraph_chunking`. ### 3.3 Deterministic Chunk IDs `chunk_id` is a UUID5 over stable section identity data. Effect: - re-ingestion is idempotent - `ON CONFLICT DO UPDATE` replaces old chunk content - same legal section does not duplicate across re-runs ### 3.4 Section Title Prepended to Embeddings During embedding, section title is prepended to chunk text. Reason: - split sub-sections often lose the title phrase that users actually search for - title prefix restores semantic recall for questions like "obligations of promoter" Reranker still reads raw chunk text, not the prefixed text. ### 3.5 Embedding Model Current defaults from `config/settings.py`: - `embedding_model = nomic-embed-text` - `embedding_dimension = 768` Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval. ### 3.6 Graph Seeding `ingestion/graph_seeder.py` populates the Neo4j knowledge graph using data already persisted in PostgreSQL. Key steps: - **Idempotent Upsert:** Documents and Sections are merged into Neo4j using UUID5 `chunk_id`. - **Relationship Extraction:** - `REFERENCES`: `MetadataExtractor` identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links. - `DERIVED_FROM`: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level). - **Execution:** Automatically triggered at the end of `scripts/ingest.py` or manually via `scripts/seed_phase3.py`. --- ## 4. Query Pipeline ### 4.1 Query Classification and Rewriting `agent/nodes.py::classifier_node` classifies query and rewrites it for retrieval. Output shape: ```json { "query_type": "fact_lookup | cross_reference | temporal | penalty_lookup | conflict_detection", "rewritten_query": "expanded retrieval-friendly query" } ``` Current route mapping: | Query Type | Route | |---|---| | `fact_lookup` | `vector_retrieval` | | `cross_reference" | `graph_retrieval` | | `penalty_lookup` | `graph_retrieval` | | `temporal` | `graph_retrieval` | | `conflict_detection" | `hybrid_retrieval` | Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query. ### 4.2 LLM Routing and Fallback Chain All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`. Current model chain: ```text THINKING tier (Generator) 1. gemini/gemini-1.5-flash 2. groq/llama-3.3-70b-versatile 3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7 FAST tier (Classifier/Validator) 1. gemini/gemini-1.5-flash ``` Provider notes: - NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1` - `temperature=0.0` for all grounding tasks - Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers. --- ## 5. Hybrid Retrieval Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families. ### 5.1 Vector Similarity Search Used to catch semantic matches when wording differs from statute text. ### 5.2 Full-Text Search Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`. ### 5.3 Reciprocal Rank Fusion Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top. ### 5.4 Section-ID-Aware Direct Lookup If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough. ### 5.5 Central Act Supplementation For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act. --- ## 6. Graph-Based Retrieval Used for section-centric questions and legal relationships. Current behavior: - extract section or rule IDs from query - traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`) - hydrate matching sections back from Postgres Graph retrieval is especially important for: - explicit section lookups - penalty questions - central vs state derivation paths Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks. --- ## 7. Reranking ### 7.1 Cross-Encoder `retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`). Pipeline: 1. deduplicate by `(section_id, doc_name)` 2. split pinned vs rankable chunks 3. rerank rankable chunks with cross-encoder 4. filter by minimum score (0.05) 5. apply score-gap cutoff (0.95) 6. prepend pinned chunks ### 7.2 Context Assembly Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated. --- ## 8. Generation ### 8.1 Buffered Generation `generator_node()` builds a numbered context block and asks for JSON output. ### 8.2 Streaming Generation `stream_generator_node()` now drives SSE output. 1. Run classification/retrieval/reranking. 2. Stream answer tokens immediately. 3. Run a second fast metadata extraction prompt 4. Push metadata/citations as the final SSE event. ### 8.3 Tone Hints by Query Type | Type | Tone Guidance | |---|---| | `fact_lookup` | Direct, no metaphors, cite per bullet. | | `penalty_lookup` | Lead with consequence/penalty. | | `cross_reference` | Explain primary section, then connections. | | `conflict_detection` | Flag contradiction ONLY if both sides are in context. | | `temporal` | Lead with exact numeric deadline/time. | --- ## 9. Validation ### 9.1 Validator Design `validator_node()` treats `confidence_score < 0.2` as a hallucination risk. - Returns `hallucination_flag: True` if score is below floor. - Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged. ### 9.2 Output Guardrails `guardrails/output_guard.py`: - Intercepts low-confidence or safe-guard failures. - Returns `InsufficientInfoResponse` when grounding is weak. - Appends legal disclaimer. --- ## 10. RAGAS Evaluation Pipeline ### 10.1 Two-Phase Architecture - **Phase 1:** Graph invocation -> `eval_phase1_results.json`. - **Phase 2:** RAGAS scoring -> `eval_results.json`. ### 10.2 Dataset & Metrics - **Rows:** 31 (Central, 4 States, Multi-Jurisdiction). - **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision. - **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80. --- ## 11. Known Failure Modes - **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries. - **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator. --- ## 12. Implementation Checklist - [x] Add `DocumentSpec` to registry. - [x] Verify PDF text extraction. - [x] Run `make ingest`. - [x] Seed Neo4j graph. - [x] Run `make eval-smoke` to verify precision.