civicsetu / docs /RAG.md
adeshboudh16
updated docs
8de7198

CivicSetu - RAG Techniques Reference

Version: 2.3 - Mobile Ledger + Quality Hardening
Last Updated: 2026-05-01

This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.


1. Current Status Snapshot

As of 2026-05-01, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.

  • Phase 9 Complete (Mobile Responsive)
    • Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
    • Interactive Graph Explorer with section drill-down.
  • Cloud Infrastructure Live
    • Relational & Vector: Neon (Postgres + pgvector)
    • Graph: Neo4j AuraDB
    • Frontend: Vercel
    • Backend API: Hugging Face Spaces
  • Live app routes
    • POST /api/v1/query - buffered response
    • POST /api/v1/query/stream - SSE token streaming
    • POST /api/v1/query/section-context - section-focused chat
    • /api/v1/graph/* - graph explorer and section drill-down
  • Session-aware graph
    • LangGraph uses session_id as thread key.
    • Each turn clears retrieval/generation fields but preserves conversation history.
  • Active retrieval routing
    • fact_lookup -> vector_retrieval
    • cross_reference|penalty_lookup|temporal -> graph_retrieval
    • conflict_detection -> hybrid_retrieval
  • Streaming is now first-class
    • streaming path reuses classifier, retrieval, and reranker
    • answer text streams first
    • citations and metadata are extracted in a second fast pass
  • Latest eval artifact (0.90 Faithfulness)
    • eval_results.json dated 2026-04-28
    • faithfulness=0.900
    • answer_relevancy=0.858
    • context_precision=0.696
    • pass_rate=0.581
  • Knowledge Graph Scale (as of 2026-05-01)
    • Documents: 6
    • Sections: 2,090
    • Edges: 2,321 (REFERENCES, DERIVED_FROM, HAS_SECTION)
  • Main remaining weakness
    • multi-jurisdiction retrieval still weak (MULTI rows pass only 20%)
    • context precision for broad fact lookups needs further HNSW tuning

2. System Overview

CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries.

Core problem:

  • legal text is structured around sections, rules, sub-clauses, and cross-references
  • users ask imprecise natural-language questions
  • answers must stay grounded and cite the right legal section

Why plain semantic RAG fails here:

  • embeddings blur important legal entities
  • user queries often omit exact statute wording
  • conflict questions need more than one legal source
  • generation models tend to fill gaps unless grounding is strict

3. Ingestion Pipeline

3.1 PDF Parsing

ingestion/parser.py uses PyMuPDF.

Important guards:

  • document-level max_pages trims form-heavy tails
  • scanned PDF detection avoids unusable OCR-free sources
  • metadata stores capped page count, not necessarily total PDF pages

3.2 Section Boundary Chunking

ingestion/chunker.py applies multiple regex families in priority order to detect section and rule boundaries.

Current purpose:

  • preserve citation boundaries
  • keep section hierarchy intact
  • split oversized sections without destroying legal structure

Fallback mode is paragraph chunking on double newlines, logged as fallback_paragraph_chunking.

3.3 Deterministic Chunk IDs

chunk_id is a UUID5 over stable section identity data.

Effect:

  • re-ingestion is idempotent
  • ON CONFLICT DO UPDATE replaces old chunk content
  • same legal section does not duplicate across re-runs

3.4 Section Title Prepended to Embeddings

During embedding, section title is prepended to chunk text.

Reason:

  • split sub-sections often lose the title phrase that users actually search for
  • title prefix restores semantic recall for questions like "obligations of promoter"

Reranker still reads raw chunk text, not the prefixed text.

3.5 Embedding Model

Current defaults from config/settings.py:

  • embedding_model = nomic-embed-text
  • embedding_dimension = 768

Query and document embeddings use asymmetric prefixes (search_query: vs search_document: ) compatible with Nomic-style retrieval.

3.6 Graph Seeding

ingestion/graph_seeder.py populates the Neo4j knowledge graph using data already persisted in PostgreSQL.

Key steps:

  • Idempotent Upsert: Documents and Sections are merged into Neo4j using UUID5 chunk_id.
  • Relationship Extraction:
    • REFERENCES: MetadataExtractor identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links.
    • DERIVED_FROM: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level).
  • Execution: Automatically triggered at the end of scripts/ingest.py or manually via scripts/seed_phase3.py.

4. Query Pipeline

4.1 Query Classification and Rewriting

agent/nodes.py::classifier_node classifies query and rewrites it for retrieval.

Output shape:

{
  "query_type": "fact_lookup | cross_reference | temporal | penalty_lookup | conflict_detection",
  "rewritten_query": "expanded retrieval-friendly query"
}

Current route mapping:

Query Type Route
fact_lookup vector_retrieval
`cross_reference" graph_retrieval
penalty_lookup graph_retrieval
temporal graph_retrieval
`conflict_detection" hybrid_retrieval

Classifier fallback: if JSON parse fails, default to fact_lookup with original query.

4.2 LLM Routing and Fallback Chain

All non-streaming LLM calls use _llm_call(). Streaming uses _llm_stream().

Current model chain:

THINKING tier (Generator)
1. gemini/gemini-1.5-flash
2. groq/llama-3.3-70b-versatile
3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7

FAST tier (Classifier/Validator)
1. gemini/gemini-1.5-flash

Provider notes:

  • NVIDIA-hosted models (Minimax, GLM) use https://integrate.api.nvidia.com/v1
  • temperature=0.0 for all grounding tasks
  • Gemini models use a temperature of 1.0 if specified as such by provider requirements for certain tiers.

5. Hybrid Retrieval

Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families.

5.1 Vector Similarity Search

Used to catch semantic matches when wording differs from statute text.

5.2 Full-Text Search

Used for exact legal wording, section numbers, and important terms via websearch_to_tsquery.

5.3 Reciprocal Rank Fusion

Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.

5.4 Section-ID-Aware Direct Lookup

If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and pins them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.

5.5 Central Act Supplementation

For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the Central RERA Act 2016. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.


6. Graph-Based Retrieval

Used for section-centric questions and legal relationships.

Current behavior:

  • extract section or rule IDs from query
  • traverse Neo4j relationships (REFERENCES and DERIVED_FROM)
  • hydrate matching sections back from Postgres

Graph retrieval is especially important for:

  • explicit section lookups
  • penalty questions
  • central vs state derivation paths

Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.


7. Reranking

7.1 Cross-Encoder

retrieval/reranker.py uses FlashRank (ms-marco-MiniLM-L-12-v2).

Pipeline:

  1. deduplicate by (section_id, doc_name)
  2. split pinned vs rankable chunks
  3. rerank rankable chunks with cross-encoder
  4. filter by minimum score (0.05)
  5. apply score-gap cutoff (0.95)
  6. prepend pinned chunks

7.2 Context Assembly

Max context size is 7 chunks. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.


8. Generation

8.1 Buffered Generation

generator_node() builds a numbered context block and asks for JSON output.

8.2 Streaming Generation

stream_generator_node() now drives SSE output.

  1. Run classification/retrieval/reranking.
  2. Stream answer tokens immediately.
  3. Run a second fast metadata extraction prompt
  4. Push metadata/citations as the final SSE event.

8.3 Tone Hints by Query Type

Type Tone Guidance
fact_lookup Direct, no metaphors, cite per bullet.
penalty_lookup Lead with consequence/penalty.
cross_reference Explain primary section, then connections.
conflict_detection Flag contradiction ONLY if both sides are in context.
temporal Lead with exact numeric deadline/time.

9. Validation

9.1 Validator Design

validator_node() treats confidence_score < 0.2 as a hallucination risk.

  • Returns hallucination_flag: True if score is below floor.
  • Graph triggers a retry (up to 2 times) with different retrieval parameters if flagged.

9.2 Output Guardrails

guardrails/output_guard.py:

  • Intercepts low-confidence or safe-guard failures.
  • Returns InsufficientInfoResponse when grounding is weak.
  • Appends legal disclaimer.

10. RAGAS Evaluation Pipeline

10.1 Two-Phase Architecture

  • Phase 1: Graph invocation -> eval_phase1_results.json.
  • Phase 2: RAGAS scoring -> eval_results.json.

10.2 Dataset & Metrics

  • Rows: 31 (Central, 4 States, Multi-Jurisdiction).
  • Primary Metrics: Faithfulness, Answer Relevancy, Context Precision.
  • Goal: Faithfulness > 0.85; Answer Relevancy > 0.80.

11. Known Failure Modes

  • Multi-Jurisdiction Retrieval: Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
  • Large Context Noise: 7 chunks sometimes include irrelevant sub-clauses that distract the generator.

12. Implementation Checklist

  • Add DocumentSpec to registry.
  • Verify PDF text extraction.
  • Run make ingest.
  • Seed Neo4j graph.
  • Run make eval-smoke to verify precision.