civicsetu / docs /RAG.md
adeshboudh16
updated docs
8de7198
# CivicSetu - RAG Techniques Reference
**Version:** 2.3 - Mobile Ledger + Quality Hardening
**Last Updated:** 2026-05-01
This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.
---
## 1. Current Status Snapshot
As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.
- **Phase 9 Complete (Mobile Responsive)**
- Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
- Interactive Graph Explorer with section drill-down.
- **Cloud Infrastructure Live**
- Relational & Vector: **Neon (Postgres + pgvector)**
- Graph: **Neo4j AuraDB**
- Frontend: **Vercel**
- Backend API: **Hugging Face Spaces**
- **Live app routes**
- `POST /api/v1/query` - buffered response
- `POST /api/v1/query/stream` - SSE token streaming
- `POST /api/v1/query/section-context` - section-focused chat
- `/api/v1/graph/*` - graph explorer and section drill-down
- **Session-aware graph**
- LangGraph uses `session_id` as thread key.
- Each turn clears retrieval/generation fields but preserves conversation history.
- **Active retrieval routing**
- `fact_lookup -> vector_retrieval`
- `cross_reference|penalty_lookup|temporal -> graph_retrieval`
- `conflict_detection -> hybrid_retrieval`
- **Streaming is now first-class**
- streaming path reuses classifier, retrieval, and reranker
- answer text streams first
- citations and metadata are extracted in a second fast pass
- **Latest eval artifact (0.90 Faithfulness)**
- `eval_results.json` dated **2026-04-28**
- `faithfulness=0.900`
- `answer_relevancy=0.858`
- `context_precision=0.696`
- `pass_rate=0.581`
- **Knowledge Graph Scale (as of 2026-05-01)**
- Documents: `6`
- Sections: `2,090`
- Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
- **Main remaining weakness**
- multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
- context precision for broad fact lookups needs further HNSW tuning
---
## 2. System Overview
CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries.
Core problem:
- legal text is structured around sections, rules, sub-clauses, and cross-references
- users ask imprecise natural-language questions
- answers must stay grounded and cite the right legal section
Why plain semantic RAG fails here:
- embeddings blur important legal entities
- user queries often omit exact statute wording
- conflict questions need more than one legal source
- generation models tend to fill gaps unless grounding is strict
---
## 3. Ingestion Pipeline
### 3.1 PDF Parsing
`ingestion/parser.py` uses **PyMuPDF**.
Important guards:
- document-level `max_pages` trims form-heavy tails
- scanned PDF detection avoids unusable OCR-free sources
- metadata stores capped page count, not necessarily total PDF pages
### 3.2 Section Boundary Chunking
`ingestion/chunker.py` applies multiple regex families in priority order to detect section and rule boundaries.
Current purpose:
- preserve citation boundaries
- keep section hierarchy intact
- split oversized sections without destroying legal structure
Fallback mode is paragraph chunking on double newlines, logged as `fallback_paragraph_chunking`.
### 3.3 Deterministic Chunk IDs
`chunk_id` is a UUID5 over stable section identity data.
Effect:
- re-ingestion is idempotent
- `ON CONFLICT DO UPDATE` replaces old chunk content
- same legal section does not duplicate across re-runs
### 3.4 Section Title Prepended to Embeddings
During embedding, section title is prepended to chunk text.
Reason:
- split sub-sections often lose the title phrase that users actually search for
- title prefix restores semantic recall for questions like "obligations of promoter"
Reranker still reads raw chunk text, not the prefixed text.
### 3.5 Embedding Model
Current defaults from `config/settings.py`:
- `embedding_model = nomic-embed-text`
- `embedding_dimension = 768`
Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.
### 3.6 Graph Seeding
`ingestion/graph_seeder.py` populates the Neo4j knowledge graph using data already persisted in PostgreSQL.
Key steps:
- **Idempotent Upsert:** Documents and Sections are merged into Neo4j using UUID5 `chunk_id`.
- **Relationship Extraction:**
- `REFERENCES`: `MetadataExtractor` identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links.
- `DERIVED_FROM`: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level).
- **Execution:** Automatically triggered at the end of `scripts/ingest.py` or manually via `scripts/seed_phase3.py`.
---
## 4. Query Pipeline
### 4.1 Query Classification and Rewriting
`agent/nodes.py::classifier_node` classifies query and rewrites it for retrieval.
Output shape:
```json
{
"query_type": "fact_lookup | cross_reference | temporal | penalty_lookup | conflict_detection",
"rewritten_query": "expanded retrieval-friendly query"
}
```
Current route mapping:
| Query Type | Route |
|---|---|
| `fact_lookup` | `vector_retrieval` |
| `cross_reference" | `graph_retrieval` |
| `penalty_lookup` | `graph_retrieval` |
| `temporal` | `graph_retrieval` |
| `conflict_detection" | `hybrid_retrieval` |
Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.
### 4.2 LLM Routing and Fallback Chain
All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.
Current model chain:
```text
THINKING tier (Generator)
1. gemini/gemini-1.5-flash
2. groq/llama-3.3-70b-versatile
3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7
FAST tier (Classifier/Validator)
1. gemini/gemini-1.5-flash
```
Provider notes:
- NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
- `temperature=0.0` for all grounding tasks
- Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.
---
## 5. Hybrid Retrieval
Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families.
### 5.1 Vector Similarity Search
Used to catch semantic matches when wording differs from statute text.
### 5.2 Full-Text Search
Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.
### 5.3 Reciprocal Rank Fusion
Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.
### 5.4 Section-ID-Aware Direct Lookup
If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.
### 5.5 Central Act Supplementation
For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.
---
## 6. Graph-Based Retrieval
Used for section-centric questions and legal relationships.
Current behavior:
- extract section or rule IDs from query
- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
- hydrate matching sections back from Postgres
Graph retrieval is especially important for:
- explicit section lookups
- penalty questions
- central vs state derivation paths
Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.
---
## 7. Reranking
### 7.1 Cross-Encoder
`retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).
Pipeline:
1. deduplicate by `(section_id, doc_name)`
2. split pinned vs rankable chunks
3. rerank rankable chunks with cross-encoder
4. filter by minimum score (0.05)
5. apply score-gap cutoff (0.95)
6. prepend pinned chunks
### 7.2 Context Assembly
Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.
---
## 8. Generation
### 8.1 Buffered Generation
`generator_node()` builds a numbered context block and asks for JSON output.
### 8.2 Streaming Generation
`stream_generator_node()` now drives SSE output.
1. Run classification/retrieval/reranking.
2. Stream answer tokens immediately.
3. Run a second fast metadata extraction prompt
4. Push metadata/citations as the final SSE event.
### 8.3 Tone Hints by Query Type
| Type | Tone Guidance |
|---|---|
| `fact_lookup` | Direct, no metaphors, cite per bullet. |
| `penalty_lookup` | Lead with consequence/penalty. |
| `cross_reference` | Explain primary section, then connections. |
| `conflict_detection` | Flag contradiction ONLY if both sides are in context. |
| `temporal` | Lead with exact numeric deadline/time. |
---
## 9. Validation
### 9.1 Validator Design
`validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
- Returns `hallucination_flag: True` if score is below floor.
- Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged.
### 9.2 Output Guardrails
`guardrails/output_guard.py`:
- Intercepts low-confidence or safe-guard failures.
- Returns `InsufficientInfoResponse` when grounding is weak.
- Appends legal disclaimer.
---
## 10. RAGAS Evaluation Pipeline
### 10.1 Two-Phase Architecture
- **Phase 1:** Graph invocation -> `eval_phase1_results.json`.
- **Phase 2:** RAGAS scoring -> `eval_results.json`.
### 10.2 Dataset & Metrics
- **Rows:** 31 (Central, 4 States, Multi-Jurisdiction).
- **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision.
- **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80.
---
## 11. Known Failure Modes
- **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
- **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator.
---
## 12. Implementation Checklist
- [x] Add `DocumentSpec` to registry.
- [x] Verify PDF text extraction.
- [x] Run `make ingest`.
- [x] Seed Neo4j graph.
- [x] Run `make eval-smoke` to verify precision.