Spaces:

adesh01
/

civicsetu

Running

App Files Files Community

civicsetu / docs /RAG.md

adeshboudh16

updated docs

8de7198 6 days ago

preview code

raw

history blame contribute delete

10.5 kB

	# CivicSetu - RAG Techniques Reference

	Version: 2.3 - Mobile Ledger + Quality Hardening
	Last Updated: 2026-05-01

	This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.

	---

	## 1. Current Status Snapshot

	As of 2026-05-01, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.

	- Phase 9 Complete (Mobile Responsive)
	- Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
	- Interactive Graph Explorer with section drill-down.
	- Cloud Infrastructure Live
	- Relational & Vector: Neon (Postgres + pgvector)
	- Graph: Neo4j AuraDB
	- Frontend: Vercel
	- Backend API: Hugging Face Spaces
	- Live app routes
	- `POST /api/v1/query` - buffered response
	- `POST /api/v1/query/stream` - SSE token streaming
	- `POST /api/v1/query/section-context` - section-focused chat
	- `/api/v1/graph/*` - graph explorer and section drill-down
	- Session-aware graph
	- LangGraph uses `session_id` as thread key.
	- Each turn clears retrieval/generation fields but preserves conversation history.
	- Active retrieval routing
	- `fact_lookup -> vector_retrieval`
	- `cross_reference\|penalty_lookup\|temporal -> graph_retrieval`
	- `conflict_detection -> hybrid_retrieval`
	- Streaming is now first-class
	- streaming path reuses classifier, retrieval, and reranker
	- answer text streams first
	- citations and metadata are extracted in a second fast pass
	- Latest eval artifact (0.90 Faithfulness)
	- `eval_results.json` dated 2026-04-28
	- `faithfulness=0.900`
	- `answer_relevancy=0.858`
	- `context_precision=0.696`
	- `pass_rate=0.581`
	- Knowledge Graph Scale (as of 2026-05-01)
	- Documents: `6`
	- Sections: `2,090`
	- Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
	- Main remaining weakness
	- multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
	- context precision for broad fact lookups needs further HNSW tuning

	---

	## 2. System Overview

	CivicSetu is a legal-domain RAG system over five Indian RERA jurisdictions plus cross-jurisdiction queries.

	Core problem:

	- legal text is structured around sections, rules, sub-clauses, and cross-references
	- users ask imprecise natural-language questions
	- answers must stay grounded and cite the right legal section

	Why plain semantic RAG fails here:

	- embeddings blur important legal entities
	- user queries often omit exact statute wording
	- conflict questions need more than one legal source
	- generation models tend to fill gaps unless grounding is strict

	---

	## 3. Ingestion Pipeline

	### 3.1 PDF Parsing

	`ingestion/parser.py` uses PyMuPDF.

	Important guards:

	- document-level `max_pages` trims form-heavy tails
	- scanned PDF detection avoids unusable OCR-free sources
	- metadata stores capped page count, not necessarily total PDF pages

	### 3.2 Section Boundary Chunking

	`ingestion/chunker.py` applies multiple regex families in priority order to detect section and rule boundaries.

	Current purpose:

	- preserve citation boundaries
	- keep section hierarchy intact
	- split oversized sections without destroying legal structure

	Fallback mode is paragraph chunking on double newlines, logged as `fallback_paragraph_chunking`.

	### 3.3 Deterministic Chunk IDs

	`chunk_id` is a UUID5 over stable section identity data.

	Effect:

	- re-ingestion is idempotent
	- `ON CONFLICT DO UPDATE` replaces old chunk content
	- same legal section does not duplicate across re-runs

	### 3.4 Section Title Prepended to Embeddings

	During embedding, section title is prepended to chunk text.

	Reason:

	- split sub-sections often lose the title phrase that users actually search for
	- title prefix restores semantic recall for questions like "obligations of promoter"

	Reranker still reads raw chunk text, not the prefixed text.

	### 3.5 Embedding Model

	Current defaults from `config/settings.py`:

	- `embedding_model = nomic-embed-text`
	- `embedding_dimension = 768`

	Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.

	### 3.6 Graph Seeding

	`ingestion/graph_seeder.py` populates the Neo4j knowledge graph using data already persisted in PostgreSQL.

	Key steps:
	- Idempotent Upsert: Documents and Sections are merged into Neo4j using UUID5 `chunk_id`.
	- Relationship Extraction:
	- `REFERENCES`: `MetadataExtractor` identifies section numbers in text (e.g., "under section 18"). Handles internal and cross-jurisdiction links.
	- `DERIVED_FROM`: Static mapping identifies which State Rule sections derive from which Central Act sections (both at Document and Section level).
	- Execution: Automatically triggered at the end of `scripts/ingest.py` or manually via `scripts/seed_phase3.py`.

	---

	## 4. Query Pipeline

	### 4.1 Query Classification and Rewriting

	`agent/nodes.py::classifier_node` classifies query and rewrites it for retrieval.

	Output shape:

	```json
	{
	"query_type": "fact_lookup \| cross_reference \| temporal \| penalty_lookup \| conflict_detection",
	"rewritten_query": "expanded retrieval-friendly query"
	}
	```

	Current route mapping:

	\| Query Type \| Route \|
	\|---\|---\|
	\| `fact_lookup` \| `vector_retrieval` \|
	\| `cross_reference" \| `graph_retrieval` \|
	\| `penalty_lookup` \| `graph_retrieval` \|
	\| `temporal` \| `graph_retrieval` \|
	\| `conflict_detection" \| `hybrid_retrieval` \|

	Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.

	### 4.2 LLM Routing and Fallback Chain

	All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.

	Current model chain:

	```text
	THINKING tier (Generator)
	1. gemini/gemini-1.5-flash
	2. groq/llama-3.3-70b-versatile
	3. NVIDIA NIM: z-ai/glm4.7 \| minimaxai/minimax-m2.7

	FAST tier (Classifier/Validator)
	1. gemini/gemini-1.5-flash
	```

	Provider notes:

	- NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
	- `temperature=0.0` for all grounding tasks
	- Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.

	---

	## 5. Hybrid Retrieval

	Hybrid retrieval combines vector similarity and PostgreSQL full-text search, then expands section families.

	### 5.1 Vector Similarity Search

	Used to catch semantic matches when wording differs from statute text.

	### 5.2 Full-Text Search

	Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.

	### 5.3 Reciprocal Rank Fusion

	Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.

	### 5.4 Section-ID-Aware Direct Lookup

	If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and pins them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.

	### 5.5 Central Act Supplementation

	For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the Central RERA Act 2016. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.

	---

	## 6. Graph-Based Retrieval

	Used for section-centric questions and legal relationships.

	Current behavior:

	- extract section or rule IDs from query
	- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
	- hydrate matching sections back from Postgres

	Graph retrieval is especially important for:

	- explicit section lookups
	- penalty questions
	- central vs state derivation paths

	Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.

	---

	## 7. Reranking

	### 7.1 Cross-Encoder

	`retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).

	Pipeline:

	1. deduplicate by `(section_id, doc_name)`
	2. split pinned vs rankable chunks
	3. rerank rankable chunks with cross-encoder
	4. filter by minimum score (0.05)
	5. apply score-gap cutoff (0.95)
	6. prepend pinned chunks

	### 7.2 Context Assembly

	Max context size is 7 chunks. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.

	---

	## 8. Generation

	### 8.1 Buffered Generation

	`generator_node()` builds a numbered context block and asks for JSON output.

	### 8.2 Streaming Generation

	`stream_generator_node()` now drives SSE output.
	1. Run classification/retrieval/reranking.
	2. Stream answer tokens immediately.
	3. Run a second fast metadata extraction prompt
	4. Push metadata/citations as the final SSE event.

	### 8.3 Tone Hints by Query Type

	\| Type \| Tone Guidance \|
	\|---\|---\|
	\| `fact_lookup` \| Direct, no metaphors, cite per bullet. \|
	\| `penalty_lookup` \| Lead with consequence/penalty. \|
	\| `cross_reference` \| Explain primary section, then connections. \|
	\| `conflict_detection` \| Flag contradiction ONLY if both sides are in context. \|
	\| `temporal` \| Lead with exact numeric deadline/time. \|

	---

	## 9. Validation

	### 9.1 Validator Design

	`validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
	- Returns `hallucination_flag: True` if score is below floor.
	- Graph triggers a retry (up to 2 times) with different retrieval parameters if flagged.

	### 9.2 Output Guardrails

	`guardrails/output_guard.py`:
	- Intercepts low-confidence or safe-guard failures.
	- Returns `InsufficientInfoResponse` when grounding is weak.
	- Appends legal disclaimer.

	---

	## 10. RAGAS Evaluation Pipeline

	### 10.1 Two-Phase Architecture

	- Phase 1: Graph invocation -> `eval_phase1_results.json`.
	- Phase 2: RAGAS scoring -> `eval_results.json`.

	### 10.2 Dataset & Metrics

	- Rows: 31 (Central, 4 States, Multi-Jurisdiction).
	- Primary Metrics: Faithfulness, Answer Relevancy, Context Precision.
	- Goal: Faithfulness > 0.85; Answer Relevancy > 0.80.

	---

	## 11. Known Failure Modes

	- Multi-Jurisdiction Retrieval: Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
	- Large Context Noise: 7 chunks sometimes include irrelevant sub-clauses that distract the generator.

	---

	## 12. Implementation Checklist

	- [x] Add `DocumentSpec` to registry.
	- [x] Verify PDF text extraction.
	- [x] Run `make ingest`.
	- [x] Seed Neo4j graph.
	- [x] Run `make eval-smoke` to verify precision.