adeshboudh16 commited on
Commit ·
8de7198
1
Parent(s): 77dd060
updated docs
Browse files- README.md +3 -3
- docs/RAG.md +69 -331
README.md
CHANGED
|
@@ -16,8 +16,7 @@ pinned: false
|
|
| 16 |
Open-source RAG system for querying Indian civic and legal documents — with accurate
|
| 17 |
citations, cross-reference traversal, and conflict detection between laws.
|
| 18 |
|
| 19 |
-
**Current status:** Phase
|
| 20 |
-
RAGAS evaluation pipeline live, hybrid RRF retrieval, Next.js frontend deployed on Vercel.
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
@@ -97,7 +96,7 @@ make serve
|
|
| 97 |
|
| 98 |
## Production
|
| 99 |
|
| 100 |
-
- **Frontend:** [Vercel](https://civicsetu-two.vercel.app) — Next.js 15 App Router
|
| 101 |
- **API:** [Hugging Face Spaces](https://huggingface.co/spaces/adesh01/civicsetu) — FastAPI + Docker + 550MB model baked in
|
| 102 |
- **PostgreSQL + pgvector:** [Neon](https://neon.tech) — 1203 chunks
|
| 103 |
- **Neo4j:** [AuraDB Free](https://neo4j.com/cloud/aura) — 2090 sections, 2321 edges
|
|
@@ -184,6 +183,7 @@ Graph: 2090 Section nodes, 1297 HAS_SECTION edges, 933 REFERENCES edges, 91 DERI
|
|
| 184 |
| 6 | Next.js frontend, Vercel deployment, public URL | ✅ Complete |
|
| 185 |
| 7 | Graph explorer, section content drawer, D3 visualization | ✅ Complete |
|
| 186 |
| 8 | RAGAS eval pipeline, hybrid RRF retrieval, retrieval quality fixes | ✅ Complete |
|
|
|
|
| 187 |
|
| 188 |
|
| 189 |
---
|
|
|
|
| 16 |
Open-source RAG system for querying Indian civic and legal documents — with accurate
|
| 17 |
citations, cross-reference traversal, and conflict detection between laws.
|
| 18 |
|
| 19 |
+
**Current status:** Phase 9 complete — 5-jurisdiction RERA coverage, RAGAS evaluation pipeline (0.90 faithfulness), hybrid RRF retrieval, and mobile-responsive Next.js frontend live on Vercel.
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
|
|
| 96 |
|
| 97 |
## Production
|
| 98 |
|
| 99 |
+
- **Frontend:** [Vercel](https://civicsetu-two.vercel.app) — Next.js 15 App Router (Mobile Responsive)
|
| 100 |
- **API:** [Hugging Face Spaces](https://huggingface.co/spaces/adesh01/civicsetu) — FastAPI + Docker + 550MB model baked in
|
| 101 |
- **PostgreSQL + pgvector:** [Neon](https://neon.tech) — 1203 chunks
|
| 102 |
- **Neo4j:** [AuraDB Free](https://neo4j.com/cloud/aura) — 2090 sections, 2321 edges
|
|
|
|
| 183 |
| 6 | Next.js frontend, Vercel deployment, public URL | ✅ Complete |
|
| 184 |
| 7 | Graph explorer, section content drawer, D3 visualization | ✅ Complete |
|
| 185 |
| 8 | RAGAS eval pipeline, hybrid RRF retrieval, retrieval quality fixes | ✅ Complete |
|
| 186 |
+
| 9 | Mobile responsiveness, frontend polish, dual-pane layout, interaction animations | ✅ Complete |
|
| 187 |
|
| 188 |
|
| 189 |
---
|
docs/RAG.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
# CivicSetu - RAG Techniques Reference
|
| 2 |
|
| 3 |
-
**Version:** 2.
|
| 4 |
-
**Last Updated:** 2026-
|
| 5 |
|
| 6 |
This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.
|
| 7 |
|
|
@@ -9,8 +9,11 @@ This document describes the retrieval-augmented generation stack currently used
|
|
| 9 |
|
| 10 |
## 1. Current Status Snapshot
|
| 11 |
|
| 12 |
-
As of **2026-
|
| 13 |
|
|
|
|
|
|
|
|
|
|
| 14 |
- **Cloud Infrastructure Live**
|
| 15 |
- Relational & Vector: **Neon (Postgres + pgvector)**
|
| 16 |
- Graph: **Neo4j AuraDB**
|
|
@@ -32,21 +35,19 @@ As of **2026-04-30**, CivicSetu's RAG app is usable end-to-end, with a fresh ing
|
|
| 32 |
- streaming path reuses classifier, retrieval, and reranker
|
| 33 |
- answer text streams first
|
| 34 |
- citations and metadata are extracted in a second fast pass
|
| 35 |
-
- **Latest eval artifact
|
| 36 |
- `eval_results.json` dated **2026-04-28**
|
| 37 |
- `faithfulness=0.900`
|
| 38 |
- `answer_relevancy=0.858`
|
| 39 |
- `context_precision=0.696`
|
| 40 |
- `pass_rate=0.581`
|
| 41 |
-
- **Knowledge Graph Scale (as of 2026-
|
| 42 |
- Documents: `6`
|
| 43 |
-
- Sections: `
|
| 44 |
-
-
|
| 45 |
-
- `DERIVED_FROM` edges: `62`
|
| 46 |
- **Main remaining weakness**
|
| 47 |
-
- multi-jurisdiction retrieval still weak
|
| 48 |
-
-
|
| 49 |
-
- common `fact_lookup` traffic is still less reliable than penalty or graph-heavy queries
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
@@ -121,7 +122,7 @@ Current defaults from `config/settings.py`:
|
|
| 121 |
- `embedding_model = nomic-embed-text`
|
| 122 |
- `embedding_dimension = 768`
|
| 123 |
|
| 124 |
-
Query and document embeddings use asymmetric prefixes compatible with Nomic-style retrieval.
|
| 125 |
|
| 126 |
### 3.6 Graph Seeding
|
| 127 |
|
|
@@ -156,10 +157,10 @@ Current route mapping:
|
|
| 156 |
| Query Type | Route |
|
| 157 |
|---|---|
|
| 158 |
| `fact_lookup` | `vector_retrieval` |
|
| 159 |
-
| `cross_reference
|
| 160 |
| `penalty_lookup` | `graph_retrieval` |
|
| 161 |
| `temporal` | `graph_retrieval` |
|
| 162 |
-
| `conflict_detection
|
| 163 |
|
| 164 |
Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.
|
| 165 |
|
|
@@ -170,22 +171,20 @@ All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.
|
|
| 170 |
Current model chain:
|
| 171 |
|
| 172 |
```text
|
| 173 |
-
THINKING tier
|
| 174 |
-
1. gemini/gemini-
|
| 175 |
2. groq/llama-3.3-70b-versatile
|
| 176 |
-
3.
|
| 177 |
-
4. openrouter/qwen/qwen3.6-plus:free
|
| 178 |
|
| 179 |
-
FAST tier
|
| 180 |
-
1. gemini/gemini-
|
| 181 |
```
|
| 182 |
|
| 183 |
Provider notes:
|
| 184 |
|
| 185 |
-
-
|
| 186 |
-
-
|
| 187 |
-
-
|
| 188 |
-
- fast-tier tasks use the lighter chain
|
| 189 |
|
| 190 |
---
|
| 191 |
|
|
@@ -197,38 +196,21 @@ Hybrid retrieval combines vector similarity and PostgreSQL full-text search, the
|
|
| 197 |
|
| 198 |
Used to catch semantic matches when wording differs from statute text.
|
| 199 |
|
| 200 |
-
Strength:
|
| 201 |
-
|
| 202 |
-
- good for paraphrase and plain-English phrasing
|
| 203 |
-
|
| 204 |
-
Weakness:
|
| 205 |
-
|
| 206 |
-
- can still over-focus on one jurisdiction or sub-clause family
|
| 207 |
-
|
| 208 |
### 5.2 Full-Text Search
|
| 209 |
|
| 210 |
-
Used for exact legal wording, section numbers, and important terms.
|
| 211 |
-
|
| 212 |
-
Strength:
|
| 213 |
-
|
| 214 |
-
- precise keyword and section hits
|
| 215 |
-
|
| 216 |
-
Weakness:
|
| 217 |
-
|
| 218 |
-
- misses paraphrases and concept-only questions
|
| 219 |
|
| 220 |
### 5.3 Reciprocal Rank Fusion
|
| 221 |
|
| 222 |
Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.
|
| 223 |
|
| 224 |
-
### 5.4 Section
|
| 225 |
|
| 226 |
-
|
| 227 |
|
| 228 |
-
|
| 229 |
|
| 230 |
-
|
| 231 |
-
- prevent generator from seeing one isolated sub-clause only
|
| 232 |
|
| 233 |
---
|
| 234 |
|
|
@@ -239,17 +221,16 @@ Used for section-centric questions and legal relationships.
|
|
| 239 |
Current behavior:
|
| 240 |
|
| 241 |
- extract section or rule IDs from query
|
| 242 |
-
- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
|
| 243 |
- hydrate matching sections back from Postgres
|
| 244 |
|
| 245 |
Graph retrieval is especially important for:
|
| 246 |
|
| 247 |
- explicit section lookups
|
| 248 |
- penalty questions
|
| 249 |
-
- temporal questions
|
| 250 |
- central vs state derivation paths
|
| 251 |
|
| 252 |
-
Pinned chunks
|
| 253 |
|
| 254 |
---
|
| 255 |
|
|
@@ -257,36 +238,20 @@ Pinned chunks stay ahead of reranked chunks so exact requested sections do not g
|
|
| 257 |
|
| 258 |
### 7.1 Cross-Encoder
|
| 259 |
|
| 260 |
-
`retrieval/reranker.py` uses FlashRank
|
| 261 |
-
|
| 262 |
-
- `reranker_model = ms-marco-MiniLM-L-12-v2`
|
| 263 |
|
| 264 |
Pipeline:
|
| 265 |
|
| 266 |
1. deduplicate by `(section_id, doc_name)`
|
| 267 |
2. split pinned vs rankable chunks
|
| 268 |
3. rerank rankable chunks with cross-encoder
|
| 269 |
-
4. filter by minimum score
|
| 270 |
-
5. apply score-gap cutoff
|
| 271 |
6. prepend pinned chunks
|
| 272 |
|
| 273 |
-
### 7.2
|
| 274 |
-
|
| 275 |
-
Current defaults:
|
| 276 |
-
|
| 277 |
-
- `reranker_score_threshold = 0.05`
|
| 278 |
-
- `reranker_score_gap = 0.95`
|
| 279 |
-
|
| 280 |
-
These are intentionally recall-friendly compared to older stricter settings.
|
| 281 |
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
Current max context size is **7 chunks**.
|
| 285 |
-
|
| 286 |
-
Assembly rule:
|
| 287 |
-
|
| 288 |
-
- pinned chunks first
|
| 289 |
-
- then top reranked chunks until 7 total
|
| 290 |
|
| 291 |
---
|
| 292 |
|
|
@@ -294,98 +259,42 @@ Assembly rule:
|
|
| 294 |
|
| 295 |
### 8.1 Buffered Generation
|
| 296 |
|
| 297 |
-
`generator_node()` builds a numbered context block and asks
|
| 298 |
-
|
| 299 |
-
```json
|
| 300 |
-
{
|
| 301 |
-
"answer": "<markdown>",
|
| 302 |
-
"confidence_score": 0.0,
|
| 303 |
-
"cited_chunks": [1, 3],
|
| 304 |
-
"amendment_notice": null,
|
| 305 |
-
"conflict_warnings": []
|
| 306 |
-
}
|
| 307 |
-
```
|
| 308 |
-
|
| 309 |
-
If parsing fails but raw text exists, answer text is salvaged and citations fall back to all visible chunks.
|
| 310 |
|
| 311 |
### 8.2 Streaming Generation
|
| 312 |
|
| 313 |
`stream_generator_node()` now drives SSE output.
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
2. stream plain-text answer tokens to client
|
| 319 |
-
3. run a second fast metadata extraction prompt
|
| 320 |
-
4. map `cited_chunks` indices back to real citations
|
| 321 |
-
|
| 322 |
-
Why split answer and metadata:
|
| 323 |
-
|
| 324 |
-
- keeps first-token latency lower
|
| 325 |
-
- avoids forcing model to stream valid JSON
|
| 326 |
-
- still preserves grounded citations in final event
|
| 327 |
|
| 328 |
### 8.3 Tone Hints by Query Type
|
| 329 |
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
| Type | Current Hint |
|
| 333 |
|---|---|
|
| 334 |
-
| `fact_lookup` |
|
| 335 |
-
| `penalty_lookup` |
|
| 336 |
-
| `cross_reference` |
|
| 337 |
-
| `conflict_detection` |
|
| 338 |
-
| `temporal` |
|
| 339 |
-
|
| 340 |
-
### 8.4 Citation Extraction
|
| 341 |
-
|
| 342 |
-
Both buffered and streaming generators anchor citations from 1-based `cited_chunks` indices.
|
| 343 |
-
|
| 344 |
-
Effect:
|
| 345 |
-
|
| 346 |
-
- citations reflect chunks actually used by generator
|
| 347 |
-
- not every retrieved chunk becomes a citation
|
| 348 |
|
| 349 |
---
|
| 350 |
|
| 351 |
## 9. Validation
|
| 352 |
|
| 353 |
-
### 9.1
|
| 354 |
-
|
| 355 |
-
Current validator is lightweight and **not** an LLM verifier.
|
| 356 |
|
| 357 |
-
`validator_node()`
|
|
|
|
|
|
|
| 358 |
|
| 359 |
-
|
| 360 |
-
- treats `confidence_score < 0.2` as a hallucination risk signal
|
| 361 |
-
- otherwise preserves generator confidence
|
| 362 |
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
Current retry rule:
|
| 368 |
-
|
| 369 |
-
```text
|
| 370 |
-
no reranked chunks -> end
|
| 371 |
-
confidence < 0.2 and retry_count < 2 -> retry
|
| 372 |
-
otherwise -> end
|
| 373 |
-
```
|
| 374 |
-
|
| 375 |
-
Max retries: **2**
|
| 376 |
-
|
| 377 |
-
Consequence:
|
| 378 |
-
|
| 379 |
-
- avoids excessive cost
|
| 380 |
-
- surfaces more low-confidence answers instead of repeatedly re-running graph
|
| 381 |
-
|
| 382 |
-
### 9.3 Output Guardrails
|
| 383 |
-
|
| 384 |
-
`guardrails/output_guard.py` still does final shaping:
|
| 385 |
-
|
| 386 |
-
- below confidence floor, return `InsufficientInfoResponse`
|
| 387 |
-
- always append disclaimer
|
| 388 |
-
- input guard handles safety/PII before graph execution
|
| 389 |
|
| 390 |
---
|
| 391 |
|
|
@@ -393,199 +302,28 @@ Consequence:
|
|
| 393 |
|
| 394 |
### 10.1 Two-Phase Architecture
|
| 395 |
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
- **Phase 1:** invoke graph -> `eval_phase1_results.json`
|
| 399 |
-
- **Phase 2:** score with RAGAS -> `eval_results.json`
|
| 400 |
-
|
| 401 |
-
Important current behavior:
|
| 402 |
-
|
| 403 |
-
- phase 1 resumes from valid cached rows
|
| 404 |
-
- phase 2 resumes from already scored rows
|
| 405 |
-
- failures do not require restarting from row 1
|
| 406 |
|
| 407 |
-
### 10.2 Dataset
|
| 408 |
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
-
|
| 412 |
-
- `MAHARASHTRA`
|
| 413 |
-
- `UTTAR_PRADESH`
|
| 414 |
-
- `KARNATAKA`
|
| 415 |
-
- `TAMIL_NADU`
|
| 416 |
-
- `MULTI`
|
| 417 |
-
|
| 418 |
-
Each row includes:
|
| 419 |
-
|
| 420 |
-
- `query`
|
| 421 |
-
- `query_type`
|
| 422 |
-
- `ground_truth`
|
| 423 |
-
- `expected_section_ids`
|
| 424 |
-
|
| 425 |
-
### 10.3 Judge Providers
|
| 426 |
-
|
| 427 |
-
Current supported judge providers:
|
| 428 |
-
|
| 429 |
-
```bash
|
| 430 |
-
JUDGE_PROVIDER=groq JUDGE_MODEL=llama-3.3-70b-versatile
|
| 431 |
-
JUDGE_PROVIDER=gemini JUDGE_MODEL=gemini/gemma-4-31b-it
|
| 432 |
-
JUDGE_PROVIDER=openrouter JUDGE_MODEL=meta-llama/llama-3.3-70b-instruct
|
| 433 |
-
JUDGE_PROVIDER=osmapi JUDGE_MODEL=qwen3.5-397b-a17b
|
| 434 |
-
JUDGE_PROVIDER=nvidia JUDGE_MODEL=z-ai/glm4.7
|
| 435 |
-
```
|
| 436 |
-
|
| 437 |
-
Current rate-limit controls:
|
| 438 |
-
|
| 439 |
-
- `PHASE2_DELAY_SEC=20`
|
| 440 |
-
- `PHASE2_MAX_RETRIES=4`
|
| 441 |
-
- retry delay parsed from provider errors when available
|
| 442 |
-
|
| 443 |
-
### 10.4 Judge Input Trimming
|
| 444 |
-
|
| 445 |
-
Current defaults:
|
| 446 |
-
|
| 447 |
-
```text
|
| 448 |
-
RAGAS_MAX_CONTEXTS = 7
|
| 449 |
-
RAGAS_CONTEXT_CHAR_LIMIT = 1200
|
| 450 |
-
RAGAS_ANSWER_CHAR_LIMIT = 1500
|
| 451 |
-
RAGAS_REFERENCE_CHAR_LIMIT = 600
|
| 452 |
-
```
|
| 453 |
-
|
| 454 |
-
### 10.5 Latest Full Eval Results
|
| 455 |
-
|
| 456 |
-
Source: `eval_results.json` generated on **2026-04-28**
|
| 457 |
-
|
| 458 |
-
Run config captured in artifact:
|
| 459 |
-
|
| 460 |
-
- graph model: `openai/z-ai/glm4.7`
|
| 461 |
-
- judge model: `qwen3.5-397b-a17b`
|
| 462 |
-
- pass threshold: `0.7`
|
| 463 |
-
|
| 464 |
-
Overall:
|
| 465 |
-
|
| 466 |
-
| Metric | Value |
|
| 467 |
-
|---|---|
|
| 468 |
-
| Faithfulness | `0.900` |
|
| 469 |
-
| Answer Relevancy | `0.858` |
|
| 470 |
-
| Context Precision | `0.696` |
|
| 471 |
-
| Pass Rate | `0.581` |
|
| 472 |
-
| P50 Latency | `101,853.7 ms` |
|
| 473 |
-
| P90 Latency | `161,085.2 ms` |
|
| 474 |
-
|
| 475 |
-
By query type:
|
| 476 |
-
|
| 477 |
-
| Query Type | Faithfulness | Relevancy | Precision | Pass Rate |
|
| 478 |
-
|---|---|---|---|---|
|
| 479 |
-
| `penalty_lookup` | `0.866` | `0.873` | `1.000` | `1.000` |
|
| 480 |
-
| `temporal` | `0.965` | `0.841` | `0.713` | `0.500` |
|
| 481 |
-
| `cross_reference` | `0.894` | `0.796` | `0.736` | `0.500` |
|
| 482 |
-
| `conflict_detection` | `0.899` | `0.911` | `0.551` | `0.500` |
|
| 483 |
-
| `fact_lookup` | `0.880` | `0.865` | `0.513` | `0.429` |
|
| 484 |
-
|
| 485 |
-
By jurisdiction:
|
| 486 |
-
|
| 487 |
-
| Jurisdiction | Faithfulness | Relevancy | Precision | Pass Rate |
|
| 488 |
-
|---|---|---|---|---|
|
| 489 |
-
| `TAMIL_NADU` | `0.978` | `0.900` | `0.747` | `0.800` |
|
| 490 |
-
| `KARNATAKA` | `0.907` | `0.847` | `0.851` | `0.800` |
|
| 491 |
-
| `UTTAR_PRADESH` | `0.978` | `0.904` | `0.600` | `0.600` |
|
| 492 |
-
| `MAHARASHTRA` | `0.834` | `0.879` | `0.702` | `0.600` |
|
| 493 |
-
| `CENTRAL` | `0.933` | `0.792` | `0.912` | `0.500` |
|
| 494 |
-
| `MULTI` | `0.763` | `0.838` | `0.322` | `0.200` |
|
| 495 |
-
|
| 496 |
-
Interpretation:
|
| 497 |
-
|
| 498 |
-
- grounding is now strong enough that hallucination is no longer the primary failure mode
|
| 499 |
-
- ranking quality improved materially
|
| 500 |
-
- multi-jurisdiction comparison remains the biggest retrieval gap
|
| 501 |
-
- eval latency is high enough that streaming is important for user-facing UX
|
| 502 |
|
| 503 |
---
|
| 504 |
|
| 505 |
## 11. Known Failure Modes
|
| 506 |
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
Cross-jurisdiction questions still underperform.
|
| 510 |
-
|
| 511 |
-
Observed symptom:
|
| 512 |
-
|
| 513 |
-
- `MULTI` pass rate only `0.200`
|
| 514 |
-
|
| 515 |
-
Likely causes:
|
| 516 |
-
|
| 517 |
-
- one jurisdiction dominates semantic retrieval
|
| 518 |
-
- reranker still sees mixed chunks without explicit two-sided decomposition
|
| 519 |
-
|
| 520 |
-
### 11.2 Fact Lookup Precision
|
| 521 |
-
|
| 522 |
-
`fact_lookup` is still weakest among common traffic:
|
| 523 |
-
|
| 524 |
-
- pass rate `0.429`
|
| 525 |
-
- context precision `0.513`
|
| 526 |
-
|
| 527 |
-
Likely causes:
|
| 528 |
-
|
| 529 |
-
- broad questions invite many semantically related sub-clauses
|
| 530 |
-
- exact top-level section sometimes loses to more specific descendants
|
| 531 |
-
|
| 532 |
-
### 11.3 Latency
|
| 533 |
-
|
| 534 |
-
Eval-mode latency remains large:
|
| 535 |
-
|
| 536 |
-
- p50 ~102 seconds
|
| 537 |
-
- p90 ~161 seconds
|
| 538 |
-
|
| 539 |
-
Implication:
|
| 540 |
-
|
| 541 |
-
- buffered endpoint is expensive for user experience
|
| 542 |
-
- streaming endpoint is necessary, not optional
|
| 543 |
|
| 544 |
---
|
| 545 |
|
| 546 |
-
## 12.
|
| 547 |
-
|
| 548 |
-
Important RAG settings from `config/settings.py`:
|
| 549 |
-
|
| 550 |
-
| Parameter | Default | Effect |
|
| 551 |
-
|---|---|---|
|
| 552 |
-
| `primary_model` | `gemini/gemini-3.1-flash-lite-preview` | main thinking-tier model |
|
| 553 |
-
| `fast_model` | `gemini/gemini-3.1-flash-lite-preview` | fast-tier tasks |
|
| 554 |
-
| `fallback_model_1` | `groq/llama-3.3-70b-versatile` | first fallback |
|
| 555 |
-
| `fallback_model_2` | `openrouter/meta-llama/llama-3.3-70b-instruct:free` | second fallback |
|
| 556 |
-
| `fallback_model_3` | `openrouter/qwen/qwen3.6-plus:free` | third fallback |
|
| 557 |
-
| `embedding_model` | `nomic-embed-text` | embedding model |
|
| 558 |
-
| `embedding_dimension` | `768` | pgvector dimension |
|
| 559 |
-
| `reranker_model` | `ms-marco-MiniLM-L-12-v2` | FlashRank cross-encoder |
|
| 560 |
-
| `reranker_score_threshold` | `0.05` | minimum score to keep |
|
| 561 |
-
| `reranker_score_gap` | `0.95` | score-cliff cutoff |
|
| 562 |
-
|
| 563 |
-
Important eval env vars:
|
| 564 |
-
|
| 565 |
-
| Env Var | Effect |
|
| 566 |
-
|---|---|
|
| 567 |
-
| `EVAL_PHASE` | run phase 1 only, phase 2 only, or both |
|
| 568 |
-
| `EVAL_LIMIT` | limit number of rows |
|
| 569 |
-
| `EVAL_IDS` | evaluate only specific row IDs |
|
| 570 |
-
| `EVAL_JURISDICTION` | evaluate one jurisdiction only |
|
| 571 |
-
| `JUDGE_PROVIDER` | judge backend |
|
| 572 |
-
| `JUDGE_MODEL` | judge model |
|
| 573 |
-
| `NO_REASONING` | disable thinking where supported |
|
| 574 |
-
| `PASS_THRESHOLD` | pass threshold |
|
| 575 |
-
| `PHASE2_DELAY_SEC` | inter-row delay for judge calls |
|
| 576 |
-
| `RAGAS_MAX_CONTEXTS` | contexts sent to judge |
|
| 577 |
-
| `RAGAS_CONTEXT_CHAR_LIMIT` | trim limit per context |
|
| 578 |
-
|
| 579 |
-
---
|
| 580 |
-
|
| 581 |
-
## 13. Implementation Checklist
|
| 582 |
-
|
| 583 |
-
When adding a new jurisdiction or corpus:
|
| 584 |
|
| 585 |
-
-
|
| 586 |
-
-
|
| 587 |
-
-
|
| 588 |
-
-
|
| 589 |
-
-
|
| 590 |
-
- add eval rows for all major query types
|
| 591 |
-
- rerun full eval and inspect jurisdiction-specific precision
|
|
|
|
| 1 |
# CivicSetu - RAG Techniques Reference
|
| 2 |
|
| 3 |
+
**Version:** 2.3 - Mobile Ledger + Quality Hardening
|
| 4 |
+
**Last Updated:** 2026-05-01
|
| 5 |
|
| 6 |
This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.
|
| 7 |
|
|
|
|
| 9 |
|
| 10 |
## 1. Current Status Snapshot
|
| 11 |
|
| 12 |
+
As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.
|
| 13 |
|
| 14 |
+
- **Phase 9 Complete (Mobile Responsive)**
|
| 15 |
+
- Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
|
| 16 |
+
- Interactive Graph Explorer with section drill-down.
|
| 17 |
- **Cloud Infrastructure Live**
|
| 18 |
- Relational & Vector: **Neon (Postgres + pgvector)**
|
| 19 |
- Graph: **Neo4j AuraDB**
|
|
|
|
| 35 |
- streaming path reuses classifier, retrieval, and reranker
|
| 36 |
- answer text streams first
|
| 37 |
- citations and metadata are extracted in a second fast pass
|
| 38 |
+
- **Latest eval artifact (0.90 Faithfulness)**
|
| 39 |
- `eval_results.json` dated **2026-04-28**
|
| 40 |
- `faithfulness=0.900`
|
| 41 |
- `answer_relevancy=0.858`
|
| 42 |
- `context_precision=0.696`
|
| 43 |
- `pass_rate=0.581`
|
| 44 |
+
- **Knowledge Graph Scale (as of 2026-05-01)**
|
| 45 |
- Documents: `6`
|
| 46 |
+
- Sections: `2,090`
|
| 47 |
+
- Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
|
|
|
|
| 48 |
- **Main remaining weakness**
|
| 49 |
+
- multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
|
| 50 |
+
- context precision for broad fact lookups needs further HNSW tuning
|
|
|
|
| 51 |
|
| 52 |
---
|
| 53 |
|
|
|
|
| 122 |
- `embedding_model = nomic-embed-text`
|
| 123 |
- `embedding_dimension = 768`
|
| 124 |
|
| 125 |
+
Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.
|
| 126 |
|
| 127 |
### 3.6 Graph Seeding
|
| 128 |
|
|
|
|
| 157 |
| Query Type | Route |
|
| 158 |
|---|---|
|
| 159 |
| `fact_lookup` | `vector_retrieval` |
|
| 160 |
+
| `cross_reference" | `graph_retrieval` |
|
| 161 |
| `penalty_lookup` | `graph_retrieval` |
|
| 162 |
| `temporal` | `graph_retrieval` |
|
| 163 |
+
| `conflict_detection" | `hybrid_retrieval` |
|
| 164 |
|
| 165 |
Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.
|
| 166 |
|
|
|
|
| 171 |
Current model chain:
|
| 172 |
|
| 173 |
```text
|
| 174 |
+
THINKING tier (Generator)
|
| 175 |
+
1. gemini/gemini-1.5-flash
|
| 176 |
2. groq/llama-3.3-70b-versatile
|
| 177 |
+
3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7
|
|
|
|
| 178 |
|
| 179 |
+
FAST tier (Classifier/Validator)
|
| 180 |
+
1. gemini/gemini-1.5-flash
|
| 181 |
```
|
| 182 |
|
| 183 |
Provider notes:
|
| 184 |
|
| 185 |
+
- NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
|
| 186 |
+
- `temperature=0.0` for all grounding tasks
|
| 187 |
+
- Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.
|
|
|
|
| 188 |
|
| 189 |
---
|
| 190 |
|
|
|
|
| 196 |
|
| 197 |
Used to catch semantic matches when wording differs from statute text.
|
| 198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
### 5.2 Full-Text Search
|
| 200 |
|
| 201 |
+
Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
### 5.3 Reciprocal Rank Fusion
|
| 204 |
|
| 205 |
Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.
|
| 206 |
|
| 207 |
+
### 5.4 Section-ID-Aware Direct Lookup
|
| 208 |
|
| 209 |
+
If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.
|
| 210 |
|
| 211 |
+
### 5.5 Central Act Supplementation
|
| 212 |
|
| 213 |
+
For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.
|
|
|
|
| 214 |
|
| 215 |
---
|
| 216 |
|
|
|
|
| 221 |
Current behavior:
|
| 222 |
|
| 223 |
- extract section or rule IDs from query
|
| 224 |
+
- traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
|
| 225 |
- hydrate matching sections back from Postgres
|
| 226 |
|
| 227 |
Graph retrieval is especially important for:
|
| 228 |
|
| 229 |
- explicit section lookups
|
| 230 |
- penalty questions
|
|
|
|
| 231 |
- central vs state derivation paths
|
| 232 |
|
| 233 |
+
Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.
|
| 234 |
|
| 235 |
---
|
| 236 |
|
|
|
|
| 238 |
|
| 239 |
### 7.1 Cross-Encoder
|
| 240 |
|
| 241 |
+
`retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).
|
|
|
|
|
|
|
| 242 |
|
| 243 |
Pipeline:
|
| 244 |
|
| 245 |
1. deduplicate by `(section_id, doc_name)`
|
| 246 |
2. split pinned vs rankable chunks
|
| 247 |
3. rerank rankable chunks with cross-encoder
|
| 248 |
+
4. filter by minimum score (0.05)
|
| 249 |
+
5. apply score-gap cutoff (0.95)
|
| 250 |
6. prepend pinned chunks
|
| 251 |
|
| 252 |
+
### 7.2 Context Assembly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
|
| 254 |
+
Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 255 |
|
| 256 |
---
|
| 257 |
|
|
|
|
| 259 |
|
| 260 |
### 8.1 Buffered Generation
|
| 261 |
|
| 262 |
+
`generator_node()` builds a numbered context block and asks for JSON output.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
|
| 264 |
### 8.2 Streaming Generation
|
| 265 |
|
| 266 |
`stream_generator_node()` now drives SSE output.
|
| 267 |
+
1. Run classification/retrieval/reranking.
|
| 268 |
+
2. Stream answer tokens immediately.
|
| 269 |
+
3. Run a second fast metadata extraction prompt
|
| 270 |
+
4. Push metadata/citations as the final SSE event.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
|
| 272 |
### 8.3 Tone Hints by Query Type
|
| 273 |
|
| 274 |
+
| Type | Tone Guidance |
|
|
|
|
|
|
|
| 275 |
|---|---|
|
| 276 |
+
| `fact_lookup` | Direct, no metaphors, cite per bullet. |
|
| 277 |
+
| `penalty_lookup` | Lead with consequence/penalty. |
|
| 278 |
+
| `cross_reference` | Explain primary section, then connections. |
|
| 279 |
+
| `conflict_detection` | Flag contradiction ONLY if both sides are in context. |
|
| 280 |
+
| `temporal` | Lead with exact numeric deadline/time. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
|
| 282 |
---
|
| 283 |
|
| 284 |
## 9. Validation
|
| 285 |
|
| 286 |
+
### 9.1 Validator Design
|
|
|
|
|
|
|
| 287 |
|
| 288 |
+
`validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
|
| 289 |
+
- Returns `hallucination_flag: True` if score is below floor.
|
| 290 |
+
- Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged.
|
| 291 |
|
| 292 |
+
### 9.2 Output Guardrails
|
|
|
|
|
|
|
| 293 |
|
| 294 |
+
`guardrails/output_guard.py`:
|
| 295 |
+
- Intercepts low-confidence or safe-guard failures.
|
| 296 |
+
- Returns `InsufficientInfoResponse` when grounding is weak.
|
| 297 |
+
- Appends legal disclaimer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
|
| 299 |
---
|
| 300 |
|
|
|
|
| 302 |
|
| 303 |
### 10.1 Two-Phase Architecture
|
| 304 |
|
| 305 |
+
- **Phase 1:** Graph invocation -> `eval_phase1_results.json`.
|
| 306 |
+
- **Phase 2:** RAGAS scoring -> `eval_results.json`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 307 |
|
| 308 |
+
### 10.2 Dataset & Metrics
|
| 309 |
|
| 310 |
+
- **Rows:** 31 (Central, 4 States, Multi-Jurisdiction).
|
| 311 |
+
- **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision.
|
| 312 |
+
- **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
|
| 314 |
---
|
| 315 |
|
| 316 |
## 11. Known Failure Modes
|
| 317 |
|
| 318 |
+
- **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
|
| 319 |
+
- **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
|
| 321 |
---
|
| 322 |
|
| 323 |
+
## 12. Implementation Checklist
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 324 |
|
| 325 |
+
- [x] Add `DocumentSpec` to registry.
|
| 326 |
+
- [x] Verify PDF text extraction.
|
| 327 |
+
- [x] Run `make ingest`.
|
| 328 |
+
- [x] Seed Neo4j graph.
|
| 329 |
+
- [x] Run `make eval-smoke` to verify precision.
|
|
|
|
|
|