adeshboudh16 commited on
Commit
8de7198
·
1 Parent(s): 77dd060

updated docs

Browse files
Files changed (2) hide show
  1. README.md +3 -3
  2. docs/RAG.md +69 -331
README.md CHANGED
@@ -16,8 +16,7 @@ pinned: false
16
  Open-source RAG system for querying Indian civic and legal documents — with accurate
17
  citations, cross-reference traversal, and conflict detection between laws.
18
 
19
- **Current status:** Phase 8 complete — 5-jurisdiction RERA coverage (Central + MH + UP + KA + TN),
20
- RAGAS evaluation pipeline live, hybrid RRF retrieval, Next.js frontend deployed on Vercel.
21
 
22
  ---
23
 
@@ -97,7 +96,7 @@ make serve
97
 
98
  ## Production
99
 
100
- - **Frontend:** [Vercel](https://civicsetu-two.vercel.app) — Next.js 15 App Router
101
  - **API:** [Hugging Face Spaces](https://huggingface.co/spaces/adesh01/civicsetu) — FastAPI + Docker + 550MB model baked in
102
  - **PostgreSQL + pgvector:** [Neon](https://neon.tech) — 1203 chunks
103
  - **Neo4j:** [AuraDB Free](https://neo4j.com/cloud/aura) — 2090 sections, 2321 edges
@@ -184,6 +183,7 @@ Graph: 2090 Section nodes, 1297 HAS_SECTION edges, 933 REFERENCES edges, 91 DERI
184
  | 6 | Next.js frontend, Vercel deployment, public URL | ✅ Complete |
185
  | 7 | Graph explorer, section content drawer, D3 visualization | ✅ Complete |
186
  | 8 | RAGAS eval pipeline, hybrid RRF retrieval, retrieval quality fixes | ✅ Complete |
 
187
 
188
 
189
  ---
 
16
  Open-source RAG system for querying Indian civic and legal documents — with accurate
17
  citations, cross-reference traversal, and conflict detection between laws.
18
 
19
+ **Current status:** Phase 9 complete — 5-jurisdiction RERA coverage, RAGAS evaluation pipeline (0.90 faithfulness), hybrid RRF retrieval, and mobile-responsive Next.js frontend live on Vercel.
 
20
 
21
  ---
22
 
 
96
 
97
  ## Production
98
 
99
+ - **Frontend:** [Vercel](https://civicsetu-two.vercel.app) — Next.js 15 App Router (Mobile Responsive)
100
  - **API:** [Hugging Face Spaces](https://huggingface.co/spaces/adesh01/civicsetu) — FastAPI + Docker + 550MB model baked in
101
  - **PostgreSQL + pgvector:** [Neon](https://neon.tech) — 1203 chunks
102
  - **Neo4j:** [AuraDB Free](https://neo4j.com/cloud/aura) — 2090 sections, 2321 edges
 
183
  | 6 | Next.js frontend, Vercel deployment, public URL | ✅ Complete |
184
  | 7 | Graph explorer, section content drawer, D3 visualization | ✅ Complete |
185
  | 8 | RAGAS eval pipeline, hybrid RRF retrieval, retrieval quality fixes | ✅ Complete |
186
+ | 9 | Mobile responsiveness, frontend polish, dual-pane layout, interaction animations | ✅ Complete |
187
 
188
 
189
  ---
docs/RAG.md CHANGED
@@ -1,7 +1,7 @@
1
  # CivicSetu - RAG Techniques Reference
2
 
3
- **Version:** 2.2 - Cloud Sync + Ingestion Refresh
4
- **Last Updated:** 2026-04-30
5
 
6
  This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.
7
 
@@ -9,8 +9,11 @@ This document describes the retrieval-augmented generation stack currently used
9
 
10
  ## 1. Current Status Snapshot
11
 
12
- As of **2026-04-30**, CivicSetu's RAG app is usable end-to-end, with a fresh ingestion cycle completed.
13
 
 
 
 
14
  - **Cloud Infrastructure Live**
15
  - Relational & Vector: **Neon (Postgres + pgvector)**
16
  - Graph: **Neo4j AuraDB**
@@ -32,21 +35,19 @@ As of **2026-04-30**, CivicSetu's RAG app is usable end-to-end, with a fresh ing
32
  - streaming path reuses classifier, retrieval, and reranker
33
  - answer text streams first
34
  - citations and metadata are extracted in a second fast pass
35
- - **Latest eval artifact improved a lot over old smoke baseline**
36
  - `eval_results.json` dated **2026-04-28**
37
  - `faithfulness=0.900`
38
  - `answer_relevancy=0.858`
39
  - `context_precision=0.696`
40
  - `pass_rate=0.581`
41
- - **Knowledge Graph Scale (as of 2026-04-30)**
42
  - Documents: `6`
43
- - Sections: `1,160`
44
- - `REFERENCES` edges: `314`
45
- - `DERIVED_FROM` edges: `62`
46
  - **Main remaining weakness**
47
- - multi-jurisdiction retrieval still weak
48
- - `MULTI` rows pass only `20%`
49
- - common `fact_lookup` traffic is still less reliable than penalty or graph-heavy queries
50
 
51
  ---
52
 
@@ -121,7 +122,7 @@ Current defaults from `config/settings.py`:
121
  - `embedding_model = nomic-embed-text`
122
  - `embedding_dimension = 768`
123
 
124
- Query and document embeddings use asymmetric prefixes compatible with Nomic-style retrieval.
125
 
126
  ### 3.6 Graph Seeding
127
 
@@ -156,10 +157,10 @@ Current route mapping:
156
  | Query Type | Route |
157
  |---|---|
158
  | `fact_lookup` | `vector_retrieval` |
159
- | `cross_reference` | `graph_retrieval` |
160
  | `penalty_lookup` | `graph_retrieval` |
161
  | `temporal` | `graph_retrieval` |
162
- | `conflict_detection` | `hybrid_retrieval` |
163
 
164
  Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.
165
 
@@ -170,22 +171,20 @@ All non-streaming LLM calls use `_llm_call()`. Streaming uses `_llm_stream()`.
170
  Current model chain:
171
 
172
  ```text
173
- THINKING tier
174
- 1. gemini/gemini-3.1-flash-lite-preview
175
  2. groq/llama-3.3-70b-versatile
176
- 3. openrouter/meta-llama/llama-3.3-70b-instruct:free
177
- 4. openrouter/qwen/qwen3.6-plus:free
178
 
179
- FAST tier
180
- 1. gemini/gemini-3.1-flash-lite-preview
181
  ```
182
 
183
  Provider notes:
184
 
185
- - non-NVIDIA models go through LiteLLM
186
- - NVIDIA-backed models use `ChatNVIDIA` directly
187
- - generator and metadata extraction use `temperature=0.0`
188
- - fast-tier tasks use the lighter chain
189
 
190
  ---
191
 
@@ -197,38 +196,21 @@ Hybrid retrieval combines vector similarity and PostgreSQL full-text search, the
197
 
198
  Used to catch semantic matches when wording differs from statute text.
199
 
200
- Strength:
201
-
202
- - good for paraphrase and plain-English phrasing
203
-
204
- Weakness:
205
-
206
- - can still over-focus on one jurisdiction or sub-clause family
207
-
208
  ### 5.2 Full-Text Search
209
 
210
- Used for exact legal wording, section numbers, and important terms.
211
-
212
- Strength:
213
-
214
- - precise keyword and section hits
215
-
216
- Weakness:
217
-
218
- - misses paraphrases and concept-only questions
219
 
220
  ### 5.3 Reciprocal Rank Fusion
221
 
222
  Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.
223
 
224
- ### 5.4 Section Family Expansion
225
 
226
- Top merged sections expand to include parent and sibling sub-sections.
227
 
228
- Purpose:
229
 
230
- - restore surrounding legal context for split sections
231
- - prevent generator from seeing one isolated sub-clause only
232
 
233
  ---
234
 
@@ -239,17 +221,16 @@ Used for section-centric questions and legal relationships.
239
  Current behavior:
240
 
241
  - extract section or rule IDs from query
242
- - traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`) populated during [Graph Seeding](#36-graph-seeding)
243
  - hydrate matching sections back from Postgres
244
 
245
  Graph retrieval is especially important for:
246
 
247
  - explicit section lookups
248
  - penalty questions
249
- - temporal questions
250
  - central vs state derivation paths
251
 
252
- Pinned chunks stay ahead of reranked chunks so exact requested sections do not get buried.
253
 
254
  ---
255
 
@@ -257,36 +238,20 @@ Pinned chunks stay ahead of reranked chunks so exact requested sections do not g
257
 
258
  ### 7.1 Cross-Encoder
259
 
260
- `retrieval/reranker.py` uses FlashRank with current default:
261
-
262
- - `reranker_model = ms-marco-MiniLM-L-12-v2`
263
 
264
  Pipeline:
265
 
266
  1. deduplicate by `(section_id, doc_name)`
267
  2. split pinned vs rankable chunks
268
  3. rerank rankable chunks with cross-encoder
269
- 4. filter by minimum score
270
- 5. apply score-gap cutoff
271
  6. prepend pinned chunks
272
 
273
- ### 7.2 Current Thresholds
274
-
275
- Current defaults:
276
-
277
- - `reranker_score_threshold = 0.05`
278
- - `reranker_score_gap = 0.95`
279
-
280
- These are intentionally recall-friendly compared to older stricter settings.
281
 
282
- ### 7.3 Final Context Assembly
283
-
284
- Current max context size is **7 chunks**.
285
-
286
- Assembly rule:
287
-
288
- - pinned chunks first
289
- - then top reranked chunks until 7 total
290
 
291
  ---
292
 
@@ -294,98 +259,42 @@ Assembly rule:
294
 
295
  ### 8.1 Buffered Generation
296
 
297
- `generator_node()` builds a numbered context block and asks the model for JSON output:
298
-
299
- ```json
300
- {
301
- "answer": "<markdown>",
302
- "confidence_score": 0.0,
303
- "cited_chunks": [1, 3],
304
- "amendment_notice": null,
305
- "conflict_warnings": []
306
- }
307
- ```
308
-
309
- If parsing fails but raw text exists, answer text is salvaged and citations fall back to all visible chunks.
310
 
311
  ### 8.2 Streaming Generation
312
 
313
  `stream_generator_node()` now drives SSE output.
314
-
315
- Flow:
316
-
317
- 1. run classifier, retrieval, reranker
318
- 2. stream plain-text answer tokens to client
319
- 3. run a second fast metadata extraction prompt
320
- 4. map `cited_chunks` indices back to real citations
321
-
322
- Why split answer and metadata:
323
-
324
- - keeps first-token latency lower
325
- - avoids forcing model to stream valid JSON
326
- - still preserves grounded citations in final event
327
 
328
  ### 8.3 Tone Hints by Query Type
329
 
330
- Current tone shaping:
331
-
332
- | Type | Current Hint |
333
  |---|---|
334
- | `fact_lookup` | direct answer, no analogies or metaphors |
335
- | `penalty_lookup` | lead with consequence |
336
- | `cross_reference` | explain cited section first, then linked sections present in context |
337
- | `conflict_detection` | only claim conflict if both sides are present |
338
- | `temporal` | lead with exact time value if present, otherwise say it is missing |
339
-
340
- ### 8.4 Citation Extraction
341
-
342
- Both buffered and streaming generators anchor citations from 1-based `cited_chunks` indices.
343
-
344
- Effect:
345
-
346
- - citations reflect chunks actually used by generator
347
- - not every retrieved chunk becomes a citation
348
 
349
  ---
350
 
351
  ## 9. Validation
352
 
353
- ### 9.1 Current Validator Design
354
-
355
- Current validator is lightweight and **not** an LLM verifier.
356
 
357
- `validator_node()` currently:
 
 
358
 
359
- - returns early if answer or chunks are missing
360
- - treats `confidence_score < 0.2` as a hallucination risk signal
361
- - otherwise preserves generator confidence
362
 
363
- This is cheaper and faster than second-pass validation, but weaker as a grounding guarantee.
364
-
365
- ### 9.2 Retry Logic
366
-
367
- Current retry rule:
368
-
369
- ```text
370
- no reranked chunks -> end
371
- confidence < 0.2 and retry_count < 2 -> retry
372
- otherwise -> end
373
- ```
374
-
375
- Max retries: **2**
376
-
377
- Consequence:
378
-
379
- - avoids excessive cost
380
- - surfaces more low-confidence answers instead of repeatedly re-running graph
381
-
382
- ### 9.3 Output Guardrails
383
-
384
- `guardrails/output_guard.py` still does final shaping:
385
-
386
- - below confidence floor, return `InsufficientInfoResponse`
387
- - always append disclaimer
388
- - input guard handles safety/PII before graph execution
389
 
390
  ---
391
 
@@ -393,199 +302,28 @@ Consequence:
393
 
394
  ### 10.1 Two-Phase Architecture
395
 
396
- Evaluation is checkpointed into two phases:
397
-
398
- - **Phase 1:** invoke graph -> `eval_phase1_results.json`
399
- - **Phase 2:** score with RAGAS -> `eval_results.json`
400
-
401
- Important current behavior:
402
-
403
- - phase 1 resumes from valid cached rows
404
- - phase 2 resumes from already scored rows
405
- - failures do not require restarting from row 1
406
 
407
- ### 10.2 Dataset
408
 
409
- `eval/golden_dataset.jsonl` currently has **31 rows** across:
410
-
411
- - `CENTRAL`
412
- - `MAHARASHTRA`
413
- - `UTTAR_PRADESH`
414
- - `KARNATAKA`
415
- - `TAMIL_NADU`
416
- - `MULTI`
417
-
418
- Each row includes:
419
-
420
- - `query`
421
- - `query_type`
422
- - `ground_truth`
423
- - `expected_section_ids`
424
-
425
- ### 10.3 Judge Providers
426
-
427
- Current supported judge providers:
428
-
429
- ```bash
430
- JUDGE_PROVIDER=groq JUDGE_MODEL=llama-3.3-70b-versatile
431
- JUDGE_PROVIDER=gemini JUDGE_MODEL=gemini/gemma-4-31b-it
432
- JUDGE_PROVIDER=openrouter JUDGE_MODEL=meta-llama/llama-3.3-70b-instruct
433
- JUDGE_PROVIDER=osmapi JUDGE_MODEL=qwen3.5-397b-a17b
434
- JUDGE_PROVIDER=nvidia JUDGE_MODEL=z-ai/glm4.7
435
- ```
436
-
437
- Current rate-limit controls:
438
-
439
- - `PHASE2_DELAY_SEC=20`
440
- - `PHASE2_MAX_RETRIES=4`
441
- - retry delay parsed from provider errors when available
442
-
443
- ### 10.4 Judge Input Trimming
444
-
445
- Current defaults:
446
-
447
- ```text
448
- RAGAS_MAX_CONTEXTS = 7
449
- RAGAS_CONTEXT_CHAR_LIMIT = 1200
450
- RAGAS_ANSWER_CHAR_LIMIT = 1500
451
- RAGAS_REFERENCE_CHAR_LIMIT = 600
452
- ```
453
-
454
- ### 10.5 Latest Full Eval Results
455
-
456
- Source: `eval_results.json` generated on **2026-04-28**
457
-
458
- Run config captured in artifact:
459
-
460
- - graph model: `openai/z-ai/glm4.7`
461
- - judge model: `qwen3.5-397b-a17b`
462
- - pass threshold: `0.7`
463
-
464
- Overall:
465
-
466
- | Metric | Value |
467
- |---|---|
468
- | Faithfulness | `0.900` |
469
- | Answer Relevancy | `0.858` |
470
- | Context Precision | `0.696` |
471
- | Pass Rate | `0.581` |
472
- | P50 Latency | `101,853.7 ms` |
473
- | P90 Latency | `161,085.2 ms` |
474
-
475
- By query type:
476
-
477
- | Query Type | Faithfulness | Relevancy | Precision | Pass Rate |
478
- |---|---|---|---|---|
479
- | `penalty_lookup` | `0.866` | `0.873` | `1.000` | `1.000` |
480
- | `temporal` | `0.965` | `0.841` | `0.713` | `0.500` |
481
- | `cross_reference` | `0.894` | `0.796` | `0.736` | `0.500` |
482
- | `conflict_detection` | `0.899` | `0.911` | `0.551` | `0.500` |
483
- | `fact_lookup` | `0.880` | `0.865` | `0.513` | `0.429` |
484
-
485
- By jurisdiction:
486
-
487
- | Jurisdiction | Faithfulness | Relevancy | Precision | Pass Rate |
488
- |---|---|---|---|---|
489
- | `TAMIL_NADU` | `0.978` | `0.900` | `0.747` | `0.800` |
490
- | `KARNATAKA` | `0.907` | `0.847` | `0.851` | `0.800` |
491
- | `UTTAR_PRADESH` | `0.978` | `0.904` | `0.600` | `0.600` |
492
- | `MAHARASHTRA` | `0.834` | `0.879` | `0.702` | `0.600` |
493
- | `CENTRAL` | `0.933` | `0.792` | `0.912` | `0.500` |
494
- | `MULTI` | `0.763` | `0.838` | `0.322` | `0.200` |
495
-
496
- Interpretation:
497
-
498
- - grounding is now strong enough that hallucination is no longer the primary failure mode
499
- - ranking quality improved materially
500
- - multi-jurisdiction comparison remains the biggest retrieval gap
501
- - eval latency is high enough that streaming is important for user-facing UX
502
 
503
  ---
504
 
505
  ## 11. Known Failure Modes
506
 
507
- ### 11.1 Multi-Jurisdiction Retrieval
508
-
509
- Cross-jurisdiction questions still underperform.
510
-
511
- Observed symptom:
512
-
513
- - `MULTI` pass rate only `0.200`
514
-
515
- Likely causes:
516
-
517
- - one jurisdiction dominates semantic retrieval
518
- - reranker still sees mixed chunks without explicit two-sided decomposition
519
-
520
- ### 11.2 Fact Lookup Precision
521
-
522
- `fact_lookup` is still weakest among common traffic:
523
-
524
- - pass rate `0.429`
525
- - context precision `0.513`
526
-
527
- Likely causes:
528
-
529
- - broad questions invite many semantically related sub-clauses
530
- - exact top-level section sometimes loses to more specific descendants
531
-
532
- ### 11.3 Latency
533
-
534
- Eval-mode latency remains large:
535
-
536
- - p50 ~102 seconds
537
- - p90 ~161 seconds
538
-
539
- Implication:
540
-
541
- - buffered endpoint is expensive for user experience
542
- - streaming endpoint is necessary, not optional
543
 
544
  ---
545
 
546
- ## 12. Configuration Reference
547
-
548
- Important RAG settings from `config/settings.py`:
549
-
550
- | Parameter | Default | Effect |
551
- |---|---|---|
552
- | `primary_model` | `gemini/gemini-3.1-flash-lite-preview` | main thinking-tier model |
553
- | `fast_model` | `gemini/gemini-3.1-flash-lite-preview` | fast-tier tasks |
554
- | `fallback_model_1` | `groq/llama-3.3-70b-versatile` | first fallback |
555
- | `fallback_model_2` | `openrouter/meta-llama/llama-3.3-70b-instruct:free` | second fallback |
556
- | `fallback_model_3` | `openrouter/qwen/qwen3.6-plus:free` | third fallback |
557
- | `embedding_model` | `nomic-embed-text` | embedding model |
558
- | `embedding_dimension` | `768` | pgvector dimension |
559
- | `reranker_model` | `ms-marco-MiniLM-L-12-v2` | FlashRank cross-encoder |
560
- | `reranker_score_threshold` | `0.05` | minimum score to keep |
561
- | `reranker_score_gap` | `0.95` | score-cliff cutoff |
562
-
563
- Important eval env vars:
564
-
565
- | Env Var | Effect |
566
- |---|---|
567
- | `EVAL_PHASE` | run phase 1 only, phase 2 only, or both |
568
- | `EVAL_LIMIT` | limit number of rows |
569
- | `EVAL_IDS` | evaluate only specific row IDs |
570
- | `EVAL_JURISDICTION` | evaluate one jurisdiction only |
571
- | `JUDGE_PROVIDER` | judge backend |
572
- | `JUDGE_MODEL` | judge model |
573
- | `NO_REASONING` | disable thinking where supported |
574
- | `PASS_THRESHOLD` | pass threshold |
575
- | `PHASE2_DELAY_SEC` | inter-row delay for judge calls |
576
- | `RAGAS_MAX_CONTEXTS` | contexts sent to judge |
577
- | `RAGAS_CONTEXT_CHAR_LIMIT` | trim limit per context |
578
-
579
- ---
580
-
581
- ## 13. Implementation Checklist
582
-
583
- When adding a new jurisdiction or corpus:
584
 
585
- - add `DocumentSpec` with correct page cap
586
- - verify PDF is text-extractable, not image-only
587
- - ingest and inspect fallback chunking logs
588
- - verify chunk counts in Postgres
589
- - run graph seeding (manual via `scripts/seed_phase3.py` or via `scripts/ingest.py`)
590
- - add eval rows for all major query types
591
- - rerun full eval and inspect jurisdiction-specific precision
 
1
  # CivicSetu - RAG Techniques Reference
2
 
3
+ **Version:** 2.3 - Mobile Ledger + Quality Hardening
4
+ **Last Updated:** 2026-05-01
5
 
6
  This document describes the retrieval-augmented generation stack currently used in CivicSetu, what is live in the app today, and where the weak spots still are.
7
 
 
9
 
10
  ## 1. Current Status Snapshot
11
 
12
+ As of **2026-05-01**, CivicSetu's RAG app is at production-grade stability (v1.0.0-level), with mobile responsiveness and retrieval quality fixes live.
13
 
14
+ - **Phase 9 Complete (Mobile Responsive)**
15
+ - Dual-pane layout for desktop; tabbed "Digital Ledger" UI for mobile.
16
+ - Interactive Graph Explorer with section drill-down.
17
  - **Cloud Infrastructure Live**
18
  - Relational & Vector: **Neon (Postgres + pgvector)**
19
  - Graph: **Neo4j AuraDB**
 
35
  - streaming path reuses classifier, retrieval, and reranker
36
  - answer text streams first
37
  - citations and metadata are extracted in a second fast pass
38
+ - **Latest eval artifact (0.90 Faithfulness)**
39
  - `eval_results.json` dated **2026-04-28**
40
  - `faithfulness=0.900`
41
  - `answer_relevancy=0.858`
42
  - `context_precision=0.696`
43
  - `pass_rate=0.581`
44
+ - **Knowledge Graph Scale (as of 2026-05-01)**
45
  - Documents: `6`
46
+ - Sections: `2,090`
47
+ - Edges: `2,321` (REFERENCES, DERIVED_FROM, HAS_SECTION)
 
48
  - **Main remaining weakness**
49
+ - multi-jurisdiction retrieval still weak (`MULTI` rows pass only `20%`)
50
+ - context precision for broad fact lookups needs further HNSW tuning
 
51
 
52
  ---
53
 
 
122
  - `embedding_model = nomic-embed-text`
123
  - `embedding_dimension = 768`
124
 
125
+ Query and document embeddings use asymmetric prefixes (`search_query: ` vs `search_document: `) compatible with Nomic-style retrieval.
126
 
127
  ### 3.6 Graph Seeding
128
 
 
157
  | Query Type | Route |
158
  |---|---|
159
  | `fact_lookup` | `vector_retrieval` |
160
+ | `cross_reference" | `graph_retrieval` |
161
  | `penalty_lookup` | `graph_retrieval` |
162
  | `temporal` | `graph_retrieval` |
163
+ | `conflict_detection" | `hybrid_retrieval` |
164
 
165
  Classifier fallback: if JSON parse fails, default to `fact_lookup` with original query.
166
 
 
171
  Current model chain:
172
 
173
  ```text
174
+ THINKING tier (Generator)
175
+ 1. gemini/gemini-1.5-flash
176
  2. groq/llama-3.3-70b-versatile
177
+ 3. NVIDIA NIM: z-ai/glm4.7 | minimaxai/minimax-m2.7
 
178
 
179
+ FAST tier (Classifier/Validator)
180
+ 1. gemini/gemini-1.5-flash
181
  ```
182
 
183
  Provider notes:
184
 
185
+ - NVIDIA-hosted models (Minimax, GLM) use `https://integrate.api.nvidia.com/v1`
186
+ - `temperature=0.0` for all grounding tasks
187
+ - Gemini models use a temperature of `1.0` if specified as such by provider requirements for certain tiers.
 
188
 
189
  ---
190
 
 
196
 
197
  Used to catch semantic matches when wording differs from statute text.
198
 
 
 
 
 
 
 
 
 
199
  ### 5.2 Full-Text Search
200
 
201
+ Used for exact legal wording, section numbers, and important terms via `websearch_to_tsquery`.
 
 
 
 
 
 
 
 
202
 
203
  ### 5.3 Reciprocal Rank Fusion
204
 
205
  Vector and FTS results are merged with RRF so chunks that rank well in both signals rise to the top.
206
 
207
+ ### 5.4 Section-ID-Aware Direct Lookup
208
 
209
+ If a query contains explicit section/rule numbers (e.g., "Section 18 refund"), the retriever performs a direct indexed lookup for those sections and **pins** them to the top of the retrieval list. This acts as a safety net when semantic search fails to rank the exact section high enough.
210
 
211
+ ### 5.5 Central Act Supplementation
212
 
213
+ For queries filtered by a specific State Jurisdiction (e.g., Maharashtra), the retriever automatically supplements results with chunks from the **Central RERA Act 2016**. This is critical because state rules often omit core definitions or penalties that are defined once in the Central Act.
 
214
 
215
  ---
216
 
 
221
  Current behavior:
222
 
223
  - extract section or rule IDs from query
224
+ - traverse Neo4j relationships (`REFERENCES` and `DERIVED_FROM`)
225
  - hydrate matching sections back from Postgres
226
 
227
  Graph retrieval is especially important for:
228
 
229
  - explicit section lookups
230
  - penalty questions
 
231
  - central vs state derivation paths
232
 
233
+ Pinned chunks (from direct lookup or graph traversal) stay ahead of reranked chunks.
234
 
235
  ---
236
 
 
238
 
239
  ### 7.1 Cross-Encoder
240
 
241
+ `retrieval/reranker.py` uses FlashRank (`ms-marco-MiniLM-L-12-v2`).
 
 
242
 
243
  Pipeline:
244
 
245
  1. deduplicate by `(section_id, doc_name)`
246
  2. split pinned vs rankable chunks
247
  3. rerank rankable chunks with cross-encoder
248
+ 4. filter by minimum score (0.05)
249
+ 5. apply score-gap cutoff (0.95)
250
  6. prepend pinned chunks
251
 
252
+ ### 7.2 Context Assembly
 
 
 
 
 
 
 
253
 
254
+ Max context size is **7 chunks**. Pinned chunks (exact matches) are never discarded by the reranker unless the context is fully saturated.
 
 
 
 
 
 
 
255
 
256
  ---
257
 
 
259
 
260
  ### 8.1 Buffered Generation
261
 
262
+ `generator_node()` builds a numbered context block and asks for JSON output.
 
 
 
 
 
 
 
 
 
 
 
 
263
 
264
  ### 8.2 Streaming Generation
265
 
266
  `stream_generator_node()` now drives SSE output.
267
+ 1. Run classification/retrieval/reranking.
268
+ 2. Stream answer tokens immediately.
269
+ 3. Run a second fast metadata extraction prompt
270
+ 4. Push metadata/citations as the final SSE event.
 
 
 
 
 
 
 
 
 
271
 
272
  ### 8.3 Tone Hints by Query Type
273
 
274
+ | Type | Tone Guidance |
 
 
275
  |---|---|
276
+ | `fact_lookup` | Direct, no metaphors, cite per bullet. |
277
+ | `penalty_lookup` | Lead with consequence/penalty. |
278
+ | `cross_reference` | Explain primary section, then connections. |
279
+ | `conflict_detection` | Flag contradiction ONLY if both sides are in context. |
280
+ | `temporal` | Lead with exact numeric deadline/time. |
 
 
 
 
 
 
 
 
 
281
 
282
  ---
283
 
284
  ## 9. Validation
285
 
286
+ ### 9.1 Validator Design
 
 
287
 
288
+ `validator_node()` treats `confidence_score < 0.2` as a hallucination risk.
289
+ - Returns `hallucination_flag: True` if score is below floor.
290
+ - Graph triggers a **retry** (up to 2 times) with different retrieval parameters if flagged.
291
 
292
+ ### 9.2 Output Guardrails
 
 
293
 
294
+ `guardrails/output_guard.py`:
295
+ - Intercepts low-confidence or safe-guard failures.
296
+ - Returns `InsufficientInfoResponse` when grounding is weak.
297
+ - Appends legal disclaimer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
298
 
299
  ---
300
 
 
302
 
303
  ### 10.1 Two-Phase Architecture
304
 
305
+ - **Phase 1:** Graph invocation -> `eval_phase1_results.json`.
306
+ - **Phase 2:** RAGAS scoring -> `eval_results.json`.
 
 
 
 
 
 
 
 
307
 
308
+ ### 10.2 Dataset & Metrics
309
 
310
+ - **Rows:** 31 (Central, 4 States, Multi-Jurisdiction).
311
+ - **Primary Metrics:** Faithfulness, Answer Relevancy, Context Precision.
312
+ - **Goal:** Faithfulness > 0.85; Answer Relevancy > 0.80.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313
 
314
  ---
315
 
316
  ## 11. Known Failure Modes
317
 
318
+ - **Multi-Jurisdiction Retrieval:** Reranker often prefers one jurisdiction's terminology, leading to unbalanced context for comparison queries.
319
+ - **Large Context Noise:** 7 chunks sometimes include irrelevant sub-clauses that distract the generator.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
320
 
321
  ---
322
 
323
+ ## 12. Implementation Checklist
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
324
 
325
+ - [x] Add `DocumentSpec` to registry.
326
+ - [x] Verify PDF text extraction.
327
+ - [x] Run `make ingest`.
328
+ - [x] Seed Neo4j graph.
329
+ - [x] Run `make eval-smoke` to verify precision.