Spaces:

adesh01
/

civicsetu

Running

File size: 25,235 Bytes

aa78dc0
 
f8b04c3
254db3e
 
aa78dc0
 
 
 
 
 
 
 
6c8a2d0
 
aa78dc0
6c8a2d0
cd1d62d
aa78dc0
6c8a2d0
636c31f
a0c09ca
cd1d62d
f8b04c3
 
cd1d62d
aa78dc0
6c8a2d0
 
cd1d62d
aa78dc0
6c8a2d0
cd1d62d
6c8a2d0
aa78dc0
6c8a2d0
f8b04c3
 
 
6c8a2d0
 
 
aa78dc0
6c8a2d0
 
 
aa78dc0
cd1d62d
 
aa78dc0
cd1d62d
 
 
 
 
 
 
254db3e
f8b04c3
 
 
 
 
254db3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
cd1d62d
aa78dc0
 
 
cd1d62d
636c31f
 
cd1d62d
a0c09ca
aa78dc0
 
cd1d62d
aa78dc0
 
a0c09ca
cd1d62d
636c31f
 
a0c09ca
 
 
 
 
aa78dc0
 
 
636c31f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
636c31f
 
 
 
 
 
 
a0c09ca
 
636c31f
a0c09ca
 
 
 
636c31f
 
 
 
aa78dc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
 
 
 
 
 
 
 
 
 
 
 
aa78dc0
 
 
 
 
 
cd1d62d
aa78dc0
 
 
 
 
 
 
cd1d62d
e45c962
f8b04c3
 
 
 
 
aa78dc0
e45c962
aa78dc0
 
 
 
 
 
 
 
636c31f
aa78dc0
 
 
a0c09ca
aa78dc0
a0c09ca
636c31f
a0c09ca
 
 
 
 
 
 
 
 
636c31f
aa78dc0
 
 
 
 
6c8a2d0
aa78dc0
 
 
 
 
636c31f
aa78dc0
 
 
 
636c31f
 
 
 
 
 
 
 
 
 
aa78dc0
 
636c31f
aa78dc0
f8b04c3
aa78dc0
f8b04c3
aa78dc0
 
f8b04c3
 
aa78dc0
 
f8b04c3
 
 
 
 
aa78dc0
 
6c8a2d0
 
 
a0c09ca
 
 
 
 
 
 
 
6c8a2d0
 
a0c09ca
 
 
aa78dc0
 
f8b04c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1d62d
 
 
 
 
 
 
 
 
 
 
 
a0c09ca
cd1d62d
 
 
 
 
 
 
 
 
 
 
a0c09ca
cd1d62d
 
 
 
f8b04c3
aa78dc0
 
 
 
 
 
a0c09ca
aa78dc0
a0c09ca
aa78dc0
 
 
 
 
 
 
 
 
a0c09ca
aa78dc0
 
 
 
636c31f
aa78dc0
 
 
 
 
 
 
 
a0c09ca
aa78dc0
cd1d62d
a0c09ca
 
636c31f
 
cd1d62d
254db3e
cd1d62d
a0c09ca
 
cd1d62d
636c31f
cd1d62d
a0c09ca
 
 
 
 
 
 
 
636c31f
a0c09ca
cd1d62d
a0c09ca
cd1d62d
a0c09ca
 
e45c962
a0c09ca
6c8a2d0
a0c09ca
 
 
 
 
 
6c8a2d0
a0c09ca
6c8a2d0
a0c09ca
 
 
 
 
 
 
2d7bf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8b04c3
2d7bf65
 
 
 
 
 
 
 
f8b04c3
 
 
 
 
 
 
 
 
 
2d7bf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8b04c3

# CivicSetu — Low Level Design (LLD)

**Version:** 2.0.0 — Phase 8 Complete (RAGAS Evaluation + Retrieval Improvements)
**Live:** https://civicsetu-two.vercel.app
**Last Updated:** April 2026

---

## 1. Module Map

```
src/civicsetu/
├── config/
│   ├── settings.py           Pydantic BaseSettings singleton (lru_cache)
│   └── document_registry.py  All document URLs + metadata (single source of truth)
├── models/
│   ├── enums.py              StrEnum: Jurisdiction, DocType, QueryType, etc.
│   └── schemas.py            Pydantic models: LegalChunk, Citation, RetrievedChunk, CivicSetuResponse
├── ingestion/
│   ├── downloader.py         httpx PDF downloader with MD5 cache check
│   ├── parser.py             PyMuPDF text extractor — max_pages cap, scanned PDF detection
│   ├── chunker.py            Section-boundary regex chunker — 6 format patterns + fallback
│   ├── metadata_extractor.py Date/Section/Rule reference/amendment regex extraction
│   ├── embedder.py           nomic-embed-text-v1.5 via sentence-transformers — truncate at 4000 chars pre-prefix
│   ├── pipeline.py           Orchestrates ingestion; prepends section_title to embeddings
│   └── graph_seeder.py       Post-ingestion REFERENCES + DERIVED_FROM edge seeding
├── stores/
│   ├── relational_store.py   Async SQLAlchemy — documents + legal_chunks tables
│   ├── vector_store.py       pgvector HNSW cosine search
│   └── graph_store.py        Neo4j Cypher interface — fresh driver per call
├── retrieval/
│   ├── vector_retriever.py   Wraps VectorStore for agent use
│   ├── graph_retriever.py    REFERENCES + DERIVED_FROM traversal, Section/Rule ID extraction
│   └── reranker.py           FlashRank cross-encoder wrapper
├── agent/
│   ├── state.py              CivicSetuState TypedDict (frozen contract)
│   ├── nodes.py              Pure functions: classifier, _rrf_retrieve (shared hybrid),
│   │                         vector_retrieval, graph_retrieval, hybrid_retrieval,
│   │                         reranker, generator, validator
│   ├── edges.py              Conditional routing: route_after_classifier,
│   │                         route_after_validator
│   └── graph.py              StateGraph assembly + get_compiled_graph()
├── prompts/
│   ├── classifier.py         Query type classification + rewriting prompt
│   ├── generator.py          Cited answer generation prompt
│   └── validator.py          Hallucination + confidence check prompt
├── guardrails/
│   ├── input_guard.py        PII detection + off-topic filter
│   └── output_guard.py       Faithfulness check + disclaimer injection
└── api/
    ├── main.py               FastAPI app factory + lifespan (graph pre-compiled)
    ├── routes/
    │   ├── health.py         GET /health — DB ping
    │   ├── query.py          POST /api/v1/query — main RAG endpoint
    │   └── ingest.py         POST /api/v1/ingest — admin endpoint
    └── middleware/
        └── logging.py        Request/response structured logging

eval/
├── golden_dataset.jsonl      31-row RAGAS evaluation dataset across 5 jurisdictions
scripts/
├── run_eval.py               Two-phase RAGAS evaluation: Phase 1 (graph invoke) + Phase 2 (RAGAS scoring)

frontend/                     Next.js 15 App Router — deployed on Vercel
├── src/app/
│   ├── layout.tsx            Root layout: ThemeProvider + dark mode
│   ├── page.tsx              Main page: wires all components together
│   └── globals.css           Tailwind directives + gradient utilities
├── src/components/
│   ├── Header.tsx            Logo, new chat, theme toggle, GitHub link
│   ├── ChatThread.tsx        Scrollable message list + empty state examples
│   ├── MessageBubble.tsx     User/assistant/error bubbles with badges + citations
│   ├── ConfidenceBadge.tsx   HIGH/MEDIUM/LOW pill
│   ├── CitationsPanel.tsx    Collapsible citation cards
│   └── InputBar.tsx          Auto-resize textarea, jurisdiction select, send
├── src/hooks/
│   └── useChat.ts            Chat state, session_id localStorage, sendMessage
└── src/lib/
    ├── types.ts              TypeScript interfaces (mirrors backend Pydantic models)
    └── api.ts                queryRera() fetch wrapper → /api/v1/query
```

---

## 2. Database Schema

### PostgreSQL Tables

```sql
documents (
    doc_id          UUID PRIMARY KEY,
    doc_name        TEXT,
    jurisdiction    TEXT,   -- Jurisdiction enum value
    doc_type        TEXT,   -- DocType enum value  (stored uppercase: ACT, RULES, CIRCULAR)
    source_url      TEXT,
    effective_date  DATE,
    gazette_number  TEXT,
    total_chunks    INTEGER,
    ingested_at     TIMESTAMPTZ,
    is_active       BOOLEAN
)

legal_chunks (
    chunk_id            UUID PRIMARY KEY,
    doc_id              UUID → documents.doc_id,
    jurisdiction        TEXT,
    doc_type            TEXT,
    doc_name            TEXT,
    section_id          TEXT,   -- "18", "3(2)", "Para-3"
    section_title       TEXT,
    section_hierarchy   TEXT[], -- ["RERA Act 2016", "18"]
    text                TEXT,
    effective_date      DATE,
    superseded_by       UUID → legal_chunks.chunk_id,
    status              TEXT,   -- ChunkStatus enum value
    source_url          TEXT,
    page_number         INTEGER,
    embedding           vector(768)  -- HNSW indexed
)
```

### pgvector Index

```sql
CREATE INDEX legal_chunks_embedding_idx
    ON legal_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
```

`m=16` — 16 connections per node. `ef_construction=64` — 64 candidates during index build.
Tuned for recall/speed balance at <10K vectors. Revisit at 100K+.

### Neo4j Graph Schema

```
Nodes:
  (:Document {doc_id, doc_name, jurisdiction, doc_type, effective_date})
  (:Section  {section_id, title, chunk_id, jurisdiction, doc_name, is_active})

Edges:
  (:Document)-[:HAS_SECTION]->(:Section)
  (:Section) -[:REFERENCES]->(:Section)       -- intra + cross-jurisdiction citations
  (:Section) -[:DERIVED_FROM]->(:Section)     -- State Rule N → RERA Act Sec M
  (:Document)-[:DERIVED_FROM]->(:Document)    -- State Rules → RERA Act 2016

Planned (Phase 7+):
  (:Section) -[:SUPERSEDES]->(:Section)
  (:Section) -[:AMENDED_BY]->(:Amendment)
  (:Section) -[:CONFLICTS_WITH]->(:Section)
```

**Live graph stats (Phase 6):**

| Metric       | Count |
|--------------|-------|
| Documents    | 9     |
| Sections     | 2090  |
| HAS_SECTION  | 1297  |
| REFERENCES   | 933   |
| DERIVED_FROM | 91    |

---

## 3. Document Registry

`document_registry.py` — single source of truth for all ingested documents.

```python
@dataclass(frozen=True)
class DocumentSpec:
    name: str
    url: str
    jurisdiction: Jurisdiction
    doc_type: DocType
    effective_date: date | None
    filename: str
    dest_subdir: str
    max_pages: int | None = None  # None = all pages; cap excludes forms/schedules appendices
```

### Ingested Documents (Phase 6)

| Key | Document | Jurisdiction | DocType | Chunks | max_pages |
|---|---|---|---|---|---|
| `rera_act_2016` | RERA Act 2016 | CENTRAL | ACT | ~224 | None |
| `mahrera_rules_2017` | MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | None |
| `up_rera_rules_2016` | UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 24 |
| `up_rera_general_regulations_2019` | UP RERA General Regulations 2019 | UTTAR_PRADESH | CIRCULAR | 85 | None |
| `karnataka_rera_rules_2017` | Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 37 |
| `tn_rera_rules_2017` | Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 15 |

**PDF source notes:**
- Karnataka official PDF (`rera.karnataka.gov.in`) is fully scanned (19MB image) — NAREDCO mirror used
- TN PDF bundles rules + forms (101 pages); `max_pages=15` excludes Forms A–O
- UP Rules PDF bundles rules + forms (52 pages); `max_pages=24` excludes prescribed forms

---

## 4. LangGraph State Machine

### State Contract (`agent/state.py`)

```python
class CivicSetuState(TypedDict):
    # Input
    query: str
    session_id: Optional[str]
    jurisdiction_filter: Optional[Jurisdiction]
    top_k: int

    # Classification
    query_type: Optional[QueryType]
    rewritten_query: Optional[str]

    # Retrieval — Annotated[list, operator.add] enables parallel node merging
    retrieved_chunks: Annotated[list[RetrievedChunk], operator.add]
    reranked_chunks: list[RetrievedChunk]

    # Generation
    raw_response: Optional[str]
    citations: list[Citation]
    confidence_score: float
    conflict_warnings: list[str]
    amendment_notice: Optional[str]

    # Control
    retry_count: int          # max 2 retries
    hallucination_flag: bool
    error: Optional[str]
```

### RetrievedChunk Schema (`models/schemas.py`)

```python
class RetrievedChunk(BaseModel):
    chunk: LegalChunk
    vector_score: float | None = None
    rerank_score: float | None = None
    retrieval_source: str = "vector"   # "vector" | "graph"
    graph_path: Optional[str] = None   # e.g. "source:18@CENTRAL"
    is_pinned: bool = False            # True = exact source section, bypasses reranker sort
```

### Node Responsibilities

| Node | Input Keys | Output Keys | LLM Call |
| :-- | :-- | :-- | :-- |
| classifier | query | query_type, rewritten_query | Yes |
| vector_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| graph_retrieval | rewritten_query, top_k | retrieved_chunks | No |
| reranker | retrieved_chunks, query | reranked_chunks | No |
| generator | reranked_chunks, query | raw_response, citations, confidence_score | Yes |
| validator | raw_response, reranked_chunks | hallucination_flag, confidence_score | Yes |
| retry | retry_count | retry_count+1, cleared retrieval fields | No |

### Routing Logic

| classifier → route_after_classifier   |                                          |
|---------------------------------------|------------------------------------------|
| fact_lookup                           | vector_retrieval (RRF hybrid)            |
| cross_reference                       | graph_retrieval (→ RRF fallback)         |
| penalty_lookup                        | graph_retrieval (→ RRF fallback)         |
| temporal                              | graph_retrieval (→ RRF fallback)         |
| conflict_detection                    | hybrid_retrieval (RRF across jur.)       |

```
validator → route_after_validator:
    confidence >= 0.5 AND not hallucinated → END
    (confidence < 0.5 OR hallucinated) AND retry_count < 2 → retry → classifier
    (confidence < 0.5 OR hallucinated) AND retry_count >= 2 → END (low confidence answer)
```

---

## 5. Chunking Strategy

### Section Boundary Detection

Six regex patterns across `DocType.RULES`, tried in order (first match wins per line):

| # | Pattern | Format | Jurisdiction |
|---|---|---|---|
| 1 | `\n(?P<id>\d{1,2}[A-Z]?)\.\s*\n(?P<title>...)` | Newline-dot-newline | MahaRERA |
| 2 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...)\.?—` | Same-line em-dash | MahaRERA |
| 3 | `^Rule\s+(?P<id>\d{1,2}[A-Z]?)\s*[.\-–]\s*(?P<title>...)` | Explicit Rule prefix | Generic |
| 4 | `^\s*(?P<id>\d{1,2}[A-Z]?)\.\s+(?P<title>...?)\.–` | ASCII hyphen `.-` | Karnataka, Tamil Nadu |
| 5 | `(?P<id>\d{1,2}[A-Z]?)-\(1\)\s*\n(?P<title>...)` | `N-(1)\nTitle` | UP RERA multi-clause |
| 6 | `(?P<id>\d{1,2}[A-Z]?)-(?!\()\s*\n(?P<title>...)` | `N-\nTitle` | UP RERA single-clause |

`DocType.ACT` uses a separate pattern set. Fallback: paragraph split on double newlines.
Rule IDs capped at `\d{1,2}` (max 2 digits) — prevents year strings like `2016` matching as rule IDs.
Logs `no_section_boundaries_found` + `fallback_paragraph_chunking` when falling back.

### Chunk Size Limits

```
MIN_CHARS = 100   — discard fragments (headers, page numbers)
MAX_CHARS = 1500  — split large sections at subsection markers (1), (2), (a), (b)
```

### Split Priority for Large Sections

```
1. Subsection markers: \n\s*\((?:\d+|[a-z]{1,3})\)\s+
2. Sentence boundary near MAX_CHARS: rfind('. ')
3. Hard cut at MAX_CHARS (last resort)
```

### parser.py — max_pages cap

```python
@staticmethod
def parse(source: str | Path, max_pages: int | None = None) -> ParsedDocument:
    all_pages = list(doc)
    if max_pages is not None:
        all_pages = all_pages[:max_pages]   # slice before fulltext build
```

---

## 6. Embedding Strategy

**Model:** `nomic-embed-text-v1.5` (via `sentence-transformers`, local — no Ollama required)
**Dimension:** 768
**Asymmetric prefixes** (MTEB/nomic-embed requirement):

```
Ingestion time:  "search_document: {section_title}\n{text}"  → pipeline.py
Query time:      "search_query: {rewritten_query}"            → retrieval/__init__.py
```

**Section title prepend (Phase 8 change):** `pipeline.py` prepends `section_title` to the
embedded text so sub-chunks (e.g. `S.11(2)`) retain their section context.
Without this, sub-chunks embed without "Obligations of promoter" — cosine similarity misses them.
The reranker still receives raw `chunk.text` (no title prefix).

Using wrong prefix at query time causes ~10–15% recall degradation.

### Truncation Guard

```python
MAX_EMBED_CHARS = 4000   # ~1000 tokens — safe ceiling before prefix added

def embed_document(self, text: str) -> list[float]:
    if len(text) > MAX_EMBED_CHARS:
        log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS)
        text = text[:MAX_EMBED_CHARS]
    prefixed = f"search_document: {text.strip()}"  # prefix AFTER truncation
    return self.embed_one(prefixed)
```

Truncation happens **before** prefix is added — prevents Ollama 500 errors on Tamil Nadu
and other gazette PDFs where sub-sections exceed 10K chars.

---

## 7. Hybrid Retrieval — `_rrf_retrieve()`

All retrieval nodes share a single async helper `_rrf_retrieve()` in `agent/nodes.py`.

### Reciprocal Rank Fusion (RRF)

```python
RRF_K = 60   # standard constant

rrf_score(chunk) = 1/(K + rank_in_vector) + 1/(K + rank_in_fts)
```

Fetches `top_k × 3` vector results and `top_k × 2` FTS results, deduplicates by `chunk_id`,
merges via RRF, returns top `top_k × 2`.

### Full-Text Search

`VectorStore.full_text_search()` uses `websearch_to_tsquery` in OR mode:

```sql
WHERE to_tsvector('english', text) @@ websearch_to_tsquery('english', :query)
ORDER BY ts_rank(to_tsvector('english', text), websearch_to_tsquery('english', :query)) DESC
```

Changed from `plainto_tsquery` (AND-mode) — AND required all query words to match,
excluding relevant sections that matched most but not all words.

### Section Family Expansion

After RRF merge, top-3 results trigger family expansion:

```python
for rc in merged[:3]:
    base_sid = re.sub(r'\([^)]*\)$', '', section_id).strip()  # "5(4)" → "5"
    family = await VectorStore.get_section_family(section_id=base_sid, jurisdiction=jur)
    # returns all chunks where section_id = '5' OR section_id LIKE '5(%'
```

`get_section_family` guard: skips if `section_id` already contains `(` (base_sid computation
strips this before calling). Hard cap: `_MAX_VECTOR_EXPANDED = 40` chunks before reranker.

**Why top-3 not top-1:** If top-1 RRF result is a sub-section (`S.5(4)`), its parent
family is expanded. But if the truly relevant parent section (`S.11`) appears at RRF rank 2,
only expanding top-1 misses it. Expanding top-3 covers more cases at the cost of a slightly
larger pool.

---

## 7b. Reranker Detail

`reranker_score_threshold = 0.1` — minimum cross-encoder score to enter candidate pool.
`reranker_score_gap = 0.6` — gap filter cliff threshold.

**Gap filter:**

```python
def _apply_score_gap(chunks, gap=0.6):
    for i in range(1, len(chunks)):
        if chunks[i-1].rerank_score - chunks[i].rerank_score >= gap:
            return chunks[:i]
    return chunks
```

**Threshold history:** Originally `threshold=0.3, gap=0.35`. Gap=0.35 was too aggressive —
cut chunks with 0.36 score drop, leaving only 1 context for generator. Raised to 0.6 (Phase 8).

Final context: `pinned_chunks + gap_filtered[:max(0, 5 - len(pinned))]` → max 5 chunks.

---

## 8. Graph Retriever

`graph_retriever.py` — called on `cross_reference`, `penalty_lookup`, `temporal` query types.

### Section ID Extraction

```python
section_pattern = re.compile(r'\b(?:section|sec\.?|s\.)\s*(\d+[A-Z]?)\b', re.IGNORECASE)
rule_pattern    = re.compile(r'\bRule\s+(\d+[A-Z]?)\b', re.IGNORECASE)
```

### Traversal Strategy (per jurisdiction)

For each jurisdiction (`CENTRAL`, `MAHARASHTRA`, `UTTAR_PRADESH`, `KARNATAKA`, `TAMIL_NADU`):

```
1. Source section chunks    — exact section_id match → is_pinned=True
2. REFERENCES outgoing      — sections source cites (depth=2)
3. REFERENCES incoming      — sections that cite source
4. DERIVED_FROM outgoing    — Act sections this Rule derives from
5. DERIVED_FROM incoming    — Rule sections implementing this Act section
```

### Pinning Rule

Only the exact `section_id` match gets `is_pinned=True`. Sub-sections are NOT pinned.
Max pinned chunks: 2 (one per jurisdiction). Remaining 3 slots filled by reranker.

---

## 9. Response Contract

```python
CivicSetuResponse:
    answer: str                    # plain English, cites section numbers
    citations: list[Citation]      # min_length=1 — NEVER empty
    confidence_score: float        # 0.0–1.0
    confidence_level: str          # "high"/"medium"/"low"
    query_type_resolved: QueryType
    conflict_warnings: list[str]   # empty until Phase 7
    amendment_notice: Optional[str]
    disclaimer: str                # always present

Citation:
    section_id: str
    doc_name: str
    jurisdiction: Jurisdiction
    effective_date: Optional[date]
    source_url: str
    chunk_id: UUID
```

---

## 9. Error Handling

| Scenario | Behaviour |
| :-- | :-- |
| LLM provider rate limited | LiteLLM auto-rotates to next provider |
| All LLM providers fail | `RuntimeError` → FastAPI 500 |
| No chunks retrieved | `InsufficientInfoResponse` returned |
| Hallucination detected | retry (max 2x) → low confidence answer |
| DB unreachable | `/health` returns `degraded`, query returns 500 |
| Scanned PDF detected | Warning logged, fallback URL used (Karnataka) |
| Section patterns not matched | Fallback paragraph chunking, warning logged |
| Neo4j event loop mismatch | Prevented — `_get_driver()` creates fresh driver per call |
| Embedding input too long | Truncated at 4000 chars before prefix; warning logged |
| max_pages exceeded | Parser silently caps pages; total_pages reflects capped count |

---

## 10. Neo4j Graph — Phase 6 State (Current)

**Nodes:** 9 Documents, 2090 Sections
**Edges:** 1297 HAS_SECTION, 933 REFERENCES, 91 DERIVED_FROM

### Documents in Graph

| Document | Jurisdiction | DocType | Chunks | Sections | DERIVED_FROM edges |
|---|---|---|---|---|---|
| RERA Act 2016 | CENTRAL | ACT | ~224 | ~224 | — |
| MahaRERA Rules 2017 | MAHARASHTRA | RULES | ~214 | ~214 | 17 sec + 1 doc |
| UP RERA Rules 2016 | UTTAR_PRADESH | RULES | 170 | 33 | 11 sec + 1 doc |
| UP RERA General Regs 2019 | UTTAR_PRADESH | CIRCULAR | 85 | 53 | — |
| Karnataka RERA Rules 2017 | KARNATAKA | RULES | 235 | 45 | 15 sec + 1 doc |
| Tamil Nadu RERA Rules 2017 | TAMIL_NADU | RULES | 157 | 36 | 15 sec + 1 doc |

### Known Open Issues (non-blocking)

| Issue | Affected | Root Cause |
|---|---|---|
| Act §13 missing from graph | UP rule 14, KA rule 11, TN rule 11 | RERA Act ingestion — §13 chunked under different ID |
| Act §66 missing from graph | KA rule 19, TN rule 19 | RERA Act ingestion — §66 not ingested |

### DERIVED_FROM Map Summary

| Jurisdiction | Mapped pairs | Resolved | Unresolved |
|---|---|---|---|
| MAHARASHTRA | 17 | 17 | 0 |
| UTTAR_PRADESH | 15 | 11 | 4 |
| KARNATAKA | 17 | 15 | 2 |
| TAMIL_NADU | 17 | 15 | 2 |

### PDF Source Decisions

| Jurisdiction | Primary URL | Issue | Resolution |
|---|---|---|---|
| CENTRAL | indiacode.nic.in | — | — |
| MAHARASHTRA | naredco.in | — | — |
| UTTAR_PRADESH | up-rera.in/pdf/rera.pdf | pages 25–52 are forms | max_pages=24 |
| KARNATAKA | naredco.in (mirror) | Official PDF fully scanned (19MB) | NAREDCO born-digital |
| TAMIL_NADU | cms.tn.gov.in | pages 16–101 are Forms A–O | max_pages=15 |

## 11. Agent Pipeline — Bug Fixes (2026-03-22)

Three production bugs fixed after 12-case E2E suite. All verified: 0 retries, 0
hallucinations, avg latency 7.6s.

### Fix 1 — `vector_store.py::get_section_family` — Pydantic crash on SELECT *

`SELECT *` returned `embedding` as a raw string; Pydantic `list[float]` validation
failed. Fix: explicit column projection, `embedding=None` on all returned chunks.
Matches every other `VectorStore` method.

### Fix 2 — `nodes.py::vector_retrieval_node` — Reranker blowup on section expansion

Section family expansion ran on all 5 similarity hits → up to 121 chunks → FlashRank
cross-encoder serial scoring → 65s reranker time. Fix (Phase 5): expand top-1 hit only; hard
cap at 25 chunks before reranker.

```python
for rc in results[:1]:
    ...family expansion...
expanded = expanded[:25]  # hard safety cap
```

**Phase 8 update:** Expanded to top-3 after RAGAS eval revealed that when a sub-section
(e.g. `S.5(4)`) ranks #1, its parent `S.5` (with the 30-day rule) was never expanded.
Cap raised to 40 to accommodate larger families.

```python
for rc in merged[:3]:    # top-3 RRF results (was: top-1)
    ...family expansion...
expanded = expanded[:40]  # was: 25
```


### Fix 3 — `nodes.py::validator_node` — False hallucination flag

Validator built context as raw `chunk.text` joined string. Generator answer cites
`"Section 11(1)"` but raw text has no section number → validator scores 0.2 →
`hallucinated=True` → spurious retry loops (7 retries across 12 tests).

Fix: mirror generator's numbered context block `[i] doc — section_id: title\ntext`.
Validator can now match cited section numbers to source context.

### E2E Regression Results (post-fix)

| Metric | Pre-fix | Post-fix |
| :-- | :-- | :-- |
| Avg latency | 19.6s | **7.6s** |
| Max latency | 87.1s | **13.3s** |
| Avg confidence | 0.908 | **0.958** |
| Total retries | 7 | **0** |
| Slow (>20s) | 3 | **0** |
| Low conf (<0.7) | 2 | **0** |
| Pass rate | 12/12 | **12/12** |

---

## 12. Agent Pipeline — RAGAS Eval Fixes (Phase 8, April 2026)

Five changes from RAGAS evaluation revealing retrieval and faithfulness failures.

### Fix 4 — Reranker thresholds too aggressive (`settings.py`)

Old `score_gap=0.35` cut after any 0.36 point drop → only 1 chunk reached generator.
New: `score_threshold=0.1`, `score_gap=0.6`. Keeps secondary relevant chunks while still
filtering genuine noise (0.98 → 0.20 drop would still cut at 0.78 gap).

### Fix 5 — Generator analogy instruction caused hallucination (`generator.py`)

"Use an analogy or real-world example" produced analogies ("Think of it like selling a
used car") not present in retrieved context → faithfulness judge scored as hallucination.
Fix: removed analogy instruction; replaced with "using only information from the provided context".

### Fix 6 — Generator weak grounding for sparse contexts (`generator.py`)

Generator constructed legal conclusions from reasoning even when context lacked evidence.
Added explicit rules:
- For sparse context: say "Based on the available context: [X]" and note missing elements
- For conflict detection: only assert conflict if BOTH provisions present in context

### Fix 7 — CONFLICT_DETECTION tone hint implied precedence reasoning (`nodes.py`)

Tone hint said "state which jurisdiction takes precedence when context supports it" —
LLM interpreted "when context supports it" loosely and applied legal reasoning.
Rewritten to: "Never infer precedence from legal reasoning — only state precedence if
the context explicitly says so."

### Fix 8 — Temporal query rewrite too generic (`classifier.py`)

Query "What is the timeline for project registration?" produced rewrite "registration
timeline period" — FTS missed Section 5 which uses "within thirty days" and "deemed registered".
Added rewriting guidance to expand temporal queries with specific legal time-period keywords.

### RAGAS Results (Phase 8 baseline, 5-row smoke, gemma-4-31b-it judge)

| Row | Faith (before) | Faith (after) | Prec (before) | Prec (after) |
|---|---|---|---|---|
| CENTRAL-FACT-001 | 1.00 | 0.50 | 0.00 | 0.00 |
| CENTRAL-FACT-002 | 0.80 | 0.62 | 0.00 | 0.33 |
| CENTRAL-XREF-001 | 0.63 | 0.50 | 1.00 | 1.00 |
| CENTRAL-CONF-001 | 0.00 | 0.62 | 0.00 | 0.00 |
| CENTRAL-TEMP-001 | 0.67 | 1.00 | 1.00 | 0.00 |
| **Overall** | 0.618 | **0.650** | 0.400* | 0.267 |

\* Before baseline had inflated precision from duplicate chunks (non-deterministic doc_id).
After Phase 8: deterministic UUID5 chunk IDs prevent duplicates on re-ingest.