adeshboudh16 commited on
Commit
6c8a2d0
Β·
1 Parent(s): 7ea4089

docs: update HLD/LLD to v0.3.0, add README, ADR 004-005

Browse files

HLD: bump to v0.3.0, Phase 2 Complete, update roadmap table
LLD: add document_registry module, update chunker patterns,
add embedder truncation guard spec, update Neo4j stats to Phase 2
README: first version - quickstart, stack table, phase roadmap, ADR index
ADR 004: multi-format chunker (Act vs Rule boundary regex decision)
ADR 005: document registry as single source of truth

README.md CHANGED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CivicSetu
2
+
3
+ Open-source RAG system for querying Indian civic and legal documents β€” with accurate
4
+ citations, cross-reference traversal, and conflict detection between laws.
5
+
6
+ **Current status:** Phase 2 complete β€” RERA Act 2016 (Central) + Maharashtra Rules 2017.
7
+
8
+ ---
9
+
10
+ ## What it does
11
+
12
+ Ask a plain-English question about RERA or Maharashtra real estate rules. Get a cited,
13
+ structured answer with section references, confidence score, and a legal disclaimer.
14
+
15
+ ```
16
+
17
+ Query: "What must a promoter disclose before selling a flat?"
18
+
19
+ Answer: "Under Section 11(3) of RERA Act 2016, a promoter must disclose...
20
+ Rule 3(2) of Maharashtra Rules further requires..."
21
+
22
+ Citations: [Section 11, RERA Act 2016], [Rule 3(2), Maharashtra Rules 2017]
23
+ Confidence: 0.95 (high)
24
+
25
+ ```
26
+
27
+ ---
28
+
29
+ ## Architecture
30
+
31
+ ```
32
+
33
+ FastAPI β†’ LangGraph Agent β†’ pgvector + Neo4j + PostgreSQL
34
+ ↑
35
+ Ingestion Pipeline (PDF β†’ chunks β†’ embeddings β†’ graph)
36
+
37
+ ```
38
+
39
+ Three stores per query:
40
+ - **pgvector** β€” semantic similarity (fact lookups)
41
+ - **Neo4j** β€” section graph traversal (cross-references, penalties)
42
+ - **PostgreSQL** β€” full chunk text + metadata
43
+
44
+ Full design: [HLD.md](docs/HLD.md) | [LLD.md](docs/LLD.md)
45
+
46
+ ---
47
+
48
+ ## Quickstart
49
+
50
+ ### Prerequisites
51
+
52
+ - Docker + Docker Compose
53
+ - [Ollama](https://ollama.ai) running locally
54
+ - `uv` package manager
55
+
56
+ ### 1. Start infrastructure
57
+
58
+ ```bash
59
+ docker compose up -d # PostgreSQL + pgvector + Neo4j
60
+ ollama pull nomic-embed-text # embedding model
61
+ ```
62
+
63
+
64
+ ### 2. Configure environment
65
+
66
+ ```bash
67
+ cp .env.example .env
68
+ # Set GEMINI_API_KEY (or GROQ_API_KEY for backup)
69
+ # Neo4j and Postgres defaults work out of the box with Docker Compose
70
+ ```
71
+
72
+
73
+ ### 3. Install dependencies
74
+
75
+ ```bash
76
+ uv sync
77
+ ```
78
+
79
+
80
+ ### 4. Ingest documents
81
+
82
+ ```bash
83
+ uv run python scripts/ingest_phase0.py # RERA Act 2016
84
+ uv run python scripts/ingest_phase2.py # Maharashtra Rules 2017
85
+ ```
86
+
87
+
88
+ ### 5. Run the API
89
+
90
+ ```bash
91
+ uv run uvicorn civicsetu.api.main:app --reload
92
+ ```
93
+
94
+
95
+ ### 6. Query
96
+
97
+ ```bash
98
+ curl -X POST http://localhost:8000/api/v1/query \
99
+ -H "Content-Type: application/json" \
100
+ -d '{"query": "What are the penalties for a promoter who delays possession?"}'
101
+ ```
102
+
103
+
104
+ ---
105
+
106
+ ## Documents ingested
107
+
108
+ | Document | Jurisdiction | Chunks | Sections |
109
+ | :-- | :-- | :-- | :-- |
110
+ | RERA Act 2016 | Central | 224 | 92 |
111
+ | Maharashtra Real Estate Rules 2017 | Maharashtra | 214 | 44 |
112
+
113
+
114
+ ---
115
+
116
+ ## Tech stack
117
+
118
+ | Layer | Technology |
119
+ | :-- | :-- |
120
+ | API | FastAPI + Uvicorn |
121
+ | Orchestration | LangGraph StateGraph |
122
+ | LLM routing | LiteLLM (Gemini β†’ Groq β†’ OpenRouter) |
123
+ | Embeddings | nomic-embed-text via Ollama (local) |
124
+ | Vector DB | pgvector + HNSW index |
125
+ | Graph DB | Neo4j Community |
126
+ | Relational | PostgreSQL + SQLAlchemy |
127
+ | Reranker | FlashRank (ms-marco-MiniLM-L-12-v2) |
128
+ | PDF parsing | PyMuPDF |
129
+
130
+
131
+ ---
132
+
133
+ ## Phase roadmap
134
+
135
+ | Phase | Scope | Status |
136
+ | :-- | :-- | :-- |
137
+ | 0 | RERA Act 2016, vector RAG, FastAPI | Complete |
138
+ | 1 | Neo4j graph, cross-reference queries | Complete |
139
+ | 2 | MahaRERA Rules 2017, multi-jurisdiction | Complete |
140
+ | 3 | DERIVED_FROM edges, conflict detection | Next |
141
+ | 4 | Multi-state expansion (UP, TN, Karnataka) | Planned |
142
+ | 5 | Open-source SaaS, UI, public API | Planned |
143
+
144
+
145
+ ---
146
+
147
+ ## ADRs
148
+
149
+ - [ADR 001 β€” three store architecture](docs/adr/001-three-store-architecture.md)
150
+ - [ADR 002 β€” section boundary chunking](docs/adr/002-section-boundary-chunking.md)
151
+ - [ADR 003 β€” LangGraph over LangChain chains](docs/adr/003-langgraph-over-langchain.md)
152
+ - [ADR 004 β€” Multi-format chunker](docs/adr/004-multi-format-chunker.md)
153
+ - [ADR 005 β€” Document registry](docs/adr/005-document-registry.md)
154
+
155
+
156
+ ## Disclaimer
157
+
158
+ CivicSetu provides AI-generated legal information, not legal advice.
159
+ Always verify with a qualified lawyer or the official gazette.
docs/HLD.md CHANGED
@@ -1,8 +1,8 @@
1
  # CivicSetu β€” High Level Design (HLD)
2
 
3
- **Version:** 0.2.0 β€” Phase 1 Complete
4
  **Last Updated:** March 2026
5
- **Status:** Phase 1 Complete β€” Graph Retrieval Live
6
 
7
  ---
8
 
@@ -15,7 +15,7 @@ amendment tracking, and conflict detection between laws.
15
  **Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
16
  labor law, GST compliance, and other civic frameworks.
17
 
18
- **Phase 0 Scope:** RERA Act 2016 (Central) β€” queryable via REST API.
19
 
20
  ---
21
 
@@ -27,28 +27,29 @@ labor law, GST compliance, and other civic frameworks.
27
  β”‚ CLIENT LAYER β”‚
28
  β”‚ HTTP REST (FastAPI) β€” /api/v1/query β”‚
29
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
30
- β”‚
31
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
- β”‚ LANGGRAPH AGENT β”‚
33
- β”‚ β”‚
34
  β”‚ [Classifier] β†’ [Vector Retrieval] β†’ [Reranker] β”‚
35
- β”‚ ↑ ↓ (Phase 1: + Graph Retrieval) β”‚
36
- β”‚ [Retry] ← [Validator] ← [Generator] β”‚
37
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
- β”‚
39
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
40
- β”‚ β”‚ β”‚
41
- β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
42
- β”‚ pgvector β”‚ β”‚ Neo4j β”‚ β”‚ PostgreSQL β”‚
43
- β”‚ (vectors) β”‚ β”‚ (graph) β”‚ β”‚ (metadata) β”‚
44
- β”‚ Phase 0 βœ… β”‚ β”‚ Phase 1 πŸ”œ β”‚ β”‚ Phase 0 βœ… β”‚
45
- β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
46
- β”‚ β”‚ β”‚
47
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
48
- β”‚
49
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
50
- β”‚ INGESTION PIPELINE β”‚
51
  β”‚ Download β†’ Parse β†’ Chunk β†’ Enrich β†’ Embed β†’ Store β”‚
 
52
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
 
54
  ```
@@ -63,15 +64,15 @@ Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`.
63
 
64
  ```
65
 
66
- PDF URL
67
- β†’ Downloader (httpx, cached locally)
68
- β†’ PDFParser (PyMuPDF, text extraction)
69
- οΏ½οΏ½ LegalChunker (section-boundary regex)
70
- β†’ MetadataExtractor(dates, references, amendment signals)
71
- β†’ Embedder (nomic-embed-text via Ollama, local)
72
- β†’ RelationalStore (PostgreSQL β€” documents + legal_chunks tables)
73
- β†’ VectorStore (pgvector β€” HNSW index, cosine similarity)
74
- β†’ GraphStore (Neo4j β€” Phase 1)
75
 
76
  ```
77
 
@@ -82,15 +83,15 @@ Triggered on every `POST /api/v1/query`.
82
  ```
83
 
84
  User Query
85
- β†’ Input Guardrails (Phase 1)
86
  β†’ Classifier Node (LLM β€” query_type + rewritten_query)
87
- β†’ Vector Retrieval (pgvector cosine search, top_k chunks)
88
- β†’ Graph Retrieval (Neo4j Cypher, REFERENCES traversal (bidirectional, depth=2))
89
- Fallback: vector retrieval when no section ID in query
90
  β†’ Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
91
  β†’ Generator Node (LLM β€” structured JSON answer with citations)
92
  β†’ Validator Node (LLM β€” hallucination + confidence check)
93
- β†’ Output Guardrails (Phase 1)
94
  β†’ CivicSetuResponse (answer + citations + confidence + disclaimer)
95
 
96
  ```
@@ -99,19 +100,20 @@ User Query
99
 
100
  ## 4. Component Responsibilities
101
 
102
- | Component | Responsibility | Technology |
103
- |---|---|---|
104
- | PDFParser | Text extraction from PDFs | PyMuPDF |
105
- | LegalChunker | Section-boundary splitting | Regex + fallback |
106
- | MetadataExtractor | Date, reference, amendment extraction | Regex |
107
- | Embedder | Dense vector generation | nomic-embed-text (Ollama) |
108
- | VectorStore | Semantic similarity search | pgvector + HNSW |
109
- | GraphStore | Section relationship traversal | Neo4j Community |
110
- | RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy |
111
- | LangGraph Agent | Query orchestration state machine | LangGraph |
112
- | LiteLLM Gateway | LLM provider fallback routing | LiteLLM |
113
- | FastAPI | HTTP API layer | FastAPI + Uvicorn |
114
- | FlashRank | Cross-encoder reranking | ONNX local model |
 
115
 
116
  ---
117
 
@@ -141,7 +143,7 @@ Step 2 Graph β†’ traverse Section 18 node, incoming + outgoing REFERENCES
141
  Step 2b Fallback β†’ vector retrieval if graph returns 0 results
142
  Step 3 Rerank β†’ cross-encoder scores, top 5 ordered
143
  Step 4 Generate β†’ LLM produces JSON with answer + citations
144
- Step 5 Validate β†’ hallucination check, confidence score (skip retry if empty retrieval)
145
  Step 6 Respond β†’ CivicSetuResponse with citations + disclaimer
146
 
147
  Output: {
@@ -158,24 +160,23 @@ Output: {
158
 
159
  ## 7. Phase Roadmap
160
 
161
- | Phase | Scope | Status |
162
- |-------|------------------------------------------------|---------------------|
163
- | 0 | RERA Act 2016, vector RAG, FastAPI | βœ… Complete |
164
- | 1 | Neo4j graph, cross-reference queries | βœ… Complete |
165
- | 2 | MahaRERA Rules + Circulars, amendment tracking | πŸ”œ Next |
166
- | 3 | Conflict detection, multi-document reasoning | Planned |
167
- | 4 | Multi-state expansion (UP, TN, Karnataka RERA) | Planned |
168
- | 5 | Open-source SaaS, UI, public API | Planned |
169
-
170
 
171
  ---
172
 
173
  ## 8. Non-Functional Requirements
174
 
175
- | Requirement | Target | Current Status |
176
- |---|---|---|
177
- | Response latency | < 10s per query | ~5–8s (local embedding) |
178
- | Citation accuracy | 100% β€” never answer without citation | Enforced by schema |
179
- | Hallucination rate | < 5% | Validator node + confidence gate |
180
- | Cost | $0 for dev/staging | βœ… All free tier |
181
- | Portability | Runs on any machine with Docker | βœ… Docker Compose |
 
1
  # CivicSetu β€” High Level Design (HLD)
2
 
3
+ **Version:** 0.3.0 β€” Phase 2 Complete
4
  **Last Updated:** March 2026
5
+ **Status:** Phase 2 Complete β€” Multi-jurisdiction ingestion live
6
 
7
  ---
8
 
 
15
  **Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
16
  labor law, GST compliance, and other civic frameworks.
17
 
18
+ **Current Scope:** RERA Act 2016 (Central) + Maharashtra Real Estate Rules 2017.
19
 
20
  ---
21
 
 
27
  β”‚ CLIENT LAYER β”‚
28
  β”‚ HTTP REST (FastAPI) β€” /api/v1/query β”‚
29
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
30
+ β”‚
31
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
+ β”‚ LANGGRAPH AGENT β”‚
33
+ β”‚ β”‚
34
  β”‚ [Classifier] β†’ [Vector Retrieval] β†’ [Reranker] β”‚
35
+ β”‚ ↑ [Graph Retrieval] β†— β”‚
36
+ β”‚ [Retry] ← [Validator] ← [Generator] β”‚
37
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
+ β”‚
39
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
40
+ β”‚ β”‚ β”‚
41
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
42
+ β”‚ pgvector β”‚ β”‚ Neo4j β”‚ β”‚ PostgreSQL β”‚
43
+ β”‚ (vectors) β”‚ β”‚ (graph) β”‚ β”‚ (metadata) β”‚
44
+ β”‚ Phase 0 β”‚ β”‚ Phase 1 β”‚ β”‚ Phase 0 β”‚
45
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
46
+ β”‚ β”‚ β”‚
47
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
48
+ β”‚
49
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
50
+ β”‚ INGESTION PIPELINE β”‚
51
  β”‚ Download β†’ Parse β†’ Chunk β†’ Enrich β†’ Embed β†’ Store β”‚
52
+ β”‚ document_registry.py β€” single source of truth for all doc URLs β”‚
53
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
54
 
55
  ```
 
64
 
65
  ```
66
 
67
+ PDF URL (from document_registry.py)
68
+ β†’ Downloader (httpx, cached locally with MD5 check)
69
+ β†’ PDFParser (PyMuPDF, text extraction, scanned page detection)
70
+ β†’ LegalChunker (multi-format regex: Act + Rule boundary detection)
71
+ β†’ MetadataExtractor (dates, cross-references, amendment signals)
72
+ β†’ Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
73
+ β†’ RelationalStore (PostgreSQL β€” documents + legal_chunks tables)
74
+ β†’ VectorStore (pgvector β€” HNSW index, cosine similarity)
75
+ β†’ GraphStore (Neo4j β€” Document + Section nodes + edges)
76
 
77
  ```
78
 
 
83
  ```
84
 
85
  User Query
86
+ β†’ Input Guardrails (PII + off-topic filter)
87
  β†’ Classifier Node (LLM β€” query_type + rewritten_query)
88
+ β†’ Vector Retrieval (pgvector cosine search, top_k chunks) ← fact_lookup
89
+ β†’ Graph Retrieval (Neo4j, REFERENCES traversal, depth=2) ← cross_reference / penalty / temporal
90
+ Fallback: vector retrieval when no section ID in query
91
  β†’ Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
92
  β†’ Generator Node (LLM β€” structured JSON answer with citations)
93
  β†’ Validator Node (LLM β€” hallucination + confidence check)
94
+ β†’ Output Guardrails (faithfulness check + disclaimer injection)
95
  β†’ CivicSetuResponse (answer + citations + confidence + disclaimer)
96
 
97
  ```
 
100
 
101
  ## 4. Component Responsibilities
102
 
103
+ | Component | Responsibility | Technology |
104
+ |--------------------|---------------------------------------------|---------------------------------|
105
+ | DocumentRegistry | Centralised doc URL + metadata management | Python dataclass |
106
+ | PDFParser | Text extraction from PDFs | PyMuPDF |
107
+ | LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) |
108
+ | MetadataExtractor | Date, reference, amendment extraction | Regex |
109
+ | Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) |
110
+ | VectorStore | Semantic similarity search | pgvector + HNSW |
111
+ | GraphStore | Section relationship traversal | Neo4j Community |
112
+ | RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy |
113
+ | LangGraph Agent | Query orchestration state machine | LangGraph |
114
+ | LiteLLM Gateway | LLM provider fallback routing | LiteLLM |
115
+ | FastAPI | HTTP API layer | FastAPI + Uvicorn |
116
+ | FlashRank | Cross-encoder reranking | ONNX local model |
117
 
118
  ---
119
 
 
143
  Step 2b Fallback β†’ vector retrieval if graph returns 0 results
144
  Step 3 Rerank β†’ cross-encoder scores, top 5 ordered
145
  Step 4 Generate β†’ LLM produces JSON with answer + citations
146
+ Step 5 Validate β†’ hallucination check, confidence score
147
  Step 6 Respond β†’ CivicSetuResponse with citations + disclaimer
148
 
149
  Output: {
 
160
 
161
  ## 7. Phase Roadmap
162
 
163
+ | Phase | Scope | Status |
164
+ |-------|------------------------------------------------|-----------------|
165
+ | 0 | RERA Act 2016, vector RAG, FastAPI | βœ… Complete |
166
+ | 1 | Neo4j graph, cross-reference queries | βœ… Complete |
167
+ | 2 | MahaRERA Rules 2017, multi-jurisdiction | βœ… Complete |
168
+ | 3 | DERIVED_FROM edges, conflict detection | Next |
169
+ | 4 | Multi-state expansion (UP, TN, Karnataka RERA) | Planned |
170
+ | 5 | Open-source SaaS, UI, public API | Planned |
 
171
 
172
  ---
173
 
174
  ## 8. Non-Functional Requirements
175
 
176
+ | Requirement | Target | Current Status |
177
+ |--------------------|--------------------------------------|---------------------------------|
178
+ | Response latency | < 10s per query | ~5–8s (local embedding) |
179
+ | Citation accuracy | 100% β€” never answer without citation | Enforced by schema |
180
+ | Hallucination rate | < 5% | Validator node + confidence gate|
181
+ | Cost | $0 for dev/staging | All free tier |
182
+ | Portability | Runs on any machine with Docker | Docker Compose |
docs/LLD.md CHANGED
@@ -11,47 +11,48 @@
11
 
12
  src/civicsetu/
13
  β”œβ”€β”€ config/
14
- β”‚ └── settings.py Pydantic BaseSettings singleton (lru_cache)
 
15
  β”œβ”€β”€ models/
16
- β”‚ β”œβ”€β”€ enums.py StrEnum: Jurisdiction, DocType, QueryType, etc.
17
- β”‚ └── schemas.py Pydantic models: LegalChunk, Citation, CivicSetuResponse
18
  β”œβ”€β”€ ingestion/
19
- β”‚ β”œβ”€β”€ downloader.py httpx PDF downloader with MD5 cache check
20
- β”‚ β”œβ”€β”€ parser.py PyMuPDF text extractor, scanned PDF detection
21
- β”‚ β”œβ”€β”€ chunker.py Section-boundary regex chunker + fallback
22
  β”‚ β”œβ”€β”€ metadata_extractor.py Date/reference/amendment regex extraction
23
- β”‚ β”œβ”€β”€ embedder.py nomic-embed-text via Ollama (document + query prefixes)
24
- β”‚ └── pipeline.py Orchestrates all ingestion steps end-to-end
25
  β”œβ”€β”€ stores/
26
- β”‚ β”œβ”€β”€ relational_store.py Async SQLAlchemy β€” documents + legal_chunks tables
27
- β”‚ β”œβ”€β”€ vector_store.py pgvector HNSW cosine search
28
- β”‚ └── graph_store.py Neo4j Cypher interface (Phase 1)
29
  β”œβ”€β”€ retrieval/
30
- β”‚ β”œβ”€β”€ vector_retriever.py Wraps VectorStore for agent use
31
- β”‚ β”œβ”€β”€ graph_retriever.py Cypher query builder (Phase 1)
32
- β”‚ └── reranker.py FlashRank cross-encoder wrapper
33
  β”œβ”€β”€ agent/
34
- β”‚ β”œβ”€β”€ state.py CivicSetuState TypedDict (frozen contract)
35
- β”‚ β”œβ”€β”€ nodes.py Pure functions: classifier, retrieval, reranker,
36
- β”‚ β”‚ generator, validator
37
- β”‚ β”œβ”€β”€ edges.py Conditional routing: route_after_classifier,
38
- β”‚ β”‚ route_after_validator
39
- β”‚ └── graph.py StateGraph assembly + get_compiled_graph()
40
  β”œβ”€β”€ prompts/
41
- β”‚ β”œβ”€β”€ classifier.py Query type classification + rewriting prompt
42
- β”‚ β”œβ”€β”€ generator.py Cited answer generation prompt
43
- β”‚ └── validator.py Hallucination + confidence check prompt
44
  β”œβ”€β”€ guardrails/
45
- β”‚ β”œβ”€β”€ input_guard.py PII detection + off-topic filter (Phase 1)
46
- β”‚ └── output_guard.py Faithfulness check + disclaimer injection (Phase 1)
47
  └── api/
48
- β”œβ”€β”€ main.py FastAPI app factory + lifespan (graph pre-compiled)
49
  β”œβ”€β”€ routes/
50
- β”‚ β”œβ”€β”€ health.py GET /health β€” DB ping
51
- β”‚ β”œβ”€β”€ query.py POST /api/v1/query β€” main RAG endpoint
52
- β”‚ └── ingest.py POST /api/v1/ingest β€” Phase 1 admin endpoint
53
  └── middleware/
54
- └── logging.py Request/response structured logging
55
 
56
  ```
57
 
@@ -201,25 +202,35 @@ validator β†’ route_after_validator:
201
 
202
  ### Section Boundary Detection
203
 
204
- Indian legal acts follow a consistent numbering format:
 
 
205
 
206
  ```
 
207
  ```
208
 
209
- ^\s*(?P<id>\d+[A-Z]?)\.\s+(?P<title>[A-Za-z][^β€”\n]{3,80})\.?β€”
 
 
210
 
211
  ```
 
212
  ```
213
 
214
- Matches: `1. Short title.β€”` / `18A. Special provisions.β€”` / ` 2. Definitions.β€”`
 
 
 
215
 
216
  ### Chunk Size Limits
217
 
218
  ```
219
  MIN_CHARS = 100 β€” discard fragments (headers, page numbers)
220
- MAX_CHARS = 2000 β€” split large sections at subsection markers (1), (2), (a), (b)
221
  ```
222
 
 
223
 
224
  ### Split Priority for Large Sections
225
 
@@ -253,6 +264,18 @@ Query time: "search_query: {query}" β†’ embed_query()
253
  Using wrong prefix at query time causes ~10–15% recall degradation.
254
  The `embed_document()` / `embed_query()` method split enforces this at the API level.
255
 
 
 
 
 
 
 
 
 
 
 
 
 
256
  ---
257
 
258
  ## 6. Response Contract
@@ -316,3 +339,17 @@ If `citations` would be empty β†’ return `InsufficientInfoResponse` instead.
316
  - Any query with explicit section number (e.g. "Section 18") β†’ cross_reference
317
  - cross_reference + penalty_lookup + temporal β†’ graph_retrieval node
318
  - fact_lookup + conflict_detection β†’ vector_retrieval node
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  src/civicsetu/
13
  β”œβ”€β”€ config/
14
+ β”‚ β”œβ”€β”€ settings.py Pydantic BaseSettings singleton (lru_cache)
15
+ β”‚ └── document_registry.py All document URLs + metadata (single source of truth)
16
  β”œβ”€β”€ models/
17
+ β”‚ β”œβ”€β”€ enums.py StrEnum: Jurisdiction, DocType, QueryType, etc.
18
+ β”‚ └── schemas.py Pydantic models: LegalChunk, Citation, CivicSetuResponse
19
  β”œβ”€β”€ ingestion/
20
+ β”‚ β”œβ”€β”€ downloader.py httpx PDF downloader with MD5 cache check
21
+ β”‚ β”œβ”€β”€ parser.py PyMuPDF text extractor, scanned PDF detection
22
+ β”‚ β”œβ”€β”€ chunker.py Section-boundary regex chunker + fallback
23
  β”‚ β”œβ”€β”€ metadata_extractor.py Date/reference/amendment regex extraction
24
+ β”‚ β”œβ”€β”€ embedder.py nomic-embed-text via Ollama (document + query prefixes)
25
+ β”‚ └── pipeline.py Orchestrates all ingestion steps end-to-end
26
  β”œβ”€β”€ stores/
27
+ β”‚ β”œβ”€β”€ relational_store.py Async SQLAlchemy β€” documents + legal_chunks tables
28
+ β”‚ β”œβ”€β”€ vector_store.py pgvector HNSW cosine search
29
+ β”‚ └── graph_store.py Neo4j Cypher interface (Phase 1)
30
  β”œβ”€β”€ retrieval/
31
+ β”‚ β”œβ”€β”€ vector_retriever.py Wraps VectorStore for agent use
32
+ β”‚ β”œβ”€β”€ graph_retriever.py Cypher query builder (Phase 1)
33
+ β”‚ └── reranker.py FlashRank cross-encoder wrapper
34
  β”œβ”€β”€ agent/
35
+ β”‚ β”œβ”€β”€ state.py CivicSetuState TypedDict (frozen contract)
36
+ β”‚ β”œβ”€β”€ nodes.py Pure functions: classifier, retrieval, reranker,
37
+ β”‚ β”‚ generator, validator
38
+ β”‚ β”œβ”€β”€ edges.py Conditional routing: route_after_classifier,
39
+ β”‚ β”‚ route_after_validator
40
+ β”‚ └── graph.py StateGraph assembly + get_compiled_graph()
41
  β”œβ”€β”€ prompts/
42
+ β”‚ β”œβ”€β”€ classifier.py Query type classification + rewriting prompt
43
+ β”‚ β”œβ”€β”€ generator.py Cited answer generation prompt
44
+ β”‚ └── validator.py Hallucination + confidence check prompt
45
  β”œβ”€β”€ guardrails/
46
+ β”‚ β”œβ”€β”€ input_guard.py PII detection + off-topic filter (Phase 1)
47
+ β”‚ └── output_guard.py Faithfulness check + disclaimer injection (Phase 1)
48
  └── api/
49
+ β”œβ”€β”€ main.py FastAPI app factory + lifespan (graph pre-compiled)
50
  β”œβ”€β”€ routes/
51
+ β”‚ β”œβ”€β”€ health.py GET /health β€” DB ping
52
+ β”‚ β”œβ”€β”€ query.py POST /api/v1/query β€” main RAG endpoint
53
+ β”‚ └── ingest.py POST /api/v1/ingest β€” Phase 1 admin endpoint
54
  └── middleware/
55
+ └── logging.py Request/response structured logging
56
 
57
  ```
58
 
 
202
 
203
  ### Section Boundary Detection
204
 
205
+ Two regex patterns to cover both document formats ingested:
206
+
207
+ **Act format** (RERA Act 2016):
208
 
209
  ```
210
+ ^\s*(?P<id>\d+[A-Z]?)\.?\s*(?P<title>[A-Z][^\nβ€”]{3,80})\.?β€”
211
  ```
212
 
213
+ Matches: `18. Return of amount and compensation.β€”`
214
+
215
+ **Rule format** (MahaRERA Rules 2017):
216
 
217
  ```
218
+ \n(?P<id>\d+)\.\s*\n(?P<title>[A-Z][^\n]{3,80})\n
219
  ```
220
 
221
+ Matches: `\n3.\nInformation to be furnished...\n`
222
+
223
+ Chunker tries Act pattern first; falls back to Rule pattern; falls back to paragraph
224
+ split if neither matches. Logs `chunking_fallback_used` on paragraph path.
225
 
226
  ### Chunk Size Limits
227
 
228
  ```
229
  MIN_CHARS = 100 β€” discard fragments (headers, page numbers)
230
+ MAX_CHARS = 1500 β€” split large sections at subsection markers (1), (2), (a), (b)
231
  ```
232
 
233
+ Reduced from 2000 β†’ 1500 to stay within nomic-embed-text practical token window.
234
 
235
  ### Split Priority for Large Sections
236
 
 
264
  Using wrong prefix at query time causes ~10–15% recall degradation.
265
  The `embed_document()` / `embed_query()` method split enforces this at the API level.
266
 
267
+ ### Truncation Guard
268
+
269
+ ```python
270
+ MAX_EMBED_CHARS = 6000 # ~1500 tokens for nomic-embed-text
271
+ if len(text) > MAX_EMBED_CHARS:
272
+ log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS)
273
+ text = text[:MAX_EMBED_CHARS]
274
+ ```
275
+
276
+ Prevents silent API errors on oversized chunks. Expected to fire on 0–2 chunks per
277
+ document where subsection splitting fails (complex tables, long definition lists).
278
+
279
  ---
280
 
281
  ## 6. Response Contract
 
339
  - Any query with explicit section number (e.g. "Section 18") β†’ cross_reference
340
  - cross_reference + penalty_lookup + temporal β†’ graph_retrieval node
341
  - fact_lookup + conflict_detection β†’ vector_retrieval node
342
+
343
+ ## 8. Neo4j Graph β€” Phase 2 State
344
+
345
+ **Nodes seeded:** 2 Documents, 438 Section nodes
346
+ **Edges seeded:** 438 HAS_SECTION, 124 REFERENCES, 0 DERIVED_FROM (Phase 3)
347
+
348
+ **Documents in graph:**
349
+ - RERA Act 2016 (CENTRAL) β€” 224 sections, 63 REFERENCES edges
350
+ - Maharashtra Real Estate Rules 2017 (MAHARASHTRA) β€” 214 sections, 61 REFERENCES edges
351
+
352
+ **Known issue (Phase 3 backlog):**
353
+ Citation deduplication keys on `section_id` only, not `(section_id, doc_name)`.
354
+ Cross-doc queries may show duplicate section IDs from different documents.
355
+ Fix: update generator citation dedup to composite key.
docs/adr/004-multi-format-chunker.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR 004 β€” Multi-format Legal Document Chunker
2
+
3
+ **Date:** March 2026
4
+ **Status:** Accepted
5
+
6
+ ---
7
+
8
+ ## Context
9
+
10
+ Phase 2 required ingesting Maharashtra Real Estate Rules 2017 alongside the RERA Act
11
+ 2016. Both are Indian legal PDFs but use structurally different numbering formats:
12
+
13
+ **Act format (RERA Act 2016):**
14
+
15
+
16
+ 18. Return of amount and compensation.β€”
17
+ (1) If the promoter fails to complete...
18
+
19
+ Section title and em-dash are on the same line as the section number.
20
+
21
+ **Rule format (Maharashtra Rules 2017):**
22
+
23
+
24
+ 3.
25
+
26
+ Information to be furnished by the promoter...
27
+
28
+
29
+ Section number is on its own line, followed by a blank line, then the title.
30
+
31
+ The existing Act-format regex (`^\s*\d+[A-Z]?\.\s+[A-Z][^β€”\n]{3,80}\.?β€”`) produces
32
+ zero section boundaries on MahaRERA, triggering fallback paragraph chunking.
33
+ Paragraph chunking on MahaRERA produces 80+ chunks with no section_id metadata β€”
34
+ breaking citation accuracy entirely.
35
+
36
+ ## Decision
37
+
38
+ Extend `LegalChunker` with a second boundary pattern for Rule format, applied as
39
+ a sequential fallback:
40
+
41
+ ```python
42
+ PATTERNS = [
43
+ ```
44
+ ("act", r'^\s*(?P<id>\d+[A-Z]?)\.?\s*(?P<title>[A-Z][^\nβ€”]{3,80})\.?β€”'),
45
+ ```
46
+ ```
47
+ ("rule", r'\n(?P<id>\d+)\.\s*\n(?P<title>[A-Z][^\n]{3,80})\n'),
48
+ ```
49
+ ]
50
+
51
+ for name, pattern in PATTERNS:
52
+ matches = list(re.finditer(pattern, text, re.MULTILINE))
53
+ if len(matches) >= MIN_SECTIONS:
54
+ log.info("chunker_pattern_selected", pattern=name, sections=len(matches))
55
+ break
56
+ ```
57
+
58
+ `MIN_SECTIONS = 5` β€” fewer than 5 matches is treated as noise, not real boundaries.
59
+
60
+ The chunker logs which pattern was selected per document. Paragraph fallback is only
61
+ reached if both patterns fail.
62
+
63
+ ## Consequences
64
+
65
+ **Positive:**
66
+
67
+ - MahaRERA produces 214 meaningful chunks with proper section_id metadata (44 sections)
68
+ - Citation accuracy preserved β€” every chunk maps to an identifiable Rule number
69
+ - Pattern selection is logged β€” observable, not silent
70
+ - Adding a third pattern (e.g. circular format) requires one array entry
71
+
72
+ **Negative:**
73
+
74
+ - Pattern priority is implicit β€” if a document accidentally matches Rule pattern first
75
+ with >= 5 hits, it bypasses Act pattern (mitigated by trying Act first)
76
+ - Regex fragility: PDFs with unusual whitespace will still hit fallback
77
+
78
+
79
+ ## Alternatives Rejected
80
+
81
+ - **Hardcode document type in ingestion config:** Requires caller to know format ahead
82
+ of time; breaks the "any PDF URL" contract of the ingestion pipeline
83
+ - **ML-based section detector:** Overkill for deterministic numbered formats; adds
84
+ model dependency with no recall benefit on well-formatted government PDFs
85
+ - **Single universal regex:** No single pattern can match both `18. Title.β€”` and
86
+ `\n18.\n\nTitle\n` without catastrophic false positives
docs/adr/005-document-registry.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR 005 β€” Document Registry as Single Source of Truth
2
+
3
+ **Date:** March 2026
4
+ **Status:** Accepted
5
+
6
+ ---
7
+
8
+ ## Context
9
+
10
+ Phase 2 introduced a second document. With two documents, ingestion scripts started
11
+ duplicating URL strings, jurisdiction values, and doc_name strings across:
12
+
13
+ - `scripts/ingest_phase0.py`
14
+ - `scripts/ingest_phase2.py`
15
+ - Tests
16
+ - Any future migration or re-ingestion scripts
17
+
18
+ A URL change (e.g. NAREDCO moves their PDF) would require grep-and-replace across
19
+ multiple files with no compile-time safety.
20
+
21
+ ## Decision
22
+
23
+ Introduce `src/civicsetu/config/document_registry.py` as the single authoritative
24
+ source for all document metadata:
25
+
26
+ ```python
27
+ @dataclass(frozen=True)
28
+ class DocumentSpec:
29
+ name: str
30
+ url: str
31
+ jurisdiction: Jurisdiction
32
+ doc_type: DocType
33
+ effective_date: date | None = None
34
+
35
+ DOCUMENT_REGISTRY: dict[str, DocumentSpec] = {
36
+ "rera_act_2016": DocumentSpec(
37
+ name="RERA Act 2016",
38
+ url="https://...",
39
+ jurisdiction=Jurisdiction.CENTRAL,
40
+ doc_type=DocType.ACT,
41
+ effective_date=date(2016, 5, 26),
42
+ ),
43
+ "mahrera_rules_2017": DocumentSpec(
44
+ name="Maharashtra Real Estate (Regulation and Development) Rules 2017",
45
+ url="https://naredco.in/...",
46
+ jurisdiction=Jurisdiction.MAHARASHTRA,
47
+ doc_type=DocType.RULES,
48
+ effective_date=date(2017, 4, 21),
49
+ ),
50
+ }
51
+ ```
52
+
53
+ All ingestion scripts import from `document_registry`. No URL strings appear outside
54
+ this file.
55
+
56
+ ## Consequences
57
+
58
+ **Positive:**
59
+
60
+ - URL change = one-line edit, guaranteed to propagate everywhere
61
+ - `DocumentSpec` is a frozen dataclass β€” immutable, hashable, diffable in git
62
+ - Phase 4 (multi-state expansion) is a registry append, not a script rewrite
63
+ - Tests can iterate `DOCUMENT_REGISTRY.values()` for fixture generation
64
+
65
+ **Negative:**
66
+
67
+ - Adding a document requires a code change + deploy (not a DB insert)
68
+ - Acceptable for Phase 0–3 volume (~10 documents); revisit for Phase 4+
69
+
70
+
71
+ ## Alternatives Rejected
72
+
73
+ - **Database table for document registry:** Correct long-term, premature for current
74
+ volume. Adds a DB round-trip to every ingestion bootstrap.
75
+ - **Environment variables per document:** Unscalable beyond 2–3 documents;
76
+ no structure, no type safety
77
+ - **YAML/TOML config file:** Adds a parsing layer with no type safety; dataclass
78
+ achieves the same with Python's own type checker
79
+
80
+