adeshboudh16 commited on
Commit Β·
6c8a2d0
1
Parent(s): 7ea4089
docs: update HLD/LLD to v0.3.0, add README, ADR 004-005
Browse filesHLD: bump to v0.3.0, Phase 2 Complete, update roadmap table
LLD: add document_registry module, update chunker patterns,
add embedder truncation guard spec, update Neo4j stats to Phase 2
README: first version - quickstart, stack table, phase roadmap, ADR index
ADR 004: multi-format chunker (Act vs Rule boundary regex decision)
ADR 005: document registry as single source of truth
- README.md +159 -0
- docs/HLD.md +65 -64
- docs/LLD.md +71 -34
- docs/adr/004-multi-format-chunker.md +86 -0
- docs/adr/005-document-registry.md +80 -0
README.md
CHANGED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CivicSetu
|
| 2 |
+
|
| 3 |
+
Open-source RAG system for querying Indian civic and legal documents β with accurate
|
| 4 |
+
citations, cross-reference traversal, and conflict detection between laws.
|
| 5 |
+
|
| 6 |
+
**Current status:** Phase 2 complete β RERA Act 2016 (Central) + Maharashtra Rules 2017.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## What it does
|
| 11 |
+
|
| 12 |
+
Ask a plain-English question about RERA or Maharashtra real estate rules. Get a cited,
|
| 13 |
+
structured answer with section references, confidence score, and a legal disclaimer.
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
Query: "What must a promoter disclose before selling a flat?"
|
| 18 |
+
|
| 19 |
+
Answer: "Under Section 11(3) of RERA Act 2016, a promoter must disclose...
|
| 20 |
+
Rule 3(2) of Maharashtra Rules further requires..."
|
| 21 |
+
|
| 22 |
+
Citations: [Section 11, RERA Act 2016], [Rule 3(2), Maharashtra Rules 2017]
|
| 23 |
+
Confidence: 0.95 (high)
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Architecture
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
FastAPI β LangGraph Agent β pgvector + Neo4j + PostgreSQL
|
| 34 |
+
β
|
| 35 |
+
Ingestion Pipeline (PDF β chunks β embeddings β graph)
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
Three stores per query:
|
| 40 |
+
- **pgvector** β semantic similarity (fact lookups)
|
| 41 |
+
- **Neo4j** β section graph traversal (cross-references, penalties)
|
| 42 |
+
- **PostgreSQL** β full chunk text + metadata
|
| 43 |
+
|
| 44 |
+
Full design: [HLD.md](docs/HLD.md) | [LLD.md](docs/LLD.md)
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Quickstart
|
| 49 |
+
|
| 50 |
+
### Prerequisites
|
| 51 |
+
|
| 52 |
+
- Docker + Docker Compose
|
| 53 |
+
- [Ollama](https://ollama.ai) running locally
|
| 54 |
+
- `uv` package manager
|
| 55 |
+
|
| 56 |
+
### 1. Start infrastructure
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
docker compose up -d # PostgreSQL + pgvector + Neo4j
|
| 60 |
+
ollama pull nomic-embed-text # embedding model
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
### 2. Configure environment
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
cp .env.example .env
|
| 68 |
+
# Set GEMINI_API_KEY (or GROQ_API_KEY for backup)
|
| 69 |
+
# Neo4j and Postgres defaults work out of the box with Docker Compose
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
### 3. Install dependencies
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
uv sync
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
### 4. Ingest documents
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
uv run python scripts/ingest_phase0.py # RERA Act 2016
|
| 84 |
+
uv run python scripts/ingest_phase2.py # Maharashtra Rules 2017
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
### 5. Run the API
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
uv run uvicorn civicsetu.api.main:app --reload
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
### 6. Query
|
| 96 |
+
|
| 97 |
+
```bash
|
| 98 |
+
curl -X POST http://localhost:8000/api/v1/query \
|
| 99 |
+
-H "Content-Type: application/json" \
|
| 100 |
+
-d '{"query": "What are the penalties for a promoter who delays possession?"}'
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Documents ingested
|
| 107 |
+
|
| 108 |
+
| Document | Jurisdiction | Chunks | Sections |
|
| 109 |
+
| :-- | :-- | :-- | :-- |
|
| 110 |
+
| RERA Act 2016 | Central | 224 | 92 |
|
| 111 |
+
| Maharashtra Real Estate Rules 2017 | Maharashtra | 214 | 44 |
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Tech stack
|
| 117 |
+
|
| 118 |
+
| Layer | Technology |
|
| 119 |
+
| :-- | :-- |
|
| 120 |
+
| API | FastAPI + Uvicorn |
|
| 121 |
+
| Orchestration | LangGraph StateGraph |
|
| 122 |
+
| LLM routing | LiteLLM (Gemini β Groq β OpenRouter) |
|
| 123 |
+
| Embeddings | nomic-embed-text via Ollama (local) |
|
| 124 |
+
| Vector DB | pgvector + HNSW index |
|
| 125 |
+
| Graph DB | Neo4j Community |
|
| 126 |
+
| Relational | PostgreSQL + SQLAlchemy |
|
| 127 |
+
| Reranker | FlashRank (ms-marco-MiniLM-L-12-v2) |
|
| 128 |
+
| PDF parsing | PyMuPDF |
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## Phase roadmap
|
| 134 |
+
|
| 135 |
+
| Phase | Scope | Status |
|
| 136 |
+
| :-- | :-- | :-- |
|
| 137 |
+
| 0 | RERA Act 2016, vector RAG, FastAPI | Complete |
|
| 138 |
+
| 1 | Neo4j graph, cross-reference queries | Complete |
|
| 139 |
+
| 2 | MahaRERA Rules 2017, multi-jurisdiction | Complete |
|
| 140 |
+
| 3 | DERIVED_FROM edges, conflict detection | Next |
|
| 141 |
+
| 4 | Multi-state expansion (UP, TN, Karnataka) | Planned |
|
| 142 |
+
| 5 | Open-source SaaS, UI, public API | Planned |
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## ADRs
|
| 148 |
+
|
| 149 |
+
- [ADR 001 β three store architecture](docs/adr/001-three-store-architecture.md)
|
| 150 |
+
- [ADR 002 β section boundary chunking](docs/adr/002-section-boundary-chunking.md)
|
| 151 |
+
- [ADR 003 β LangGraph over LangChain chains](docs/adr/003-langgraph-over-langchain.md)
|
| 152 |
+
- [ADR 004 β Multi-format chunker](docs/adr/004-multi-format-chunker.md)
|
| 153 |
+
- [ADR 005 β Document registry](docs/adr/005-document-registry.md)
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
## Disclaimer
|
| 157 |
+
|
| 158 |
+
CivicSetu provides AI-generated legal information, not legal advice.
|
| 159 |
+
Always verify with a qualified lawyer or the official gazette.
|
docs/HLD.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
# CivicSetu β High Level Design (HLD)
|
| 2 |
|
| 3 |
-
**Version:** 0.
|
| 4 |
**Last Updated:** March 2026
|
| 5 |
-
**Status:** Phase
|
| 6 |
|
| 7 |
---
|
| 8 |
|
|
@@ -15,7 +15,7 @@ amendment tracking, and conflict detection between laws.
|
|
| 15 |
**Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
|
| 16 |
labor law, GST compliance, and other civic frameworks.
|
| 17 |
|
| 18 |
-
**
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
@@ -27,28 +27,29 @@ labor law, GST compliance, and other civic frameworks.
|
|
| 27 |
β CLIENT LAYER β
|
| 28 |
β HTTP REST (FastAPI) β /api/v1/query β
|
| 29 |
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
|
| 30 |
-
β
|
| 31 |
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
|
| 32 |
-
β LANGGRAPH AGENT
|
| 33 |
-
β
|
| 34 |
β [Classifier] β [Vector Retrieval] β [Reranker] β
|
| 35 |
-
β β
|
| 36 |
-
β [Retry] β [Validator] β [Generator]
|
| 37 |
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
|
| 38 |
-
β
|
| 39 |
-
βββββββββββββββββββ
|
| 40 |
-
β
|
| 41 |
-
βββββββββΌβββββββ ββββββββββ
|
| 42 |
-
β pgvector β β Neo4j β
|
| 43 |
-
β (vectors) β β
|
| 44 |
-
β Phase 0
|
| 45 |
-
βββββββββ¬βββββββ ββββββββββ
|
| 46 |
-
β
|
| 47 |
-
βββββββββββββββββββ
|
| 48 |
-
β
|
| 49 |
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
|
| 50 |
-
β INGESTION PIPELINE
|
| 51 |
β Download β Parse β Chunk β Enrich β Embed β Store β
|
|
|
|
| 52 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
|
| 54 |
```
|
|
@@ -63,15 +64,15 @@ Runs once per document. Triggered via `make ingest` or `POST /api/v1/ingest`.
|
|
| 63 |
|
| 64 |
```
|
| 65 |
|
| 66 |
-
PDF URL
|
| 67 |
-
β Downloader
|
| 68 |
-
β PDFParser
|
| 69 |
-
|
| 70 |
-
β MetadataExtractor(dates, references, amendment signals)
|
| 71 |
-
β Embedder
|
| 72 |
-
β RelationalStore
|
| 73 |
-
β VectorStore
|
| 74 |
-
β GraphStore
|
| 75 |
|
| 76 |
```
|
| 77 |
|
|
@@ -82,15 +83,15 @@ Triggered on every `POST /api/v1/query`.
|
|
| 82 |
```
|
| 83 |
|
| 84 |
User Query
|
| 85 |
-
β Input Guardrails (
|
| 86 |
β Classifier Node (LLM β query_type + rewritten_query)
|
| 87 |
-
β Vector Retrieval (pgvector cosine search, top_k chunks)
|
| 88 |
-
β Graph Retrieval (Neo4j
|
| 89 |
-
|
| 90 |
β Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
|
| 91 |
β Generator Node (LLM β structured JSON answer with citations)
|
| 92 |
β Validator Node (LLM β hallucination + confidence check)
|
| 93 |
-
β Output Guardrails (
|
| 94 |
β CivicSetuResponse (answer + citations + confidence + disclaimer)
|
| 95 |
|
| 96 |
```
|
|
@@ -99,19 +100,20 @@ User Query
|
|
| 99 |
|
| 100 |
## 4. Component Responsibilities
|
| 101 |
|
| 102 |
-
| Component
|
| 103 |
-
|---|---|---|
|
| 104 |
-
|
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
| 110 |
-
|
|
| 111 |
-
|
|
| 112 |
-
|
|
| 113 |
-
|
|
| 114 |
-
|
|
|
|
|
| 115 |
|
| 116 |
---
|
| 117 |
|
|
@@ -141,7 +143,7 @@ Step 2 Graph β traverse Section 18 node, incoming + outgoing REFERENCES
|
|
| 141 |
Step 2b Fallback β vector retrieval if graph returns 0 results
|
| 142 |
Step 3 Rerank β cross-encoder scores, top 5 ordered
|
| 143 |
Step 4 Generate β LLM produces JSON with answer + citations
|
| 144 |
-
Step 5 Validate β hallucination check, confidence score
|
| 145 |
Step 6 Respond β CivicSetuResponse with citations + disclaimer
|
| 146 |
|
| 147 |
Output: {
|
|
@@ -158,24 +160,23 @@ Output: {
|
|
| 158 |
|
| 159 |
## 7. Phase Roadmap
|
| 160 |
|
| 161 |
-
| Phase | Scope | Status
|
| 162 |
-
|-------|------------------------------------------------|-----------------
|
| 163 |
-
| 0 | RERA Act 2016, vector RAG, FastAPI | β
Complete
|
| 164 |
-
| 1 | Neo4j graph, cross-reference queries | β
Complete
|
| 165 |
-
| 2 | MahaRERA Rules
|
| 166 |
-
| 3 |
|
| 167 |
-
| 4 | Multi-state expansion (UP, TN, Karnataka RERA) | Planned
|
| 168 |
-
| 5 | Open-source SaaS, UI, public API | Planned
|
| 169 |
-
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
## 8. Non-Functional Requirements
|
| 174 |
|
| 175 |
-
| Requirement
|
| 176 |
-
|---|---|---|
|
| 177 |
-
| Response latency
|
| 178 |
-
| Citation accuracy
|
| 179 |
-
| Hallucination rate | < 5%
|
| 180 |
-
| Cost
|
| 181 |
-
| Portability
|
|
|
|
| 1 |
# CivicSetu β High Level Design (HLD)
|
| 2 |
|
| 3 |
+
**Version:** 0.3.0 β Phase 2 Complete
|
| 4 |
**Last Updated:** March 2026
|
| 5 |
+
**Status:** Phase 2 Complete β Multi-jurisdiction ingestion live
|
| 6 |
|
| 7 |
---
|
| 8 |
|
|
|
|
| 15 |
**Target Users:** Indian citizens, lawyers, homebuyers, activists navigating RERA, RTI,
|
| 16 |
labor law, GST compliance, and other civic frameworks.
|
| 17 |
|
| 18 |
+
**Current Scope:** RERA Act 2016 (Central) + Maharashtra Real Estate Rules 2017.
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
|
|
| 27 |
β CLIENT LAYER β
|
| 28 |
β HTTP REST (FastAPI) β /api/v1/query β
|
| 29 |
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
|
| 30 |
+
β
|
| 31 |
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
|
| 32 |
+
β LANGGRAPH AGENT β
|
| 33 |
+
β β
|
| 34 |
β [Classifier] β [Vector Retrieval] β [Reranker] β
|
| 35 |
+
β β [Graph Retrieval] β β
|
| 36 |
+
β [Retry] β [Validator] β [Generator] β
|
| 37 |
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
|
| 38 |
+
β
|
| 39 |
+
ββββββββββββββββββββΌβββββββββββββββββββββββ
|
| 40 |
+
β β β
|
| 41 |
+
βββββββββΌβββββββ βββββββββΌββββββββββ βββββββββΌβββββββββ
|
| 42 |
+
β pgvector β β Neo4j β β PostgreSQL β
|
| 43 |
+
β (vectors) β β (graph) β β (metadata) β
|
| 44 |
+
β Phase 0 β β Phase 1 β β Phase 0 β
|
| 45 |
+
βββββββββ¬βββββββ βββββββββ¬ββββββββββ βββββββββ¬βββββββββ
|
| 46 |
+
β β β
|
| 47 |
+
ββββββββββββββββββββ΄βββββββββββββββββββββββ
|
| 48 |
+
β
|
| 49 |
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
β INGESTION PIPELINE β
|
| 51 |
β Download β Parse β Chunk β Enrich β Embed β Store β
|
| 52 |
+
β document_registry.py β single source of truth for all doc URLs β
|
| 53 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 54 |
|
| 55 |
```
|
|
|
|
| 64 |
|
| 65 |
```
|
| 66 |
|
| 67 |
+
PDF URL (from document_registry.py)
|
| 68 |
+
β Downloader (httpx, cached locally with MD5 check)
|
| 69 |
+
β PDFParser (PyMuPDF, text extraction, scanned page detection)
|
| 70 |
+
β LegalChunker (multi-format regex: Act + Rule boundary detection)
|
| 71 |
+
β MetadataExtractor (dates, cross-references, amendment signals)
|
| 72 |
+
β Embedder (nomic-embed-text via Ollama, MAX_EMBED_CHARS=6000 guard)
|
| 73 |
+
β RelationalStore (PostgreSQL β documents + legal_chunks tables)
|
| 74 |
+
β VectorStore (pgvector β HNSW index, cosine similarity)
|
| 75 |
+
β GraphStore (Neo4j β Document + Section nodes + edges)
|
| 76 |
|
| 77 |
```
|
| 78 |
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
User Query
|
| 86 |
+
β Input Guardrails (PII + off-topic filter)
|
| 87 |
β Classifier Node (LLM β query_type + rewritten_query)
|
| 88 |
+
β Vector Retrieval (pgvector cosine search, top_k chunks) β fact_lookup
|
| 89 |
+
β Graph Retrieval (Neo4j, REFERENCES traversal, depth=2) β cross_reference / penalty / temporal
|
| 90 |
+
Fallback: vector retrieval when no section ID in query
|
| 91 |
β Reranker (FlashRank ms-marco-MiniLM-L-12-v2, cross-encoder)
|
| 92 |
β Generator Node (LLM β structured JSON answer with citations)
|
| 93 |
β Validator Node (LLM β hallucination + confidence check)
|
| 94 |
+
β Output Guardrails (faithfulness check + disclaimer injection)
|
| 95 |
β CivicSetuResponse (answer + citations + confidence + disclaimer)
|
| 96 |
|
| 97 |
```
|
|
|
|
| 100 |
|
| 101 |
## 4. Component Responsibilities
|
| 102 |
|
| 103 |
+
| Component | Responsibility | Technology |
|
| 104 |
+
|--------------------|---------------------------------------------|---------------------------------|
|
| 105 |
+
| DocumentRegistry | Centralised doc URL + metadata management | Python dataclass |
|
| 106 |
+
| PDFParser | Text extraction from PDFs | PyMuPDF |
|
| 107 |
+
| LegalChunker | Multi-format section-boundary splitting | Regex (Act + Rule patterns) |
|
| 108 |
+
| MetadataExtractor | Date, reference, amendment extraction | Regex |
|
| 109 |
+
| Embedder | Dense vector generation + truncation guard | nomic-embed-text (Ollama) |
|
| 110 |
+
| VectorStore | Semantic similarity search | pgvector + HNSW |
|
| 111 |
+
| GraphStore | Section relationship traversal | Neo4j Community |
|
| 112 |
+
| RelationalStore | Metadata persistence + chunk storage | PostgreSQL + SQLAlchemy |
|
| 113 |
+
| LangGraph Agent | Query orchestration state machine | LangGraph |
|
| 114 |
+
| LiteLLM Gateway | LLM provider fallback routing | LiteLLM |
|
| 115 |
+
| FastAPI | HTTP API layer | FastAPI + Uvicorn |
|
| 116 |
+
| FlashRank | Cross-encoder reranking | ONNX local model |
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
| 143 |
Step 2b Fallback β vector retrieval if graph returns 0 results
|
| 144 |
Step 3 Rerank β cross-encoder scores, top 5 ordered
|
| 145 |
Step 4 Generate β LLM produces JSON with answer + citations
|
| 146 |
+
Step 5 Validate β hallucination check, confidence score
|
| 147 |
Step 6 Respond β CivicSetuResponse with citations + disclaimer
|
| 148 |
|
| 149 |
Output: {
|
|
|
|
| 160 |
|
| 161 |
## 7. Phase Roadmap
|
| 162 |
|
| 163 |
+
| Phase | Scope | Status |
|
| 164 |
+
|-------|------------------------------------------------|-----------------|
|
| 165 |
+
| 0 | RERA Act 2016, vector RAG, FastAPI | β
Complete |
|
| 166 |
+
| 1 | Neo4j graph, cross-reference queries | β
Complete |
|
| 167 |
+
| 2 | MahaRERA Rules 2017, multi-jurisdiction | β
Complete |
|
| 168 |
+
| 3 | DERIVED_FROM edges, conflict detection | Next |
|
| 169 |
+
| 4 | Multi-state expansion (UP, TN, Karnataka RERA) | Planned |
|
| 170 |
+
| 5 | Open-source SaaS, UI, public API | Planned |
|
|
|
|
| 171 |
|
| 172 |
---
|
| 173 |
|
| 174 |
## 8. Non-Functional Requirements
|
| 175 |
|
| 176 |
+
| Requirement | Target | Current Status |
|
| 177 |
+
|--------------------|--------------------------------------|---------------------------------|
|
| 178 |
+
| Response latency | < 10s per query | ~5β8s (local embedding) |
|
| 179 |
+
| Citation accuracy | 100% β never answer without citation | Enforced by schema |
|
| 180 |
+
| Hallucination rate | < 5% | Validator node + confidence gate|
|
| 181 |
+
| Cost | $0 for dev/staging | All free tier |
|
| 182 |
+
| Portability | Runs on any machine with Docker | Docker Compose |
|
docs/LLD.md
CHANGED
|
@@ -11,47 +11,48 @@
|
|
| 11 |
|
| 12 |
src/civicsetu/
|
| 13 |
βββ config/
|
| 14 |
-
β
|
|
|
|
| 15 |
βββ models/
|
| 16 |
-
β βββ enums.py
|
| 17 |
-
β βββ schemas.py
|
| 18 |
βββ ingestion/
|
| 19 |
-
β βββ downloader.py
|
| 20 |
-
β βββ parser.py
|
| 21 |
-
β βββ chunker.py
|
| 22 |
β βββ metadata_extractor.py Date/reference/amendment regex extraction
|
| 23 |
-
β βββ embedder.py
|
| 24 |
-
β βββ pipeline.py
|
| 25 |
βββ stores/
|
| 26 |
-
β βββ relational_store.py
|
| 27 |
-
β βββ vector_store.py
|
| 28 |
-
β βββ graph_store.py
|
| 29 |
βββ retrieval/
|
| 30 |
-
β βββ vector_retriever.py
|
| 31 |
-
β βββ graph_retriever.py
|
| 32 |
-
β βββ reranker.py
|
| 33 |
βββ agent/
|
| 34 |
-
β βββ state.py
|
| 35 |
-
β βββ nodes.py
|
| 36 |
-
β β
|
| 37 |
-
β βββ edges.py
|
| 38 |
-
β β
|
| 39 |
-
β βββ graph.py
|
| 40 |
βββ prompts/
|
| 41 |
-
β βββ classifier.py
|
| 42 |
-
β βββ generator.py
|
| 43 |
-
β βββ validator.py
|
| 44 |
βββ guardrails/
|
| 45 |
-
β βββ input_guard.py
|
| 46 |
-
β βββ output_guard.py
|
| 47 |
βββ api/
|
| 48 |
-
βββ main.py
|
| 49 |
βββ routes/
|
| 50 |
-
β βββ health.py
|
| 51 |
-
β βββ query.py
|
| 52 |
-
β βββ ingest.py
|
| 53 |
βββ middleware/
|
| 54 |
-
βββ logging.py
|
| 55 |
|
| 56 |
```
|
| 57 |
|
|
@@ -201,25 +202,35 @@ validator β route_after_validator:
|
|
| 201 |
|
| 202 |
### Section Boundary Detection
|
| 203 |
|
| 204 |
-
|
|
|
|
|
|
|
| 205 |
|
| 206 |
```
|
|
|
|
| 207 |
```
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
| 210 |
|
| 211 |
```
|
|
|
|
| 212 |
```
|
| 213 |
|
| 214 |
-
Matches: `
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
### Chunk Size Limits
|
| 217 |
|
| 218 |
```
|
| 219 |
MIN_CHARS = 100 β discard fragments (headers, page numbers)
|
| 220 |
-
MAX_CHARS =
|
| 221 |
```
|
| 222 |
|
|
|
|
| 223 |
|
| 224 |
### Split Priority for Large Sections
|
| 225 |
|
|
@@ -253,6 +264,18 @@ Query time: "search_query: {query}" β embed_query()
|
|
| 253 |
Using wrong prefix at query time causes ~10β15% recall degradation.
|
| 254 |
The `embed_document()` / `embed_query()` method split enforces this at the API level.
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
---
|
| 257 |
|
| 258 |
## 6. Response Contract
|
|
@@ -316,3 +339,17 @@ If `citations` would be empty β return `InsufficientInfoResponse` instead.
|
|
| 316 |
- Any query with explicit section number (e.g. "Section 18") β cross_reference
|
| 317 |
- cross_reference + penalty_lookup + temporal β graph_retrieval node
|
| 318 |
- fact_lookup + conflict_detection β vector_retrieval node
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
src/civicsetu/
|
| 13 |
βββ config/
|
| 14 |
+
β βββ settings.py Pydantic BaseSettings singleton (lru_cache)
|
| 15 |
+
β βββ document_registry.py All document URLs + metadata (single source of truth)
|
| 16 |
βββ models/
|
| 17 |
+
β βββ enums.py StrEnum: Jurisdiction, DocType, QueryType, etc.
|
| 18 |
+
β βββ schemas.py Pydantic models: LegalChunk, Citation, CivicSetuResponse
|
| 19 |
βββ ingestion/
|
| 20 |
+
β βββ downloader.py httpx PDF downloader with MD5 cache check
|
| 21 |
+
β βββ parser.py PyMuPDF text extractor, scanned PDF detection
|
| 22 |
+
β βββ chunker.py Section-boundary regex chunker + fallback
|
| 23 |
β βββ metadata_extractor.py Date/reference/amendment regex extraction
|
| 24 |
+
β βββ embedder.py nomic-embed-text via Ollama (document + query prefixes)
|
| 25 |
+
β βββ pipeline.py Orchestrates all ingestion steps end-to-end
|
| 26 |
βββ stores/
|
| 27 |
+
β βββ relational_store.py Async SQLAlchemy β documents + legal_chunks tables
|
| 28 |
+
β βββ vector_store.py pgvector HNSW cosine search
|
| 29 |
+
β βββ graph_store.py Neo4j Cypher interface (Phase 1)
|
| 30 |
βββ retrieval/
|
| 31 |
+
β βββ vector_retriever.py Wraps VectorStore for agent use
|
| 32 |
+
β βββ graph_retriever.py Cypher query builder (Phase 1)
|
| 33 |
+
β βββ reranker.py FlashRank cross-encoder wrapper
|
| 34 |
βββ agent/
|
| 35 |
+
β βββ state.py CivicSetuState TypedDict (frozen contract)
|
| 36 |
+
β βββ nodes.py Pure functions: classifier, retrieval, reranker,
|
| 37 |
+
β β generator, validator
|
| 38 |
+
β βββ edges.py Conditional routing: route_after_classifier,
|
| 39 |
+
β β route_after_validator
|
| 40 |
+
β βββ graph.py StateGraph assembly + get_compiled_graph()
|
| 41 |
βββ prompts/
|
| 42 |
+
β βββ classifier.py Query type classification + rewriting prompt
|
| 43 |
+
β βββ generator.py Cited answer generation prompt
|
| 44 |
+
β βββ validator.py Hallucination + confidence check prompt
|
| 45 |
βββ guardrails/
|
| 46 |
+
β βββ input_guard.py PII detection + off-topic filter (Phase 1)
|
| 47 |
+
β βββ output_guard.py Faithfulness check + disclaimer injection (Phase 1)
|
| 48 |
βββ api/
|
| 49 |
+
βββ main.py FastAPI app factory + lifespan (graph pre-compiled)
|
| 50 |
βββ routes/
|
| 51 |
+
β βββ health.py GET /health β DB ping
|
| 52 |
+
β βββ query.py POST /api/v1/query β main RAG endpoint
|
| 53 |
+
β βββ ingest.py POST /api/v1/ingest β Phase 1 admin endpoint
|
| 54 |
βββ middleware/
|
| 55 |
+
βββ logging.py Request/response structured logging
|
| 56 |
|
| 57 |
```
|
| 58 |
|
|
|
|
| 202 |
|
| 203 |
### Section Boundary Detection
|
| 204 |
|
| 205 |
+
Two regex patterns to cover both document formats ingested:
|
| 206 |
+
|
| 207 |
+
**Act format** (RERA Act 2016):
|
| 208 |
|
| 209 |
```
|
| 210 |
+
^\s*(?P<id>\d+[A-Z]?)\.?\s*(?P<title>[A-Z][^\nβ]{3,80})\.?β
|
| 211 |
```
|
| 212 |
|
| 213 |
+
Matches: `18. Return of amount and compensation.β`
|
| 214 |
+
|
| 215 |
+
**Rule format** (MahaRERA Rules 2017):
|
| 216 |
|
| 217 |
```
|
| 218 |
+
\n(?P<id>\d+)\.\s*\n(?P<title>[A-Z][^\n]{3,80})\n
|
| 219 |
```
|
| 220 |
|
| 221 |
+
Matches: `\n3.\nInformation to be furnished...\n`
|
| 222 |
+
|
| 223 |
+
Chunker tries Act pattern first; falls back to Rule pattern; falls back to paragraph
|
| 224 |
+
split if neither matches. Logs `chunking_fallback_used` on paragraph path.
|
| 225 |
|
| 226 |
### Chunk Size Limits
|
| 227 |
|
| 228 |
```
|
| 229 |
MIN_CHARS = 100 β discard fragments (headers, page numbers)
|
| 230 |
+
MAX_CHARS = 1500 β split large sections at subsection markers (1), (2), (a), (b)
|
| 231 |
```
|
| 232 |
|
| 233 |
+
Reduced from 2000 β 1500 to stay within nomic-embed-text practical token window.
|
| 234 |
|
| 235 |
### Split Priority for Large Sections
|
| 236 |
|
|
|
|
| 264 |
Using wrong prefix at query time causes ~10β15% recall degradation.
|
| 265 |
The `embed_document()` / `embed_query()` method split enforces this at the API level.
|
| 266 |
|
| 267 |
+
### Truncation Guard
|
| 268 |
+
|
| 269 |
+
```python
|
| 270 |
+
MAX_EMBED_CHARS = 6000 # ~1500 tokens for nomic-embed-text
|
| 271 |
+
if len(text) > MAX_EMBED_CHARS:
|
| 272 |
+
log.warning("embedding_truncated", original_len=len(text), truncated_to=MAX_EMBED_CHARS)
|
| 273 |
+
text = text[:MAX_EMBED_CHARS]
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
Prevents silent API errors on oversized chunks. Expected to fire on 0β2 chunks per
|
| 277 |
+
document where subsection splitting fails (complex tables, long definition lists).
|
| 278 |
+
|
| 279 |
---
|
| 280 |
|
| 281 |
## 6. Response Contract
|
|
|
|
| 339 |
- Any query with explicit section number (e.g. "Section 18") β cross_reference
|
| 340 |
- cross_reference + penalty_lookup + temporal β graph_retrieval node
|
| 341 |
- fact_lookup + conflict_detection β vector_retrieval node
|
| 342 |
+
|
| 343 |
+
## 8. Neo4j Graph β Phase 2 State
|
| 344 |
+
|
| 345 |
+
**Nodes seeded:** 2 Documents, 438 Section nodes
|
| 346 |
+
**Edges seeded:** 438 HAS_SECTION, 124 REFERENCES, 0 DERIVED_FROM (Phase 3)
|
| 347 |
+
|
| 348 |
+
**Documents in graph:**
|
| 349 |
+
- RERA Act 2016 (CENTRAL) β 224 sections, 63 REFERENCES edges
|
| 350 |
+
- Maharashtra Real Estate Rules 2017 (MAHARASHTRA) β 214 sections, 61 REFERENCES edges
|
| 351 |
+
|
| 352 |
+
**Known issue (Phase 3 backlog):**
|
| 353 |
+
Citation deduplication keys on `section_id` only, not `(section_id, doc_name)`.
|
| 354 |
+
Cross-doc queries may show duplicate section IDs from different documents.
|
| 355 |
+
Fix: update generator citation dedup to composite key.
|
docs/adr/004-multi-format-chunker.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR 004 β Multi-format Legal Document Chunker
|
| 2 |
+
|
| 3 |
+
**Date:** March 2026
|
| 4 |
+
**Status:** Accepted
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Context
|
| 9 |
+
|
| 10 |
+
Phase 2 required ingesting Maharashtra Real Estate Rules 2017 alongside the RERA Act
|
| 11 |
+
2016. Both are Indian legal PDFs but use structurally different numbering formats:
|
| 12 |
+
|
| 13 |
+
**Act format (RERA Act 2016):**
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
18. Return of amount and compensation.β
|
| 17 |
+
(1) If the promoter fails to complete...
|
| 18 |
+
|
| 19 |
+
Section title and em-dash are on the same line as the section number.
|
| 20 |
+
|
| 21 |
+
**Rule format (Maharashtra Rules 2017):**
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
3.
|
| 25 |
+
|
| 26 |
+
Information to be furnished by the promoter...
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
Section number is on its own line, followed by a blank line, then the title.
|
| 30 |
+
|
| 31 |
+
The existing Act-format regex (`^\s*\d+[A-Z]?\.\s+[A-Z][^β\n]{3,80}\.?β`) produces
|
| 32 |
+
zero section boundaries on MahaRERA, triggering fallback paragraph chunking.
|
| 33 |
+
Paragraph chunking on MahaRERA produces 80+ chunks with no section_id metadata β
|
| 34 |
+
breaking citation accuracy entirely.
|
| 35 |
+
|
| 36 |
+
## Decision
|
| 37 |
+
|
| 38 |
+
Extend `LegalChunker` with a second boundary pattern for Rule format, applied as
|
| 39 |
+
a sequential fallback:
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
PATTERNS = [
|
| 43 |
+
```
|
| 44 |
+
("act", r'^\s*(?P<id>\d+[A-Z]?)\.?\s*(?P<title>[A-Z][^\nβ]{3,80})\.?β'),
|
| 45 |
+
```
|
| 46 |
+
```
|
| 47 |
+
("rule", r'\n(?P<id>\d+)\.\s*\n(?P<title>[A-Z][^\n]{3,80})\n'),
|
| 48 |
+
```
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
for name, pattern in PATTERNS:
|
| 52 |
+
matches = list(re.finditer(pattern, text, re.MULTILINE))
|
| 53 |
+
if len(matches) >= MIN_SECTIONS:
|
| 54 |
+
log.info("chunker_pattern_selected", pattern=name, sections=len(matches))
|
| 55 |
+
break
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
`MIN_SECTIONS = 5` β fewer than 5 matches is treated as noise, not real boundaries.
|
| 59 |
+
|
| 60 |
+
The chunker logs which pattern was selected per document. Paragraph fallback is only
|
| 61 |
+
reached if both patterns fail.
|
| 62 |
+
|
| 63 |
+
## Consequences
|
| 64 |
+
|
| 65 |
+
**Positive:**
|
| 66 |
+
|
| 67 |
+
- MahaRERA produces 214 meaningful chunks with proper section_id metadata (44 sections)
|
| 68 |
+
- Citation accuracy preserved β every chunk maps to an identifiable Rule number
|
| 69 |
+
- Pattern selection is logged β observable, not silent
|
| 70 |
+
- Adding a third pattern (e.g. circular format) requires one array entry
|
| 71 |
+
|
| 72 |
+
**Negative:**
|
| 73 |
+
|
| 74 |
+
- Pattern priority is implicit β if a document accidentally matches Rule pattern first
|
| 75 |
+
with >= 5 hits, it bypasses Act pattern (mitigated by trying Act first)
|
| 76 |
+
- Regex fragility: PDFs with unusual whitespace will still hit fallback
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## Alternatives Rejected
|
| 80 |
+
|
| 81 |
+
- **Hardcode document type in ingestion config:** Requires caller to know format ahead
|
| 82 |
+
of time; breaks the "any PDF URL" contract of the ingestion pipeline
|
| 83 |
+
- **ML-based section detector:** Overkill for deterministic numbered formats; adds
|
| 84 |
+
model dependency with no recall benefit on well-formatted government PDFs
|
| 85 |
+
- **Single universal regex:** No single pattern can match both `18. Title.β` and
|
| 86 |
+
`\n18.\n\nTitle\n` without catastrophic false positives
|
docs/adr/005-document-registry.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR 005 β Document Registry as Single Source of Truth
|
| 2 |
+
|
| 3 |
+
**Date:** March 2026
|
| 4 |
+
**Status:** Accepted
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Context
|
| 9 |
+
|
| 10 |
+
Phase 2 introduced a second document. With two documents, ingestion scripts started
|
| 11 |
+
duplicating URL strings, jurisdiction values, and doc_name strings across:
|
| 12 |
+
|
| 13 |
+
- `scripts/ingest_phase0.py`
|
| 14 |
+
- `scripts/ingest_phase2.py`
|
| 15 |
+
- Tests
|
| 16 |
+
- Any future migration or re-ingestion scripts
|
| 17 |
+
|
| 18 |
+
A URL change (e.g. NAREDCO moves their PDF) would require grep-and-replace across
|
| 19 |
+
multiple files with no compile-time safety.
|
| 20 |
+
|
| 21 |
+
## Decision
|
| 22 |
+
|
| 23 |
+
Introduce `src/civicsetu/config/document_registry.py` as the single authoritative
|
| 24 |
+
source for all document metadata:
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
@dataclass(frozen=True)
|
| 28 |
+
class DocumentSpec:
|
| 29 |
+
name: str
|
| 30 |
+
url: str
|
| 31 |
+
jurisdiction: Jurisdiction
|
| 32 |
+
doc_type: DocType
|
| 33 |
+
effective_date: date | None = None
|
| 34 |
+
|
| 35 |
+
DOCUMENT_REGISTRY: dict[str, DocumentSpec] = {
|
| 36 |
+
"rera_act_2016": DocumentSpec(
|
| 37 |
+
name="RERA Act 2016",
|
| 38 |
+
url="https://...",
|
| 39 |
+
jurisdiction=Jurisdiction.CENTRAL,
|
| 40 |
+
doc_type=DocType.ACT,
|
| 41 |
+
effective_date=date(2016, 5, 26),
|
| 42 |
+
),
|
| 43 |
+
"mahrera_rules_2017": DocumentSpec(
|
| 44 |
+
name="Maharashtra Real Estate (Regulation and Development) Rules 2017",
|
| 45 |
+
url="https://naredco.in/...",
|
| 46 |
+
jurisdiction=Jurisdiction.MAHARASHTRA,
|
| 47 |
+
doc_type=DocType.RULES,
|
| 48 |
+
effective_date=date(2017, 4, 21),
|
| 49 |
+
),
|
| 50 |
+
}
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
All ingestion scripts import from `document_registry`. No URL strings appear outside
|
| 54 |
+
this file.
|
| 55 |
+
|
| 56 |
+
## Consequences
|
| 57 |
+
|
| 58 |
+
**Positive:**
|
| 59 |
+
|
| 60 |
+
- URL change = one-line edit, guaranteed to propagate everywhere
|
| 61 |
+
- `DocumentSpec` is a frozen dataclass β immutable, hashable, diffable in git
|
| 62 |
+
- Phase 4 (multi-state expansion) is a registry append, not a script rewrite
|
| 63 |
+
- Tests can iterate `DOCUMENT_REGISTRY.values()` for fixture generation
|
| 64 |
+
|
| 65 |
+
**Negative:**
|
| 66 |
+
|
| 67 |
+
- Adding a document requires a code change + deploy (not a DB insert)
|
| 68 |
+
- Acceptable for Phase 0β3 volume (~10 documents); revisit for Phase 4+
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
## Alternatives Rejected
|
| 72 |
+
|
| 73 |
+
- **Database table for document registry:** Correct long-term, premature for current
|
| 74 |
+
volume. Adds a DB round-trip to every ingestion bootstrap.
|
| 75 |
+
- **Environment variables per document:** Unscalable beyond 2β3 documents;
|
| 76 |
+
no structure, no type safety
|
| 77 |
+
- **YAML/TOML config file:** Adds a parsing layer with no type safety; dataclass
|
| 78 |
+
achieves the same with Python's own type checker
|
| 79 |
+
|
| 80 |
+
|