File size: 9,060 Bytes
7ff7119 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | # Architecture
LangGraph-native Document Intelligence platform. This document goes beyond
the README — it covers design decisions, the subgraph hierarchy, state
design, and the anti-hallucination stack.
## 1. High-level architecture
### 4 compiled LangGraph artifacts
The system is organized around four graphs sharing a common `AsyncSqliteSaver`
checkpointer:
| # | Graph | Entry point | When |
|---|-------|-------------|------|
| 1 | `pipeline_graph` | `app.run_pipeline(files)` | on upload |
| 2 | `chat_graph` | `app.ask(question)` | chat tab |
| 3 | `dd_graph` | `app.dd_report(thread_id)` | DD tab button |
| 4 | `package_insights_graph` | `app.package_insights(thread_id, pkg_type)` | demo button |
Chat tools read from the persisted pipeline state — they do not re-read
files. They access the in-memory `ChatToolContext`, which holds the
HybridStore and a documents snapshot.
### Pipeline graph topology
```
START
→ start_timer
→ dispatch_ingest (Send API: per-doc fan-out)
→ ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph)
→ ingest_join (fan-in)
→ dispatch_classify (Send API)
→ classify_per_doc (regex/keyword classifier in dummy mode;
vision-aware in vLLM mode)
→ classify_join
→ dispatch_extract (Send API)
→ extract_per_doc (regex extractor in dummy mode +
flatten_universal; structured LLM in vLLM mode)
→ extract_join
→ quote_validator (anti-hallucination layer #7)
→ dispatch_rag_index (Send API)
→ rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert)
→ rag_join
→ compare_node (three-way matching, sync)
→ risk_subgraph (basic + 14 domain × Send + plausibility +
LLM ensemble + duplicate)
→ finish_timer
→ report_node (10-section JSON structure)
→ END
```
The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.
### Risk subgraph topology
```
risk_subgraph (input: PipelineState):
→ basic_risk_dispatch (Send: per-doc basic risk)
→ basic_risk / noop_basic
→ domain_dispatch_node (Send: per-doc × per-applicable-check, ~30 parallel)
→ apply_domain_check
→ [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain)
→ llm_risk_per_doc / noop_llm
→ plausibility_dispatch (Send: per-doc plausibility)
→ plausibility / noop_plaus
→ evidence_score_node (per-doc info)
→ duplicate_detector_node (package-level, sync, ISA 240)
END
```
The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`:
`llm_risk → filter_llm_risks → drop_business_normal → drop_repeats`.
### DD multi-agent supervisor graph
```
dd_graph:
START
→ contract_filter_node (keep only contract-type docs)
→ per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
→ supervisor_node (LLM router or heuristic; Command(goto=...))
├─ → audit_specialist (pricing anomalies, overcharging)
├─ → legal_specialist (red flags, change-of-control, non-compete)
├─ → compliance_specialist (GDPR, AML, data protection)
└─ → financial_specialist (monthly obligations, expirations)
↺ (loops back to supervisor up to dd_supervisor_max_iterations)
→ dd_synthesizer (one LLM call: executive_summary +
top_red_flags + per-contract risk_level rating)
END
```
### Package insights graph
A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(`audit | dd | compliance | general`).
## 2. State design
### `PipelineState` (TypedDict)
Read-mostly fields with **reducer-driven Send fan-in**:
- `files: list[tuple[str, bytes]]` — raw upload
- `documents: Annotated[list[ProcessedDocument], merge_doc_results]` —
per-doc field-level merge keyed by `file_name`
- `risks: Annotated[list[Risk], merge_risks]` — dedup by description
- `comparison: ComparisonReport | None`
- `report: dict`
- `package_insights: PackageInsights | None`
- `dd_report: DDPortfolioReport | None`
- `started_at`, `finished_at`, `processing_seconds`
- `progress_events: Annotated[list[str], add]` — Streamlit progress feed
### `Risk` (Pydantic)
The single risk type used everywhere:
- `description: str`
- `severity: str` (`"high" | "medium" | "low" | "info"`)
- `rationale: str`
- `kind: str` (`"validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check"`)
- `regulation: str | None` (e.g. `"HU VAT Act §169"`, `"ISA 240"`, `"GDPR Article 28"`)
- `affected_document: str | None`
- `source_check_id: str | None`
## 3. Anti-hallucination stack (5+1 layers)
1. **`temperature=0`** — every LLM call is deterministic-ish.
2. **`_quotes` schema field** — verbatim source citations.
3. **`_confidence` schema field** — per-field reliability (high|medium|low).
4. **`validate_plausibility()`** — Python deterministic plausibility checks
(negative VAT, non-standard rates, future dates, etc.).
5. **3-filter LLM risk pipeline** —
`filter_llm_risks` (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact)
→ `drop_business_normal_risks` (semantic: cross-check vs extracted_data,
6 known false-positive patterns)
→ `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold).
6. **Quote validator** — final cross-check that every `_quotes` entry
actually appears in the source `full_text` (whitespace + diacritic +
case normalized). If invalid, downgrades confidence.
## 4. Domain checks (14 deterministic rules)
| # | check_id | Regulation | HU-specific? | Applies to |
|---|----------|-----------|--------------|------------|
| 01 | `check_01_invoice_mandatory` | HU VAT Act §169 | yes | invoice |
| 02 | `check_02_tax_cdv` | HU Tax Procedure Act §22 mod-11 | yes | invoice + contract + ... |
| 03 | `check_03_contract_completeness` | Universal contract completeness | no | contract |
| 04 | `check_04_proportionality` | Universal contract proportionality (>31.7%) | no | contract |
| 05 | `check_05_rounded_amounts` | ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | `check_06_evidence_score` | ISA 500 | no | (separate entry, info-only) |
| 07 | `check_07_materiality` | ISA 320 | no | invoice + contract + financial_report |
| 08 | `check_08_gdpr_28` | GDPR Article 28 | no (EU) | contract |
| 09 | `check_09_dd_red_flags` | M&A DD best practice | no | contract |
| 10 | `check_10_incoterms` | Incoterms 2020 | no | contract |
| 11 | `check_11_ifrs_har` | IFRS / national GAAP comparison | no | financial_report |
| 12 | `check_12_duplicate_invoice` | ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | `check_13_aml_sanctions` | AML / Sanctions screening | no | invoice + contract + ... |
| 14 | `check_14_contract_dates` | Contract date best practice | no | contract |
The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they
have separate entry points) and filters `is_hu_specific=True` out for non-HU
documents.
## 5. Provider system
Three providers via `configurable_alternatives`:
- **`vllm`** — `ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the
AMD MI300X vLLM endpoint. Production default.
- **`ollama`** — `ChatOllama` with a local Ollama daemon (Qwen 2.5 7B
Instruct). Development fallback.
- **`dummy`** — `DummyChatModel` (deterministic stub, no network).
CI / eval / load.
Provider selection is **runtime-switchable** without restart:
```python
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
```
## 6. Embedding
`BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via `@lru_cache`.
Pre-downloaded at Docker build time so runtime has no network call.
## 7. Hybrid retrieval (Chroma + BM25)
`store/hybrid_store.py` runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.
## 8. Async-first runtime
LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime`
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the `AsyncSqliteSaver` SQLite pool persistent across user
interactions — they do not rebuild per request.
## 9. Multilingual support
The codebase is English-first but multilingual-tolerant:
- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps `eng + hun + deu` as Tesseract languages.
- Demo data may include mixed-language documents.
The output (UI, exec summary, DOCX report) is **always English**.
|