File size: 9,060 Bytes
7ff7119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Architecture

LangGraph-native Document Intelligence platform. This document goes beyond
the README β€” it covers design decisions, the subgraph hierarchy, state
design, and the anti-hallucination stack.

## 1. High-level architecture

### 4 compiled LangGraph artifacts

The system is organized around four graphs sharing a common `AsyncSqliteSaver`
checkpointer:

| # | Graph | Entry point | When |
|---|-------|-------------|------|
| 1 | `pipeline_graph` | `app.run_pipeline(files)` | on upload |
| 2 | `chat_graph` | `app.ask(question)` | chat tab |
| 3 | `dd_graph` | `app.dd_report(thread_id)` | DD tab button |
| 4 | `package_insights_graph` | `app.package_insights(thread_id, pkg_type)` | demo button |

Chat tools read from the persisted pipeline state β€” they do not re-read
files. They access the in-memory `ChatToolContext`, which holds the
HybridStore and a documents snapshot.

### Pipeline graph topology

```
START
  β†’ start_timer
  β†’ dispatch_ingest          (Send API: per-doc fan-out)
  β†’ ingest_per_doc           (PDF/DOCX/PNG/TXT loader subgraph)
  β†’ ingest_join              (fan-in)
  β†’ dispatch_classify        (Send API)
  β†’ classify_per_doc         (regex/keyword classifier in dummy mode;
                              vision-aware in vLLM mode)
  β†’ classify_join
  β†’ dispatch_extract         (Send API)
  β†’ extract_per_doc          (regex extractor in dummy mode +
                              flatten_universal; structured LLM in vLLM mode)
  β†’ extract_join
  β†’ quote_validator          (anti-hallucination layer #7)
  β†’ dispatch_rag_index       (Send API)
  β†’ rag_index_per_doc        (chunker + batched embed + Chroma+BM25 upsert)
  β†’ rag_join
  β†’ compare_node             (three-way matching, sync)
  β†’ risk_subgraph            (basic + 14 domain Γ— Send + plausibility +
                              LLM ensemble + duplicate)
  β†’ finish_timer
  β†’ report_node              (10-section JSON structure)
  β†’ END
```

The per-doc Send fan-out yields a 5–8Γ— speedup in a CPU-bound environment.

### Risk subgraph topology

```
risk_subgraph (input: PipelineState):
  β†’ basic_risk_dispatch         (Send: per-doc basic risk)
  β†’ basic_risk / noop_basic
  β†’ domain_dispatch_node        (Send: per-doc Γ— per-applicable-check, ~30 parallel)
  β†’ apply_domain_check
  β†’ [if llm provided] llm_risk_dispatch  (Send: per-doc LLM risk + 3-filter chain)
  β†’ llm_risk_per_doc / noop_llm
  β†’ plausibility_dispatch       (Send: per-doc plausibility)
  β†’ plausibility / noop_plaus
  β†’ evidence_score_node         (per-doc info)
  β†’ duplicate_detector_node     (package-level, sync, ISA 240)
END
```

The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`:
`llm_risk β†’ filter_llm_risks β†’ drop_business_normal β†’ drop_repeats`.

### DD multi-agent supervisor graph

```
dd_graph:
  START
  β†’ contract_filter_node      (keep only contract-type docs)
  β†’ per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
  β†’ supervisor_node           (LLM router or heuristic; Command(goto=...))
        β”œβ”€ β†’ audit_specialist     (pricing anomalies, overcharging)
        β”œβ”€ β†’ legal_specialist     (red flags, change-of-control, non-compete)
        β”œβ”€ β†’ compliance_specialist (GDPR, AML, data protection)
        └─ β†’ financial_specialist (monthly obligations, expirations)
  β†Ί (loops back to supervisor up to dd_supervisor_max_iterations)
  β†’ dd_synthesizer            (one LLM call: executive_summary +
                               top_red_flags + per-contract risk_level rating)
  END
```

### Package insights graph

A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(`audit | dd | compliance | general`).

## 2. State design

### `PipelineState` (TypedDict)

Read-mostly fields with **reducer-driven Send fan-in**:

- `files: list[tuple[str, bytes]]` β€” raw upload
- `documents: Annotated[list[ProcessedDocument], merge_doc_results]` β€”
  per-doc field-level merge keyed by `file_name`
- `risks: Annotated[list[Risk], merge_risks]` β€” dedup by description
- `comparison: ComparisonReport | None`
- `report: dict`
- `package_insights: PackageInsights | None`
- `dd_report: DDPortfolioReport | None`
- `started_at`, `finished_at`, `processing_seconds`
- `progress_events: Annotated[list[str], add]` β€” Streamlit progress feed

### `Risk` (Pydantic)

The single risk type used everywhere:

- `description: str`
- `severity: str` (`"high" | "medium" | "low" | "info"`)
- `rationale: str`
- `kind: str` (`"validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check"`)
- `regulation: str | None` (e.g. `"HU VAT Act Β§169"`, `"ISA 240"`, `"GDPR Article 28"`)
- `affected_document: str | None`
- `source_check_id: str | None`

## 3. Anti-hallucination stack (5+1 layers)

1. **`temperature=0`** β€” every LLM call is deterministic-ish.
2. **`_quotes` schema field** β€” verbatim source citations.
3. **`_confidence` schema field** β€” per-field reliability (high|medium|low).
4. **`validate_plausibility()`** β€” Python deterministic plausibility checks
   (negative VAT, non-standard rates, future dates, etc.).
5. **3-filter LLM risk pipeline** β€”
   `filter_llm_risks` (formal: β‰₯5 words, β‰₯2 domain terms, β‰₯1 concrete fact)
   β†’ `drop_business_normal_risks` (semantic: cross-check vs extracted_data,
   6 known false-positive patterns)
   β†’ `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold).
6. **Quote validator** β€” final cross-check that every `_quotes` entry
   actually appears in the source `full_text` (whitespace + diacritic +
   case normalized). If invalid, downgrades confidence.

## 4. Domain checks (14 deterministic rules)

| # | check_id | Regulation | HU-specific? | Applies to |
|---|----------|-----------|--------------|------------|
| 01 | `check_01_invoice_mandatory` | HU VAT Act Β§169 | yes | invoice |
| 02 | `check_02_tax_cdv` | HU Tax Procedure Act Β§22 mod-11 | yes | invoice + contract + ... |
| 03 | `check_03_contract_completeness` | Universal contract completeness | no | contract |
| 04 | `check_04_proportionality` | Universal contract proportionality (>31.7%) | no | contract |
| 05 | `check_05_rounded_amounts` | ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | `check_06_evidence_score` | ISA 500 | no | (separate entry, info-only) |
| 07 | `check_07_materiality` | ISA 320 | no | invoice + contract + financial_report |
| 08 | `check_08_gdpr_28` | GDPR Article 28 | no (EU) | contract |
| 09 | `check_09_dd_red_flags` | M&A DD best practice | no | contract |
| 10 | `check_10_incoterms` | Incoterms 2020 | no | contract |
| 11 | `check_11_ifrs_har` | IFRS / national GAAP comparison | no | financial_report |
| 12 | `check_12_duplicate_invoice` | ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | `check_13_aml_sanctions` | AML / Sanctions screening | no | invoice + contract + ... |
| 14 | `check_14_contract_dates` | Contract date best practice | no | contract |

The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they
have separate entry points) and filters `is_hu_specific=True` out for non-HU
documents.

## 5. Provider system

Three providers via `configurable_alternatives`:

- **`vllm`** β€” `ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the
  AMD MI300X vLLM endpoint. Production default.
- **`ollama`** β€” `ChatOllama` with a local Ollama daemon (Qwen 2.5 7B
  Instruct). Development fallback.
- **`dummy`** β€” `DummyChatModel` (deterministic stub, no network).
  CI / eval / load.

Provider selection is **runtime-switchable** without restart:

```python
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
```

## 6. Embedding

`BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via `@lru_cache`.
Pre-downloaded at Docker build time so runtime has no network call.

## 7. Hybrid retrieval (Chroma + BM25)

`store/hybrid_store.py` runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.

## 8. Async-first runtime

LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime`
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the `AsyncSqliteSaver` SQLite pool persistent across user
interactions β€” they do not rebuild per request.

## 9. Multilingual support

The codebase is English-first but multilingual-tolerant:

- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps `eng + hun + deu` as Tesseract languages.
- Demo data may include mixed-language documents.

The output (UI, exec summary, DOCX report) is **always English**.