| # Architecture |
|
|
| LangGraph-native Document Intelligence platform. This document goes beyond |
| the README β it covers design decisions, the subgraph hierarchy, state |
| design, and the anti-hallucination stack. |
|
|
| ## 1. High-level architecture |
|
|
| ### 4 compiled LangGraph artifacts |
|
|
| The system is organized around four graphs sharing a common `AsyncSqliteSaver` |
| checkpointer: |
|
|
| | # | Graph | Entry point | When | |
| |---|-------|-------------|------| |
| | 1 | `pipeline_graph` | `app.run_pipeline(files)` | on upload | |
| | 2 | `chat_graph` | `app.ask(question)` | chat tab | |
| | 3 | `dd_graph` | `app.dd_report(thread_id)` | DD tab button | |
| | 4 | `package_insights_graph` | `app.package_insights(thread_id, pkg_type)` | demo button | |
|
|
| Chat tools read from the persisted pipeline state β they do not re-read |
| files. They access the in-memory `ChatToolContext`, which holds the |
| HybridStore and a documents snapshot. |
|
|
| ### Pipeline graph topology |
|
|
| ``` |
| START |
| β start_timer |
| β dispatch_ingest (Send API: per-doc fan-out) |
| β ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph) |
| β ingest_join (fan-in) |
| β dispatch_classify (Send API) |
| β classify_per_doc (regex/keyword classifier in dummy mode; |
| vision-aware in vLLM mode) |
| β classify_join |
| β dispatch_extract (Send API) |
| β extract_per_doc (regex extractor in dummy mode + |
| flatten_universal; structured LLM in vLLM mode) |
| β extract_join |
| β quote_validator (anti-hallucination layer #7) |
| β dispatch_rag_index (Send API) |
| β rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert) |
| β rag_join |
| β compare_node (three-way matching, sync) |
| β risk_subgraph (basic + 14 domain Γ Send + plausibility + |
| LLM ensemble + duplicate) |
| β finish_timer |
| β report_node (10-section JSON structure) |
| β END |
| ``` |
|
|
| The per-doc Send fan-out yields a 5β8Γ speedup in a CPU-bound environment. |
|
|
| ### Risk subgraph topology |
|
|
| ``` |
| risk_subgraph (input: PipelineState): |
| β basic_risk_dispatch (Send: per-doc basic risk) |
| β basic_risk / noop_basic |
| β domain_dispatch_node (Send: per-doc Γ per-applicable-check, ~30 parallel) |
| β apply_domain_check |
| β [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain) |
| β llm_risk_per_doc / noop_llm |
| β plausibility_dispatch (Send: per-doc plausibility) |
| β plausibility / noop_plaus |
| β evidence_score_node (per-doc info) |
| β duplicate_detector_node (package-level, sync, ISA 240) |
| END |
| ``` |
|
|
| The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`: |
| `llm_risk β filter_llm_risks β drop_business_normal β drop_repeats`. |
|
|
| ### DD multi-agent supervisor graph |
|
|
| ``` |
| dd_graph: |
| START |
| β contract_filter_node (keep only contract-type docs) |
| β per_contract_summary_node (Python-deterministic per-contract DDContractSummary) |
| β supervisor_node (LLM router or heuristic; Command(goto=...)) |
| ββ β audit_specialist (pricing anomalies, overcharging) |
| ββ β legal_specialist (red flags, change-of-control, non-compete) |
| ββ β compliance_specialist (GDPR, AML, data protection) |
| ββ β financial_specialist (monthly obligations, expirations) |
| βΊ (loops back to supervisor up to dd_supervisor_max_iterations) |
| β dd_synthesizer (one LLM call: executive_summary + |
| top_red_flags + per-contract risk_level rating) |
| END |
| ``` |
|
|
| ### Package insights graph |
|
|
| A simple 1-LLM-call graph: ingests the full document package and produces |
| cross-doc findings using a perspective-driven prompt |
| (`audit | dd | compliance | general`). |
|
|
| ## 2. State design |
|
|
| ### `PipelineState` (TypedDict) |
|
|
| Read-mostly fields with **reducer-driven Send fan-in**: |
|
|
| - `files: list[tuple[str, bytes]]` β raw upload |
| - `documents: Annotated[list[ProcessedDocument], merge_doc_results]` β |
| per-doc field-level merge keyed by `file_name` |
| - `risks: Annotated[list[Risk], merge_risks]` β dedup by description |
| - `comparison: ComparisonReport | None` |
| - `report: dict` |
| - `package_insights: PackageInsights | None` |
| - `dd_report: DDPortfolioReport | None` |
| - `started_at`, `finished_at`, `processing_seconds` |
| - `progress_events: Annotated[list[str], add]` β Streamlit progress feed |
|
|
| ### `Risk` (Pydantic) |
|
|
| The single risk type used everywhere: |
|
|
| - `description: str` |
| - `severity: str` (`"high" | "medium" | "low" | "info"`) |
| - `rationale: str` |
| - `kind: str` (`"validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check"`) |
| - `regulation: str | None` (e.g. `"HU VAT Act Β§169"`, `"ISA 240"`, `"GDPR Article 28"`) |
| - `affected_document: str | None` |
| - `source_check_id: str | None` |
|
|
| ## 3. Anti-hallucination stack (5+1 layers) |
|
|
| 1. **`temperature=0`** β every LLM call is deterministic-ish. |
| 2. **`_quotes` schema field** β verbatim source citations. |
| 3. **`_confidence` schema field** β per-field reliability (high|medium|low). |
| 4. **`validate_plausibility()`** β Python deterministic plausibility checks |
| (negative VAT, non-standard rates, future dates, etc.). |
| 5. **3-filter LLM risk pipeline** β |
| `filter_llm_risks` (formal: β₯5 words, β₯2 domain terms, β₯1 concrete fact) |
| β `drop_business_normal_risks` (semantic: cross-check vs extracted_data, |
| 6 known false-positive patterns) |
| β `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold). |
| 6. **Quote validator** β final cross-check that every `_quotes` entry |
| actually appears in the source `full_text` (whitespace + diacritic + |
| case normalized). If invalid, downgrades confidence. |
|
|
| ## 4. Domain checks (14 deterministic rules) |
|
|
| | # | check_id | Regulation | HU-specific? | Applies to | |
| |---|----------|-----------|--------------|------------| |
| | 01 | `check_01_invoice_mandatory` | HU VAT Act Β§169 | yes | invoice | |
| | 02 | `check_02_tax_cdv` | HU Tax Procedure Act Β§22 mod-11 | yes | invoice + contract + ... | |
| | 03 | `check_03_contract_completeness` | Universal contract completeness | no | contract | |
| | 04 | `check_04_proportionality` | Universal contract proportionality (>31.7%) | no | contract | |
| | 05 | `check_05_rounded_amounts` | ISA 240 (Journal of Accountancy 2018) | no | invoice | |
| | 06 | `check_06_evidence_score` | ISA 500 | no | (separate entry, info-only) | |
| | 07 | `check_07_materiality` | ISA 320 | no | invoice + contract + financial_report | |
| | 08 | `check_08_gdpr_28` | GDPR Article 28 | no (EU) | contract | |
| | 09 | `check_09_dd_red_flags` | M&A DD best practice | no | contract | |
| | 10 | `check_10_incoterms` | Incoterms 2020 | no | contract | |
| | 11 | `check_11_ifrs_har` | IFRS / national GAAP comparison | no | financial_report | |
| | 12 | `check_12_duplicate_invoice` | ISA 240 (duplicate invoice) | no | (separate entry, package-level) | |
| | 13 | `check_13_aml_sanctions` | AML / Sanctions screening | no | invoice + contract + ... | |
| | 14 | `check_14_contract_dates` | Contract date best practice | no | contract | |
|
|
| The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they |
| have separate entry points) and filters `is_hu_specific=True` out for non-HU |
| documents. |
|
|
| ## 5. Provider system |
|
|
| Three providers via `configurable_alternatives`: |
|
|
| - **`vllm`** β `ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the |
| AMD MI300X vLLM endpoint. Production default. |
| - **`ollama`** β `ChatOllama` with a local Ollama daemon (Qwen 2.5 7B |
| Instruct). Development fallback. |
| - **`dummy`** β `DummyChatModel` (deterministic stub, no network). |
| CI / eval / load. |
|
|
| Provider selection is **runtime-switchable** without restart: |
|
|
| ```python |
| graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}}) |
| ``` |
|
|
| ## 6. Embedding |
|
|
| `BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default. |
| Sentence-transformers loads it on first call via `@lru_cache`. |
| Pre-downloaded at Docker build time so runtime has no network call. |
|
|
| ## 7. Hybrid retrieval (Chroma + BM25) |
|
|
| `store/hybrid_store.py` runs vector search and BM25 in parallel and merges |
| with Reciprocal Rank Fusion (RRF). The chunker uses natural break points |
| (paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char |
| overlap. |
|
|
| ## 8. Async-first runtime |
|
|
| LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer |
| on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime` |
| singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP |
| session, and the `AsyncSqliteSaver` SQLite pool persistent across user |
| interactions β they do not rebuild per request. |
|
|
| ## 9. Multilingual support |
|
|
| The codebase is English-first but multilingual-tolerant: |
|
|
| - The classifier matches HU/EN/DE keyword patterns. |
| - Risk filters tolerate HU/DE business terms. |
| - The OCR layer keeps `eng + hun + deu` as Tesseract languages. |
| - Demo data may include mixed-language documents. |
|
|
| The output (UI, exec summary, DOCX report) is **always English**. |
|
|