Architecture
LangGraph-native Document Intelligence platform. This document goes beyond the README β it covers design decisions, the subgraph hierarchy, state design, and the anti-hallucination stack.
1. High-level architecture
4 compiled LangGraph artifacts
The system is organized around four graphs sharing a common AsyncSqliteSaver
checkpointer:
| # | Graph | Entry point | When |
|---|---|---|---|
| 1 | pipeline_graph |
app.run_pipeline(files) |
on upload |
| 2 | chat_graph |
app.ask(question) |
chat tab |
| 3 | dd_graph |
app.dd_report(thread_id) |
DD tab button |
| 4 | package_insights_graph |
app.package_insights(thread_id, pkg_type) |
demo button |
Chat tools read from the persisted pipeline state β they do not re-read
files. They access the in-memory ChatToolContext, which holds the
HybridStore and a documents snapshot.
Pipeline graph topology
START
β start_timer
β dispatch_ingest (Send API: per-doc fan-out)
β ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph)
β ingest_join (fan-in)
β dispatch_classify (Send API)
β classify_per_doc (regex/keyword classifier in dummy mode;
vision-aware in vLLM mode)
β classify_join
β dispatch_extract (Send API)
β extract_per_doc (regex extractor in dummy mode +
flatten_universal; structured LLM in vLLM mode)
β extract_join
β quote_validator (anti-hallucination layer #7)
β dispatch_rag_index (Send API)
β rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert)
β rag_join
β compare_node (three-way matching, sync)
β risk_subgraph (basic + 14 domain Γ Send + plausibility +
LLM ensemble + duplicate)
β finish_timer
β report_node (10-section JSON structure)
β END
The per-doc Send fan-out yields a 5β8Γ speedup in a CPU-bound environment.
Risk subgraph topology
risk_subgraph (input: PipelineState):
β basic_risk_dispatch (Send: per-doc basic risk)
β basic_risk / noop_basic
β domain_dispatch_node (Send: per-doc Γ per-applicable-check, ~30 parallel)
β apply_domain_check
β [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain)
β llm_risk_per_doc / noop_llm
β plausibility_dispatch (Send: per-doc plausibility)
β plausibility / noop_plaus
β evidence_score_node (per-doc info)
β duplicate_detector_node (package-level, sync, ISA 240)
END
The full anti-hallucination 5+1 layer chain runs inside llm_risk_per_doc:
llm_risk β filter_llm_risks β drop_business_normal β drop_repeats.
DD multi-agent supervisor graph
dd_graph:
START
β contract_filter_node (keep only contract-type docs)
β per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
β supervisor_node (LLM router or heuristic; Command(goto=...))
ββ β audit_specialist (pricing anomalies, overcharging)
ββ β legal_specialist (red flags, change-of-control, non-compete)
ββ β compliance_specialist (GDPR, AML, data protection)
ββ β financial_specialist (monthly obligations, expirations)
βΊ (loops back to supervisor up to dd_supervisor_max_iterations)
β dd_synthesizer (one LLM call: executive_summary +
top_red_flags + per-contract risk_level rating)
END
Package insights graph
A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(audit | dd | compliance | general).
2. State design
PipelineState (TypedDict)
Read-mostly fields with reducer-driven Send fan-in:
files: list[tuple[str, bytes]]β raw uploaddocuments: Annotated[list[ProcessedDocument], merge_doc_results]β per-doc field-level merge keyed byfile_namerisks: Annotated[list[Risk], merge_risks]β dedup by descriptioncomparison: ComparisonReport | Nonereport: dictpackage_insights: PackageInsights | Nonedd_report: DDPortfolioReport | Nonestarted_at,finished_at,processing_secondsprogress_events: Annotated[list[str], add]β Streamlit progress feed
Risk (Pydantic)
The single risk type used everywhere:
description: strseverity: str("high" | "medium" | "low" | "info")rationale: strkind: str("validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check")regulation: str | None(e.g."HU VAT Act Β§169","ISA 240","GDPR Article 28")affected_document: str | Nonesource_check_id: str | None
3. Anti-hallucination stack (5+1 layers)
temperature=0β every LLM call is deterministic-ish._quotesschema field β verbatim source citations._confidenceschema field β per-field reliability (high|medium|low).validate_plausibility()β Python deterministic plausibility checks (negative VAT, non-standard rates, future dates, etc.).- 3-filter LLM risk pipeline β
filter_llm_risks(formal: β₯5 words, β₯2 domain terms, β₯1 concrete fact) βdrop_business_normal_risks(semantic: cross-check vs extracted_data, 6 known false-positive patterns) βdrop_repeats_of_basic(textual dedup vs basic risks, 70% threshold). - Quote validator β final cross-check that every
_quotesentry actually appears in the sourcefull_text(whitespace + diacritic + case normalized). If invalid, downgrades confidence.
4. Domain checks (14 deterministic rules)
| # | check_id | Regulation | HU-specific? | Applies to |
|---|---|---|---|---|
| 01 | check_01_invoice_mandatory |
HU VAT Act Β§169 | yes | invoice |
| 02 | check_02_tax_cdv |
HU Tax Procedure Act Β§22 mod-11 | yes | invoice + contract + ... |
| 03 | check_03_contract_completeness |
Universal contract completeness | no | contract |
| 04 | check_04_proportionality |
Universal contract proportionality (>31.7%) | no | contract |
| 05 | check_05_rounded_amounts |
ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | check_06_evidence_score |
ISA 500 | no | (separate entry, info-only) |
| 07 | check_07_materiality |
ISA 320 | no | invoice + contract + financial_report |
| 08 | check_08_gdpr_28 |
GDPR Article 28 | no (EU) | contract |
| 09 | check_09_dd_red_flags |
M&A DD best practice | no | contract |
| 10 | check_10_incoterms |
Incoterms 2020 | no | contract |
| 11 | check_11_ifrs_har |
IFRS / national GAAP comparison | no | financial_report |
| 12 | check_12_duplicate_invoice |
ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | check_13_aml_sanctions |
AML / Sanctions screening | no | invoice + contract + ... |
| 14 | check_14_contract_dates |
Contract date best practice | no | contract |
The dispatch in domain_dispatch_node skips check_06 and check_12 (they
have separate entry points) and filters is_hu_specific=True out for non-HU
documents.
5. Provider system
Three providers via configurable_alternatives:
vllmβChatOpenAIwithbase_url=VLLM_BASE_URLpointing at the AMD MI300X vLLM endpoint. Production default.ollamaβChatOllamawith a local Ollama daemon (Qwen 2.5 7B Instruct). Development fallback.dummyβDummyChatModel(deterministic stub, no network). CI / eval / load.
Provider selection is runtime-switchable without restart:
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
6. Embedding
BAAI/bge-m3 (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via @lru_cache.
Pre-downloaded at Docker build time so runtime has no network call.
7. Hybrid retrieval (Chroma + BM25)
store/hybrid_store.py runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.
8. Async-first runtime
LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (app/async_runtime.py's AsyncRuntime
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the AsyncSqliteSaver SQLite pool persistent across user
interactions β they do not rebuild per request.
9. Multilingual support
The codebase is English-first but multilingual-tolerant:
- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps
eng + hun + deuas Tesseract languages. - Demo data may include mixed-language documents.
The output (UI, exec summary, DOCX report) is always English.