Architecture
LangGraph-native Document Intelligence platform. This document goes beyond the README — it covers design decisions, the subgraph hierarchy, state design, and the anti-hallucination stack.
1. High-level architecture
4 compiled LangGraph artifacts
The system is organized around four graphs sharing a common AsyncSqliteSaver
checkpointer:
| # | Graph | Entry point | When |
|---|---|---|---|
| 1 | pipeline_graph |
app.run_pipeline(files) |
on upload |
| 2 | chat_graph |
app.ask(question) |
chat tab |
| 3 | dd_graph |
app.dd_report(thread_id) |
DD tab button |
| 4 | package_insights_graph |
app.package_insights(thread_id, pkg_type) |
demo button |
Chat tools read from the persisted pipeline state — they do not re-read
files. They access the in-memory ChatToolContext, which holds the
HybridStore and a documents snapshot.
Pipeline graph topology
START
→ start_timer
→ dispatch_ingest (Send API: per-doc fan-out)
→ ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph)
→ ingest_join (fan-in)
→ dispatch_classify (Send API)
→ classify_per_doc (regex/keyword classifier in dummy mode;
vision-aware in vLLM mode)
→ classify_join
→ dispatch_extract (Send API)
→ extract_per_doc (regex extractor in dummy mode +
flatten_universal; structured LLM in vLLM mode)
→ extract_join
→ quote_validator (anti-hallucination layer #7)
→ dispatch_rag_index (Send API)
→ rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert)
→ rag_join
→ compare_node (three-way matching, sync)
→ risk_subgraph (basic + 14 domain × Send + plausibility +
LLM ensemble + duplicate)
→ finish_timer
→ report_node (10-section JSON structure)
→ END
The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.
Risk subgraph topology
risk_subgraph (input: PipelineState):
→ basic_risk_dispatch (Send: per-doc basic risk)
→ basic_risk / noop_basic
→ domain_dispatch_node (Send: per-doc × per-applicable-check, ~30 parallel)
→ apply_domain_check
→ [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain)
→ llm_risk_per_doc / noop_llm
→ plausibility_dispatch (Send: per-doc plausibility)
→ plausibility / noop_plaus
→ evidence_score_node (per-doc info)
→ duplicate_detector_node (package-level, sync, ISA 240)
END
The full anti-hallucination 5+1 layer chain runs inside llm_risk_per_doc:
llm_risk → filter_llm_risks → drop_business_normal → drop_repeats.
DD multi-agent supervisor graph
dd_graph:
START
→ contract_filter_node (keep only contract-type docs)
→ per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
→ supervisor_node (LLM router or heuristic; Command(goto=...))
├─ → audit_specialist (pricing anomalies, overcharging)
├─ → legal_specialist (red flags, change-of-control, non-compete)
├─ → compliance_specialist (GDPR, AML, data protection)
└─ → financial_specialist (monthly obligations, expirations)
↺ (loops back to supervisor up to dd_supervisor_max_iterations)
→ dd_synthesizer (one LLM call: executive_summary +
top_red_flags + per-contract risk_level rating)
END
Package insights graph
A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(audit | dd | compliance | general).
2. State design
PipelineState (TypedDict)
Read-mostly fields with reducer-driven Send fan-in:
files: list[tuple[str, bytes]]— raw uploaddocuments: Annotated[list[ProcessedDocument], merge_doc_results]— per-doc field-level merge keyed byfile_namerisks: Annotated[list[Risk], merge_risks]— dedup by descriptioncomparison: ComparisonReport | Nonereport: dictpackage_insights: PackageInsights | Nonedd_report: DDPortfolioReport | Nonestarted_at,finished_at,processing_secondsprogress_events: Annotated[list[str], add]— Streamlit progress feed
Risk (Pydantic)
The single risk type used everywhere:
description: strseverity: str("high" | "medium" | "low" | "info")rationale: strkind: str("validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check")regulation: str | None(e.g."HU VAT Act §169","ISA 240","GDPR Article 28")affected_document: str | Nonesource_check_id: str | None
3. Anti-hallucination stack (5+1 layers)
temperature=0— every LLM call is deterministic-ish._quotesschema field — verbatim source citations._confidenceschema field — per-field reliability (high|medium|low).validate_plausibility()— Python deterministic plausibility checks (negative VAT, non-standard rates, future dates, etc.).- 3-filter LLM risk pipeline —
filter_llm_risks(formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact) →drop_business_normal_risks(semantic: cross-check vs extracted_data, 6 known false-positive patterns) →drop_repeats_of_basic(textual dedup vs basic risks, 70% threshold). - Quote validator — final cross-check that every
_quotesentry actually appears in the sourcefull_text(whitespace + diacritic + case normalized). If invalid, downgrades confidence.
4. Domain checks (14 deterministic rules)
| # | check_id | Regulation | HU-specific? | Applies to |
|---|---|---|---|---|
| 01 | check_01_invoice_mandatory |
HU VAT Act §169 | yes | invoice |
| 02 | check_02_tax_cdv |
HU Tax Procedure Act §22 mod-11 | yes | invoice + contract + ... |
| 03 | check_03_contract_completeness |
Universal contract completeness | no | contract |
| 04 | check_04_proportionality |
Universal contract proportionality (>31.7%) | no | contract |
| 05 | check_05_rounded_amounts |
ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | check_06_evidence_score |
ISA 500 | no | (separate entry, info-only) |
| 07 | check_07_materiality |
ISA 320 | no | invoice + contract + financial_report |
| 08 | check_08_gdpr_28 |
GDPR Article 28 | no (EU) | contract |
| 09 | check_09_dd_red_flags |
M&A DD best practice | no | contract |
| 10 | check_10_incoterms |
Incoterms 2020 | no | contract |
| 11 | check_11_ifrs_har |
IFRS / national GAAP comparison | no | financial_report |
| 12 | check_12_duplicate_invoice |
ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | check_13_aml_sanctions |
AML / Sanctions screening | no | invoice + contract + ... |
| 14 | check_14_contract_dates |
Contract date best practice | no | contract |
The dispatch in domain_dispatch_node skips check_06 and check_12 (they
have separate entry points) and filters is_hu_specific=True out for non-HU
documents.
5. Provider system
Three providers via configurable_alternatives:
vllm—ChatOpenAIwithbase_url=VLLM_BASE_URLpointing at the AMD MI300X vLLM endpoint. Production default.ollama—ChatOllamawith a local Ollama daemon (Qwen 2.5 7B Instruct). Development fallback.dummy—DummyChatModel(deterministic stub, no network). CI / eval / load.
Provider selection is runtime-switchable without restart:
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
6. Embedding
BAAI/bge-m3 (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via @lru_cache.
Pre-downloaded at Docker build time so runtime has no network call.
7. Hybrid retrieval (Chroma + BM25)
store/hybrid_store.py runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.
8. Async-first runtime
LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (app/async_runtime.py's AsyncRuntime
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the AsyncSqliteSaver SQLite pool persistent across user
interactions — they do not rebuild per request.
9. Multilingual support
The codebase is English-first but multilingual-tolerant:
- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps
eng + hun + deuas Tesseract languages. - Demo data may include mixed-language documents.
The output (UI, exec summary, DOCX report) is always English.