paperhawk / ARCHITECTURE.md
Nándorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119

Architecture

LangGraph-native Document Intelligence platform. This document goes beyond the README — it covers design decisions, the subgraph hierarchy, state design, and the anti-hallucination stack.

1. High-level architecture

4 compiled LangGraph artifacts

The system is organized around four graphs sharing a common AsyncSqliteSaver checkpointer:

# Graph Entry point When
1 pipeline_graph app.run_pipeline(files) on upload
2 chat_graph app.ask(question) chat tab
3 dd_graph app.dd_report(thread_id) DD tab button
4 package_insights_graph app.package_insights(thread_id, pkg_type) demo button

Chat tools read from the persisted pipeline state — they do not re-read files. They access the in-memory ChatToolContext, which holds the HybridStore and a documents snapshot.

Pipeline graph topology

START
  → start_timer
  → dispatch_ingest          (Send API: per-doc fan-out)
  → ingest_per_doc           (PDF/DOCX/PNG/TXT loader subgraph)
  → ingest_join              (fan-in)
  → dispatch_classify        (Send API)
  → classify_per_doc         (regex/keyword classifier in dummy mode;
                              vision-aware in vLLM mode)
  → classify_join
  → dispatch_extract         (Send API)
  → extract_per_doc          (regex extractor in dummy mode +
                              flatten_universal; structured LLM in vLLM mode)
  → extract_join
  → quote_validator          (anti-hallucination layer #7)
  → dispatch_rag_index       (Send API)
  → rag_index_per_doc        (chunker + batched embed + Chroma+BM25 upsert)
  → rag_join
  → compare_node             (three-way matching, sync)
  → risk_subgraph            (basic + 14 domain × Send + plausibility +
                              LLM ensemble + duplicate)
  → finish_timer
  → report_node              (10-section JSON structure)
  → END

The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.

Risk subgraph topology

risk_subgraph (input: PipelineState):
  → basic_risk_dispatch         (Send: per-doc basic risk)
  → basic_risk / noop_basic
  → domain_dispatch_node        (Send: per-doc × per-applicable-check, ~30 parallel)
  → apply_domain_check
  → [if llm provided] llm_risk_dispatch  (Send: per-doc LLM risk + 3-filter chain)
  → llm_risk_per_doc / noop_llm
  → plausibility_dispatch       (Send: per-doc plausibility)
  → plausibility / noop_plaus
  → evidence_score_node         (per-doc info)
  → duplicate_detector_node     (package-level, sync, ISA 240)
END

The full anti-hallucination 5+1 layer chain runs inside llm_risk_per_doc: llm_risk → filter_llm_risks → drop_business_normal → drop_repeats.

DD multi-agent supervisor graph

dd_graph:
  START
  → contract_filter_node      (keep only contract-type docs)
  → per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
  → supervisor_node           (LLM router or heuristic; Command(goto=...))
        ├─ → audit_specialist     (pricing anomalies, overcharging)
        ├─ → legal_specialist     (red flags, change-of-control, non-compete)
        ├─ → compliance_specialist (GDPR, AML, data protection)
        └─ → financial_specialist (monthly obligations, expirations)
  ↺ (loops back to supervisor up to dd_supervisor_max_iterations)
  → dd_synthesizer            (one LLM call: executive_summary +
                               top_red_flags + per-contract risk_level rating)
  END

Package insights graph

A simple 1-LLM-call graph: ingests the full document package and produces cross-doc findings using a perspective-driven prompt (audit | dd | compliance | general).

2. State design

PipelineState (TypedDict)

Read-mostly fields with reducer-driven Send fan-in:

  • files: list[tuple[str, bytes]] — raw upload
  • documents: Annotated[list[ProcessedDocument], merge_doc_results] — per-doc field-level merge keyed by file_name
  • risks: Annotated[list[Risk], merge_risks] — dedup by description
  • comparison: ComparisonReport | None
  • report: dict
  • package_insights: PackageInsights | None
  • dd_report: DDPortfolioReport | None
  • started_at, finished_at, processing_seconds
  • progress_events: Annotated[list[str], add] — Streamlit progress feed

Risk (Pydantic)

The single risk type used everywhere:

  • description: str
  • severity: str ("high" | "medium" | "low" | "info")
  • rationale: str
  • kind: str ("validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check")
  • regulation: str | None (e.g. "HU VAT Act §169", "ISA 240", "GDPR Article 28")
  • affected_document: str | None
  • source_check_id: str | None

3. Anti-hallucination stack (5+1 layers)

  1. temperature=0 — every LLM call is deterministic-ish.
  2. _quotes schema field — verbatim source citations.
  3. _confidence schema field — per-field reliability (high|medium|low).
  4. validate_plausibility() — Python deterministic plausibility checks (negative VAT, non-standard rates, future dates, etc.).
  5. 3-filter LLM risk pipelinefilter_llm_risks (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact) → drop_business_normal_risks (semantic: cross-check vs extracted_data, 6 known false-positive patterns) → drop_repeats_of_basic (textual dedup vs basic risks, 70% threshold).
  6. Quote validator — final cross-check that every _quotes entry actually appears in the source full_text (whitespace + diacritic + case normalized). If invalid, downgrades confidence.

4. Domain checks (14 deterministic rules)

# check_id Regulation HU-specific? Applies to
01 check_01_invoice_mandatory HU VAT Act §169 yes invoice
02 check_02_tax_cdv HU Tax Procedure Act §22 mod-11 yes invoice + contract + ...
03 check_03_contract_completeness Universal contract completeness no contract
04 check_04_proportionality Universal contract proportionality (>31.7%) no contract
05 check_05_rounded_amounts ISA 240 (Journal of Accountancy 2018) no invoice
06 check_06_evidence_score ISA 500 no (separate entry, info-only)
07 check_07_materiality ISA 320 no invoice + contract + financial_report
08 check_08_gdpr_28 GDPR Article 28 no (EU) contract
09 check_09_dd_red_flags M&A DD best practice no contract
10 check_10_incoterms Incoterms 2020 no contract
11 check_11_ifrs_har IFRS / national GAAP comparison no financial_report
12 check_12_duplicate_invoice ISA 240 (duplicate invoice) no (separate entry, package-level)
13 check_13_aml_sanctions AML / Sanctions screening no invoice + contract + ...
14 check_14_contract_dates Contract date best practice no contract

The dispatch in domain_dispatch_node skips check_06 and check_12 (they have separate entry points) and filters is_hu_specific=True out for non-HU documents.

5. Provider system

Three providers via configurable_alternatives:

  • vllmChatOpenAI with base_url=VLLM_BASE_URL pointing at the AMD MI300X vLLM endpoint. Production default.
  • ollamaChatOllama with a local Ollama daemon (Qwen 2.5 7B Instruct). Development fallback.
  • dummyDummyChatModel (deterministic stub, no network). CI / eval / load.

Provider selection is runtime-switchable without restart:

graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})

6. Embedding

BAAI/bge-m3 (2.27 GB, 1024 dim, multilingual) by default. Sentence-transformers loads it on first call via @lru_cache. Pre-downloaded at Docker build time so runtime has no network call.

7. Hybrid retrieval (Chroma + BM25)

store/hybrid_store.py runs vector search and BM25 in parallel and merges with Reciprocal Rank Fusion (RRF). The chunker uses natural break points (paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char overlap.

8. Async-first runtime

LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer on a long-lived background event loop (app/async_runtime.py's AsyncRuntime singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP session, and the AsyncSqliteSaver SQLite pool persistent across user interactions — they do not rebuild per request.

9. Multilingual support

The codebase is English-first but multilingual-tolerant:

  • The classifier matches HU/EN/DE keyword patterns.
  • Risk filters tolerate HU/DE business terms.
  • The OCR layer keeps eng + hun + deu as Tesseract languages.
  • Demo data may include mixed-language documents.

The output (UI, exec summary, DOCX report) is always English.