Spaces:

Vincsipe
/

paperhawk

Running

App Files Files Community

paperhawk / ARCHITECTURE.md

Nándorfi Vince

Initial paperhawk push to HF Space (LFS for binaries)

7ff7119 3 days ago

preview code

raw

history blame contribute delete

9.06 kB

Architecture

LangGraph-native Document Intelligence platform. This document goes beyond the README — it covers design decisions, the subgraph hierarchy, state design, and the anti-hallucination stack.

1. High-level architecture

4 compiled LangGraph artifacts

The system is organized around four graphs sharing a common AsyncSqliteSaver checkpointer:

#	Graph	Entry point	When
1	`pipeline_graph`	`app.run_pipeline(files)`	on upload
2	`chat_graph`	`app.ask(question)`	chat tab
3	`dd_graph`	`app.dd_report(thread_id)`	DD tab button
4	`package_insights_graph`	`app.package_insights(thread_id, pkg_type)`	demo button

Chat tools read from the persisted pipeline state — they do not re-read files. They access the in-memory ChatToolContext, which holds the HybridStore and a documents snapshot.

Pipeline graph topology

START
  → start_timer
  → dispatch_ingest          (Send API: per-doc fan-out)
  → ingest_per_doc           (PDF/DOCX/PNG/TXT loader subgraph)
  → ingest_join              (fan-in)
  → dispatch_classify        (Send API)
  → classify_per_doc         (regex/keyword classifier in dummy mode;
                              vision-aware in vLLM mode)
  → classify_join
  → dispatch_extract         (Send API)
  → extract_per_doc          (regex extractor in dummy mode +
                              flatten_universal; structured LLM in vLLM mode)
  → extract_join
  → quote_validator          (anti-hallucination layer #7)
  → dispatch_rag_index       (Send API)
  → rag_index_per_doc        (chunker + batched embed + Chroma+BM25 upsert)
  → rag_join
  → compare_node             (three-way matching, sync)
  → risk_subgraph            (basic + 14 domain × Send + plausibility +
                              LLM ensemble + duplicate)
  → finish_timer
  → report_node              (10-section JSON structure)
  → END

The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.

Risk subgraph topology

risk_subgraph (input: PipelineState):
  → basic_risk_dispatch         (Send: per-doc basic risk)
  → basic_risk / noop_basic
  → domain_dispatch_node        (Send: per-doc × per-applicable-check, ~30 parallel)
  → apply_domain_check
  → [if llm provided] llm_risk_dispatch  (Send: per-doc LLM risk + 3-filter chain)
  → llm_risk_per_doc / noop_llm
  → plausibility_dispatch       (Send: per-doc plausibility)
  → plausibility / noop_plaus
  → evidence_score_node         (per-doc info)
  → duplicate_detector_node     (package-level, sync, ISA 240)
END

The full anti-hallucination 5+1 layer chain runs inside llm_risk_per_doc: llm_risk → filter_llm_risks → drop_business_normal → drop_repeats.

DD multi-agent supervisor graph

dd_graph:
  START
  → contract_filter_node      (keep only contract-type docs)
  → per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
  → supervisor_node           (LLM router or heuristic; Command(goto=...))
        ├─ → audit_specialist     (pricing anomalies, overcharging)
        ├─ → legal_specialist     (red flags, change-of-control, non-compete)
        ├─ → compliance_specialist (GDPR, AML, data protection)
        └─ → financial_specialist (monthly obligations, expirations)
  ↺ (loops back to supervisor up to dd_supervisor_max_iterations)
  → dd_synthesizer            (one LLM call: executive_summary +
                               top_red_flags + per-contract risk_level rating)
  END

Package insights graph

A simple 1-LLM-call graph: ingests the full document package and produces cross-doc findings using a perspective-driven prompt (audit | dd | compliance | general).

2. State design

`PipelineState` (TypedDict)

Read-mostly fields with reducer-driven Send fan-in:

files: list[tuple[str, bytes]] — raw upload
documents: Annotated[list[ProcessedDocument], merge_doc_results] — per-doc field-level merge keyed by file_name
risks: Annotated[list[Risk], merge_risks] — dedup by description
comparison: ComparisonReport | None
report: dict
package_insights: PackageInsights | None
dd_report: DDPortfolioReport | None
started_at, finished_at, processing_seconds
progress_events: Annotated[list[str], add] — Streamlit progress feed

`Risk` (Pydantic)

The single risk type used everywhere:

description: str
severity: str ("high" | "medium" | "low" | "info")
rationale: str
kind: str ("validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check")
regulation: str | None (e.g. "HU VAT Act §169", "ISA 240", "GDPR Article 28")
affected_document: str | None
source_check_id: str | None

3. Anti-hallucination stack (5+1 layers)

temperature=0 — every LLM call is deterministic-ish.
_quotes schema field — verbatim source citations.
_confidence schema field — per-field reliability (high|medium|low).
validate_plausibility() — Python deterministic plausibility checks (negative VAT, non-standard rates, future dates, etc.).
3-filter LLM risk pipeline — filter_llm_risks (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact) → drop_business_normal_risks (semantic: cross-check vs extracted_data, 6 known false-positive patterns) → drop_repeats_of_basic (textual dedup vs basic risks, 70% threshold).
Quote validator — final cross-check that every _quotes entry actually appears in the source full_text (whitespace + diacritic + case normalized). If invalid, downgrades confidence.

4. Domain checks (14 deterministic rules)

#	check_id	Regulation	HU-specific?	Applies to
01	`check_01_invoice_mandatory`	HU VAT Act §169	yes	invoice
02	`check_02_tax_cdv`	HU Tax Procedure Act §22 mod-11	yes	invoice + contract + ...
03	`check_03_contract_completeness`	Universal contract completeness	no	contract
04	`check_04_proportionality`	Universal contract proportionality (>31.7%)	no	contract
05	`check_05_rounded_amounts`	ISA 240 (Journal of Accountancy 2018)	no	invoice
06	`check_06_evidence_score`	ISA 500	no	(separate entry, info-only)
07	`check_07_materiality`	ISA 320	no	invoice + contract + financial_report
08	`check_08_gdpr_28`	GDPR Article 28	no (EU)	contract
09	`check_09_dd_red_flags`	M&A DD best practice	no	contract
10	`check_10_incoterms`	Incoterms 2020	no	contract
11	`check_11_ifrs_har`	IFRS / national GAAP comparison	no	financial_report
12	`check_12_duplicate_invoice`	ISA 240 (duplicate invoice)	no	(separate entry, package-level)
13	`check_13_aml_sanctions`	AML / Sanctions screening	no	invoice + contract + ...
14	`check_14_contract_dates`	Contract date best practice	no	contract

The dispatch in domain_dispatch_node skips check_06 and check_12 (they have separate entry points) and filters is_hu_specific=True out for non-HU documents.

5. Provider system

Three providers via configurable_alternatives:

vllm — ChatOpenAI with base_url=VLLM_BASE_URL pointing at the AMD MI300X vLLM endpoint. Production default.
ollama — ChatOllama with a local Ollama daemon (Qwen 2.5 7B Instruct). Development fallback.
dummy — DummyChatModel (deterministic stub, no network). CI / eval / load.

Provider selection is runtime-switchable without restart:

graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})

6. Embedding

BAAI/bge-m3 (2.27 GB, 1024 dim, multilingual) by default. Sentence-transformers loads it on first call via @lru_cache. Pre-downloaded at Docker build time so runtime has no network call.

7. Hybrid retrieval (Chroma + BM25)

store/hybrid_store.py runs vector search and BM25 in parallel and merges with Reciprocal Rank Fusion (RRF). The chunker uses natural break points (paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char overlap.

8. Async-first runtime

LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer on a long-lived background event loop (app/async_runtime.py's AsyncRuntime singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP session, and the AsyncSqliteSaver SQLite pool persistent across user interactions — they do not rebuild per request.

9. Multilingual support

The codebase is English-first but multilingual-tolerant:

The classifier matches HU/EN/DE keyword patterns.
Risk filters tolerate HU/DE business terms.
The OCR layer keeps eng + hun + deu as Tesseract languages.
Demo data may include mixed-language documents.

The output (UI, exec summary, DOCX report) is always English.