PaperHawk Architecture
How PaperHawk is built and why each piece is where it is. This document explains the multi-graph LangGraph orchestration, the 14 deterministic domain checks, the 6-layer anti-hallucination stack, and the multi-agent DD assistant.
High-level architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER (Streamlit 5-tab UI) β
β Upload β Results β Chat β DD Assistant β Report β
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
β pipeline_graph β β chat_graph β β dd_graph β
β β β β β β
β Ingest β β β Intent classify β β Contract filter β β
β Classify β β β β Plan β β β Per-contract summary β β
β Extract β β β Agent (5 tools) β β Multi-agent specialists β
β Compare β β β β Synthesizer β β β (audit/legal/compliance β
β Risk β β β Validator β β /financial) β β
β Report β β ([Source: β¦]) β β Supervisor β Synthesizerβ
ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
β β
βββββββββββββββ¬βββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββ
β package_insights_graph β
β β
β Cross-document analysis β
β (price-drift, dupes, β
β three-way matching) β
ββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β Provider abstraction β
β (configurable_alternatives)
β β
β vLLM ββ Ollama ββ Dummy β
ββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β AMD MI300X (vLLM) β
β Qwen 2.5 14B Instruct β
β 192 GB HBM3, ROCm 7.0 β
ββββββββββββββββββββββββββββ
Compiled graphs (4)
Every entry-point in the system is a separately compiled LangGraph artifact with its own typed state and AsyncSqliteSaver checkpointer:
1. pipeline_graph β the document processing pipeline
The 6-step end-to-end flow when the user uploads a package:
- Ingest β PDF (PyMuPDF + pdfplumber for table extraction), DOCX (native), images (vision-first via the LLM), with Tesseract OCR fallback for scanned PDFs (EN/HU/DE)
- Classify β 6-way doc-type classifier with structured output (
invoice,delivery_note,purchase_order,contract,financial_report,other); ISA 500 evidence-quality score - Extract β per doc-type Pydantic v2 schema with
_quotesand_confidencefields; universal fallback schema for unknown types - Compare β three-way matching subgraph (invoice + delivery note + PO), duplicate-invoice detection (ISA 240)
- Risk β basic plausibility + 14 domain checks (Send-API parallel fan-out) + LLM risk ensemble + 3-stage filter chain
- Report β DOCX export, JSON output, Streamlit UI rendering
State: PipelineState (Pydantic), with reducers for risk lists and per-document results.
2. chat_graph β the agentic chat
5-tool ReAct agent with strict citation enforcement:
- Tools:
list_documents,get_extraction,search_documents(hybrid Chroma + BM25 with Reciprocal Rank Fusion),compare_documents,validate_document - Prompt: 17-rule system prompt enforcing
[Source: filename.pdf]format - Validator node: post-processor that drops any answer without citations
- Intent classifier: routes to direct-answer vs tool-use paths to keep latency low for casual queries
State: ChatState with message history, retrieved chunks, and citation list.
3. dd_graph β the multi-agent DD assistant
For M&A due-diligence packages:
- Contract filter β selects only contract-type documents from the package
- Per-contract summary β extracts each contract's key terms (parties, term, value, change-of-control, non-compete, auto-renewal)
- 4 specialist agents (run in parallel via Send-API):
audit_specialistβ material misstatement risk, ISA 240 fraud indicatorslegal_specialistβ change-of-control, non-compete, automatic-renewal red flagscompliance_specialistβ GDPR Art. 28 sub-processor language, AML counterparty checksfinancial_specialistβ Ptk. 6:98 disproportionate penalty clauses, materiality thresholds
- Supervisor β coordinates specialists, drops business-normal noise
- Synthesizer β writes 3-paragraph executive summary
State: DDState with contract list, per-contract summaries, specialist findings, executive summary.
4. package_insights_graph β cross-document analysis
Package-level analyzers that don't fit into the per-document pipeline:
- Pricing-drift detector β flags > 30% price changes for the same line item across invoices in a package (caught the 57.5% drift in our live demo)
- Duplicate-invoice detector β exact + near-match (date within 13 days, amount within 1%)
- Counterparty consistency β same supplier name spelled differently across documents
State: PackageState with per-document extractions and aggregated findings.
Subgraphs (6)
Reusable LangGraph subgraphs imported by the main graphs:
| Subgraph | Purpose |
|---|---|
extract_subgraph |
Per-document extraction with quote validator |
ingest_subgraph |
PDF/DOCX/image loading with OCR fallback |
llm_risk_subgraph |
LLM risk generation with structured output |
rag_index_subgraph |
Chunking, embedding, ChromaDB indexing |
rag_query_subgraph |
Hybrid Chroma + BM25 retrieval with RRF |
risk_subgraph |
Domain check fan-out + LLM risk + 3-stage filters |
14 deterministic domain checks
The check registry (domain_checks/__init__.py) is the heart of PaperHawk's auditor-grade output. Every check is a Python Protocol implementation, not an LLM prompt β they cannot hallucinate, can be unit-tested, and produce defensible findings with explicit regulation sources.
A-tier (essential)
- Mandatory invoice elements (HU VAT Act Β§169) β 18 required elements per invoice
- Tax-ID checksum (Art. 22 Β§) β mod-11 Hungarian tax-ID validation
- Contract completeness (Ptk. Book 6) β termination, governing law, penalty, confidentiality clauses
- Disproportionality (Ptk. 6:98) β penalty clause > 31.7% of contract value flagged HIGH
- Rounded amounts (ISA 240) β > 14.7% rounded amounts flagged suspicious, > 24.3% flagged HIGH
- Evidence hierarchy (ISA 500) β document-type reliability score (8/10 invoice, 7/10 contract)
B-tier (supplementary)
- Materiality (ISA 320) β 1.93% of document value as info-level threshold
- GDPR Article 28 β 10 mandatory sub-processor language elements + PII detection
- DD red flags (M&A) β change-of-control, non-compete, automatic-renewal triggers
C-tier (informational)
- Incoterms 2020 β 11 incoterm rules detected via regex word-boundaries
- IFRS/HAR anomaly β goodwill amortization flag, operational lease in IFRS context
- Duplicate invoice (ISA 240) β exact + near-match with 13-day date filter
- AML sanctions (Pmt.) β static EU/OFAC snapshot with fuzzy name match
- Contract dates β start-end consistency, expiry detection
Jurisdiction-aware: Hungarian-specific rules (HU VAT Act, Ptk., Art.) apply only to Hungarian documents. Universal rules (ISA, GDPR, Incoterms, AML) apply everywhere.
6-layer anti-hallucination stack
The system is designed so the LLM cannot lie about a document and have the lie pass through.
| Layer | What it does |
|---|---|
1. temperature=0 |
Deterministic outputs every run |
| 2. Source quote requirement | Every extraction must include a verbatim quote from the source PDF in _quotes |
| 3. Confidence scoring | high / medium / low per extracted field, surfaced to the user |
| 4. Plausibility validators | Deterministic Python checks for math, dates, totals, item-level VAT, currency normalization |
| 5. 3-stage LLM-risk filter chain | Drops business-normal noise, drops repeats of basic deterministic checks, drops contradictions |
| 6. Quote validator | Text-search the source PDF for the claimed quote; downgrade confidence if not found verbatim, drop entirely if obviously fabricated |
In our live audit demo, layer 6 caught 4 of 6 hallucinated citations from Qwen 2.5 14B and downgraded them to low confidence.
The validation/ package is one of the most-edited folders in the repo precisely because we treat anti-hallucination as a first-class concern, not a guardrail layer slapped on top.
Provider abstraction
configurable_alternatives lets us swap LLM backends with a single env var:
LLM_PROFILE |
Backend | Use case |
|---|---|---|
vllm |
vLLM REST endpoint (OpenAI-compatible) | Production on AMD MI300X |
ollama |
Local Ollama at localhost:11434 |
Dev on consumer GPU |
dummy |
Deterministic stub | CI tests, smoke tests, judge quick-demo |
The application code never imports an LLM SDK directly β all calls go through providers/ factory functions with configurable_alternatives. Switching from Anthropic Claude (our original dev target) to Qwen on vLLM required zero application code changes β only env vars.
Embedding + retrieval
- Model: BAAI/bge-m3 (1024-dim, multilingual EN/HU/DE/FR via sentence-transformers)
- Storage: ChromaDB persistent (per-session) + BM25 in-memory keyword index
- Hybrid retrieval: Reciprocal Rank Fusion of Chroma top-K and BM25 top-K
- Chunking: Natural-boundary chunking (paragraph-aware, ~500 tokens with overlap)
The embedding model loads once at app startup (2.3 GB to RAM/VRAM). On first run it downloads from Hugging Face Hub to `/.cache/huggingface/`.
State persistence
- Per-session: Streamlit
session_statefor UI state (uploaded files, current package) - Per-graph:
AsyncSqliteSavercheckpointer atdata/checkpoints.sqlitefor LangGraph state - Vector store: ChromaDB at
chroma_db/(gitignored)
Restarting the app loads the last checkpoint, so chat history and extraction results survive a restart.
Streamlit UI (5 tabs)
- Upload β drag-and-drop (PDF, DOCX, PNG, JPG, TXT), 200 MB per file, plus 3 pre-bundled demo packages
- Results β classification confidence, extracted data, risks per document, package-level cross-doc analysis
- Chat β agentic chat with
[Source: filename.pdf]citations - DD Assistant β for M&A packages: per-contract summaries + 4 specialist findings + executive summary + downloadable DOCX
- Report β JSON output + DOCX export
The async runtime uses a long-lived background event loop (app/async_runtime.py) so the UI stays responsive during multi-minute pipeline runs.
Cross-references
docs/AMD_DEPLOYMENT.mdβ how the production vLLM endpoint runs on AMD MI300Xdocs/HUGGINGFACE_DEPLOYMENT.mdβ how the Streamlit app deploys as a public HF Spacedocs/SUBMISSION.mdβ full hackathon submission brief with TAM/SAM, competitor positioning, live deployment validation