NΓ‘ndorfi Vince
Sync documentation overhaul from main (markdown only, LFS history preserved)
3385e0e | # PaperHawk Architecture | |
| How PaperHawk is built and why each piece is where it is. This document explains the multi-graph LangGraph orchestration, the 14 deterministic domain checks, the 6-layer anti-hallucination stack, and the multi-agent DD assistant. | |
| --- | |
| ## High-level architecture | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β USER (Streamlit 5-tab UI) β | |
| β Upload β Results β Chat β DD Assistant β Report β | |
| ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββΌβββββββββββββββββββββββββ | |
| β β β | |
| βΌ βΌ βΌ | |
| ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ | |
| β pipeline_graph β β chat_graph β β dd_graph β | |
| β β β β β β | |
| β Ingest β β β Intent classify β β Contract filter β β | |
| β Classify β β β β Plan β β β Per-contract summary β β | |
| β Extract β β β Agent (5 tools) β β Multi-agent specialists β | |
| β Compare β β β β Synthesizer β β β (audit/legal/compliance β | |
| β Risk β β β Validator β β /financial) β β | |
| β Report β β ([Source: β¦]) β β Supervisor β Synthesizerβ | |
| ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ | |
| β β | |
| βββββββββββββββ¬βββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββ | |
| β package_insights_graph β | |
| β β | |
| β Cross-document analysis β | |
| β (price-drift, dupes, β | |
| β three-way matching) β | |
| ββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββ | |
| β Provider abstraction β | |
| β (configurable_alternatives) | |
| β β | |
| β vLLM ββ Ollama ββ Dummy β | |
| ββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββ | |
| β AMD MI300X (vLLM) β | |
| β Qwen 2.5 14B Instruct β | |
| β 192 GB HBM3, ROCm 7.0 β | |
| ββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Compiled graphs (4) | |
| Every entry-point in the system is a separately compiled LangGraph artifact with its own typed state and `AsyncSqliteSaver` checkpointer: | |
| ### 1. `pipeline_graph` β the document processing pipeline | |
| The 6-step end-to-end flow when the user uploads a package: | |
| 1. **Ingest** β PDF (PyMuPDF + pdfplumber for table extraction), DOCX (native), images (vision-first via the LLM), with Tesseract OCR fallback for scanned PDFs (EN/HU/DE) | |
| 2. **Classify** β 6-way doc-type classifier with structured output (`invoice`, `delivery_note`, `purchase_order`, `contract`, `financial_report`, `other`); ISA 500 evidence-quality score | |
| 3. **Extract** β per doc-type Pydantic v2 schema with `_quotes` and `_confidence` fields; universal fallback schema for unknown types | |
| 4. **Compare** β three-way matching subgraph (invoice + delivery note + PO), duplicate-invoice detection (ISA 240) | |
| 5. **Risk** β basic plausibility + 14 domain checks (Send-API parallel fan-out) + LLM risk ensemble + 3-stage filter chain | |
| 6. **Report** β DOCX export, JSON output, Streamlit UI rendering | |
| State: `PipelineState` (Pydantic), with reducers for risk lists and per-document results. | |
| ### 2. `chat_graph` β the agentic chat | |
| 5-tool ReAct agent with strict citation enforcement: | |
| - **Tools**: `list_documents`, `get_extraction`, `search_documents` (hybrid Chroma + BM25 with Reciprocal Rank Fusion), `compare_documents`, `validate_document` | |
| - **Prompt**: 17-rule system prompt enforcing `[Source: filename.pdf]` format | |
| - **Validator node**: post-processor that drops any answer without citations | |
| - **Intent classifier**: routes to direct-answer vs tool-use paths to keep latency low for casual queries | |
| State: `ChatState` with message history, retrieved chunks, and citation list. | |
| ### 3. `dd_graph` β the multi-agent DD assistant | |
| For M&A due-diligence packages: | |
| - **Contract filter** β selects only contract-type documents from the package | |
| - **Per-contract summary** β extracts each contract's key terms (parties, term, value, change-of-control, non-compete, auto-renewal) | |
| - **4 specialist agents** (run in parallel via Send-API): | |
| - `audit_specialist` β material misstatement risk, ISA 240 fraud indicators | |
| - `legal_specialist` β change-of-control, non-compete, automatic-renewal red flags | |
| - `compliance_specialist` β GDPR Art. 28 sub-processor language, AML counterparty checks | |
| - `financial_specialist` β Ptk. 6:98 disproportionate penalty clauses, materiality thresholds | |
| - **Supervisor** β coordinates specialists, drops business-normal noise | |
| - **Synthesizer** β writes 3-paragraph executive summary | |
| State: `DDState` with contract list, per-contract summaries, specialist findings, executive summary. | |
| ### 4. `package_insights_graph` β cross-document analysis | |
| Package-level analyzers that don't fit into the per-document pipeline: | |
| - **Pricing-drift detector** β flags > 30% price changes for the same line item across invoices in a package (caught the 57.5% drift in our live demo) | |
| - **Duplicate-invoice detector** β exact + near-match (date within 13 days, amount within 1%) | |
| - **Counterparty consistency** β same supplier name spelled differently across documents | |
| State: `PackageState` with per-document extractions and aggregated findings. | |
| --- | |
| ## Subgraphs (6) | |
| Reusable LangGraph subgraphs imported by the main graphs: | |
| | Subgraph | Purpose | | |
| |---|---| | |
| | `extract_subgraph` | Per-document extraction with quote validator | | |
| | `ingest_subgraph` | PDF/DOCX/image loading with OCR fallback | | |
| | `llm_risk_subgraph` | LLM risk generation with structured output | | |
| | `rag_index_subgraph` | Chunking, embedding, ChromaDB indexing | | |
| | `rag_query_subgraph` | Hybrid Chroma + BM25 retrieval with RRF | | |
| | `risk_subgraph` | Domain check fan-out + LLM risk + 3-stage filters | | |
| --- | |
| ## 14 deterministic domain checks | |
| The check registry (`domain_checks/__init__.py`) is the heart of PaperHawk's auditor-grade output. Every check is a Python `Protocol` implementation, not an LLM prompt β they cannot hallucinate, can be unit-tested, and produce defensible findings with explicit regulation sources. | |
| ### A-tier (essential) | |
| 1. **Mandatory invoice elements** (HU VAT Act Β§169) β 18 required elements per invoice | |
| 2. **Tax-ID checksum** (Art. 22 Β§) β mod-11 Hungarian tax-ID validation | |
| 3. **Contract completeness** (Ptk. Book 6) β termination, governing law, penalty, confidentiality clauses | |
| 4. **Disproportionality** (Ptk. 6:98) β penalty clause > 31.7% of contract value flagged HIGH | |
| 5. **Rounded amounts** (ISA 240) β > 14.7% rounded amounts flagged suspicious, > 24.3% flagged HIGH | |
| 6. **Evidence hierarchy** (ISA 500) β document-type reliability score (8/10 invoice, 7/10 contract) | |
| ### B-tier (supplementary) | |
| 7. **Materiality** (ISA 320) β 1.93% of document value as info-level threshold | |
| 8. **GDPR Article 28** β 10 mandatory sub-processor language elements + PII detection | |
| 9. **DD red flags** (M&A) β change-of-control, non-compete, automatic-renewal triggers | |
| ### C-tier (informational) | |
| 10. **Incoterms 2020** β 11 incoterm rules detected via regex word-boundaries | |
| 11. **IFRS/HAR anomaly** β goodwill amortization flag, operational lease in IFRS context | |
| 12. **Duplicate invoice** (ISA 240) β exact + near-match with 13-day date filter | |
| 13. **AML sanctions** (Pmt.) β static EU/OFAC snapshot with fuzzy name match | |
| 14. **Contract dates** β start-end consistency, expiry detection | |
| **Jurisdiction-aware**: Hungarian-specific rules (HU VAT Act, Ptk., Art.) apply only to Hungarian documents. Universal rules (ISA, GDPR, Incoterms, AML) apply everywhere. | |
| --- | |
| ## 6-layer anti-hallucination stack | |
| The system is designed so the LLM **cannot** lie about a document and have the lie pass through. | |
| | Layer | What it does | | |
| |---|---| | |
| | 1. `temperature=0` | Deterministic outputs every run | | |
| | 2. Source quote requirement | Every extraction must include a verbatim quote from the source PDF in `_quotes` | | |
| | 3. Confidence scoring | high / medium / low per extracted field, surfaced to the user | | |
| | 4. Plausibility validators | Deterministic Python checks for math, dates, totals, item-level VAT, currency normalization | | |
| | 5. 3-stage LLM-risk filter chain | Drops business-normal noise, drops repeats of basic deterministic checks, drops contradictions | | |
| | 6. Quote validator | Text-search the source PDF for the claimed quote; downgrade confidence if not found verbatim, drop entirely if obviously fabricated | | |
| In our live audit demo, layer 6 caught **4 of 6** hallucinated citations from Qwen 2.5 14B and downgraded them to `low` confidence. | |
| The `validation/` package is one of the most-edited folders in the repo precisely because we treat anti-hallucination as a first-class concern, not a guardrail layer slapped on top. | |
| --- | |
| ## Provider abstraction | |
| `configurable_alternatives` lets us swap LLM backends with a single env var: | |
| | `LLM_PROFILE` | Backend | Use case | | |
| |---|---|---| | |
| | `vllm` | vLLM REST endpoint (OpenAI-compatible) | Production on AMD MI300X | | |
| | `ollama` | Local Ollama at `localhost:11434` | Dev on consumer GPU | | |
| | `dummy` | Deterministic stub | CI tests, smoke tests, judge quick-demo | | |
| The application code never imports an LLM SDK directly β all calls go through `providers/` factory functions with `configurable_alternatives`. Switching from Anthropic Claude (our original dev target) to Qwen on vLLM required **zero application code changes** β only env vars. | |
| --- | |
| ## Embedding + retrieval | |
| - **Model**: BAAI/bge-m3 (1024-dim, multilingual EN/HU/DE/FR via sentence-transformers) | |
| - **Storage**: ChromaDB persistent (per-session) + BM25 in-memory keyword index | |
| - **Hybrid retrieval**: Reciprocal Rank Fusion of Chroma top-K and BM25 top-K | |
| - **Chunking**: Natural-boundary chunking (paragraph-aware, ~500 tokens with overlap) | |
| The embedding model loads once at app startup (~2.3 GB to RAM/VRAM). On first run it downloads from Hugging Face Hub to `~/.cache/huggingface/`. | |
| --- | |
| ## State persistence | |
| - **Per-session**: Streamlit `session_state` for UI state (uploaded files, current package) | |
| - **Per-graph**: `AsyncSqliteSaver` checkpointer at `data/checkpoints.sqlite` for LangGraph state | |
| - **Vector store**: ChromaDB at `chroma_db/` (gitignored) | |
| Restarting the app loads the last checkpoint, so chat history and extraction results survive a restart. | |
| --- | |
| ## Streamlit UI (5 tabs) | |
| 1. **Upload** β drag-and-drop (PDF, DOCX, PNG, JPG, TXT), 200 MB per file, plus 3 pre-bundled demo packages | |
| 2. **Results** β classification confidence, extracted data, risks per document, package-level cross-doc analysis | |
| 3. **Chat** β agentic chat with `[Source: filename.pdf]` citations | |
| 4. **DD Assistant** β for M&A packages: per-contract summaries + 4 specialist findings + executive summary + downloadable DOCX | |
| 5. **Report** β JSON output + DOCX export | |
| The async runtime uses a long-lived background event loop (`app/async_runtime.py`) so the UI stays responsive during multi-minute pipeline runs. | |
| --- | |
| ## Cross-references | |
| - [`docs/AMD_DEPLOYMENT.md`](AMD_DEPLOYMENT.md) β how the production vLLM endpoint runs on AMD MI300X | |
| - [`docs/HUGGINGFACE_DEPLOYMENT.md`](HUGGINGFACE_DEPLOYMENT.md) β how the Streamlit app deploys as a public HF Space | |
| - [`docs/SUBMISSION.md`](SUBMISSION.md) β full hackathon submission brief with TAM/SAM, competitor positioning, live deployment validation | |