Spaces:

Vincsipe
/

paperhawk

Running

App Files Files Community

paperhawk / docs /ARCHITECTURE.md

Nándorfi Vince

Sync documentation overhaul from main (markdown only, LFS history preserved)

3385e0e 2 days ago

preview code

raw

history blame contribute delete

13.3 kB

PaperHawk Architecture

How PaperHawk is built and why each piece is where it is. This document explains the multi-graph LangGraph orchestration, the 14 deterministic domain checks, the 6-layer anti-hallucination stack, and the multi-agent DD assistant.

High-level architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                          USER (Streamlit 5-tab UI)                       │
│   Upload  │  Results  │  Chat  │  DD Assistant  │  Report                │
└────────────────────────────────┬─────────────────────────────────────────┘
                                 │
            ┌────────────────────┼────────────────────────┐
            │                    │                        │
            ▼                    ▼                        ▼
   ┌──────────────────┐ ┌──────────────────┐  ┌─────────────────────────┐
   │  pipeline_graph  │ │   chat_graph     │  │    dd_graph             │
   │                  │ │                  │  │                         │
   │ Ingest →         │ │ Intent classify  │  │ Contract filter →       │
   │ Classify →       │ │ → Plan →         │  │ Per-contract summary →  │
   │ Extract →        │ │ Agent (5 tools)  │  │ Multi-agent specialists │
   │ Compare →        │ │ → Synthesizer →  │  │ (audit/legal/compliance │
   │ Risk →           │ │ Validator        │  │  /financial) →          │
   │ Report           │ │ ([Source: …])    │  │ Supervisor → Synthesizer│
   └──────────────────┘ └──────────────────┘  └─────────────────────────┘
            │                                        │
            └─────────────┬──────────────────────────┘
                          ▼
                ┌──────────────────────────┐
                │  package_insights_graph  │
                │                          │
                │  Cross-document analysis │
                │  (price-drift, dupes,    │
                │   three-way matching)    │
                └──────────────────────────┘
                          │
                          ▼
                ┌──────────────────────────┐
                │    Provider abstraction  │
                │ (configurable_alternatives)
                │                          │
                │ vLLM ←→ Ollama ←→ Dummy  │
                └──────────────────────────┘
                          │
                          ▼
                ┌──────────────────────────┐
                │  AMD MI300X (vLLM)       │
                │  Qwen 2.5 14B Instruct   │
                │  192 GB HBM3, ROCm 7.0   │
                └──────────────────────────┘

Compiled graphs (4)

Every entry-point in the system is a separately compiled LangGraph artifact with its own typed state and AsyncSqliteSaver checkpointer:

1. `pipeline_graph` — the document processing pipeline

The 6-step end-to-end flow when the user uploads a package:

Ingest — PDF (PyMuPDF + pdfplumber for table extraction), DOCX (native), images (vision-first via the LLM), with Tesseract OCR fallback for scanned PDFs (EN/HU/DE)
Classify — 6-way doc-type classifier with structured output (invoice, delivery_note, purchase_order, contract, financial_report, other); ISA 500 evidence-quality score
Extract — per doc-type Pydantic v2 schema with _quotes and _confidence fields; universal fallback schema for unknown types
Compare — three-way matching subgraph (invoice + delivery note + PO), duplicate-invoice detection (ISA 240)
Risk — basic plausibility + 14 domain checks (Send-API parallel fan-out) + LLM risk ensemble + 3-stage filter chain
Report — DOCX export, JSON output, Streamlit UI rendering

State: PipelineState (Pydantic), with reducers for risk lists and per-document results.

2. `chat_graph` — the agentic chat

5-tool ReAct agent with strict citation enforcement:

Tools: list_documents, get_extraction, search_documents (hybrid Chroma + BM25 with Reciprocal Rank Fusion), compare_documents, validate_document
Prompt: 17-rule system prompt enforcing [Source: filename.pdf] format
Validator node: post-processor that drops any answer without citations
Intent classifier: routes to direct-answer vs tool-use paths to keep latency low for casual queries

State: ChatState with message history, retrieved chunks, and citation list.

3. `dd_graph` — the multi-agent DD assistant

For M&A due-diligence packages:

Contract filter — selects only contract-type documents from the package
Per-contract summary — extracts each contract's key terms (parties, term, value, change-of-control, non-compete, auto-renewal)
4 specialist agents (run in parallel via Send-API):
- audit_specialist — material misstatement risk, ISA 240 fraud indicators
- legal_specialist — change-of-control, non-compete, automatic-renewal red flags
- compliance_specialist — GDPR Art. 28 sub-processor language, AML counterparty checks
- financial_specialist — Ptk. 6:98 disproportionate penalty clauses, materiality thresholds
Supervisor — coordinates specialists, drops business-normal noise
Synthesizer — writes 3-paragraph executive summary

State: DDState with contract list, per-contract summaries, specialist findings, executive summary.

4. `package_insights_graph` — cross-document analysis

Package-level analyzers that don't fit into the per-document pipeline:

Pricing-drift detector — flags > 30% price changes for the same line item across invoices in a package (caught the 57.5% drift in our live demo)
Duplicate-invoice detector — exact + near-match (date within 13 days, amount within 1%)
Counterparty consistency — same supplier name spelled differently across documents

State: PackageState with per-document extractions and aggregated findings.

Subgraphs (6)

Reusable LangGraph subgraphs imported by the main graphs:

Subgraph	Purpose
`extract_subgraph`	Per-document extraction with quote validator
`ingest_subgraph`	PDF/DOCX/image loading with OCR fallback
`llm_risk_subgraph`	LLM risk generation with structured output
`rag_index_subgraph`	Chunking, embedding, ChromaDB indexing
`rag_query_subgraph`	Hybrid Chroma + BM25 retrieval with RRF
`risk_subgraph`	Domain check fan-out + LLM risk + 3-stage filters

14 deterministic domain checks

The check registry (domain_checks/__init__.py) is the heart of PaperHawk's auditor-grade output. Every check is a Python Protocol implementation, not an LLM prompt — they cannot hallucinate, can be unit-tested, and produce defensible findings with explicit regulation sources.

A-tier (essential)

Mandatory invoice elements (HU VAT Act §169) — 18 required elements per invoice
Tax-ID checksum (Art. 22 §) — mod-11 Hungarian tax-ID validation
Contract completeness (Ptk. Book 6) — termination, governing law, penalty, confidentiality clauses
Disproportionality (Ptk. 6:98) — penalty clause > 31.7% of contract value flagged HIGH
Rounded amounts (ISA 240) — > 14.7% rounded amounts flagged suspicious, > 24.3% flagged HIGH
Evidence hierarchy (ISA 500) — document-type reliability score (8/10 invoice, 7/10 contract)

B-tier (supplementary)

Materiality (ISA 320) — 1.93% of document value as info-level threshold
GDPR Article 28 — 10 mandatory sub-processor language elements + PII detection
DD red flags (M&A) — change-of-control, non-compete, automatic-renewal triggers

C-tier (informational)

Incoterms 2020 — 11 incoterm rules detected via regex word-boundaries
IFRS/HAR anomaly — goodwill amortization flag, operational lease in IFRS context
Duplicate invoice (ISA 240) — exact + near-match with 13-day date filter
AML sanctions (Pmt.) — static EU/OFAC snapshot with fuzzy name match
Contract dates — start-end consistency, expiry detection

Jurisdiction-aware: Hungarian-specific rules (HU VAT Act, Ptk., Art.) apply only to Hungarian documents. Universal rules (ISA, GDPR, Incoterms, AML) apply everywhere.

6-layer anti-hallucination stack

The system is designed so the LLM cannot lie about a document and have the lie pass through.

Layer	What it does
1. `temperature=0`	Deterministic outputs every run
2. Source quote requirement	Every extraction must include a verbatim quote from the source PDF in `_quotes`
3. Confidence scoring	high / medium / low per extracted field, surfaced to the user
4. Plausibility validators	Deterministic Python checks for math, dates, totals, item-level VAT, currency normalization
5. 3-stage LLM-risk filter chain	Drops business-normal noise, drops repeats of basic deterministic checks, drops contradictions
6. Quote validator	Text-search the source PDF for the claimed quote; downgrade confidence if not found verbatim, drop entirely if obviously fabricated

In our live audit demo, layer 6 caught 4 of 6 hallucinated citations from Qwen 2.5 14B and downgraded them to low confidence.

The validation/ package is one of the most-edited folders in the repo precisely because we treat anti-hallucination as a first-class concern, not a guardrail layer slapped on top.

Provider abstraction

configurable_alternatives lets us swap LLM backends with a single env var:

`LLM_PROFILE`	Backend	Use case
`vllm`	vLLM REST endpoint (OpenAI-compatible)	Production on AMD MI300X
`ollama`	Local Ollama at `localhost:11434`	Dev on consumer GPU
`dummy`	Deterministic stub	CI tests, smoke tests, judge quick-demo

The application code never imports an LLM SDK directly — all calls go through providers/ factory functions with configurable_alternatives. Switching from Anthropic Claude (our original dev target) to Qwen on vLLM required zero application code changes — only env vars.

Embedding + retrieval

Model: BAAI/bge-m3 (1024-dim, multilingual EN/HU/DE/FR via sentence-transformers)
Storage: ChromaDB persistent (per-session) + BM25 in-memory keyword index
Hybrid retrieval: Reciprocal Rank Fusion of Chroma top-K and BM25 top-K
Chunking: Natural-boundary chunking (paragraph-aware, ~500 tokens with overlap)

The embedding model loads once at app startup (~~2.3 GB to RAM/VRAM). On first run it downloads from Hugging Face Hub to `~~/.cache/huggingface/`.

State persistence

Per-session: Streamlit session_state for UI state (uploaded files, current package)
Per-graph: AsyncSqliteSaver checkpointer at data/checkpoints.sqlite for LangGraph state
Vector store: ChromaDB at chroma_db/ (gitignored)

Restarting the app loads the last checkpoint, so chat history and extraction results survive a restart.

Streamlit UI (5 tabs)

Upload — drag-and-drop (PDF, DOCX, PNG, JPG, TXT), 200 MB per file, plus 3 pre-bundled demo packages
Results — classification confidence, extracted data, risks per document, package-level cross-doc analysis
Chat — agentic chat with [Source: filename.pdf] citations
DD Assistant — for M&A packages: per-contract summaries + 4 specialist findings + executive summary + downloadable DOCX
Report — JSON output + DOCX export

The async runtime uses a long-lived background event loop (app/async_runtime.py) so the UI stays responsive during multi-minute pipeline runs.

Cross-references

docs/AMD_DEPLOYMENT.md — how the production vLLM endpoint runs on AMD MI300X
docs/HUGGINGFACE_DEPLOYMENT.md — how the Streamlit app deploys as a public HF Space
docs/SUBMISSION.md — full hackathon submission brief with TAM/SAM, competitor positioning, live deployment validation