paperhawk / ARCHITECTURE.md
Nándorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119
# Architecture
LangGraph-native Document Intelligence platform. This document goes beyond
the README — it covers design decisions, the subgraph hierarchy, state
design, and the anti-hallucination stack.
## 1. High-level architecture
### 4 compiled LangGraph artifacts
The system is organized around four graphs sharing a common `AsyncSqliteSaver`
checkpointer:
| # | Graph | Entry point | When |
|---|-------|-------------|------|
| 1 | `pipeline_graph` | `app.run_pipeline(files)` | on upload |
| 2 | `chat_graph` | `app.ask(question)` | chat tab |
| 3 | `dd_graph` | `app.dd_report(thread_id)` | DD tab button |
| 4 | `package_insights_graph` | `app.package_insights(thread_id, pkg_type)` | demo button |
Chat tools read from the persisted pipeline state — they do not re-read
files. They access the in-memory `ChatToolContext`, which holds the
HybridStore and a documents snapshot.
### Pipeline graph topology
```
START
→ start_timer
→ dispatch_ingest (Send API: per-doc fan-out)
→ ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph)
→ ingest_join (fan-in)
→ dispatch_classify (Send API)
→ classify_per_doc (regex/keyword classifier in dummy mode;
vision-aware in vLLM mode)
→ classify_join
→ dispatch_extract (Send API)
→ extract_per_doc (regex extractor in dummy mode +
flatten_universal; structured LLM in vLLM mode)
→ extract_join
→ quote_validator (anti-hallucination layer #7)
→ dispatch_rag_index (Send API)
→ rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert)
→ rag_join
→ compare_node (three-way matching, sync)
→ risk_subgraph (basic + 14 domain × Send + plausibility +
LLM ensemble + duplicate)
→ finish_timer
→ report_node (10-section JSON structure)
→ END
```
The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.
### Risk subgraph topology
```
risk_subgraph (input: PipelineState):
→ basic_risk_dispatch (Send: per-doc basic risk)
→ basic_risk / noop_basic
→ domain_dispatch_node (Send: per-doc × per-applicable-check, ~30 parallel)
→ apply_domain_check
→ [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain)
→ llm_risk_per_doc / noop_llm
→ plausibility_dispatch (Send: per-doc plausibility)
→ plausibility / noop_plaus
→ evidence_score_node (per-doc info)
→ duplicate_detector_node (package-level, sync, ISA 240)
END
```
The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`:
`llm_risk → filter_llm_risks → drop_business_normal → drop_repeats`.
### DD multi-agent supervisor graph
```
dd_graph:
START
→ contract_filter_node (keep only contract-type docs)
→ per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
→ supervisor_node (LLM router or heuristic; Command(goto=...))
├─ → audit_specialist (pricing anomalies, overcharging)
├─ → legal_specialist (red flags, change-of-control, non-compete)
├─ → compliance_specialist (GDPR, AML, data protection)
└─ → financial_specialist (monthly obligations, expirations)
↺ (loops back to supervisor up to dd_supervisor_max_iterations)
→ dd_synthesizer (one LLM call: executive_summary +
top_red_flags + per-contract risk_level rating)
END
```
### Package insights graph
A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(`audit | dd | compliance | general`).
## 2. State design
### `PipelineState` (TypedDict)
Read-mostly fields with **reducer-driven Send fan-in**:
- `files: list[tuple[str, bytes]]` — raw upload
- `documents: Annotated[list[ProcessedDocument], merge_doc_results]`
per-doc field-level merge keyed by `file_name`
- `risks: Annotated[list[Risk], merge_risks]` — dedup by description
- `comparison: ComparisonReport | None`
- `report: dict`
- `package_insights: PackageInsights | None`
- `dd_report: DDPortfolioReport | None`
- `started_at`, `finished_at`, `processing_seconds`
- `progress_events: Annotated[list[str], add]` — Streamlit progress feed
### `Risk` (Pydantic)
The single risk type used everywhere:
- `description: str`
- `severity: str` (`"high" | "medium" | "low" | "info"`)
- `rationale: str`
- `kind: str` (`"validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check"`)
- `regulation: str | None` (e.g. `"HU VAT Act §169"`, `"ISA 240"`, `"GDPR Article 28"`)
- `affected_document: str | None`
- `source_check_id: str | None`
## 3. Anti-hallucination stack (5+1 layers)
1. **`temperature=0`** — every LLM call is deterministic-ish.
2. **`_quotes` schema field** — verbatim source citations.
3. **`_confidence` schema field** — per-field reliability (high|medium|low).
4. **`validate_plausibility()`** — Python deterministic plausibility checks
(negative VAT, non-standard rates, future dates, etc.).
5. **3-filter LLM risk pipeline** —
`filter_llm_risks` (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact)
→ `drop_business_normal_risks` (semantic: cross-check vs extracted_data,
6 known false-positive patterns)
→ `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold).
6. **Quote validator** — final cross-check that every `_quotes` entry
actually appears in the source `full_text` (whitespace + diacritic +
case normalized). If invalid, downgrades confidence.
## 4. Domain checks (14 deterministic rules)
| # | check_id | Regulation | HU-specific? | Applies to |
|---|----------|-----------|--------------|------------|
| 01 | `check_01_invoice_mandatory` | HU VAT Act §169 | yes | invoice |
| 02 | `check_02_tax_cdv` | HU Tax Procedure Act §22 mod-11 | yes | invoice + contract + ... |
| 03 | `check_03_contract_completeness` | Universal contract completeness | no | contract |
| 04 | `check_04_proportionality` | Universal contract proportionality (>31.7%) | no | contract |
| 05 | `check_05_rounded_amounts` | ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | `check_06_evidence_score` | ISA 500 | no | (separate entry, info-only) |
| 07 | `check_07_materiality` | ISA 320 | no | invoice + contract + financial_report |
| 08 | `check_08_gdpr_28` | GDPR Article 28 | no (EU) | contract |
| 09 | `check_09_dd_red_flags` | M&A DD best practice | no | contract |
| 10 | `check_10_incoterms` | Incoterms 2020 | no | contract |
| 11 | `check_11_ifrs_har` | IFRS / national GAAP comparison | no | financial_report |
| 12 | `check_12_duplicate_invoice` | ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | `check_13_aml_sanctions` | AML / Sanctions screening | no | invoice + contract + ... |
| 14 | `check_14_contract_dates` | Contract date best practice | no | contract |
The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they
have separate entry points) and filters `is_hu_specific=True` out for non-HU
documents.
## 5. Provider system
Three providers via `configurable_alternatives`:
- **`vllm`**`ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the
AMD MI300X vLLM endpoint. Production default.
- **`ollama`**`ChatOllama` with a local Ollama daemon (Qwen 2.5 7B
Instruct). Development fallback.
- **`dummy`**`DummyChatModel` (deterministic stub, no network).
CI / eval / load.
Provider selection is **runtime-switchable** without restart:
```python
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
```
## 6. Embedding
`BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via `@lru_cache`.
Pre-downloaded at Docker build time so runtime has no network call.
## 7. Hybrid retrieval (Chroma + BM25)
`store/hybrid_store.py` runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.
## 8. Async-first runtime
LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime`
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the `AsyncSqliteSaver` SQLite pool persistent across user
interactions — they do not rebuild per request.
## 9. Multilingual support
The codebase is English-first but multilingual-tolerant:
- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps `eng + hun + deu` as Tesseract languages.
- Demo data may include mixed-language documents.
The output (UI, exec summary, DOCX report) is **always English**.