Spaces:

Vincsipe
/

paperhawk

Running

App Files Files Community

paperhawk / ARCHITECTURE.md

Nándorfi Vince

Initial paperhawk push to HF Space (LFS for binaries)

7ff7119 3 days ago

preview code

raw

history blame contribute delete

9.06 kB

	# Architecture

	LangGraph-native Document Intelligence platform. This document goes beyond
	the README — it covers design decisions, the subgraph hierarchy, state
	design, and the anti-hallucination stack.

	## 1. High-level architecture

	### 4 compiled LangGraph artifacts

	The system is organized around four graphs sharing a common `AsyncSqliteSaver`
	checkpointer:

	\| # \| Graph \| Entry point \| When \|
	\|---\|-------\|-------------\|------\|
	\| 1 \| `pipeline_graph` \| `app.run_pipeline(files)` \| on upload \|
	\| 2 \| `chat_graph` \| `app.ask(question)` \| chat tab \|
	\| 3 \| `dd_graph` \| `app.dd_report(thread_id)` \| DD tab button \|
	\| 4 \| `package_insights_graph` \| `app.package_insights(thread_id, pkg_type)` \| demo button \|

	Chat tools read from the persisted pipeline state — they do not re-read
	files. They access the in-memory `ChatToolContext`, which holds the
	HybridStore and a documents snapshot.

	### Pipeline graph topology

	```
	START
	→ start_timer
	→ dispatch_ingest (Send API: per-doc fan-out)
	→ ingest_per_doc (PDF/DOCX/PNG/TXT loader subgraph)
	→ ingest_join (fan-in)
	→ dispatch_classify (Send API)
	→ classify_per_doc (regex/keyword classifier in dummy mode;
	vision-aware in vLLM mode)
	→ classify_join
	→ dispatch_extract (Send API)
	→ extract_per_doc (regex extractor in dummy mode +
	flatten_universal; structured LLM in vLLM mode)
	→ extract_join
	→ quote_validator (anti-hallucination layer #7)
	→ dispatch_rag_index (Send API)
	→ rag_index_per_doc (chunker + batched embed + Chroma+BM25 upsert)
	→ rag_join
	→ compare_node (three-way matching, sync)
	→ risk_subgraph (basic + 14 domain × Send + plausibility +
	LLM ensemble + duplicate)
	→ finish_timer
	→ report_node (10-section JSON structure)
	→ END
	```

	The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.

	### Risk subgraph topology

	```
	risk_subgraph (input: PipelineState):
	→ basic_risk_dispatch (Send: per-doc basic risk)
	→ basic_risk / noop_basic
	→ domain_dispatch_node (Send: per-doc × per-applicable-check, ~30 parallel)
	→ apply_domain_check
	→ [if llm provided] llm_risk_dispatch (Send: per-doc LLM risk + 3-filter chain)
	→ llm_risk_per_doc / noop_llm
	→ plausibility_dispatch (Send: per-doc plausibility)
	→ plausibility / noop_plaus
	→ evidence_score_node (per-doc info)
	→ duplicate_detector_node (package-level, sync, ISA 240)
	END
	```

	The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`:
	`llm_risk → filter_llm_risks → drop_business_normal → drop_repeats`.

	### DD multi-agent supervisor graph

	```
	dd_graph:
	START
	→ contract_filter_node (keep only contract-type docs)
	→ per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
	→ supervisor_node (LLM router or heuristic; Command(goto=...))
	├─ → audit_specialist (pricing anomalies, overcharging)
	├─ → legal_specialist (red flags, change-of-control, non-compete)
	├─ → compliance_specialist (GDPR, AML, data protection)
	└─ → financial_specialist (monthly obligations, expirations)
	↺ (loops back to supervisor up to dd_supervisor_max_iterations)
	→ dd_synthesizer (one LLM call: executive_summary +
	top_red_flags + per-contract risk_level rating)
	END
	```

	### Package insights graph

	A simple 1-LLM-call graph: ingests the full document package and produces
	cross-doc findings using a perspective-driven prompt
	(`audit \| dd \| compliance \| general`).

	## 2. State design

	### `PipelineState` (TypedDict)

	Read-mostly fields with reducer-driven Send fan-in:

	- `files: list[tuple[str, bytes]]` — raw upload
	- `documents: Annotated[list[ProcessedDocument], merge_doc_results]` —
	per-doc field-level merge keyed by `file_name`
	- `risks: Annotated[list[Risk], merge_risks]` — dedup by description
	- `comparison: ComparisonReport \| None`
	- `report: dict`
	- `package_insights: PackageInsights \| None`
	- `dd_report: DDPortfolioReport \| None`
	- `started_at`, `finished_at`, `processing_seconds`
	- `progress_events: Annotated[list[str], add]` — Streamlit progress feed

	### `Risk` (Pydantic)

	The single risk type used everywhere:

	- `description: str`
	- `severity: str` (`"high" \| "medium" \| "low" \| "info"`)
	- `rationale: str`
	- `kind: str` (`"validation" \| "domain_rule" \| "plausibility" \| "llm_analysis" \| "cross_check"`)
	- `regulation: str \| None` (e.g. `"HU VAT Act §169"`, `"ISA 240"`, `"GDPR Article 28"`)
	- `affected_document: str \| None`
	- `source_check_id: str \| None`

	## 3. Anti-hallucination stack (5+1 layers)

	1. `temperature=0` — every LLM call is deterministic-ish.
	2. `_quotes` schema field — verbatim source citations.
	3. `_confidence` schema field — per-field reliability (high\|medium\|low).
	4. `validate_plausibility()` — Python deterministic plausibility checks
	(negative VAT, non-standard rates, future dates, etc.).
	5. 3-filter LLM risk pipeline —
	`filter_llm_risks` (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact)
	→ `drop_business_normal_risks` (semantic: cross-check vs extracted_data,
	6 known false-positive patterns)
	→ `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold).
	6. Quote validator — final cross-check that every `_quotes` entry
	actually appears in the source `full_text` (whitespace + diacritic +
	case normalized). If invalid, downgrades confidence.

	## 4. Domain checks (14 deterministic rules)

	\| # \| check_id \| Regulation \| HU-specific? \| Applies to \|
	\|---\|----------\|-----------\|--------------\|------------\|
	\| 01 \| `check_01_invoice_mandatory` \| HU VAT Act §169 \| yes \| invoice \|
	\| 02 \| `check_02_tax_cdv` \| HU Tax Procedure Act §22 mod-11 \| yes \| invoice + contract + ... \|
	\| 03 \| `check_03_contract_completeness` \| Universal contract completeness \| no \| contract \|
	\| 04 \| `check_04_proportionality` \| Universal contract proportionality (>31.7%) \| no \| contract \|
	\| 05 \| `check_05_rounded_amounts` \| ISA 240 (Journal of Accountancy 2018) \| no \| invoice \|
	\| 06 \| `check_06_evidence_score` \| ISA 500 \| no \| (separate entry, info-only) \|
	\| 07 \| `check_07_materiality` \| ISA 320 \| no \| invoice + contract + financial_report \|
	\| 08 \| `check_08_gdpr_28` \| GDPR Article 28 \| no (EU) \| contract \|
	\| 09 \| `check_09_dd_red_flags` \| M&A DD best practice \| no \| contract \|
	\| 10 \| `check_10_incoterms` \| Incoterms 2020 \| no \| contract \|
	\| 11 \| `check_11_ifrs_har` \| IFRS / national GAAP comparison \| no \| financial_report \|
	\| 12 \| `check_12_duplicate_invoice` \| ISA 240 (duplicate invoice) \| no \| (separate entry, package-level) \|
	\| 13 \| `check_13_aml_sanctions` \| AML / Sanctions screening \| no \| invoice + contract + ... \|
	\| 14 \| `check_14_contract_dates` \| Contract date best practice \| no \| contract \|

	The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they
	have separate entry points) and filters `is_hu_specific=True` out for non-HU
	documents.

	## 5. Provider system

	Three providers via `configurable_alternatives`:

	- `vllm` — `ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the
	AMD MI300X vLLM endpoint. Production default.
	- `ollama` — `ChatOllama` with a local Ollama daemon (Qwen 2.5 7B
	Instruct). Development fallback.
	- `dummy` — `DummyChatModel` (deterministic stub, no network).
	CI / eval / load.

	Provider selection is runtime-switchable without restart:

	```python
	graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
	```

	## 6. Embedding

	`BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default.
	Sentence-transformers loads it on first call via `@lru_cache`.
	Pre-downloaded at Docker build time so runtime has no network call.

	## 7. Hybrid retrieval (Chroma + BM25)

	`store/hybrid_store.py` runs vector search and BM25 in parallel and merges
	with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
	(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
	overlap.

	## 8. Async-first runtime

	LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
	on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime`
	singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
	session, and the `AsyncSqliteSaver` SQLite pool persistent across user
	interactions — they do not rebuild per request.

	## 9. Multilingual support

	The codebase is English-first but multilingual-tolerant:

	- The classifier matches HU/EN/DE keyword patterns.
	- Risk filters tolerate HU/DE business terms.
	- The OCR layer keeps `eng + hun + deu` as Tesseract languages.
	- Demo data may include mixed-language documents.

	The output (UI, exec summary, DOCX report) is always English.