File size: 13,322 Bytes
3385e0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# PaperHawk Architecture

How PaperHawk is built and why each piece is where it is. This document explains the multi-graph LangGraph orchestration, the 14 deterministic domain checks, the 6-layer anti-hallucination stack, and the multi-agent DD assistant.

---

## High-level architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          USER (Streamlit 5-tab UI)                       β”‚
β”‚   Upload  β”‚  Results  β”‚  Chat  β”‚  DD Assistant  β”‚  Report                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                    β”‚                        β”‚
            β–Ό                    β–Ό                        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  pipeline_graph  β”‚ β”‚   chat_graph     β”‚  β”‚    dd_graph             β”‚
   β”‚                  β”‚ β”‚                  β”‚  β”‚                         β”‚
   β”‚ Ingest β†’         β”‚ β”‚ Intent classify  β”‚  β”‚ Contract filter β†’       β”‚
   β”‚ Classify β†’       β”‚ β”‚ β†’ Plan β†’         β”‚  β”‚ Per-contract summary β†’  β”‚
   β”‚ Extract β†’        β”‚ β”‚ Agent (5 tools)  β”‚  β”‚ Multi-agent specialists β”‚
   β”‚ Compare β†’        β”‚ β”‚ β†’ Synthesizer β†’  β”‚  β”‚ (audit/legal/compliance β”‚
   β”‚ Risk β†’           β”‚ β”‚ Validator        β”‚  β”‚  /financial) β†’          β”‚
   β”‚ Report           β”‚ β”‚ ([Source: …])    β”‚  β”‚ Supervisor β†’ Synthesizerβ”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                                        β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  package_insights_graph  β”‚
                β”‚                          β”‚
                β”‚  Cross-document analysis β”‚
                β”‚  (price-drift, dupes,    β”‚
                β”‚   three-way matching)    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚    Provider abstraction  β”‚
                β”‚ (configurable_alternatives)
                β”‚                          β”‚
                β”‚ vLLM ←→ Ollama ←→ Dummy  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  AMD MI300X (vLLM)       β”‚
                β”‚  Qwen 2.5 14B Instruct   β”‚
                β”‚  192 GB HBM3, ROCm 7.0   β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Compiled graphs (4)

Every entry-point in the system is a separately compiled LangGraph artifact with its own typed state and `AsyncSqliteSaver` checkpointer:

### 1. `pipeline_graph` β€” the document processing pipeline

The 6-step end-to-end flow when the user uploads a package:

1. **Ingest** β€” PDF (PyMuPDF + pdfplumber for table extraction), DOCX (native), images (vision-first via the LLM), with Tesseract OCR fallback for scanned PDFs (EN/HU/DE)
2. **Classify** β€” 6-way doc-type classifier with structured output (`invoice`, `delivery_note`, `purchase_order`, `contract`, `financial_report`, `other`); ISA 500 evidence-quality score
3. **Extract** β€” per doc-type Pydantic v2 schema with `_quotes` and `_confidence` fields; universal fallback schema for unknown types
4. **Compare** β€” three-way matching subgraph (invoice + delivery note + PO), duplicate-invoice detection (ISA 240)
5. **Risk** β€” basic plausibility + 14 domain checks (Send-API parallel fan-out) + LLM risk ensemble + 3-stage filter chain
6. **Report** β€” DOCX export, JSON output, Streamlit UI rendering

State: `PipelineState` (Pydantic), with reducers for risk lists and per-document results.

### 2. `chat_graph` β€” the agentic chat

5-tool ReAct agent with strict citation enforcement:

- **Tools**: `list_documents`, `get_extraction`, `search_documents` (hybrid Chroma + BM25 with Reciprocal Rank Fusion), `compare_documents`, `validate_document`
- **Prompt**: 17-rule system prompt enforcing `[Source: filename.pdf]` format
- **Validator node**: post-processor that drops any answer without citations
- **Intent classifier**: routes to direct-answer vs tool-use paths to keep latency low for casual queries

State: `ChatState` with message history, retrieved chunks, and citation list.

### 3. `dd_graph` β€” the multi-agent DD assistant

For M&A due-diligence packages:

- **Contract filter** β€” selects only contract-type documents from the package
- **Per-contract summary** β€” extracts each contract's key terms (parties, term, value, change-of-control, non-compete, auto-renewal)
- **4 specialist agents** (run in parallel via Send-API):
  - `audit_specialist` β€” material misstatement risk, ISA 240 fraud indicators
  - `legal_specialist` β€” change-of-control, non-compete, automatic-renewal red flags
  - `compliance_specialist` β€” GDPR Art. 28 sub-processor language, AML counterparty checks
  - `financial_specialist` β€” Ptk. 6:98 disproportionate penalty clauses, materiality thresholds
- **Supervisor** β€” coordinates specialists, drops business-normal noise
- **Synthesizer** β€” writes 3-paragraph executive summary

State: `DDState` with contract list, per-contract summaries, specialist findings, executive summary.

### 4. `package_insights_graph` β€” cross-document analysis

Package-level analyzers that don't fit into the per-document pipeline:

- **Pricing-drift detector** β€” flags > 30% price changes for the same line item across invoices in a package (caught the 57.5% drift in our live demo)
- **Duplicate-invoice detector** β€” exact + near-match (date within 13 days, amount within 1%)
- **Counterparty consistency** β€” same supplier name spelled differently across documents

State: `PackageState` with per-document extractions and aggregated findings.

---

## Subgraphs (6)

Reusable LangGraph subgraphs imported by the main graphs:

| Subgraph | Purpose |
|---|---|
| `extract_subgraph` | Per-document extraction with quote validator |
| `ingest_subgraph` | PDF/DOCX/image loading with OCR fallback |
| `llm_risk_subgraph` | LLM risk generation with structured output |
| `rag_index_subgraph` | Chunking, embedding, ChromaDB indexing |
| `rag_query_subgraph` | Hybrid Chroma + BM25 retrieval with RRF |
| `risk_subgraph` | Domain check fan-out + LLM risk + 3-stage filters |

---

## 14 deterministic domain checks

The check registry (`domain_checks/__init__.py`) is the heart of PaperHawk's auditor-grade output. Every check is a Python `Protocol` implementation, not an LLM prompt β€” they cannot hallucinate, can be unit-tested, and produce defensible findings with explicit regulation sources.

### A-tier (essential)

1. **Mandatory invoice elements** (HU VAT Act Β§169) β€” 18 required elements per invoice
2. **Tax-ID checksum** (Art. 22 Β§) β€” mod-11 Hungarian tax-ID validation
3. **Contract completeness** (Ptk. Book 6) β€” termination, governing law, penalty, confidentiality clauses
4. **Disproportionality** (Ptk. 6:98) β€” penalty clause > 31.7% of contract value flagged HIGH
5. **Rounded amounts** (ISA 240) β€” > 14.7% rounded amounts flagged suspicious, > 24.3% flagged HIGH
6. **Evidence hierarchy** (ISA 500) β€” document-type reliability score (8/10 invoice, 7/10 contract)

### B-tier (supplementary)

7. **Materiality** (ISA 320) β€” 1.93% of document value as info-level threshold
8. **GDPR Article 28** β€” 10 mandatory sub-processor language elements + PII detection
9. **DD red flags** (M&A) β€” change-of-control, non-compete, automatic-renewal triggers

### C-tier (informational)

10. **Incoterms 2020** β€” 11 incoterm rules detected via regex word-boundaries
11. **IFRS/HAR anomaly** β€” goodwill amortization flag, operational lease in IFRS context
12. **Duplicate invoice** (ISA 240) β€” exact + near-match with 13-day date filter
13. **AML sanctions** (Pmt.) β€” static EU/OFAC snapshot with fuzzy name match
14. **Contract dates** β€” start-end consistency, expiry detection

**Jurisdiction-aware**: Hungarian-specific rules (HU VAT Act, Ptk., Art.) apply only to Hungarian documents. Universal rules (ISA, GDPR, Incoterms, AML) apply everywhere.

---

## 6-layer anti-hallucination stack

The system is designed so the LLM **cannot** lie about a document and have the lie pass through.

| Layer | What it does |
|---|---|
| 1. `temperature=0` | Deterministic outputs every run |
| 2. Source quote requirement | Every extraction must include a verbatim quote from the source PDF in `_quotes` |
| 3. Confidence scoring | high / medium / low per extracted field, surfaced to the user |
| 4. Plausibility validators | Deterministic Python checks for math, dates, totals, item-level VAT, currency normalization |
| 5. 3-stage LLM-risk filter chain | Drops business-normal noise, drops repeats of basic deterministic checks, drops contradictions |
| 6. Quote validator | Text-search the source PDF for the claimed quote; downgrade confidence if not found verbatim, drop entirely if obviously fabricated |

In our live audit demo, layer 6 caught **4 of 6** hallucinated citations from Qwen 2.5 14B and downgraded them to `low` confidence.

The `validation/` package is one of the most-edited folders in the repo precisely because we treat anti-hallucination as a first-class concern, not a guardrail layer slapped on top.

---

## Provider abstraction

`configurable_alternatives` lets us swap LLM backends with a single env var:

| `LLM_PROFILE` | Backend | Use case |
|---|---|---|
| `vllm` | vLLM REST endpoint (OpenAI-compatible) | Production on AMD MI300X |
| `ollama` | Local Ollama at `localhost:11434` | Dev on consumer GPU |
| `dummy` | Deterministic stub | CI tests, smoke tests, judge quick-demo |

The application code never imports an LLM SDK directly β€” all calls go through `providers/` factory functions with `configurable_alternatives`. Switching from Anthropic Claude (our original dev target) to Qwen on vLLM required **zero application code changes** β€” only env vars.

---

## Embedding + retrieval

- **Model**: BAAI/bge-m3 (1024-dim, multilingual EN/HU/DE/FR via sentence-transformers)
- **Storage**: ChromaDB persistent (per-session) + BM25 in-memory keyword index
- **Hybrid retrieval**: Reciprocal Rank Fusion of Chroma top-K and BM25 top-K
- **Chunking**: Natural-boundary chunking (paragraph-aware, ~500 tokens with overlap)

The embedding model loads once at app startup (~2.3 GB to RAM/VRAM). On first run it downloads from Hugging Face Hub to `~/.cache/huggingface/`.

---

## State persistence

- **Per-session**: Streamlit `session_state` for UI state (uploaded files, current package)
- **Per-graph**: `AsyncSqliteSaver` checkpointer at `data/checkpoints.sqlite` for LangGraph state
- **Vector store**: ChromaDB at `chroma_db/` (gitignored)

Restarting the app loads the last checkpoint, so chat history and extraction results survive a restart.

---

## Streamlit UI (5 tabs)

1. **Upload** β€” drag-and-drop (PDF, DOCX, PNG, JPG, TXT), 200 MB per file, plus 3 pre-bundled demo packages
2. **Results** β€” classification confidence, extracted data, risks per document, package-level cross-doc analysis
3. **Chat** β€” agentic chat with `[Source: filename.pdf]` citations
4. **DD Assistant** β€” for M&A packages: per-contract summaries + 4 specialist findings + executive summary + downloadable DOCX
5. **Report** β€” JSON output + DOCX export

The async runtime uses a long-lived background event loop (`app/async_runtime.py`) so the UI stays responsive during multi-minute pipeline runs.

---

## Cross-references

- [`docs/AMD_DEPLOYMENT.md`](AMD_DEPLOYMENT.md) β€” how the production vLLM endpoint runs on AMD MI300X
- [`docs/HUGGINGFACE_DEPLOYMENT.md`](HUGGINGFACE_DEPLOYMENT.md) β€” how the Streamlit app deploys as a public HF Space
- [`docs/SUBMISSION.md`](SUBMISSION.md) β€” full hackathon submission brief with TAM/SAM, competitor positioning, live deployment validation