File size: 7,880 Bytes
7ff7119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# CLAUDE.md β€” paperhawk

Project-level instructions for Claude Code working in this repository. Any
session that starts in this folder reads this file automatically.

**Last updated:** 2026-05-03

---

## 1. Project overview

A LangGraph-native, multi-agent Document Intelligence platform built for the
**AMD Developer Hackathon Γ— lablab.ai** (May 2026). MIT-licensed, English-only
codebase, designed to run on **AMD Instinct MI300X** GPUs via the vLLM runtime
serving **Qwen 2.5 Instruct** open-source models.

The system processes business document packages (invoices, contracts, delivery
notes, purchase orders, financial reports) end-to-end:

1. **Ingest** β€” PDF / DOCX / image with vision-first scanned fallback
2. **Classify** β€” 6-way doc-type classifier (LLM with structured output)
3. **Extract** β€” typed Pydantic schema extraction with anti-hallucination
4. **Cross-reference** β€” three-way matching (invoice + delivery + PO)
5. **Risk analysis** β€” basic + 14 domain rules + LLM ensemble + 3 filters
6. **Report** β€” DOCX export, JSON API, executive summary

The chat layer is a 5-tool agentic ReAct loop with explicit `[Source: filename]`
citations and an anti-hallucination validator.

---

## 2. Workflow rules

### Language

- **English everywhere** β€” code, comments, docstrings, prompts, UI, error
  messages, log lines.
- **Multilingual fallback** β€” for legacy interop and the multilingual demo:
  some loaders, classifiers, and regex filters accept HU/DE input. EN is
  always the primary path.
- Two HU reference documents are kept under `docs/` with `_HU.md` suffix
  (`Teljes-rendszer-attekintes-langgraph_HU.md`, `MUKODESI_LEIRAS_HU.md`).
  These are read-only references; do not edit.

### License + IP

- **MIT licensed** β€” see `LICENSE`.
- `NOTICE.md` is a non-binding author request (no legal force).
- Never paste proprietary code from outside this repo.

### Provider

- The default chat provider is `vllm` (Qwen 2.5 14B Instruct on AMD MI300X
  through the OpenAI-compatible vLLM endpoint).
- `ollama` is a local dev fallback (Qwen 2.5 7B Instruct on a laptop GPU/CPU).
- `dummy` is the deterministic CI / eval / smoke provider (no network, no LLM).
- Never re-introduce a Claude / Anthropic provider here β€” that path is
  out of scope for the AMD edition.

### Git

- The AI **NEVER** runs git operations on `main` (no commit, no push, no
  cherry-pick, no merge). The user runs all `main`-branch git operations.
- The AI MAY commit on non-`main` feature branches when explicitly asked.
- The AI **NEVER** pushes β€” push is the user's task only.

### Build hygiene

- Do not commit `.env`, `chroma_db/`, `data/checkpoints.sqlite`, `__pycache__/`.
- Magyar / English commit messages are both fine; English preferred for the
  public history of an MIT repo.

### Anti-hallucination is sacred

- The 5+1 layers (`temperature=0`, `_quotes`, `_confidence`, plausibility
  filters, LLM-risk 3 filters, quote validator) are not optional. Every
  LLM-generated piece of data is cross-checked.
- Source citations in the chat use the canonical `[Source: filename]` format
  (validator enforces this).

---

## 3. Repo layout

```
paperhawk/
β”œβ”€β”€ app/                   # Streamlit UI (5 tabs) + async runtime
β”œβ”€β”€ config.py              # Pydantic Settings (env-bound)
β”œβ”€β”€ domain_checks/         # 14 deterministic rules + base + registry
β”œβ”€β”€ eval/                  # Eval harness (questions + run_eval)
β”œβ”€β”€ graph/                 # 4 compiled graphs (pipeline / chat / dd /
β”‚                          # package_insights) + 6 states + checkpointer
β”œβ”€β”€ ingest/                # PDF / DOCX / image / OCR / tables / txt
β”œβ”€β”€ infra/vllm/            # AMD MI300X deployment (Dockerfile + serve.sh + README)
β”œβ”€β”€ load/                  # Load benchmarks
β”œβ”€β”€ nodes/                 # Per-stage node functions:
β”‚   β”œβ”€β”€ chat/              #   chat agent + 5 tools
β”‚   β”œβ”€β”€ dd/                #   DD specialists + supervisor + synthesizer
β”‚   β”œβ”€β”€ extract/           #   extract + dummy + quote validator
β”‚   β”œβ”€β”€ ingest/            #   ingest helpers
β”‚   β”œβ”€β”€ pipeline/          #   classify / compare / duplicate / report / docx
β”‚   └── risk/              #   basic / domain dispatch / LLM risk + 3 filters
β”œβ”€β”€ providers/             # vLLM / Ollama / Dummy LLM providers + embeddings
β”œβ”€β”€ schemas/               # 6 JSON schemas + pydantic_models + flatten_universal
β”œβ”€β”€ store/                 # ChromaDB + BM25 hybrid + chunking
β”œβ”€β”€ subgraphs/             # 6 reusable subgraphs (Send API parallelism)
β”œβ”€β”€ tests/                 # unit + integration + e2e_api + e2e_screenshot
β”œβ”€β”€ tools/                 # 5 chat tools + ChatToolContext
β”œβ”€β”€ utils/                 # dates + numbers + docx_export
└── validation/            # anti-halluc layers (5+1)
```

---

## 4. Hot files

When fixing bugs or adding features, these are the most-edited files:

- `graph/states/pipeline_state.py` β€” `Risk`, `Classification`, `ExtractedData`,
  `merge_risks`, `merge_doc_results` reducers.
- `domain_checks/__init__.py` β€” the 14-check registry.
- `domain_checks/check_*_*.py` β€” individual deterministic rules.
- `nodes/risk/_prompts.py` β€” `RISK_SYSTEM_PROMPT` (anti-halluc 9+6+4 examples).
- `nodes/chat/_prompts.py` β€” `AGENTIC_SYSTEM_PROMPT` (17 rules).
- `validation/llm_risk_filters.py` β€” 3-filter chain.
- `app/main.py` β€” Streamlit UI (5 tabs).

---

## 5. Testing

```bash
# Fast: unit + integration (dummy LLM)
LLM_PROFILE=dummy pytest tests/unit tests/integration -x --tb=short

# Slow: end-to-end with real LLM
LLM_PROFILE=vllm pytest tests/e2e_api -m e2e -x --tb=short

# UI Playwright (real LLM, slow)
LLM_PROFILE=vllm pytest tests/e2e_screenshot -x --tb=short
```

`LLM_PROFILE=dummy` works without any external service. `LLM_PROFILE=vllm`
requires `VLLM_BASE_URL` to point at a running vLLM endpoint.

---

## 6. Deploy targets

- **Hugging Face Space** β€” Streamlit Space under
  `huggingface.co/spaces/lablab-ai-amd-developer-hackathon/<your-space>`.
  See `docs/hf-space-deployment.md`.
- **AMD Developer Cloud MI300X** β€” vLLM serving Qwen 2.5 14B (or 32B).
  See `docs/qwen-vllm-deployment.md` and `infra/vllm/README.md`.

---

## 7. Pitch positioning

When writing project descriptions, the README, video, or social posts:

- **Beyond simple RAG** β€” multi-agent platform with 14 deterministic checks
  + an LLM ensemble. The 5-tool chat is *agentic*, not retrieval-only.
- **Track 1** (AI Agents & Agentic Workflows) is the target track.
- **Cross-track**: Build in Public is in scope (AMD GPU prize).
- **HF Special Prize** is in scope (Reachy Mini robot β€” like-vote driven).

---

## 8. The Glossary (HU β†’ EN field names)

The full per-field rename map is in
`pwc-ai-verseny/document-intelligence-agentic-langgraph-amd/ATIRASI_TERV.md`
sections **32 (field names) and 33 (severity literals)**. Keep that file
open when editing extraction schemas, domain checks, or anything that
touches the `Risk` Pydantic.

---

## 9. Common pitfalls

- **Severity literals**: always `"high" | "medium" | "low" | "info"` β€”
  never `"magas" | "kozepes" | "alacsony"`. Many `_normalize_severity()`
  helpers map HU β†’ EN if legacy data sneaks in, but new code emits EN.
- **Risk fields**: `description`, `severity`, `rationale`, `kind`,
  `regulation`, `affected_document`, `source_check_id`. NOT
  `leiras / sulyossag / indoklas / tipus / jogszabaly / erinto_dokumentum / forras_check_id`.
- **Doc types**: `"invoice" | "delivery_note" | "purchase_order" | "contract" | "financial_report" | "other"`.
- **`_quotes` alias** (not `_idezetek`) β€” both in JSON schemas and Pydantic models.
- **Multilingual fallback**: read-only in classifiers and regex filters;
  never emit HU in new code.