File size: 7,880 Bytes
7ff7119 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | # CLAUDE.md β paperhawk
Project-level instructions for Claude Code working in this repository. Any
session that starts in this folder reads this file automatically.
**Last updated:** 2026-05-03
---
## 1. Project overview
A LangGraph-native, multi-agent Document Intelligence platform built for the
**AMD Developer Hackathon Γ lablab.ai** (May 2026). MIT-licensed, English-only
codebase, designed to run on **AMD Instinct MI300X** GPUs via the vLLM runtime
serving **Qwen 2.5 Instruct** open-source models.
The system processes business document packages (invoices, contracts, delivery
notes, purchase orders, financial reports) end-to-end:
1. **Ingest** β PDF / DOCX / image with vision-first scanned fallback
2. **Classify** β 6-way doc-type classifier (LLM with structured output)
3. **Extract** β typed Pydantic schema extraction with anti-hallucination
4. **Cross-reference** β three-way matching (invoice + delivery + PO)
5. **Risk analysis** β basic + 14 domain rules + LLM ensemble + 3 filters
6. **Report** β DOCX export, JSON API, executive summary
The chat layer is a 5-tool agentic ReAct loop with explicit `[Source: filename]`
citations and an anti-hallucination validator.
---
## 2. Workflow rules
### Language
- **English everywhere** β code, comments, docstrings, prompts, UI, error
messages, log lines.
- **Multilingual fallback** β for legacy interop and the multilingual demo:
some loaders, classifiers, and regex filters accept HU/DE input. EN is
always the primary path.
- Two HU reference documents are kept under `docs/` with `_HU.md` suffix
(`Teljes-rendszer-attekintes-langgraph_HU.md`, `MUKODESI_LEIRAS_HU.md`).
These are read-only references; do not edit.
### License + IP
- **MIT licensed** β see `LICENSE`.
- `NOTICE.md` is a non-binding author request (no legal force).
- Never paste proprietary code from outside this repo.
### Provider
- The default chat provider is `vllm` (Qwen 2.5 14B Instruct on AMD MI300X
through the OpenAI-compatible vLLM endpoint).
- `ollama` is a local dev fallback (Qwen 2.5 7B Instruct on a laptop GPU/CPU).
- `dummy` is the deterministic CI / eval / smoke provider (no network, no LLM).
- Never re-introduce a Claude / Anthropic provider here β that path is
out of scope for the AMD edition.
### Git
- The AI **NEVER** runs git operations on `main` (no commit, no push, no
cherry-pick, no merge). The user runs all `main`-branch git operations.
- The AI MAY commit on non-`main` feature branches when explicitly asked.
- The AI **NEVER** pushes β push is the user's task only.
### Build hygiene
- Do not commit `.env`, `chroma_db/`, `data/checkpoints.sqlite`, `__pycache__/`.
- Magyar / English commit messages are both fine; English preferred for the
public history of an MIT repo.
### Anti-hallucination is sacred
- The 5+1 layers (`temperature=0`, `_quotes`, `_confidence`, plausibility
filters, LLM-risk 3 filters, quote validator) are not optional. Every
LLM-generated piece of data is cross-checked.
- Source citations in the chat use the canonical `[Source: filename]` format
(validator enforces this).
---
## 3. Repo layout
```
paperhawk/
βββ app/ # Streamlit UI (5 tabs) + async runtime
βββ config.py # Pydantic Settings (env-bound)
βββ domain_checks/ # 14 deterministic rules + base + registry
βββ eval/ # Eval harness (questions + run_eval)
βββ graph/ # 4 compiled graphs (pipeline / chat / dd /
β # package_insights) + 6 states + checkpointer
βββ ingest/ # PDF / DOCX / image / OCR / tables / txt
βββ infra/vllm/ # AMD MI300X deployment (Dockerfile + serve.sh + README)
βββ load/ # Load benchmarks
βββ nodes/ # Per-stage node functions:
β βββ chat/ # chat agent + 5 tools
β βββ dd/ # DD specialists + supervisor + synthesizer
β βββ extract/ # extract + dummy + quote validator
β βββ ingest/ # ingest helpers
β βββ pipeline/ # classify / compare / duplicate / report / docx
β βββ risk/ # basic / domain dispatch / LLM risk + 3 filters
βββ providers/ # vLLM / Ollama / Dummy LLM providers + embeddings
βββ schemas/ # 6 JSON schemas + pydantic_models + flatten_universal
βββ store/ # ChromaDB + BM25 hybrid + chunking
βββ subgraphs/ # 6 reusable subgraphs (Send API parallelism)
βββ tests/ # unit + integration + e2e_api + e2e_screenshot
βββ tools/ # 5 chat tools + ChatToolContext
βββ utils/ # dates + numbers + docx_export
βββ validation/ # anti-halluc layers (5+1)
```
---
## 4. Hot files
When fixing bugs or adding features, these are the most-edited files:
- `graph/states/pipeline_state.py` β `Risk`, `Classification`, `ExtractedData`,
`merge_risks`, `merge_doc_results` reducers.
- `domain_checks/__init__.py` β the 14-check registry.
- `domain_checks/check_*_*.py` β individual deterministic rules.
- `nodes/risk/_prompts.py` β `RISK_SYSTEM_PROMPT` (anti-halluc 9+6+4 examples).
- `nodes/chat/_prompts.py` β `AGENTIC_SYSTEM_PROMPT` (17 rules).
- `validation/llm_risk_filters.py` β 3-filter chain.
- `app/main.py` β Streamlit UI (5 tabs).
---
## 5. Testing
```bash
# Fast: unit + integration (dummy LLM)
LLM_PROFILE=dummy pytest tests/unit tests/integration -x --tb=short
# Slow: end-to-end with real LLM
LLM_PROFILE=vllm pytest tests/e2e_api -m e2e -x --tb=short
# UI Playwright (real LLM, slow)
LLM_PROFILE=vllm pytest tests/e2e_screenshot -x --tb=short
```
`LLM_PROFILE=dummy` works without any external service. `LLM_PROFILE=vllm`
requires `VLLM_BASE_URL` to point at a running vLLM endpoint.
---
## 6. Deploy targets
- **Hugging Face Space** β Streamlit Space under
`huggingface.co/spaces/lablab-ai-amd-developer-hackathon/<your-space>`.
See `docs/hf-space-deployment.md`.
- **AMD Developer Cloud MI300X** β vLLM serving Qwen 2.5 14B (or 32B).
See `docs/qwen-vllm-deployment.md` and `infra/vllm/README.md`.
---
## 7. Pitch positioning
When writing project descriptions, the README, video, or social posts:
- **Beyond simple RAG** β multi-agent platform with 14 deterministic checks
+ an LLM ensemble. The 5-tool chat is *agentic*, not retrieval-only.
- **Track 1** (AI Agents & Agentic Workflows) is the target track.
- **Cross-track**: Build in Public is in scope (AMD GPU prize).
- **HF Special Prize** is in scope (Reachy Mini robot β like-vote driven).
---
## 8. The Glossary (HU β EN field names)
The full per-field rename map is in
`pwc-ai-verseny/document-intelligence-agentic-langgraph-amd/ATIRASI_TERV.md`
sections **32 (field names) and 33 (severity literals)**. Keep that file
open when editing extraction schemas, domain checks, or anything that
touches the `Risk` Pydantic.
---
## 9. Common pitfalls
- **Severity literals**: always `"high" | "medium" | "low" | "info"` β
never `"magas" | "kozepes" | "alacsony"`. Many `_normalize_severity()`
helpers map HU β EN if legacy data sneaks in, but new code emits EN.
- **Risk fields**: `description`, `severity`, `rationale`, `kind`,
`regulation`, `affected_document`, `source_check_id`. NOT
`leiras / sulyossag / indoklas / tipus / jogszabaly / erinto_dokumentum / forras_check_id`.
- **Doc types**: `"invoice" | "delivery_note" | "purchase_order" | "contract" | "financial_report" | "other"`.
- **`_quotes` alias** (not `_idezetek`) β both in JSON schemas and Pydantic models.
- **Multilingual fallback**: read-only in classifiers and regex filters;
never emit HU in new code.
|