paperhawk / CLAUDE.md
NΓ‘ndorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119
|
raw
history blame
7.88 kB
# CLAUDE.md β€” paperhawk
Project-level instructions for Claude Code working in this repository. Any
session that starts in this folder reads this file automatically.
**Last updated:** 2026-05-03
---
## 1. Project overview
A LangGraph-native, multi-agent Document Intelligence platform built for the
**AMD Developer Hackathon Γ— lablab.ai** (May 2026). MIT-licensed, English-only
codebase, designed to run on **AMD Instinct MI300X** GPUs via the vLLM runtime
serving **Qwen 2.5 Instruct** open-source models.
The system processes business document packages (invoices, contracts, delivery
notes, purchase orders, financial reports) end-to-end:
1. **Ingest** β€” PDF / DOCX / image with vision-first scanned fallback
2. **Classify** β€” 6-way doc-type classifier (LLM with structured output)
3. **Extract** β€” typed Pydantic schema extraction with anti-hallucination
4. **Cross-reference** β€” three-way matching (invoice + delivery + PO)
5. **Risk analysis** β€” basic + 14 domain rules + LLM ensemble + 3 filters
6. **Report** β€” DOCX export, JSON API, executive summary
The chat layer is a 5-tool agentic ReAct loop with explicit `[Source: filename]`
citations and an anti-hallucination validator.
---
## 2. Workflow rules
### Language
- **English everywhere** β€” code, comments, docstrings, prompts, UI, error
messages, log lines.
- **Multilingual fallback** β€” for legacy interop and the multilingual demo:
some loaders, classifiers, and regex filters accept HU/DE input. EN is
always the primary path.
- Two HU reference documents are kept under `docs/` with `_HU.md` suffix
(`Teljes-rendszer-attekintes-langgraph_HU.md`, `MUKODESI_LEIRAS_HU.md`).
These are read-only references; do not edit.
### License + IP
- **MIT licensed** β€” see `LICENSE`.
- `NOTICE.md` is a non-binding author request (no legal force).
- Never paste proprietary code from outside this repo.
### Provider
- The default chat provider is `vllm` (Qwen 2.5 14B Instruct on AMD MI300X
through the OpenAI-compatible vLLM endpoint).
- `ollama` is a local dev fallback (Qwen 2.5 7B Instruct on a laptop GPU/CPU).
- `dummy` is the deterministic CI / eval / smoke provider (no network, no LLM).
- Never re-introduce a Claude / Anthropic provider here β€” that path is
out of scope for the AMD edition.
### Git
- The AI **NEVER** runs git operations on `main` (no commit, no push, no
cherry-pick, no merge). The user runs all `main`-branch git operations.
- The AI MAY commit on non-`main` feature branches when explicitly asked.
- The AI **NEVER** pushes β€” push is the user's task only.
### Build hygiene
- Do not commit `.env`, `chroma_db/`, `data/checkpoints.sqlite`, `__pycache__/`.
- Magyar / English commit messages are both fine; English preferred for the
public history of an MIT repo.
### Anti-hallucination is sacred
- The 5+1 layers (`temperature=0`, `_quotes`, `_confidence`, plausibility
filters, LLM-risk 3 filters, quote validator) are not optional. Every
LLM-generated piece of data is cross-checked.
- Source citations in the chat use the canonical `[Source: filename]` format
(validator enforces this).
---
## 3. Repo layout
```
paperhawk/
β”œβ”€β”€ app/ # Streamlit UI (5 tabs) + async runtime
β”œβ”€β”€ config.py # Pydantic Settings (env-bound)
β”œβ”€β”€ domain_checks/ # 14 deterministic rules + base + registry
β”œβ”€β”€ eval/ # Eval harness (questions + run_eval)
β”œβ”€β”€ graph/ # 4 compiled graphs (pipeline / chat / dd /
β”‚ # package_insights) + 6 states + checkpointer
β”œβ”€β”€ ingest/ # PDF / DOCX / image / OCR / tables / txt
β”œβ”€β”€ infra/vllm/ # AMD MI300X deployment (Dockerfile + serve.sh + README)
β”œβ”€β”€ load/ # Load benchmarks
β”œβ”€β”€ nodes/ # Per-stage node functions:
β”‚ β”œβ”€β”€ chat/ # chat agent + 5 tools
β”‚ β”œβ”€β”€ dd/ # DD specialists + supervisor + synthesizer
β”‚ β”œβ”€β”€ extract/ # extract + dummy + quote validator
β”‚ β”œβ”€β”€ ingest/ # ingest helpers
β”‚ β”œβ”€β”€ pipeline/ # classify / compare / duplicate / report / docx
β”‚ └── risk/ # basic / domain dispatch / LLM risk + 3 filters
β”œβ”€β”€ providers/ # vLLM / Ollama / Dummy LLM providers + embeddings
β”œβ”€β”€ schemas/ # 6 JSON schemas + pydantic_models + flatten_universal
β”œβ”€β”€ store/ # ChromaDB + BM25 hybrid + chunking
β”œβ”€β”€ subgraphs/ # 6 reusable subgraphs (Send API parallelism)
β”œβ”€β”€ tests/ # unit + integration + e2e_api + e2e_screenshot
β”œβ”€β”€ tools/ # 5 chat tools + ChatToolContext
β”œβ”€β”€ utils/ # dates + numbers + docx_export
└── validation/ # anti-halluc layers (5+1)
```
---
## 4. Hot files
When fixing bugs or adding features, these are the most-edited files:
- `graph/states/pipeline_state.py` β€” `Risk`, `Classification`, `ExtractedData`,
`merge_risks`, `merge_doc_results` reducers.
- `domain_checks/__init__.py` β€” the 14-check registry.
- `domain_checks/check_*_*.py` β€” individual deterministic rules.
- `nodes/risk/_prompts.py` β€” `RISK_SYSTEM_PROMPT` (anti-halluc 9+6+4 examples).
- `nodes/chat/_prompts.py` β€” `AGENTIC_SYSTEM_PROMPT` (17 rules).
- `validation/llm_risk_filters.py` β€” 3-filter chain.
- `app/main.py` β€” Streamlit UI (5 tabs).
---
## 5. Testing
```bash
# Fast: unit + integration (dummy LLM)
LLM_PROFILE=dummy pytest tests/unit tests/integration -x --tb=short
# Slow: end-to-end with real LLM
LLM_PROFILE=vllm pytest tests/e2e_api -m e2e -x --tb=short
# UI Playwright (real LLM, slow)
LLM_PROFILE=vllm pytest tests/e2e_screenshot -x --tb=short
```
`LLM_PROFILE=dummy` works without any external service. `LLM_PROFILE=vllm`
requires `VLLM_BASE_URL` to point at a running vLLM endpoint.
---
## 6. Deploy targets
- **Hugging Face Space** β€” Streamlit Space under
`huggingface.co/spaces/lablab-ai-amd-developer-hackathon/<your-space>`.
See `docs/hf-space-deployment.md`.
- **AMD Developer Cloud MI300X** β€” vLLM serving Qwen 2.5 14B (or 32B).
See `docs/qwen-vllm-deployment.md` and `infra/vllm/README.md`.
---
## 7. Pitch positioning
When writing project descriptions, the README, video, or social posts:
- **Beyond simple RAG** β€” multi-agent platform with 14 deterministic checks
+ an LLM ensemble. The 5-tool chat is *agentic*, not retrieval-only.
- **Track 1** (AI Agents & Agentic Workflows) is the target track.
- **Cross-track**: Build in Public is in scope (AMD GPU prize).
- **HF Special Prize** is in scope (Reachy Mini robot β€” like-vote driven).
---
## 8. The Glossary (HU β†’ EN field names)
The full per-field rename map is in
`pwc-ai-verseny/document-intelligence-agentic-langgraph-amd/ATIRASI_TERV.md`
sections **32 (field names) and 33 (severity literals)**. Keep that file
open when editing extraction schemas, domain checks, or anything that
touches the `Risk` Pydantic.
---
## 9. Common pitfalls
- **Severity literals**: always `"high" | "medium" | "low" | "info"` β€”
never `"magas" | "kozepes" | "alacsony"`. Many `_normalize_severity()`
helpers map HU β†’ EN if legacy data sneaks in, but new code emits EN.
- **Risk fields**: `description`, `severity`, `rationale`, `kind`,
`regulation`, `affected_document`, `source_check_id`. NOT
`leiras / sulyossag / indoklas / tipus / jogszabaly / erinto_dokumentum / forras_check_id`.
- **Doc types**: `"invoice" | "delivery_note" | "purchase_order" | "contract" | "financial_report" | "other"`.
- **`_quotes` alias** (not `_idezetek`) β€” both in JSON schemas and Pydantic models.
- **Multilingual fallback**: read-only in classifiers and regex filters;
never emit HU in new code.