paperhawk / CLAUDE.md
NΓ‘ndorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119
|
raw
history blame
7.88 kB

CLAUDE.md β€” paperhawk

Project-level instructions for Claude Code working in this repository. Any session that starts in this folder reads this file automatically.

Last updated: 2026-05-03


1. Project overview

A LangGraph-native, multi-agent Document Intelligence platform built for the AMD Developer Hackathon Γ— lablab.ai (May 2026). MIT-licensed, English-only codebase, designed to run on AMD Instinct MI300X GPUs via the vLLM runtime serving Qwen 2.5 Instruct open-source models.

The system processes business document packages (invoices, contracts, delivery notes, purchase orders, financial reports) end-to-end:

  1. Ingest β€” PDF / DOCX / image with vision-first scanned fallback
  2. Classify β€” 6-way doc-type classifier (LLM with structured output)
  3. Extract β€” typed Pydantic schema extraction with anti-hallucination
  4. Cross-reference β€” three-way matching (invoice + delivery + PO)
  5. Risk analysis β€” basic + 14 domain rules + LLM ensemble + 3 filters
  6. Report β€” DOCX export, JSON API, executive summary

The chat layer is a 5-tool agentic ReAct loop with explicit [Source: filename] citations and an anti-hallucination validator.


2. Workflow rules

Language

  • English everywhere β€” code, comments, docstrings, prompts, UI, error messages, log lines.
  • Multilingual fallback β€” for legacy interop and the multilingual demo: some loaders, classifiers, and regex filters accept HU/DE input. EN is always the primary path.
  • Two HU reference documents are kept under docs/ with _HU.md suffix (Teljes-rendszer-attekintes-langgraph_HU.md, MUKODESI_LEIRAS_HU.md). These are read-only references; do not edit.

License + IP

  • MIT licensed β€” see LICENSE.
  • NOTICE.md is a non-binding author request (no legal force).
  • Never paste proprietary code from outside this repo.

Provider

  • The default chat provider is vllm (Qwen 2.5 14B Instruct on AMD MI300X through the OpenAI-compatible vLLM endpoint).
  • ollama is a local dev fallback (Qwen 2.5 7B Instruct on a laptop GPU/CPU).
  • dummy is the deterministic CI / eval / smoke provider (no network, no LLM).
  • Never re-introduce a Claude / Anthropic provider here β€” that path is out of scope for the AMD edition.

Git

  • The AI NEVER runs git operations on main (no commit, no push, no cherry-pick, no merge). The user runs all main-branch git operations.
  • The AI MAY commit on non-main feature branches when explicitly asked.
  • The AI NEVER pushes β€” push is the user's task only.

Build hygiene

  • Do not commit .env, chroma_db/, data/checkpoints.sqlite, __pycache__/.
  • Magyar / English commit messages are both fine; English preferred for the public history of an MIT repo.

Anti-hallucination is sacred

  • The 5+1 layers (temperature=0, _quotes, _confidence, plausibility filters, LLM-risk 3 filters, quote validator) are not optional. Every LLM-generated piece of data is cross-checked.
  • Source citations in the chat use the canonical [Source: filename] format (validator enforces this).

3. Repo layout

paperhawk/
β”œβ”€β”€ app/                   # Streamlit UI (5 tabs) + async runtime
β”œβ”€β”€ config.py              # Pydantic Settings (env-bound)
β”œβ”€β”€ domain_checks/         # 14 deterministic rules + base + registry
β”œβ”€β”€ eval/                  # Eval harness (questions + run_eval)
β”œβ”€β”€ graph/                 # 4 compiled graphs (pipeline / chat / dd /
β”‚                          # package_insights) + 6 states + checkpointer
β”œβ”€β”€ ingest/                # PDF / DOCX / image / OCR / tables / txt
β”œβ”€β”€ infra/vllm/            # AMD MI300X deployment (Dockerfile + serve.sh + README)
β”œβ”€β”€ load/                  # Load benchmarks
β”œβ”€β”€ nodes/                 # Per-stage node functions:
β”‚   β”œβ”€β”€ chat/              #   chat agent + 5 tools
β”‚   β”œβ”€β”€ dd/                #   DD specialists + supervisor + synthesizer
β”‚   β”œβ”€β”€ extract/           #   extract + dummy + quote validator
β”‚   β”œβ”€β”€ ingest/            #   ingest helpers
β”‚   β”œβ”€β”€ pipeline/          #   classify / compare / duplicate / report / docx
β”‚   └── risk/              #   basic / domain dispatch / LLM risk + 3 filters
β”œβ”€β”€ providers/             # vLLM / Ollama / Dummy LLM providers + embeddings
β”œβ”€β”€ schemas/               # 6 JSON schemas + pydantic_models + flatten_universal
β”œβ”€β”€ store/                 # ChromaDB + BM25 hybrid + chunking
β”œβ”€β”€ subgraphs/             # 6 reusable subgraphs (Send API parallelism)
β”œβ”€β”€ tests/                 # unit + integration + e2e_api + e2e_screenshot
β”œβ”€β”€ tools/                 # 5 chat tools + ChatToolContext
β”œβ”€β”€ utils/                 # dates + numbers + docx_export
└── validation/            # anti-halluc layers (5+1)

4. Hot files

When fixing bugs or adding features, these are the most-edited files:

  • graph/states/pipeline_state.py β€” Risk, Classification, ExtractedData, merge_risks, merge_doc_results reducers.
  • domain_checks/__init__.py β€” the 14-check registry.
  • domain_checks/check_*_*.py β€” individual deterministic rules.
  • nodes/risk/_prompts.py β€” RISK_SYSTEM_PROMPT (anti-halluc 9+6+4 examples).
  • nodes/chat/_prompts.py β€” AGENTIC_SYSTEM_PROMPT (17 rules).
  • validation/llm_risk_filters.py β€” 3-filter chain.
  • app/main.py β€” Streamlit UI (5 tabs).

5. Testing

# Fast: unit + integration (dummy LLM)
LLM_PROFILE=dummy pytest tests/unit tests/integration -x --tb=short

# Slow: end-to-end with real LLM
LLM_PROFILE=vllm pytest tests/e2e_api -m e2e -x --tb=short

# UI Playwright (real LLM, slow)
LLM_PROFILE=vllm pytest tests/e2e_screenshot -x --tb=short

LLM_PROFILE=dummy works without any external service. LLM_PROFILE=vllm requires VLLM_BASE_URL to point at a running vLLM endpoint.


6. Deploy targets

  • Hugging Face Space β€” Streamlit Space under huggingface.co/spaces/lablab-ai-amd-developer-hackathon/<your-space>. See docs/hf-space-deployment.md.
  • AMD Developer Cloud MI300X β€” vLLM serving Qwen 2.5 14B (or 32B). See docs/qwen-vllm-deployment.md and infra/vllm/README.md.

7. Pitch positioning

When writing project descriptions, the README, video, or social posts:

  • Beyond simple RAG β€” multi-agent platform with 14 deterministic checks
    • an LLM ensemble. The 5-tool chat is agentic, not retrieval-only.
  • Track 1 (AI Agents & Agentic Workflows) is the target track.
  • Cross-track: Build in Public is in scope (AMD GPU prize).
  • HF Special Prize is in scope (Reachy Mini robot β€” like-vote driven).

8. The Glossary (HU β†’ EN field names)

The full per-field rename map is in pwc-ai-verseny/document-intelligence-agentic-langgraph-amd/ATIRASI_TERV.md sections 32 (field names) and 33 (severity literals). Keep that file open when editing extraction schemas, domain checks, or anything that touches the Risk Pydantic.


9. Common pitfalls

  • Severity literals: always "high" | "medium" | "low" | "info" β€” never "magas" | "kozepes" | "alacsony". Many _normalize_severity() helpers map HU β†’ EN if legacy data sneaks in, but new code emits EN.
  • Risk fields: description, severity, rationale, kind, regulation, affected_document, source_check_id. NOT leiras / sulyossag / indoklas / tipus / jogszabaly / erinto_dokumentum / forras_check_id.
  • Doc types: "invoice" | "delivery_note" | "purchase_order" | "contract" | "financial_report" | "other".
  • _quotes alias (not _idezetek) β€” both in JSON schemas and Pydantic models.
  • Multilingual fallback: read-only in classifiers and regex filters; never emit HU in new code.