Spaces:

Vincsipe
/

paperhawk

Running

App Files Files Community

paperhawk / docs /SUBMISSION.md

Nándorfi Vince

Sync documentation overhaul from main (markdown only, LFS history preserved)

3385e0e 2 days ago

preview code

raw

history blame contribute delete

21.2 kB

	# PaperHawk — Hackathon Submission Brief

	> One-pager for the AMD Developer Hackathon × lablab.ai (May 2026) submission form.
	> Every section below is ready to paste directly into the lablab.ai project page.

	---

	## Project Title

	PaperHawk

	---

	## Short Description

	> Multi-agent document intelligence that catches what RAG misses. 14 deterministic domain checks, 5+1 anti-hallucination layers, and a 5-tool agentic chat — running Qwen 2.5 on AMD Instinct MI300X via vLLM. Open source, MIT licensed.

	(247 characters)

	---

	## Long Description (Submission Form — 600-2000 char limit, copy-paste-ready)

	> Use this version when filing the lablab.ai Submission Form Long Description field. Compact, all key points covered (problem, solution, target audience, USP, performance, market, future), exactly within the 600-2000 character envelope. Char count: ~1880.

	```
	The Problem
	Audit, legal due diligence, tax compliance, and M&A rely on humans reading dozens of documents looking for errors and red flags. A senior auditor needs ~8 hours per 50-page package. ChatGPT/Copilot/Harvey handle one document at a time, hallucinate citations, and lack jurisdiction-specific compliance knowledge.

	Our Solution: PaperHawk
	PaperHawk is an agentic multi-document intelligence platform processing 3-50 PDFs simultaneously, detecting cross-document inconsistencies humans miss. It combines:
	- 14 deterministic statutory rules (HU VAT Act §169, ISA 240/320/500, GDPR Art. 28, AML, Ptk. 6:98, Art. 22) hand-coded in Python
	- 6-layer anti-hallucination stack (temperature=0, source quotes, confidence scores, plausibility, LLM-risk filters, quote validator)
	- Multi-agent LangGraph orchestration (4 graphs + 6 subgraphs, 5-tool agentic chat)
	- Cross-document red flag detection (e.g. 57.5% price drift across 3 invoices auto-detected)

	Target Audience
	Auditors, lawyers, tax advisors, DD analysts, compliance officers, CFOs, forensic accountants, banking risk teams. EU + Hungarian focus initially.

	Why We Win (vs Harvey, ChatPwC, OWL, Copilot)
	These tools handle ONE document well. We handle MANY together — three-way matching, cross-doc consistency, package-level red flags. Plus jurisdiction-specific compliance rules hard-coded, not prompt-engineered. Open-source MIT, self-hostable on AMD MI300X.

	Performance
	23.3 sec for 3-document audit (61.7x faster than manual). Qwen 2.5 14B Instruct on AMD MI300X via vLLM (307 t/s prompt, 252 t/s generation, 30.4% prefix cache hit rate).

	Market & Future
	EU professional services market ~$280B TAM, document workflows ~$45B SAM, HU/CEE audit beachhead ~$2B SOM. Roadmap: NAV eAFA integration, fraud detection (Benford's Law), partner risk scoring, human-in-the-loop M2M validation. SaaS revenue ($500-2k/seat/month) + on-prem enterprise for Big Four.
	```

	---

	## Extended Reference Material — Long Description Source (NOT for Submission Form)

	> The 10-section detailed write-up below is the source material for the demo video voiceover, the slide deck (`docs/slides/PaperHawk_Slides.pdf`), and the technical walkthrough README. Do not paste this into the Submission Form — it would exceed the 2000-char limit several times over. Keep it here as the canonical "what we built" reference.

	### The Problem

	RAG retrieves passages. Audit finds inconsistencies. Today's RAG chatbots can't do the second.

	When someone opens a folder of 25 invoices, three contracts, two purchase orders, and a financial report, they don't ask a chatbot to summarize the contract. They ask: "Does the supplier in Invoice #7 match the vendor in PO #3? Is the VAT rate consistent across the package? Is there a hidden change-of-control clause? Is the math on the gross total correct? Are any of these counterparties on the EU/OFAC sanctions list?"

	These are not retrieval questions. They are reasoning, validation, and cross-reference questions over multiple typed documents. A standard chunk-embed-retrieve-generate pipeline cannot answer them, because the question is not contained in any single chunk. It lives in the relationship between documents.

	PaperHawk is built specifically for this gap.

	### What We Built

	PaperHawk is a LangGraph 0.6-native system with 4 compiled graphs (pipeline, chat, DD assistant, package insights) wired together with Send-API parallelism, an `AsyncSqliteSaver` checkpointer, and a `configurable_alternatives` provider that swaps cleanly between vLLM (production), Ollama (local dev), and a deterministic dummy (CI). It is not a single-agent retrieval pipeline.

	Concretely:

	- 6 reusable subgraphs for ingest, classification, extraction, risk dispatch, LLM risk ensemble, and chat tool routing
	- 14 deterministic domain checks wired into a registry — ISA 240/500/320 (audit standards), GDPR Article 28, Incoterms 2020, AML sanctions, tax-ID validation, contract completeness, materiality thresholds, and more. Every check is a Python `Protocol` implementation, not an LLM prompt.
	- 5+1 anti-hallucination layers: `temperature=0`, a `_quotes` field for verbatim source citation, `_confidence` per extracted field, plausibility validators, a 3-layer LLM-risk filter chain, and a quote validator that drops any LLM output whose claimed source quote isn't found in the document.
	- 5-tool agentic chat (`list_documents`, `get_extraction`, `search_documents`, `compare_documents`, `validate_document`) with strict `[Source: filename.pdf]` citations validated by a post-processor — answers without provenance never reach the user.
	- Multi-agent DD assistant: 4 specialist agents (audit / legal / compliance / financial) coordinated by a supervisor and a synthesizer, in the spirit of the LangGraph supervisor cookbook but production-shaped.
	- Streamlit 5-tab UI: Upload, Results, Chat, DD Assistant, Report — drivable in 30 seconds with three pre-bundled demo packages.

	The codebase ships with 61 tests passing in CI without any LLM (the deterministic dummy provider), is MIT licensed, and is English-first with a multilingual fallback path for EN/HU/DE inputs.

	### Why AMD Instinct MI300X

	The MI300X gives us 192 GB of HBM3 memory in a single accelerator — enough headroom to host Qwen 2.5 14B Instruct in BF16 with comfortable KV-cache space for our long agentic conversations. The DD supervisor plus four specialists in one session easily exceeds 32k tokens of context, and the MI300X handles it without paging.

	vLLM's continuous batching on ROCm lets the Streamlit UI fire concurrent requests during a multi-document upload without queueing artifacts. The FP8 / BF16 paths supported by the MI300X memory bandwidth open a clean upgrade route to Qwen 2.5 32B for finals night.

	We're using the AMD Developer Cloud — `infra/vllm/Dockerfile` and `infra/vllm/serve.sh` are committed in the repo and start vLLM with `--api-key`, `--max-model-len 32768`, and a configurable model tag. The whole inference stack is containerized; nothing is hand-rolled on the GPU node.

	### Why Qwen 2.5 Instruct

	Three reasons.

	First, strong tool calling. Qwen 2.5 14B handles our 5-tool chat router reliably; tool-routing accuracy in our integration tests is on par with the proprietary reference model we used in early development. The tool-call JSON is well-formed, parameters are typed correctly, and unnecessary tool calls are rare.

	Second, structured output that holds. `with_structured_output` returns valid Pydantic v2 JSON every time in our extraction subgraph, including the nested `_quotes` and `_confidence` fields. This is where many smaller open-source models fail under load — Qwen 2.5 doesn't.

	Third, multilingual fluency. Our pipeline often reads Hungarian, German, and English documents in the same package, and Qwen handles cross-lingual extraction without dropping accuracy. We don't fine-tune; we pull `Qwen/Qwen2.5-14B-Instruct` from Hugging Face directly into the vLLM container — clean, reproducible, and rerunnable by anyone.

	### The Pipeline (5-Step End-to-End)

	1. Ingest — PDF, DOCX, and image inputs go through three loaders. Scanned PDFs hit a vision-first fallback (the LLM reads the rendered page directly); native PDFs use PyMuPDF + pdfplumber for table-aware extraction; DOCX is parsed natively.
	2. Classify — A 6-way doc-type classifier (`invoice`, `delivery_note`, `purchase_order`, `contract`, `financial_report`, `other`) with structured output, calibrated for ISA 500 evidence-quality scoring.
	3. Extract — Per doc-type Pydantic schema, with a universal extraction subgraph as a fallback for unknown types. Every extracted field carries its own `_quotes` and `_confidence` — anti-hallucination is built into the type system, not a post-hoc check.
	4. Cross-reference — Three-way matching (invoice + delivery note + purchase order) for audit packages; multi-agent synthesis for DD packages; package-level analyzers for duplicate-invoice detection (ISA 240) and pricing anomalies.
	5. Risk + Report — Plausibility checks + 14 domain checks (deterministic, parallel via Send fan-out) + LLM risk ensemble + 3-layer filter that drops repeats, business-normal flags, and unsupported claims. Final output: a ranked risk list with severity, regulation source, and source citations; a downloadable DOCX report; structured JSON for API consumers.

	### Anti-Hallucination Is Non-Negotiable

	The system is designed so the LLM cannot lie about a document and have the lie pass through.

	Every LLM-generated extraction includes a `_quotes` array with the verbatim text the model cites as source. A post-processor scans each quote against the document body. If the quote isn't there, the field is rejected — period. The 3-layer LLM-risk filter rejects any risk claim whose quoted evidence isn't in the package, repeats a finding from the deterministic domain checks, or describes a normal business condition.

	This isn't a guardrail layer slapped on top — it's the trust contract between the model and the user, and it runs on every output. The `validation/` package is one of the most-edited folders in the repo precisely because we treat it as a first-class concern, not an afterthought.

	### Demo Packages

	Three pre-built scenarios are bundled in `test_data/demo_packages/`. Each is a one-click demo from the Upload tab:

	- Audit Demo — Three invoices from the same supplier; the March one is 50% pricier than January and February. The package-level analyzer flags it as an over-billing pattern, and the chat answers "Why is the March invoice more expensive?" with cited line items.
	- DD Demo — An NDA, a service agreement, and an amendment in an acquisition scenario. The DD assistant flags a hidden change-of-control trigger and an automatic-renewal red flag, and the synthesizer writes an executive summary in three paragraphs.
	- Compliance Demo — Two contracts; one is missing GDPR Article 28 sub-processor language. Domain check #8 detects it, and the report includes the exact regulatory citation.

	End-to-end demo time on AMD MI300X: 30–90 seconds per package.

	### Track 1 + Build in Public + Hugging Face Special Prize

	Track 1 — AI Agents & Agentic Workflows is our primary submission. The track brief asks for projects that "move beyond simple RAG to build sophisticated AI agentic systems and workloads." PaperHawk fits the brief: 4 compiled graphs, 6 subgraphs, multi-agent DD orchestration, 5-tool agentic chat, and a registry-based deterministic check fabric. None of this is retrieval-only. The chat is an agent; the DD assistant is a multi-agent system; the pipeline is a typed-state orchestration.

	Ship It + Build in Public is a natural cross-track fit. The repo is MIT licensed and public on GitHub. We're publishing a technical walkthrough and at least two updates on X / LinkedIn — tagging `@AIatAMD` and `@lablab` — covering two design choices that don't usually appear in hackathon RAG demos: the LangGraph Send-API parallelism for the deterministic check fan-out, and the post-hoc citation validator for the chat tool outputs.

	Hugging Face Special Prize: deployed as a Streamlit Space under the `lablab-ai-amd-developer-hackathon` organization. Public, runnable in the browser, no signup required. The Space carries the same `paperhawk.jpeg` cover and points at our vLLM endpoint; visitors can drive the three demo packages from the front page.

	One codebase, one MIT license, three prize pools.

	### Tech Stack

	\| Layer \| Choice \|
	\|---\|---\|
	\| Orchestration \| LangGraph 0.6 (4 compiled graphs, 6 subgraphs, AsyncSqliteSaver) \|
	\| LLM \| Qwen 2.5 14B Instruct on vLLM (AMD Instinct MI300X, ROCm) \|
	\| Embedding \| BAAI/bge-m3 (multilingual, 1024-dim, sentence-transformers) \|
	\| Retrieval \| ChromaDB + BM25 hybrid with Reciprocal Rank Fusion \|
	\| Schemas \| Pydantic v2 with field aliases for the `_quotes` JSON contract \|
	\| UI \| Streamlit 5-tab + async runtime + long-lived background event loop \|
	\| Deploy \| Hugging Face Spaces (Streamlit SDK) + AMD Developer Cloud (vLLM container) \|
	\| Testing \| pytest 8 (61 PASS in CI without any LLM), Playwright UI smoke tests \|
	\| License \| MIT \|

	### Built By

	Team CsimpiCsirkek:

	- Vince Nándorfi — Lead, LangGraph architecture, AMD adaptation
	- Tamás Vitai
	- Gábor Murcsik

	---

	## Technology & Category Tags

	`agentic-ai` · `multi-agent` · `langgraph` · `qwen` · `amd-mi300x` · `vllm` · `rocm` · `huggingface-spaces` · `document-intelligence` · `streamlit` · `python` · `mit-license`

	---

	## Tracks Targeted

	\| Track / Prize \| Status \| Rationale \|
	\|---\|---\|---\|
	\| Track 1 — AI Agents & Agentic Workflows \| Primary submission \| Multi-agent system, 4 compiled graphs, 6 subgraphs, 5-tool agentic chat — well past the "simple RAG" line \|
	\| Ship It + Build in Public \| Cross-track \| MIT-licensed public GitHub repo + technical walkthrough + ≥2 social posts tagging `@AIatAMD` and `@lablab` \|
	\| Hugging Face Special Prize \| Special category \| Streamlit Space published under the `lablab-ai-amd-developer-hackathon` HF organization \|

	---

	## Submission Checklist

	\| Item \| Status \| Notes \|
	\|---\|---\|---\|
	\| Project Title \| DONE \| `PaperHawk` \|
	\| Short Description \| DONE \| 247 characters, A+C blend \|
	\| Long Description \| DONE \| 10 sections, builder-energy tone \|
	\| Cover Image \| DONE \| `docs/slides/01_cover.png` (1280 × 720, 16:9) \|
	\| Slide Presentation \| DONE \| `docs/slides/PaperHawk_Slides.pdf` (10 slides) \|
	\| Technology & Category Tags \| DONE \| 12 tags \|
	\| Public GitHub Repository \| DONE \| `github.com/nandorfivince/paperhawk` \|
	\| Live HF Space — `Vincsipe/paperhawk` (Plan-B) \| DONE \| Validated end-to-end 2026-05-05 \|
	\| Live HF Space — `lablab-ai-amd-developer-hackathon/paperhawk` (Plan-A) \| BLOCKED \| Org-quota issue, ticket pending \|
	\| Build-in-Public Posts \| TODO at posting time \| 4 drafts ready in `docs/social-posts/` \|
	\| Video Presentation \| TODO \| Demo walkthrough video (max 3 min) \|
	\| AMD Developer Experience Feedback \| DONE \| See section below \|

	---

	## Live Deployment Validation (2026-05-05)

	End-to-end live test of the full stack succeeded on 2026-05-05 reggel with the following measured results:

	\| Metric \| Value \|
	\|---\|---\|
	\| Audit Demo processing time (3 PDFs) \| 23.3 seconds \|
	\| Speedup vs manual auditor (24 min estimate) \| 61.7× \|
	\| vLLM cold-start from snapshot (HF cache preserved) \| ~30 seconds (vs 70 sec clean install) \|
	\| Prompt throughput \| 307 tokens/sec \|
	\| Generation throughput \| 252 tokens/sec \|
	\| Prefix cache hit rate \| 30.4% \|
	\| Cross-document red flag detected \| 57.5% price drift (78,740 → 124,016 Ft over 3 invoices) \|
	\| Anti-hallucination quote validator \| Caught 4 of 6 hallucinated citations, downgraded confidence \|
	\| Jurisdictional standards applied \| HU VAT Act §169, ISA 500, ISA 320 \|

	The full pipeline ran from a publicly-deployed Hugging Face Space (`Vincsipe/paperhawk`) through to the AMD MI300X vLLM endpoint and back, with all 14 deterministic domain checks executing and the package-level cross-doc analyzer correctly identifying the price-drift red flag without human prompting.

	Recorded outputs: 4 win-screenshots (`Screenshot from 2026-05-05 10-07-{15,22,31,37}.png`) usable in the Submission video and slides.

	---

	## AMD Developer Experience Feedback

	Our team had a generally positive experience deploying our agentic document intelligence platform on AMD's stack. Key feedback by component:

	### ROCm 7.0

	The vLLM 0.17.1 + ROCm 7.0 build was stable out of the box on the Quick Start image. Qwen 2.5 14B Instruct loaded in 17.4 sec to MI300X VRAM (27.6 GB model + 141 GB available KV cache), CUDA graph compilation took 20.5 sec, total cold-start ~70 sec. Production-grade throughput: 307 tokens/sec prompt, 252 tokens/sec generation, 30.4% prefix cache hit rate. The OpenAI-compatible REST endpoint at port 8000 worked transparently. We did not need any ROCm-specific code changes from our development setup — vLLM abstracted everything. Recommendation: keep the Quick Start vLLM image fresh; it saved us hours of setup.

	### AMD Developer Cloud (DigitalOcean-powered)

	Strengths:

	- $1.99/hour MI300X pricing is fair and predictable
	- The Quick Start vLLM image saved hours of setup (Docker + ROCm + vLLM pre-installed, JupyterLab launched on port 80)
	- 192 GB HBM3 + 141 GB available KV cache — lots of headroom for large-context multi-agent workloads
	- Snapshot-and-destroy workflow excellent for cost control: $0.32/day storage for ~96 GB snapshot, 5-10 min recreate from snapshot, HF model cache preserved inside the Docker container layer means warm restart is ~30 sec instead of cold-start 70 sec
	- Auto-destroy on credit runout (when no payment method) is a built-in safety net we appreciated
	- Free $100 promo credit makes the platform genuinely accessible to hackathon participants

	Pain points and UI improvement opportunities:

	1. Sidebar `GPU Droplets` link in the left navigation routes to the CPU Droplet flow (a clear UI bug — workaround is the homepage `Create a GPU Droplet` card or the top-right `Create` dropdown). We hit this twice in our first hour.
	2. Default region NYC1 was 'out of capacity' for MI300X plan — we had to switch to ATL1 via URL parameter (`?region=atl1`). The region selector on the GPU Droplet creation page does not appear to be exposed in the UI; we found the workaround by inspecting the URL of a successful creation. Adding region availability indicators on the GPU Plan selector would help.
	3. Reboot after `apt-get upgrade` (recommended via Security notice) does not auto-restart the `rocm` Docker container — needed `docker start rocm` manually. Worth documenting in the Quick Start onboarding.

	### AMD APIs

	We did not use the lower-level ROCm-API or AMD-specific SDKs directly. Our stack was vLLM + OpenAI-compatible REST → all hardware-specific work was abstracted away through standard Python tooling. This is actually a strength: we ran a production-grade paperhawk pipeline (originally developed against Anthropic Claude API) on AMD MI300X with zero application code changes — proving the AMD stack via vLLM is a real drop-in alternative for production AI workloads. We changed only environment variables (`LLM_PROFILE`, `VLLM_BASE_URL`, `VLLM_API_KEY`, `VLLM_MODEL`).

	### Overall verdict

	AMD MI300X via the Developer Cloud is a viable production deployment platform for agentic LLM applications. The Quick Start vLLM image is a major time-saver. The few UI bugs and capacity-region issues are minor compared to the platform's strengths. The combination of $1.99/hour MI300X pricing + snapshot-restore workflow + OpenAI-compatible vLLM endpoint makes this a credible alternative to AWS p4d/p5 or GCP A3 for inference workloads, especially at the price point.

	---

	## Submission URLs (filled at submission time)

	### Plan-A (lablab-org admin reagált) — preferred

	- GitHub repo: https://github.com/nandorfivince/paperhawk
	- Hugging Face Space (official): https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/paperhawk
	- Live application URL: same as HF Space URL above
	- Slide deck: `docs/slides/PaperHawk_Slides.pdf`
	- Demo video: (uploaded at submission time)

	### Plan-B (lablab-org quota unresolved) — fallback

	- GitHub repo: https://github.com/nandorfivince/paperhawk
	- Hugging Face Space (working, parallel): https://huggingface.co/spaces/Vincsipe/paperhawk
	- Live application URL: same as HF Space URL above
	- Slide deck: `docs/slides/PaperHawk_Slides.pdf`
	- Demo video: (uploaded at submission time)

	Plan-B trade-off: HF Special Prize (Reachy Mini robot + HF PRO + $500 credits) requires the Space to be under the `lablab-ai-amd-developer-hackathon` org. If we ship under `Vincsipe/paperhawk`, we forfeit the HF Special Prize but retain qualification for the four main judging criteria (Presentation, Business Value, Application of Technology, Originality).

	---

	This document is the canonical submission brief. Paste sections directly into the lablab.ai project page when filing the submission.