AMD Developer Hackathon × lablab.ai · May 2026
PaperHawk hero

PaperHawk

Multi-agent document intelligence on AMD Instinct MI300X.
Built by engineers who ship.

Vince Nándorfi Tamás Vitai Gábor Murcsik
Team CsimpiCsirkek · MIT
The Problem

RAG retrieves.
Audit finds.

Today's RAG chatbots can do the first. They cannot do the second.

What RAG does well

Chunk a document. Embed the chunks. Retrieve top-K passages. Generate an answer with the retrieved context.

Great for FAQ chatbots. Great for Q&A on a single document.

What auditors actually need

"Does the supplier in Invoice #7 match the vendor in PO #3? Is the VAT rate consistent across the package? Any change-of-control clauses? Sanctions hits?"

These questions live in the relationship between documents — not in any single chunk.

What We Built

A multi-agent system.
Not a retrieval pipeline.

LangGraph 0.6-native. Production-shaped. Open source under MIT.

4
Compiled
graphs
6
Reusable
subgraphs
14
Deterministic
domain checks
5+1
Anti-halluc
layers
5
Agentic
chat tools

Send-API parallelism · AsyncSqliteSaver checkpointer · configurable_alternatives provider stack (vLLM / Ollama / dummy) · multi-agent DD assistant with 4 specialists + supervisor + synthesizer · Streamlit 5-tab UI · 61 tests passing in CI without an LLM.

The Pipeline

Five steps. End-to-end.

Every step is a typed Pydantic-state node. Every LLM call has structured output.

1
Ingest
PDF · DOCX · image. Vision-first OCR fallback for scanned pages.
2
Classify
6-way doc-type classifier. ISA 500 evidence-quality score.
3
Extract
Pydantic schema per doc-type. _quotes + _confidence per field.
4
Cross-ref
3-way matching. Package-level analyzer. DD multi-agent.
5
Risk + Report
14 checks (parallel Send) · LLM ensemble · 3-layer filter · DOCX export.

On AMD MI300X with Qwen 2.5 14B: 30–90 seconds end-to-end per package.

Beyond LLMs · Deterministic Reasoning

Fourteen rules. In Python.

Every check is a typed Protocol, not a prompt. Run in parallel via the LangGraph Send API.

Tier A — Audit · 6 checks

  • ISA 500 Evidence hierarchy
  • ISA 320 Materiality threshold
  • ISA 240 Duplicate invoice detector
  • ISA 240 Rounded-amount anomaly
  • Tax-ID CDV mod-11 checksum
  • Mandatory fields Invoice completeness

Tier B — Compliance · 4 checks

  • GDPR Art. 28 Sub-processor clause
  • AML / Sanctions EU + OFAC fuzzy match
  • M&A red flag Change-of-control · auto-renewal
  • Disproportionality Penalty-vs-value ratio

Tier C — Standards · 4 checks

  • Incoterms 2020 11-rule recognizer
  • IFRS / GAAP Goodwill + lease anomaly
  • Math validation Net + VAT + gross
  • Contract completeness 6-key-clause check

Jurisdiction-aware: locale-specific rules trigger only on locale-tagged inputs. Universal rules run everywhere.

Trust by Design

Anti-halluc 5+1. DD multi-agent.

5+1 layers, every output
1
temperature=0 on every LLM call
2
_quotes verbatim source citation
3
_confidence per extracted field
4
Plausibility validators (math · dates · ranges)
5
3-layer LLM-risk filter chain
+1
Quote validator: drops claims whose quotes aren't in the doc
DD supervisor pattern
Audit specialist
Legal specialist
Compliance specialist
Financial specialist
Supervisor — routing & coordination
Synthesizer → Executive Summary

Four specialists read the same package independently. The supervisor coordinates routing. The synthesizer writes a 3-paragraph executive brief with cited red flags.

The Stack

Qwen on AMD MI300X via vLLM.

192 GB HBM3. ROCm-native. Open-source models, end-to-end.

Streamlit · 5-tab UI
Upload · Results · Chat · DD · Report
LangGraph 0.6 orchestration
4 graphs · 6 subgraphs · Send API · AsyncSqliteSaver
Qwen 2.5 14B Instruct (open source)
tool-calling · structured-output · multilingual
vLLM continuous batching
--api-key · --max-model-len 32768 · OpenAI-compatible
AMD Instinct MI300X · ROCm
192 GB HBM3 · BF16 / FP8 · AMD Developer Cloud
Hugging Face Spaces deploy
lablab-ai-amd-developer-hackathon · Streamlit SDK
See It In Action

Three one-click demos.

Bundled in the repo. Drivable from the Streamlit Upload tab in 30 seconds.

Audit Demo

Three invoices from the same supplier. The March one is 50% pricier than January and February.

→ ISA 240 over-billing pattern flagged with cited line items.

DD Demo

NDA + service agreement + amendment in an acquisition scenario.

→ Hidden change-of-control + auto-renewal red flags.

Compliance Demo

Two contracts; one is missing GDPR Article 28 sub-processor language.

→ Domain check #8 detects the gap with regulatory citation.
On AMD MI300X with Qwen 2.5 14B Instruct: 30–90 seconds per package · end-to-end · with citations.
Open · Reproducible · Public

Built for builders.

MIT licensed. Reproducible from a clean clone. No closed weights, no proprietary extensions.

/ 01

Open source · MIT

Public GitHub repo. No "training data not included" footnotes. Clone it, run it, fork it. The whole codebase is yours to read.

/ 02

Reproducible

Same stack from laptop to MI300X. infra/vllm/Dockerfile + serve.sh + requirements.txt. One command, one container.

/ 03

Battle-tested

61 tests passing in CI without any LLM. Deterministic dummy provider for CI; vLLM and Ollama for everything else.

github.com/nandorfivince/paperhawk | HF Space: lablab-ai-amd-developer-hackathon/paperhawk | License: MIT
The Team

Three engineers.
One shipped product.

We've shipped together for nearly a decade. PaperHawk is what happens when domain knowledge, engineering rigor, and product instinct meet on the same codebase.

Lead · LangGraph · AMD Adaptation
Vince Nándorfi
System architecture, domain research, ROCm/vLLM adaptation, testing. PaperHawk's blueprint and the AMD-edition rewrite.
Engineering · DevOps
Tamás Vitai
Senior++ engineer. Implementation, infrastructure, integration testing. Where the code meets the runtime.
Engineering · Algorithms
Gábor Murcsik
Engineering rigor. Algorithmic precision. Senior systems thinking, sharpened over years of complex production builds.

Beyond simple RAG. Built to ship.