Spaces:

Vincsipe
/

paperhawk

Running

File size: 9,060 Bytes

7ff7119

# Architecture

LangGraph-native Document Intelligence platform. This document goes beyond
the README — it covers design decisions, the subgraph hierarchy, state
design, and the anti-hallucination stack.

## 1. High-level architecture

### 4 compiled LangGraph artifacts

The system is organized around four graphs sharing a common `AsyncSqliteSaver`
checkpointer:

| # | Graph | Entry point | When |
|---|-------|-------------|------|
| 1 | `pipeline_graph` | `app.run_pipeline(files)` | on upload |
| 2 | `chat_graph` | `app.ask(question)` | chat tab |
| 3 | `dd_graph` | `app.dd_report(thread_id)` | DD tab button |
| 4 | `package_insights_graph` | `app.package_insights(thread_id, pkg_type)` | demo button |

Chat tools read from the persisted pipeline state — they do not re-read
files. They access the in-memory `ChatToolContext`, which holds the
HybridStore and a documents snapshot.

### Pipeline graph topology

```
START
  → start_timer
  → dispatch_ingest          (Send API: per-doc fan-out)
  → ingest_per_doc           (PDF/DOCX/PNG/TXT loader subgraph)
  → ingest_join              (fan-in)
  → dispatch_classify        (Send API)
  → classify_per_doc         (regex/keyword classifier in dummy mode;
                              vision-aware in vLLM mode)
  → classify_join
  → dispatch_extract         (Send API)
  → extract_per_doc          (regex extractor in dummy mode +
                              flatten_universal; structured LLM in vLLM mode)
  → extract_join
  → quote_validator          (anti-hallucination layer #7)
  → dispatch_rag_index       (Send API)
  → rag_index_per_doc        (chunker + batched embed + Chroma+BM25 upsert)
  → rag_join
  → compare_node             (three-way matching, sync)
  → risk_subgraph            (basic + 14 domain × Send + plausibility +
                              LLM ensemble + duplicate)
  → finish_timer
  → report_node              (10-section JSON structure)
  → END
```

The per-doc Send fan-out yields a 5–8× speedup in a CPU-bound environment.

### Risk subgraph topology

```
risk_subgraph (input: PipelineState):
  → basic_risk_dispatch         (Send: per-doc basic risk)
  → basic_risk / noop_basic
  → domain_dispatch_node        (Send: per-doc × per-applicable-check, ~30 parallel)
  → apply_domain_check
  → [if llm provided] llm_risk_dispatch  (Send: per-doc LLM risk + 3-filter chain)
  → llm_risk_per_doc / noop_llm
  → plausibility_dispatch       (Send: per-doc plausibility)
  → plausibility / noop_plaus
  → evidence_score_node         (per-doc info)
  → duplicate_detector_node     (package-level, sync, ISA 240)
END
```

The full anti-hallucination 5+1 layer chain runs inside `llm_risk_per_doc`:
`llm_risk → filter_llm_risks → drop_business_normal → drop_repeats`.

### DD multi-agent supervisor graph

```
dd_graph:
  START
  → contract_filter_node      (keep only contract-type docs)
  → per_contract_summary_node (Python-deterministic per-contract DDContractSummary)
  → supervisor_node           (LLM router or heuristic; Command(goto=...))
        ├─ → audit_specialist     (pricing anomalies, overcharging)
        ├─ → legal_specialist     (red flags, change-of-control, non-compete)
        ├─ → compliance_specialist (GDPR, AML, data protection)
        └─ → financial_specialist (monthly obligations, expirations)
  ↺ (loops back to supervisor up to dd_supervisor_max_iterations)
  → dd_synthesizer            (one LLM call: executive_summary +
                               top_red_flags + per-contract risk_level rating)
  END
```

### Package insights graph

A simple 1-LLM-call graph: ingests the full document package and produces
cross-doc findings using a perspective-driven prompt
(`audit | dd | compliance | general`).

## 2. State design

### `PipelineState` (TypedDict)

Read-mostly fields with **reducer-driven Send fan-in**:

- `files: list[tuple[str, bytes]]` — raw upload
- `documents: Annotated[list[ProcessedDocument], merge_doc_results]` —
  per-doc field-level merge keyed by `file_name`
- `risks: Annotated[list[Risk], merge_risks]` — dedup by description
- `comparison: ComparisonReport | None`
- `report: dict`
- `package_insights: PackageInsights | None`
- `dd_report: DDPortfolioReport | None`
- `started_at`, `finished_at`, `processing_seconds`
- `progress_events: Annotated[list[str], add]` — Streamlit progress feed

### `Risk` (Pydantic)

The single risk type used everywhere:

- `description: str`
- `severity: str` (`"high" | "medium" | "low" | "info"`)
- `rationale: str`
- `kind: str` (`"validation" | "domain_rule" | "plausibility" | "llm_analysis" | "cross_check"`)
- `regulation: str | None` (e.g. `"HU VAT Act §169"`, `"ISA 240"`, `"GDPR Article 28"`)
- `affected_document: str | None`
- `source_check_id: str | None`

## 3. Anti-hallucination stack (5+1 layers)

1. **`temperature=0`** — every LLM call is deterministic-ish.
2. **`_quotes` schema field** — verbatim source citations.
3. **`_confidence` schema field** — per-field reliability (high|medium|low).
4. **`validate_plausibility()`** — Python deterministic plausibility checks
   (negative VAT, non-standard rates, future dates, etc.).
5. **3-filter LLM risk pipeline** —
   `filter_llm_risks` (formal: ≥5 words, ≥2 domain terms, ≥1 concrete fact)
   → `drop_business_normal_risks` (semantic: cross-check vs extracted_data,
   6 known false-positive patterns)
   → `drop_repeats_of_basic` (textual dedup vs basic risks, 70% threshold).
6. **Quote validator** — final cross-check that every `_quotes` entry
   actually appears in the source `full_text` (whitespace + diacritic +
   case normalized). If invalid, downgrades confidence.

## 4. Domain checks (14 deterministic rules)

| # | check_id | Regulation | HU-specific? | Applies to |
|---|----------|-----------|--------------|------------|
| 01 | `check_01_invoice_mandatory` | HU VAT Act §169 | yes | invoice |
| 02 | `check_02_tax_cdv` | HU Tax Procedure Act §22 mod-11 | yes | invoice + contract + ... |
| 03 | `check_03_contract_completeness` | Universal contract completeness | no | contract |
| 04 | `check_04_proportionality` | Universal contract proportionality (>31.7%) | no | contract |
| 05 | `check_05_rounded_amounts` | ISA 240 (Journal of Accountancy 2018) | no | invoice |
| 06 | `check_06_evidence_score` | ISA 500 | no | (separate entry, info-only) |
| 07 | `check_07_materiality` | ISA 320 | no | invoice + contract + financial_report |
| 08 | `check_08_gdpr_28` | GDPR Article 28 | no (EU) | contract |
| 09 | `check_09_dd_red_flags` | M&A DD best practice | no | contract |
| 10 | `check_10_incoterms` | Incoterms 2020 | no | contract |
| 11 | `check_11_ifrs_har` | IFRS / national GAAP comparison | no | financial_report |
| 12 | `check_12_duplicate_invoice` | ISA 240 (duplicate invoice) | no | (separate entry, package-level) |
| 13 | `check_13_aml_sanctions` | AML / Sanctions screening | no | invoice + contract + ... |
| 14 | `check_14_contract_dates` | Contract date best practice | no | contract |

The dispatch in `domain_dispatch_node` skips `check_06` and `check_12` (they
have separate entry points) and filters `is_hu_specific=True` out for non-HU
documents.

## 5. Provider system

Three providers via `configurable_alternatives`:

- **`vllm`** — `ChatOpenAI` with `base_url=VLLM_BASE_URL` pointing at the
  AMD MI300X vLLM endpoint. Production default.
- **`ollama`** — `ChatOllama` with a local Ollama daemon (Qwen 2.5 7B
  Instruct). Development fallback.
- **`dummy`** — `DummyChatModel` (deterministic stub, no network).
  CI / eval / load.

Provider selection is **runtime-switchable** without restart:

```python
graph.invoke(state, config={"configurable": {"llm_profile": "dummy"}})
```

## 6. Embedding

`BAAI/bge-m3` (2.27 GB, 1024 dim, multilingual) by default.
Sentence-transformers loads it on first call via `@lru_cache`.
Pre-downloaded at Docker build time so runtime has no network call.

## 7. Hybrid retrieval (Chroma + BM25)

`store/hybrid_store.py` runs vector search and BM25 in parallel and merges
with Reciprocal Rank Fusion (RRF). The chunker uses natural break points
(paragraph + sentence boundaries), tuned to ~15K-char chunks with 500-char
overlap.

## 8. Async-first runtime

LangGraph 0.6 is async-first. The Streamlit app runs the entire async layer
on a long-lived background event loop (`app/async_runtime.py`'s `AsyncRuntime`
singleton). This keeps the ChromaDB connection, the Anthropic / OpenAI HTTP
session, and the `AsyncSqliteSaver` SQLite pool persistent across user
interactions — they do not rebuild per request.

## 9. Multilingual support

The codebase is English-first but multilingual-tolerant:

- The classifier matches HU/EN/DE keyword patterns.
- Risk filters tolerate HU/DE business terms.
- The OCR layer keeps `eng + hun + deu` as Tesseract languages.
- Demo data may include mixed-language documents.

The output (UI, exec summary, DOCX report) is **always English**.