nkshirsa commited on
Commit
19e7d25
Β·
verified Β·
1 Parent(s): 0438ca3

Upload SYSTEM_DESIGN.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. SYSTEM_DESIGN.md +1249 -0
SYSTEM_DESIGN.md CHANGED
@@ -1,3 +1,1252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
  ---
3
 
 
1
+ # PhD Research OS β€” Complete System Design
2
+ ## Version 2.0 | Post-Audit Architecture
3
+
4
+ **Date**: 2026-04-23
5
+ **Status**: DESIGN COMPLETE β€” Ready for phased implementation
6
+ **Addresses**: All 87 blindspots from the audit
7
+ **Hardware Target**: 16-24GB VRAM consumer GPU (RTX 4090 / RTX 3090 / A6000)
8
+
9
+ ---
10
+
11
+ ## 1. System Overview
12
+
13
+ ```
14
+ ╔══════════════════════════════════════════════════════════════════════════╗
15
+ β•‘ PhD Research OS v2.0 β•‘
16
+ β•‘ "The Epistemic Engine" β•‘
17
+ ╠══════════════════════════════════════════════════════════════════════════╣
18
+ β•‘ β•‘
19
+ β•‘ β”Œβ”€β”€β”€ INPUTS ──────────────────────────────────────────────────────┐ β•‘
20
+ β•‘ β”‚ PDF Bundles β”‚ Supplements β”‚ Datasets β”‚ Code Repos β”‚ Lab Notes β”‚ β•‘
21
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
22
+ β•‘ β–Ό β•‘
23
+ β•‘ β”Œβ”€β”€β”€ LAYER 0: STRUCTURAL INGESTION ──────────────────────────────┐ β•‘
24
+ β•‘ β”‚ Marker β†’ Nougat β†’ GROBID β”‚ Region Classifier β”‚ Plot Digitizer β”‚ β•‘
25
+ β•‘ β”‚ Section-aware chunks β”‚ Bounding boxes β”‚ Quality scores β”‚ β•‘
26
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
27
+ β•‘ β–Ό β•‘
28
+ β•‘ β”Œβ”€β”€β”€ LAYER 1: ENTITY RESOLUTION ─────────────────────────────────┐ β•‘
29
+ β•‘ β”‚ Ontology normalizer β”‚ Citation resolver β”‚ VoR lineage β”‚ Retract. β”‚ β•‘
30
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
31
+ β•‘ β–Ό β•‘
32
+ β•‘ β”Œβ”€β”€β”€ LAYER 2: QUALIFIED EXTRACTION ──────────────────────────────┐ β•‘
33
+ β•‘ β”‚ AI Model Council (parallel) β”‚ Epistemic Separation Engine β”‚ β•‘
34
+ β•‘ β”‚ Qualifier preservation β”‚ Statistical extraction β”‚ OOD gating β”‚ β•‘
35
+ β•‘ β”‚ Guidance constrained decoding β”‚ Source quotes + bboxes β”‚ β•‘
36
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
37
+ β•‘ β–Ό β•‘
38
+ β•‘ β”Œβ”€β”€β”€ LAYER 3: CANONICALIZATION ──────────────────────────────────┐ β•‘
39
+ β•‘ β”‚ Embedding dedup β”‚ Canonical registry β”‚ Alias merging β”‚ β•‘
40
+ β•‘ β”‚ Evidence aggregation β”‚ Temporal versioning β”‚ Lineage diff β”‚ β•‘
41
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
42
+ β•‘ β–Ό β•‘
43
+ β•‘ β”Œβ”€β”€β”€ LAYER 4: KNOWLEDGE GRAPH ───────────────────────────────────┐ β•‘
44
+ β•‘ β”‚ SQLite-backed graph β”‚ Typed epistemic edges β”‚ Lab lineage β”‚ β•‘
45
+ β•‘ β”‚ Method compatibility β”‚ Transitive constraints β”‚ Gap analysis β”‚ β•‘
46
+ β•‘ β”‚ Null evidence β”‚ Conflict clustering β”‚ Versioned ontology β”‚ β•‘
47
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
48
+ β•‘ β–Ό β•‘
49
+ β•‘ β”Œβ”€β”€β”€ LAYER 5: CALIBRATED SCORING ────────────────────────────────┐ β•‘
50
+ β•‘ β”‚ Code-computed confidence β”‚ 3 separate scores β”‚ Statistical gateοΏ½οΏ½ β•‘
51
+ β•‘ β”‚ Parser confidence propagation β”‚ Section modifiers β”‚ Brier mon. β”‚ β•‘
52
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
53
+ β•‘ β–Ό β•‘
54
+ β•‘ β”Œβ”€β”€β”€ LAYER 6: EVALUATION ────────────────────────────────────────┐ β•‘
55
+ β•‘ β”‚ LLM-as-Judge CI/CD β”‚ Versioned golden set β”‚ Stochastic tests β”‚ β•‘
56
+ β•‘ β”‚ Hidden holdout β”‚ Fatigue management β”‚ Counter-metrics β”‚ β•‘
57
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
58
+ β•‘ β–Ό β•‘
59
+ β•‘ β”Œβ”€β”€β”€ LAYER 7: PROVENANCE & REPRODUCIBILITY ──────────────────────┐ β•‘
60
+ β•‘ β”‚ Version pinning β”‚ Output lineage β”‚ PDF.js viewer β”‚ Containers β”‚ β•‘
61
+ β•‘ β”‚ Security sandbox β”‚ License checking β”‚ Epistemic Embargo β”‚ β•‘
62
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
63
+ β•‘ β–Ό β•‘
64
+ β•‘ β”Œβ”€β”€β”€ OUTPUTS ─────────────────────────────────────────────────────┐ β•‘
65
+ β•‘ β”‚ Obsidian Vault β”‚ Courtroom UI β”‚ Gap Analysis β”‚ Decision Objectsβ”‚ β•‘
66
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
67
+ β•‘ β•‘
68
+ β•‘ β”Œβ”€β”€β”€ CROSS-CUTTING ──────────────────────────────────────────────┐ β•‘
69
+ β•‘ β”‚ AI Model Council β”‚ Meta-Improver β”‚ Superpowers Skills β”‚ β•‘
70
+ β•‘ β”‚ ECC Harness β”‚ Companion Agents β”‚ Manual Synthesis Mode β”‚ β•‘
71
+ β•‘ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β•‘
72
+ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
73
+ ```
74
+
75
+ ---
76
+
77
+ ## 2. Model Architecture
78
+
79
+ ### 2.1 The Two-Model Strategy
80
+
81
+ The system runs TWO models, not one. This solves the local-vs-online tension:
82
+
83
+ ```
84
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
85
+ β”‚ PRIMARY BRAIN (Fully Local β€” Never Touches Internet) β”‚
86
+ β”‚ β”‚
87
+ β”‚ Model: Qwen3-8B Q4 AWQ β”‚
88
+ β”‚ VRAM: ~5GB weights + ~4GB KV cache (PolarQuant) β”‚
89
+ β”‚ Total: ~9GB (fits 16GB GPU with room for batch) β”‚
90
+ β”‚ Context: 128K tokens (full paper length) β”‚
91
+ β”‚ Serving: Ollama (simplest) or vLLM (fastest) β”‚
92
+ β”‚ β”‚
93
+ β”‚ Tasks: β”‚
94
+ β”‚ β€’ Claim extraction (Layer 2) β”‚
95
+ β”‚ β€’ Epistemic classification β”‚
96
+ β”‚ β€’ Confidence component estimation β”‚
97
+ β”‚ β€’ Conflict hypothesis generation β”‚
98
+ β”‚ β€’ Query decomposition β”‚
99
+ β”‚ β€’ Decision object generation β”‚
100
+ β”‚ β”‚
101
+ β”‚ Constrained decoding: Guidance engine β”‚
102
+ β”‚ Training: SFT β†’ DPO β†’ GRPO (4-stage pipeline) β”‚
103
+ β”‚ Privacy: ALL paper data stays local β”‚
104
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
105
+ β”‚ COMPANION BRAIN (Online β€” For Non-Sensitive Tasks) β”‚
106
+ β”‚ β”‚
107
+ β”‚ Model: Claude API / GPT-4o-mini / OpenRouter β”‚
108
+ β”‚ OR: Local Qwen3-30B-A3B MoE Q4 (~6GB, 3B active) β”‚
109
+ β”‚ β”‚
110
+ β”‚ Tasks: β”‚
111
+ β”‚ β€’ Meta-Improver external scanning (arXiv, GitHub) β”‚
112
+ β”‚ β€’ Prompt optimization A/B testing β”‚
113
+ β”‚ β€’ Training data generation for new domains β”‚
114
+ β”‚ β€’ Retraction/correction checking (needs internet) β”‚
115
+ β”‚ β€’ Repository URL validation β”‚
116
+ β”‚ β”‚
117
+ β”‚ Privacy: NEVER sees raw paper text β”‚
118
+ β”‚ Only receives: metadata, queries, anonymized claims β”‚
119
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
120
+ ```
121
+
122
+ ### 2.2 Why Qwen3-8B, Not Qwen2.5-3B
123
+
124
+ | Metric | Qwen2.5-3B | Qwen3-8B | Improvement |
125
+ |--------|-----------|----------|-------------|
126
+ | AIME (math reasoning) | ~15% | ~45%+ | 3Γ— |
127
+ | MATH-500 | ~85% | ~95%+ | +10 pts |
128
+ | JSON structural accuracy (SFT) | ~65% | ~80%+ | +15 pts |
129
+ | Context window | 32K | 128K | 4Γ— |
130
+ | Hybrid thinking mode | No | Yes | New capability |
131
+ | VRAM at Q4 AWQ | ~2.5GB | ~5GB | Acceptable |
132
+
133
+ ### 2.3 Alternative: Qwen3-30B-A3B MoE (The Stealth Option)
134
+
135
+ For users with 8GB+ VRAM who want maximum quality:
136
+ - 30B total parameters, only 3B activated per token (Mixture of Experts)
137
+ - ~6GB at Q4 quantization
138
+ - Quality equivalent to dense 14B+ models
139
+ - Apache 2.0 license
140
+ - Available: `Qwen/Qwen3-30B-A3B-Instruct-2507` (1M downloads)
141
+
142
+ ### 2.4 Multimodal: Qwen3-VL-8B-Instruct
143
+
144
+ For figure/diagram processing (Layer 0):
145
+ - Same architecture as text model but with vision encoder
146
+ - Available: `Qwen/Qwen3-VL-8B-Instruct` (3.9M downloads)
147
+ - AWQ 4-bit: `cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit` (~5GB)
148
+ - Handles: figure classification, diagram understanding, micrograph analysis
149
+ - Does NOT replace plot digitizer for quantitative data
150
+
151
+ ### 2.5 VLM for Multimodal Figures: Qwen3-VL-30B-A3B-Instruct
152
+
153
+ For maximum figure understanding with MoE efficiency:
154
+ - Available: `Qwen/Qwen3-VL-30B-A3B-Instruct` (1.5M downloads)
155
+ - AWQ: `QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ` (667K downloads)
156
+ - Only 3B active params β€” fits alongside primary brain
157
+
158
+ ---
159
+
160
+ ## 3. Training Pipeline (4-Stage)
161
+
162
+ ### Stage 1: SFT on Domain Data
163
+
164
+ ```python
165
+ # Current implementation (train.py) β€” KEEP but upgrade base model
166
+ from trl import SFTConfig, SFTTrainer
167
+ from peft import LoraConfig
168
+
169
+ trainer = SFTTrainer(
170
+ model="Qwen/Qwen3-8B", # Upgraded from Qwen2.5-3B
171
+ args=SFTConfig(
172
+ output_dir="./research-os-sft",
173
+ num_train_epochs=3,
174
+ per_device_train_batch_size=2,
175
+ gradient_accumulation_steps=8,
176
+ learning_rate=2e-4,
177
+ max_length=4096, # Longer for paper sections
178
+ assistant_only_loss=True,
179
+ bf16=True,
180
+ gradient_checkpointing=True,
181
+ push_to_hub=True,
182
+ hub_model_id="nkshirsa/phd-research-os-brain-v2",
183
+ ),
184
+ train_dataset=expanded_dataset, # 10K+ examples (up from 1,900)
185
+ peft_config=LoraConfig(r=64, lora_alpha=16, target_modules="all-linear"),
186
+ )
187
+ trainer.train()
188
+ ```
189
+
190
+ ### Stage 2: DPO on Preference Pairs
191
+
192
+ ```python
193
+ from trl import DPOConfig, DPOTrainer
194
+
195
+ # Dataset: pairs of (correct extraction, incorrect extraction) for same text
196
+ trainer = DPOTrainer(
197
+ model="./research-os-sft", # From stage 1
198
+ args=DPOConfig(
199
+ output_dir="./research-os-dpo",
200
+ learning_rate=5e-7,
201
+ num_train_epochs=1,
202
+ max_length=4096,
203
+ bf16=True,
204
+ push_to_hub=True,
205
+ ),
206
+ train_dataset=preference_dataset,
207
+ peft_config=LoraConfig(r=64, target_modules="all-linear"),
208
+ )
209
+ ```
210
+
211
+ ### Stage 3: GRPO with Epistemic Reward Functions
212
+
213
+ This is the critical stage that bakes JSON reliability and epistemic correctness into the model:
214
+
215
+ ```python
216
+ from trl import GRPOTrainer, GRPOConfig
217
+ from trl.rewards import think_format_reward
218
+ import json
219
+
220
+ # ── Reward Function 1: JSON Validity ──
221
+ def json_validity_reward(completions, **kwargs):
222
+ """Binary reward: is the output valid JSON?"""
223
+ rewards = []
224
+ for completion in completions:
225
+ content = completion[0]["content"] if isinstance(completion, list) else completion
226
+ try:
227
+ json.loads(content)
228
+ rewards.append(1.0)
229
+ except (json.JSONDecodeError, TypeError):
230
+ rewards.append(0.0)
231
+ return rewards
232
+
233
+ # ── Reward Function 2: Schema Compliance ──
234
+ REQUIRED_KEYS = {"text", "epistemic_tag", "confidence", "missing_fields", "status"}
235
+ VALID_TAGS = {"Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"}
236
+
237
+ def schema_compliance_reward(completions, **kwargs):
238
+ """Reward for matching the Research OS claim schema."""
239
+ rewards = []
240
+ for completion in completions:
241
+ content = completion[0]["content"] if isinstance(completion, list) else completion
242
+ score = 0.0
243
+ try:
244
+ data = json.loads(content)
245
+ claims = data if isinstance(data, list) else data.get("claims", [data])
246
+
247
+ for claim in claims:
248
+ if not isinstance(claim, dict):
249
+ continue
250
+ # Key presence: 0.3
251
+ present_keys = set(claim.keys()) & REQUIRED_KEYS
252
+ score += 0.3 * len(present_keys) / len(REQUIRED_KEYS)
253
+ # Valid epistemic tag: 0.3
254
+ if claim.get("epistemic_tag") in VALID_TAGS:
255
+ score += 0.3
256
+ # Confidence in range: 0.2
257
+ conf = claim.get("confidence", -1)
258
+ if isinstance(conf, (int, float)) and 0 <= conf <= 1:
259
+ score += 0.2
260
+ # Status consistency: 0.2
261
+ missing = claim.get("missing_fields", [])
262
+ status = claim.get("status", "")
263
+ if (missing and status == "Incomplete") or (not missing and status == "Complete"):
264
+ score += 0.2
265
+
266
+ if claims:
267
+ score /= len(claims)
268
+ except:
269
+ pass
270
+ rewards.append(score)
271
+ return rewards
272
+
273
+ # ── Reward Function 3: Qualifier Preservation ──
274
+ HEDGING_WORDS = {"may", "might", "could", "suggests", "possibly", "potentially",
275
+ "appears", "seems", "likely", "unlikely", "not significant"}
276
+
277
+ def qualifier_preservation_reward(completions, prompts, **kwargs):
278
+ """Reward for preserving hedging language from source text."""
279
+ rewards = []
280
+ for completion, prompt in zip(completions, prompts):
281
+ content = completion[0]["content"] if isinstance(completion, list) else completion
282
+ prompt_text = prompt[0]["content"] if isinstance(prompt, list) else prompt
283
+
284
+ # Find hedging words in source
285
+ source_hedges = {w for w in HEDGING_WORDS if w in prompt_text.lower()}
286
+ if not source_hedges:
287
+ rewards.append(0.5) # Neutral if no hedging in source
288
+ continue
289
+
290
+ # Check if hedging is preserved in extraction
291
+ try:
292
+ data = json.loads(content)
293
+ claims = data if isinstance(data, list) else data.get("claims", [data])
294
+ claim_text = " ".join(c.get("text", "") for c in claims if isinstance(c, dict)).lower()
295
+
296
+ preserved = sum(1 for h in source_hedges if h in claim_text)
297
+ rewards.append(preserved / len(source_hedges))
298
+ except:
299
+ rewards.append(0.0)
300
+ return rewards
301
+
302
+ # ── GRPO Training ──
303
+ trainer = GRPOTrainer(
304
+ model="./research-os-dpo", # From stage 2
305
+ reward_funcs=[
306
+ json_validity_reward, # Weight: 0.3
307
+ schema_compliance_reward, # Weight: 0.4
308
+ qualifier_preservation_reward, # Weight: 0.3
309
+ ],
310
+ args=GRPOConfig(
311
+ output_dir="./research-os-grpo",
312
+ learning_rate=1e-6,
313
+ num_generations=8,
314
+ max_completion_length=2048,
315
+ bf16=True,
316
+ gradient_checkpointing=True,
317
+ logging_steps=10,
318
+ push_to_hub=True,
319
+ hub_model_id="nkshirsa/phd-research-os-brain-v2",
320
+ reward_weights=[0.3, 0.4, 0.3],
321
+ ),
322
+ train_dataset=prompt_dataset, # "prompt" column with paper excerpts
323
+ peft_config=LoraConfig(r=64, target_modules="all-linear"),
324
+ )
325
+ trainer.train()
326
+ ```
327
+
328
+ ### Stage 4: Calibration Fine-Tuning (ConfTuner)
329
+
330
+ After GRPO, apply ConfTuner with tokenized Brier score loss to fix confidence calibration. This is a specialized fine-tuning pass that targets only the confidence output tokens.
331
+
332
+ ---
333
+
334
+ ## 4. Layer Specifications
335
+
336
+ ### 4.0 Layer 0: Structural Ingestion Engine
337
+
338
+ **Purpose**: Convert PDF bundles into section-aware, bbox-annotated, quality-scored structured regions.
339
+
340
+ **Technology Stack**:
341
+
342
+ | Component | Tool | Purpose |
343
+ |-----------|------|---------|
344
+ | Layout detection | Marker (VikParuchuri/marker) | PDF β†’ structured markdown with layout awareness |
345
+ | Math/equation | Nougat (facebookresearch/nougat) | Scientific PDFs β†’ LaTeX equations |
346
+ | Bibliographic | GROBID | Headers, authors, citations, references |
347
+ | Region classifier | LayoutLMv3 or DocTR | Classify page regions: text, table, figure, equation |
348
+ | Plot digitizer | PlotDigitizer (algorithmic) | Quantitative plots β†’ CSV of (x,y) coordinates |
349
+ | VLM for figures | Qwen3-VL-8B-Instruct Q4 AWQ | Semantic figure understanding |
350
+ | OCR quality | Per-span confidence scoring | Flag degraded regions |
351
+
352
+ **Output Schema** (per region):
353
+
354
+ ```json
355
+ {
356
+ "region_id": "REG_00042",
357
+ "document_type": "main|supplement_1|supplement_2",
358
+ "page": 5,
359
+ "bbox": [72, 340, 540, 420],
360
+ "region_type": "body_text|table|figure|equation|caption|header|reference|footnote",
361
+ "section": "results",
362
+ "subsection": "3.2_sensitivity_characterization",
363
+ "content": {
364
+ "text": "The LOD was 0.8 Β± 0.03 fM (Table 2)",
365
+ "markdown": "The LOD was 0.8 Β± 0.03 fM ([Table 2](#table-2))",
366
+ "parse_method": "marker",
367
+ "parse_confidence": 0.95,
368
+ "ocr_source": false
369
+ },
370
+ "cross_references": [
371
+ {"ref_text": "Table 2", "ref_type": "table", "resolved_to": "REG_00038", "verified": true}
372
+ ],
373
+ "extraction_status": "extractable|low_confidence|unextractable",
374
+ "quality_flags": [],
375
+ "figures": {
376
+ "detected": true,
377
+ "figure_type": "scatter_plot|bar_chart|diagram|micrograph|schematic",
378
+ "digitizable": true,
379
+ "digitized_data": null
380
+ }
381
+ }
382
+ ```
383
+
384
+ **Chunking Strategy**: Section-aware, NOT page-based.
385
+ 1. Marker identifies section boundaries (Introduction, Methods, Results subsections)
386
+ 2. Chunk by section with 1-paragraph overlap to preceding and following sections
387
+ 3. Tables always kept whole (never split across chunks)
388
+ 4. Figure + caption always kept together
389
+ 5. Maximum chunk size: 4096 tokens (model context allows it)
390
+
391
+ **Paper Bundle Handling**:
392
+ ```
393
+ Input: {
394
+ "main_pdf": "path/to/paper.pdf",
395
+ "supplements": ["path/to/supplement_1.pdf", "path/to/supplement_data.xlsx"],
396
+ "code_repo": "https://github.com/author/repo",
397
+ "dataset": "https://zenodo.org/record/12345"
398
+ }
399
+ ```
400
+
401
+ ### 4.1 Layer 1: Entity Resolution
402
+
403
+ **Purpose**: Normalize entities, resolve citations, check retractions, establish version lineage.
404
+
405
+ **Components**:
406
+
407
+ ```
408
+ Entity Normalizer
409
+ β”œβ”€β”€ Gene/protein names β†’ UniProt ID
410
+ β”œβ”€β”€ Chemical names β†’ PubChem CID
411
+ β”œβ”€β”€ Disease names β†’ MeSH ID
412
+ β”œβ”€β”€ Assay names β†’ BAO ontology
413
+ β”œβ”€β”€ Abbreviations β†’ canonical form (LRU cache)
414
+ └── Custom domain ontology (user-extensible)
415
+
416
+ Citation Chain Resolver
417
+ β”œβ”€β”€ In-text "[32]" β†’ reference list β†’ DOI
418
+ β”œβ”€β”€ DOI β†’ CrossRef metadata
419
+ β”œβ”€β”€ Check: is cited paper in knowledge base?
420
+ β”œβ”€β”€ If yes: link claim to original source
421
+ β”œβ”€β”€ If no: flag as "citation_orphan" for potential ingestion
422
+ └── Classify: primary claim vs inherited citation
423
+
424
+ Version of Record (VoR) Lineage
425
+ β”œβ”€β”€ Before ingestion: query DOI/arXiv for version chain
426
+ β”œβ”€β”€ If preprint exists in DB and VoR arriving: supersede
427
+ β”œβ”€β”€ If VoR exists and erratum arriving: amend specific claims
428
+ β”œβ”€β”€ If retraction: invalidate ALL claims, propagate penalty
429
+ └── Store full lineage: preprint_doi β†’ vor_doi β†’ errata β†’ retraction
430
+
431
+ Retraction Checker
432
+ β”œβ”€β”€ CrossRef "update-to" relationship
433
+ β”œβ”€β”€ Retraction Watch database (periodic sync via companion model)
434
+ └── Propagate retraction status through citation chains
435
+ ```
436
+
437
+ ### 4.2 Layer 2: Qualified Extraction
438
+
439
+ **Purpose**: Extract claims with full epistemic qualification using the AI Model Council.
440
+
441
+ **Council Architecture** (Parallel-Then-Merge):
442
+
443
+ ```
444
+ Round 1 (PARALLEL β€” no visibility between members):
445
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
446
+ β”‚ Query Plannerβ”‚ β”‚ Extractor β”‚ β”‚ Extractor 2 β”‚ β”‚ Critic β”‚
447
+ β”‚ (decompose) β”‚ β”‚ (Qwen3-8B) β”‚ β”‚ (if heterog.)β”‚ β”‚ (adversarial)β”‚
448
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
449
+ β”‚ β”‚ β”‚ β”‚
450
+ β–Ό β–Ό β–Ό β–Ό
451
+ sub-queries claims_A claims_B critique
452
+
453
+ Round 2 (DEBATE β€” see tags and reasoning, NOT confidence):
454
+ All members see each other's epistemic tags and reasoning chains
455
+ Each member can revise their classification
456
+ Confidence scores remain HIDDEN (prevents anchoring)
457
+
458
+ Round 3 (SYNTHESIS β€” Chairman):
459
+ Chairman sees everything including confidence
460
+ Applies completeness penalty (code-enforced, not prompt-instructed)
461
+ Resolves disagreements with documented reasoning
462
+ Tags each claim with council_vote_distribution
463
+ ```
464
+
465
+ **Epistemic Separation Engine**:
466
+
467
+ | Section | Epistemic Default | Confidence Modifier |
468
+ |---------|-------------------|-------------------|
469
+ | Results (with statistics) | Fact (if p < threshold) | 1.0 |
470
+ | Results (narrative) | Interpretation | 0.85 |
471
+ | Methods | Protocol metadata (not a claim) | N/A |
472
+ | Abstract | Interpretation (forced) | 0.7 penalty |
473
+ | Discussion | Interpretation or Hypothesis | 0.75 penalty |
474
+ | Conclusion | Cross-check against Results | 0.8 if supported, 0.5 if not |
475
+ | Supplement | Same as main body section rules | 1.0 (no penalty for supplement source) |
476
+
477
+ **Constrained Decoding** (Guidance engine):
478
+
479
+ ```python
480
+ from guidance import models, gen, select
481
+
482
+ TAGS = ["Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"]
483
+
484
+ lm = models.Transformers("./research-os-grpo") # Local model
485
+
486
+ with lm:
487
+ output = lm + f"""
488
+ Analyze this scientific text and extract claims.
489
+
490
+ Text: {section_text}
491
+ Section: {section_name}
492
+
493
+ <reasoning>{gen("reasoning", max_tokens=500)}</reasoning>
494
+
495
+ Claims:
496
+ [
497
+ {{
498
+ "text": "{gen("claim_text", max_tokens=200)}",
499
+ "epistemic_tag": "{select(TAGS, name="tag")}",
500
+ "confidence_components": {{
501
+ "evidence_strength": {gen("ev_str", regex=r"0\.[0-9][0-9]?[0-9]?", name="evidence")},
502
+ "qualifiers": ["{gen("qualifiers", max_tokens=100)}"]
503
+ }},
504
+ "source_quote": "{gen("source_quote", max_tokens=200)}",
505
+ "source_page": {gen("page", regex=r"[0-9]+", name="page")},
506
+ "is_null_result": {select(["true", "false"], name="is_null")},
507
+ "is_inherited_citation": {select(["true", "false"], name="is_inherited")}
508
+ }}
509
+ ]
510
+ """
511
+ # output["tag"] is GUARANTEED to be in TAGS
512
+ # output["is_null"] is GUARANTEED to be boolean
513
+ ```
514
+
515
+ **Claim Schema v2** (expanded from v1):
516
+
517
+ ```json
518
+ {
519
+ "claim_id": "CLM_00042",
520
+ "text": "The LOD was 0.8 fM in 10 mM PBS",
521
+ "epistemic_tag": "Fact",
522
+ "confidence": 0.855,
523
+ "confidence_components": {
524
+ "evidence_strength": 900,
525
+ "study_quality_weight": 1000,
526
+ "journal_tier_weight": 1000,
527
+ "completeness_penalty": 1000,
528
+ "section_modifier": 1000,
529
+ "qualifier_penalty": 950
530
+ },
531
+ "qualifiers": ["in 10 mM PBS only", "n=5"],
532
+ "missing_fields": [],
533
+ "status": "Complete",
534
+ "is_null_result": false,
535
+ "is_inherited_citation": false,
536
+ "causal_direction": "observed_correlation",
537
+ "statistical_evidence": {
538
+ "p_value": 0.001,
539
+ "effect_size": 2.1,
540
+ "effect_size_type": "cohens_d",
541
+ "sample_size": 5,
542
+ "confidence_interval": [0.6, 1.0],
543
+ "practical_significance": true
544
+ },
545
+ "source_quote": "The limit of detection was determined to be 0.8 fM using the 3Οƒ/slope method.",
546
+ "source_page": 5,
547
+ "source_bbox": [72, 340, 540, 365],
548
+ "source_section": "results",
549
+ "source_doi": "10.1234/example",
550
+ "council_vote": {
551
+ "extractor_1": {"tag": "Fact", "reasoning": "Direct measurement with statistics"},
552
+ "extractor_2": {"tag": "Fact", "reasoning": "Quantitative with clear methodology"},
553
+ "critic": {"tag": "Fact", "reasoning": "Supported by Table 2 data"},
554
+ "chairman": {"tag": "Fact", "reasoning": "Unanimous agreement, strong statistics"}
555
+ },
556
+ "granularity": "atomic",
557
+ "parent_claim_id": null,
558
+ "sub_claims": [],
559
+ "ontology_version": "quantum_bio_v1",
560
+ "pipeline_version": "2.1.0",
561
+ "taxonomy_version": "quantum_bio_v1",
562
+ "extraction_timestamp": "2026-04-23T10:30:00Z"
563
+ }
564
+ ```
565
+
566
+ ### 4.3 Layer 3: Canonicalization
567
+
568
+ **Purpose**: Deduplicate claims, merge aliases, aggregate evidence, track temporal versions.
569
+
570
+ ```
571
+ New claim arrives β†’
572
+ 1. Embed claim text (local embedding model or Qwen3-8B last-hidden-state)
573
+ 2. Search existing canonical claims (cosine similarity)
574
+ 3. If similarity > 0.85:
575
+ β”œβ”€β”€ MERGE: Add new source as evidence for existing canonical claim
576
+ β”œβ”€β”€ Update evidence_count, source_list, confidence (re-aggregate)
577
+ β”œβ”€β”€ If confidence_components differ significantly: flag for human review
578
+ └── Store alias mapping: new_claim_id β†’ canonical_claim_id
579
+ 4. If similarity 0.70-0.85:
580
+ β”œβ”€β”€ FLAG as "potential duplicate β€” review recommended"
581
+ └── Show both claims in review queue with similarity score
582
+ 5. If similarity < 0.70:
583
+ └── CREATE new canonical claim
584
+ ```
585
+
586
+ **Temporal Versioning**:
587
+ ```
588
+ canonical_claim:
589
+ version_history: [
590
+ {version: 1, source: "preprint_2024", confidence: 0.65, date: "2024-03"},
591
+ {version: 2, source: "vor_2024", confidence: 0.85, date: "2024-09"},
592
+ {version: 3, source: "new_study_2025", confidence: 0.90, date: "2025-02"}
593
+ ]
594
+ current_version: 3
595
+ supersedes: null
596
+ superseded_by: null
597
+ ```
598
+
599
+ ### 4.4 Layer 4: Knowledge Graph
600
+
601
+ **Implementation**: SQLite-backed adjacency list (NOT Neo4j β€” keeps the system local and zero-dependency).
602
+
603
+ **Schema**:
604
+
605
+ ```sql
606
+ CREATE TABLE graph_nodes (
607
+ node_id TEXT PRIMARY KEY, -- canonical_claim_id or entity_id
608
+ node_type TEXT NOT NULL, -- claim | entity | method | condition
609
+ label TEXT NOT NULL,
610
+ properties TEXT, -- JSON
611
+ created_at TEXT NOT NULL
612
+ );
613
+
614
+ CREATE TABLE graph_edges (
615
+ edge_id TEXT PRIMARY KEY,
616
+ source_node TEXT NOT NULL,
617
+ target_node TEXT NOT NULL,
618
+ edge_type TEXT NOT NULL, -- supports | refutes | extends | depends_on |
619
+ -- supersedes | blocks | investigative_hypothesis |
620
+ -- method_uses | condition_applies
621
+ confidence INTEGER NOT NULL, -- Fixed-point Γ—1000
622
+ evidence_sources TEXT, -- JSON array of source DOIs
623
+ is_inferred INTEGER DEFAULT 0, -- 0=observed, 1=inferred (transitive)
624
+ inference_chain TEXT, -- JSON: hop details if inferred
625
+ method_compatible INTEGER, -- NULL=unchecked, 0=incompatible, 1=compatible
626
+ created_at TEXT NOT NULL,
627
+ updated_at TEXT NOT NULL,
628
+ FOREIGN KEY(source_node) REFERENCES graph_nodes(node_id),
629
+ FOREIGN KEY(target_node) REFERENCES graph_nodes(node_id)
630
+ );
631
+
632
+ -- Index for fast graph traversal
633
+ CREATE INDEX idx_edges_source ON graph_edges(source_node);
634
+ CREATE INDEX idx_edges_target ON graph_edges(target_node);
635
+ CREATE INDEX idx_edges_type ON graph_edges(edge_type);
636
+ ```
637
+
638
+ **Edge Types**:
639
+
640
+ | Type | Meaning | Confidence Rule |
641
+ |------|---------|----------------|
642
+ | `supports` | Claim A provides evidence for Claim B | From source text, observed |
643
+ | `refutes` | Claim A contradicts Claim B | From source text or conflict detection |
644
+ | `extends` | Claim A adds conditions/parameters to B | Section analysis |
645
+ | `depends_on` | Claim A assumes Claim B is true | Citation chain analysis |
646
+ | `supersedes` | Claim A replaces older Claim B (newer data) | Temporal versioning |
647
+ | `blocks` | Null finding: no evidence of relationship | Null result extraction |
648
+ | `investigative_hypothesis` | Inferred multi-hop (NOT observed) | min(hop_confidences) Γ— 0.5 |
649
+
650
+ **Transitive Inference Constraints**:
651
+ - NEVER auto-generate `supports` across multiple hops
652
+ - Only `investigative_hypothesis` edges for multi-hop
653
+ - Require method_compatible=1 for each hop before generating inference
654
+ - Default queries return observed edges only
655
+ - `include_inferred=True` flag required for graph queries that include inferences
656
+
657
+ **Gap Analysis Protocol**:
658
+ ```python
659
+ def find_gaps(self, domain_id: str) -> list:
660
+ """Find structural holes in the knowledge graph."""
661
+ # 1. Get all entities in domain
662
+ entities = self.get_entities(domain_id)
663
+
664
+ # 2. For each entity pair in same domain
665
+ for a, b in combinations(entities, 2):
666
+ # 3. Check if edge exists
667
+ edges = self.get_edges(a.id, b.id)
668
+ if not edges:
669
+ # 4. Check if both are well-connected (dense neighborhood)
670
+ a_degree = self.get_degree(a.id)
671
+ b_degree = self.get_degree(b.id)
672
+ if a_degree > 3 and b_degree > 3:
673
+ # 5. This is a high-value gap
674
+ info_gain = (a_degree + b_degree) / max_degree
675
+ gaps.append({
676
+ "entity_a": a, "entity_b": b,
677
+ "information_gain": info_gain,
678
+ "suggested_action": "experiment" if info_gain > 0.7 else "literature_search"
679
+ })
680
+
681
+ return sorted(gaps, key=lambda g: -g["information_gain"])
682
+ ```
683
+
684
+ ### 4.5 Layer 5: Calibrated Scoring
685
+
686
+ **Purpose**: Compute confidence using CODE, not LLM. Three separate scores.
687
+
688
+ ```python
689
+ def compute_claim_scores(claim: dict, source: dict, section: str) -> dict:
690
+ """
691
+ Code-computed scoring. The LLM provides COMPONENTS,
692
+ the code computes the FINAL SCORES.
693
+
694
+ The LLM NEVER sets the final confidence directly.
695
+ """
696
+ # ── Score 1: Evidence Quality ──
697
+ evidence_strength = claim["confidence_components"]["evidence_strength"] # From LLM
698
+ study_quality = taxonomy.get_weight(source["study_type"], domain_id) # From taxonomy
699
+ journal_tier = JOURNAL_TIER_WEIGHTS[source["journal_tier"]] # From config
700
+ completeness = 700 if claim["missing_fields"] else 1000 # Binary: code enforced
701
+ section_mod = SECTION_MODIFIERS[section] # From config
702
+
703
+ # Fixed-point multiplication chain
704
+ evidence_quality = (evidence_strength * study_quality // 1000
705
+ * journal_tier // 1000
706
+ * completeness // 1000
707
+ * section_mod // 1000)
708
+
709
+ # ── Score 2: Claim Truth Likelihood ──
710
+ # Based on evidence quality + source count + conflict status
711
+ source_count_bonus = min(claim["evidence_count"] * 50, 200) # Max +0.2 for multiple sources
712
+ conflict_penalty = -300 if claim.get("has_active_conflict") else 0
713
+ null_evidence_penalty = -200 if claim.get("has_null_evidence") else 0
714
+
715
+ truth_likelihood = min(1000, max(0,
716
+ evidence_quality + source_count_bonus + conflict_penalty + null_evidence_penalty
717
+ ))
718
+
719
+ # ── Score 3: Qualifier Strength ──
720
+ # How definitive is the claim's language?
721
+ qualifier_count = len(claim.get("qualifiers", []))
722
+ is_null = claim.get("is_null_result", False)
723
+ is_inherited = claim.get("is_inherited_citation", False)
724
+
725
+ qualifier_strength = 1000
726
+ if qualifier_count > 0:
727
+ qualifier_strength -= qualifier_count * 100 # -0.1 per qualifier
728
+ if is_null:
729
+ qualifier_strength = min(qualifier_strength, 500) # Cap at 0.5 for null results
730
+ if is_inherited:
731
+ qualifier_strength -= 200 # -0.2 for inherited citations
732
+ qualifier_strength = max(0, qualifier_strength)
733
+
734
+ # ── Statistical Evidence Gate ──
735
+ stats = claim.get("statistical_evidence", {})
736
+ if stats.get("effect_size") is not None:
737
+ effect = stats["effect_size"]
738
+ sample_n = stats.get("sample_size", 0)
739
+
740
+ # Large N + tiny effect = statistically significant but practically meaningless
741
+ if sample_n > 1000 and abs(effect) < 0.1:
742
+ # Override: this is NOT practically significant
743
+ evidence_quality = min(evidence_quality, 400) # Cap at 0.4
744
+ claim["practical_significance"] = False
745
+
746
+ # ── Parser Confidence Propagation ──
747
+ parse_conf = claim.get("parse_confidence", 1000)
748
+ evidence_quality = min(evidence_quality, parse_conf) # Parser uncertainty CAPS claim
749
+
750
+ return {
751
+ "evidence_quality": evidence_quality, # Fixed-point Γ—1000
752
+ "truth_likelihood": truth_likelihood, # Fixed-point Γ—1000
753
+ "qualifier_strength": qualifier_strength, # Fixed-point Γ—1000
754
+ "composite_confidence": (evidence_quality + truth_likelihood + qualifier_strength) // 3,
755
+ "practical_significance": claim.get("practical_significance", True),
756
+ }
757
+ ```
758
+
759
+ ### 4.6 Layer 6: Evaluation
760
+
761
+ **Evaluation Pipeline** (runs in CI/CD on every prompt/model/taxonomy change):
762
+
763
+ ```
764
+ 1. STRUCTURAL TESTS (existing 119 tests β€” code correctness)
765
+ └── pytest tests/ β†’ all pass?
766
+
767
+ 2. GOLDEN DATASET REGRESSION (versioned annotations)
768
+ β”œβ”€β”€ Extraction recall β‰₯ 70%
769
+ β”œβ”€β”€ Hallucination rate ≀ 10%
770
+ β”œβ”€β”€ Epistemic accuracy β‰₯ 60%
771
+ β”œβ”€β”€ Qualifier preservation rate β‰₯ 80% (NEW)
772
+ └── Null result detection rate β‰₯ 50% (NEW)
773
+
774
+ 3. LLM-AS-JUDGE (faithfulness & grounding)
775
+ β”œβ”€β”€ Faithfulness: does extracted claim appear in source text?
776
+ β”œβ”€β”€ Grounding: can claim be traced to specific source quote?
777
+ β”œβ”€β”€ Tag correctness: does epistemic tag match expert judgment?
778
+ β”œβ”€β”€ Qualifier preservation: are hedging words maintained?
779
+ └── Run on 5 golden papers, 3 times each (stochastic check)
780
+
781
+ 4. CALIBRATION CHECK (monthly)
782
+ β”œβ”€β”€ Brier score from calibration_log
783
+ β”œβ”€β”€ Alert if ECE > 0.25
784
+ └── Trigger ConfTuner re-training if needed
785
+
786
+ 5. HIDDEN HOLDOUT (never seen during development)
787
+ β”œβ”€β”€ 3 papers reserved, never used in training or golden set
788
+ β”œβ”€β”€ Evaluated quarterly
789
+ └── Detects benchmark overfitting
790
+ ```
791
+
792
+ **Versioned Annotation Guidelines**:
793
+ ```
794
+ /evaluation/
795
+ β”œβ”€β”€ guidelines_v1.0.md # Annotation rules (version controlled)
796
+ β”œβ”€β”€ golden_dataset/
797
+ β”‚ β”œβ”€β”€ paper_001.json # Annotated under guidelines v1.0
798
+ β”‚ β”œβ”€β”€ paper_002.json # Annotated under guidelines v1.0
799
+ β”‚ └── paper_006.json # Annotated under guidelines v1.1
800
+ β”œβ”€β”€ frozen_anchors/ # NEVER re-annotated
801
+ β”‚ β”œβ”€β”€ paper_001_frozen.json
802
+ β”‚ └── paper_002_frozen.json
803
+ └── holdout/ # NEVER seen during development
804
+ β”œβ”€β”€ paper_H1.json
805
+ └── paper_H2.json
806
+ ```
807
+
808
+ ### 4.7 Layer 7: Provenance & Reproducibility
809
+
810
+ **Output Lineage** (every claim tagged):
811
+ ```json
812
+ {
813
+ "pipeline_version": "2.1.0",
814
+ "model_checkpoint": "research-os-grpo-v2-step-5000",
815
+ "parser_version": "marker-1.2.0",
816
+ "taxonomy_version": "quantum_bio_v1",
817
+ "prompt_hash": "sha256:a3b4c5...",
818
+ "extraction_timestamp": "2026-04-23T10:30:00Z",
819
+ "guidance_schema_version": "1.0"
820
+ }
821
+ ```
822
+
823
+ **Security Sandbox** (for repository validation):
824
+ ```
825
+ β”Œβ”€β”€β”€ SANDBOX (isolated from main system) ─────────────────┐
826
+ β”‚ β€’ Timeout: 60 seconds max per URL check β”‚
827
+ β”‚ β€’ Network: HTTP GET only, no POST/PUT/DELETE β”‚
828
+ β”‚ β€’ Download limit: 100MB per artifact β”‚
829
+ β”‚ β€’ No code execution (dry-run validation only) β”‚
830
+ β”‚ β€’ Actual code execution requires human authorization β”‚
831
+ β”‚ β€’ Credential isolation: no access to main DB or API keysβ”‚
832
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
833
+ ```
834
+
835
+ **Epistemic Embargo** (for IP protection):
836
+ ```
837
+ User creates "Private Graph" β†’
838
+ All claims extracted in this mode go to private subgraph β†’
839
+ Private subgraph is NOT visible to other users / companion agents β†’
840
+ After paper submission: user clicks "Merge to Lab Graph" β†’
841
+ Claims move from private to shared graph with full provenance
842
+ ```
843
+
844
+ ---
845
+
846
+ ## 5. UI Architecture
847
+
848
+ ### 5.1 Courtroom UI (Conflict Resolution)
849
+
850
+ ```
851
+ Default View (Review Queue):
852
+ ⚠️ 3-way conflict detected β€” Debye screening threshold
853
+ Papers: Chen 2022, Nakamura 2023, Williams 2024
854
+ Comparability confidence: 0.58 (method differences detected)
855
+ [Review] [Defer] [Dismiss]
856
+
857
+ Expanded View (Courtroom β€” click to open):
858
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
859
+ β”‚ Chen 2022 β”‚ Nakamura 23 β”‚ Williams 24 β”‚
860
+ β”‚ ACS Nano T1 β”‚ Biosens. T1 β”‚ Sensors T3 β”‚
861
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
862
+ β”‚ Claim text β”‚ Claim text β”‚ Claim text β”‚
863
+ β”‚ (nestable) β”‚ (nestable) β”‚ (nestable) β”‚
864
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
865
+ β”‚ Method box β”‚ Method box β”‚ Method box β”‚
866
+ β”‚ N=5 p<.001 β”‚ N=12 p<.01 β”‚ N=3 p=.12 β”‚
867
+ β”‚ [PDFπŸ“„] β”‚ [PDFπŸ“„] β”‚ [PDFπŸ“„] β”‚
868
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
869
+
870
+ System Analysis (Level 5 β€” unverified):
871
+ "These claims are not directly comparable..."
872
+ Confidence in analysis: 0.62
873
+
874
+ Council Votes: Ext1: scope_diff | Ext2: value_mismatch | Critic: scope_diff
875
+
876
+ [Agree] [Override with custom] [Defer β€” need more info]
877
+
878
+ ⚠️ Missing competitor evidence:
879
+ "3 papers cited by these sources are not yet ingested"
880
+ [Ingest Park 2023] [Ingest Liu 2024] [Ingest Fernandez 2023]
881
+ ```
882
+
883
+ ### 5.2 Progressive Disclosure Levels
884
+
885
+ ```
886
+ Level 0: Dashboard
887
+ Epistemic Health Score per claim cluster
888
+ Today's review queue (priority-ranked)
889
+
890
+ Level 1: Claim Detail
891
+ Text + tag + composite confidence + source
892
+ [Expand to see scoring breakdown]
893
+
894
+ Level 2: Scoring Breakdown
895
+ 3 separate scores (evidence, truth, qualifier)
896
+ Statistical evidence if available
897
+ Parser confidence for this region
898
+
899
+ Level 3: Provenance Chain
900
+ Source quote + page + bbox
901
+ Council vote distribution
902
+ Pipeline version + model checkpoint
903
+
904
+ Level 4: Graph Neighborhood
905
+ 2-hop subgraph around this claim
906
+ Typed edges visible
907
+ Inferred edges dashed + labeled
908
+
909
+ Level 5: Full Debug
910
+ Raw LLM outputs from each council member
911
+ Token-level confidence distribution
912
+ Parse regions and quality flags
913
+ ```
914
+
915
+ ### 5.3 Manual Synthesis Mode
916
+
917
+ ```
918
+ [Toggle] 🧠 Manual Synthesis Mode: ON
919
+
920
+ In this mode:
921
+ βœ… Claims displayed (text + source)
922
+ βœ… Organized by topic clusters
923
+ ❌ NO confidence scores shown
924
+ ❌ NO conflict flags shown
925
+ ❌ NO gap analysis shown
926
+ ❌ NO system suggestions
927
+
928
+ The researcher draws connections manually.
929
+ Then switches back to compare with system's analysis.
930
+ ```
931
+
932
+ ---
933
+
934
+ ## 6. Local Deployment
935
+
936
+ ### 6.1 Minimal Setup (16GB VRAM)
937
+
938
+ ```bash
939
+ # 1. Install Ollama (simplest local LLM server)
940
+ curl -fsSL https://ollama.com/install.sh | sh
941
+
942
+ # 2. Pull quantized model (after fine-tuning and uploading GGUF)
943
+ ollama pull nkshirsa/research-os-brain:q4_k_m
944
+
945
+ # 3. Verify it's running
946
+ curl http://localhost:11434/api/generate -d '{"model": "research-os-brain:q4_k_m", "prompt": "test"}'
947
+
948
+ # 4. Start the Research OS
949
+ pip install -r requirements.txt
950
+ python -m phd_research_os.serve --model ollama://research-os-brain:q4_k_m --port 8080
951
+
952
+ # 5. Open UI
953
+ # http://localhost:8080
954
+ ```
955
+
956
+ ### 6.2 VRAM Budget
957
+
958
+ ```
959
+ Qwen3-8B Q4 AWQ weights: ~5.0 GB
960
+ PolarQuant KV cache (128K): ~3.8 GB
961
+ Qwen3-VL-8B Q4 (for figures): ~5.0 GB (loaded on-demand, not persistent)
962
+ Guidance engine overhead: ~0.5 GB
963
+ ChromaDB embeddings: ~0.5 GB
964
+ ──────────────────────────────────────
965
+ Total (text only): ~9.8 GB ← fits 16GB GPU
966
+ Total (with VLM loaded): ~14.8 GB ← fits 16GB GPU (tight)
967
+ Total (with VLM on-demand): ~9.8 GB ← swap VLM in/out per figure
968
+ ```
969
+
970
+ ---
971
+
972
+ ## 7. Data Flow (Complete Pipeline)
973
+
974
+ ```
975
+ PDF Bundle arrives
976
+ β”‚
977
+ β–Ό
978
+ LAYER 0: Structural Ingestion
979
+ β”œβ”€β”€ Marker: layout-aware markdown with section boundaries
980
+ β”œβ”€β”€ Nougat: equations β†’ LaTeX (routed by region classifier)
981
+ β”œβ”€β”€ GROBID: references β†’ structured citations
982
+ β”œβ”€β”€ Figure regions β†’ classify β†’ VLM (semantic) or Digitizer (quantitative)
983
+ β”œβ”€β”€ Per-region quality scoring (parse_confidence, ocr_confidence)
984
+ β”œβ”€β”€ Cross-reference verification (Figure 3 β†’ correct figure object?)
985
+ └── Output: list of annotated regions with bbox, section, quality
986
+ β”‚
987
+ β–Ό
988
+ LAYER 1: Entity Resolution
989
+ β”œβ”€β”€ Normalize entities (gene names, chemicals, assays β†’ canonical IDs)
990
+ β”œβ”€β”€ Resolve in-text citations ([32] β†’ DOI β†’ metadata)
991
+ β”œβ”€β”€ Check VoR lineage (is this a preprint we already have?)
992
+ β”œβ”€β”€ Check retraction status (CrossRef + Retraction Watch)
993
+ └── Tag: primary vs inherited claims
994
+ β”‚
995
+ β–Ό
996
+ LAYER 2: Qualified Extraction (AI Model Council)
997
+ β”œβ”€β”€ Round 1 (parallel): Query Planner + 2 Extractors + Critic
998
+ β”‚ Each independently processes section-aware chunks
999
+ β”‚ Guidance engine enforces: valid JSON, valid tags, valid ranges
1000
+ β”‚ Section modifier applied (Abstract=0.7, Results=1.0, Discussion=0.75)
1001
+ β”œβ”€β”€ Round 2 (debate): Share tags + reasoning (NOT confidence)
1002
+ β”œβ”€β”€ Round 3 (chairman): Synthesize final claims
1003
+ β”‚ Apply completeness penalty (code-enforced: 0.7 if missing fields)
1004
+ β”‚ Preserve qualifiers from source text
1005
+ β”‚ Extract statistical evidence (N, p, d, CI)
1006
+ β”‚ Tag null results, inherited citations, causal direction
1007
+ └── Output: list of qualified claims with full provenance
1008
+ β”‚
1009
+ β–Ό
1010
+ LAYER 3: Canonicalization
1011
+ β”œβ”€β”€ Embed each new claim
1012
+ β”œβ”€β”€ Compare against existing canonical claims (cosine > 0.85 = merge)
1013
+ β”œβ”€β”€ Merge: add source as evidence, update confidence aggregation
1014
+ β”œβ”€β”€ Create: new canonical claim with first source
1015
+ └── Temporal versioning: if same claim from VoR supersedes preprint version
1016
+ β”‚
1017
+ β–Ό
1018
+ LAYER 4: Knowledge Graph
1019
+ β”œβ”€β”€ Insert claim as graph node
1020
+ β”œβ”€β”€ Create edges from citation analysis (supports, depends_on)
1021
+ β”œβ”€β”€ Run conflict detector (keyword + embedding similarity for candidates)
1022
+ β”œβ”€β”€ Council evaluates candidate conflicts β†’ typed edges (refutes, scope_diff)
1023
+ β”œβ”€β”€ Check for null evidence β†’ blocking edges
1024
+ β”œβ”€β”€ Update method-compatibility metadata on edges
1025
+ β”œβ”€β”€ Cluster related conflicts into case files
1026
+ └── Run gap analysis (if in Research Landscape mode)
1027
+ β”‚
1028
+ β–Ό
1029
+ LAYER 5: Calibrated Scoring (CODE-COMPUTED)
1030
+ β”œβ”€β”€ evidence_quality = evidence Γ— quality Γ— tier Γ— completeness Γ— section
1031
+ β”œβ”€β”€ truth_likelihood = evidence_quality + source_bonus - conflict_penalty
1032
+ β”œβ”€β”€ qualifier_strength = 1.0 - qualifier_countΓ—0.1 - null_penalty - inherited_penalty
1033
+ β”œβ”€β”€ Statistical evidence gate: large N + tiny effect β†’ cap confidence
1034
+ β”œβ”€β”€ Parser confidence propagation: parse_confidence caps evidence_quality
1035
+ └── Store all 3 scores + composite on claim
1036
+ β”‚
1037
+ β–Ό
1038
+ LAYER 6: Evaluation (on config change)
1039
+ β”œβ”€β”€ Regression gate against golden dataset
1040
+ β”œβ”€β”€ LLM-as-Judge faithfulness + grounding check
1041
+ β”œβ”€β”€ Brier score monitoring (monthly)
1042
+ └── Hidden holdout benchmark (quarterly)
1043
+ β”‚
1044
+ β–Ό
1045
+ LAYER 7: Provenance
1046
+ β”œβ”€β”€ Tag claim with full pipeline version lineage
1047
+ β”œβ”€β”€ Store bbox + source quote for UI traceability
1048
+ └── Export: Obsidian vault, Courtroom UI, CSV, BibTeX
1049
+ ```
1050
+
1051
+ ---
1052
+
1053
+ ## 8. Implementation Phases (Aligned with PhD Timeline)
1054
+
1055
+ ### Phase A: Foundation (Weeks 1-6) β€” MUST BE FIRST
1056
+
1057
+ | Week | Task | Deliverable |
1058
+ |------|------|-------------|
1059
+ | 1-2 | Integrate Marker for PDF β†’ structured markdown | Section-aware regions with bbox |
1060
+ | 3 | Add Nougat routing for equation-heavy regions | LaTeX preservation |
1061
+ | 4 | Implement section-aware chunking (replace page-based) | Semantic chunks |
1062
+ | 5 | Add quality scoring per-region | parse_confidence on every span |
1063
+ | 6 | Integrate Guidance engine for constrained decoding | Guaranteed valid JSON output |
1064
+
1065
+ ### Phase B: Identity (Weeks 7-12)
1066
+
1067
+ | Week | Task | Deliverable |
1068
+ |------|------|-------------|
1069
+ | 7-8 | Claim canonicalization with embedding dedup | Canonical registry |
1070
+ | 9 | Entity normalization (abbreviations, synonyms) | Ontology mapper |
1071
+ | 10-11 | Citation chain resolution ([32] β†’ DOI) | Primary vs inherited tagging |
1072
+ | 12 | VoR lineage detection | Preprint β†’ VoR superseding |
1073
+
1074
+ ### Phase C: Structure (Weeks 13-20)
1075
+
1076
+ | Week | Task | Deliverable |
1077
+ |------|------|-------------|
1078
+ | 13-14 | SQLite-backed knowledge graph with typed edges | Graph schema + CRUD |
1079
+ | 15-16 | Qualifier preservation + null result handling | Blocking edges |
1080
+ | 17-18 | Method-compatibility layer | Comparability confidence |
1081
+ | 19-20 | Conflict clustering into case files | Case file UI |
1082
+
1083
+ ### Phase D: Calibration (Weeks 21-26)
1084
+
1085
+ | Week | Task | Deliverable |
1086
+ |------|------|-------------|
1087
+ | 21-22 | Epistemic Separation Engine (section modifiers) | Section-aware scoring |
1088
+ | 23-24 | Statistical evidence extraction (N, p, d, CI) | Practical significance gate |
1089
+ | 25-26 | GRPO training with epistemic reward functions | Trained model v2 |
1090
+
1091
+ ### Phase E: Judgment (Weeks 27-32)
1092
+
1093
+ | Week | Task | Deliverable |
1094
+ |------|------|-------------|
1095
+ | 27-28 | Courtroom UI with PDF.js bounding box viewer | Provenance display |
1096
+ | 29-30 | Council parallel-then-merge architecture | Hidden confidence protocol |
1097
+ | 31-32 | Conflict clustering + case file resolution | Batch conflict resolution |
1098
+
1099
+ ### Phase F: Longevity (Ongoing, PhD Year 1+)
1100
+
1101
+ | Task | Trigger |
1102
+ |------|---------|
1103
+ | Versioned ontology with backward-compatible queries | 3rd taxonomy update |
1104
+ | VoR lineage tracking | First preprint β†’ VoR encounter |
1105
+ | Ongoing Brier calibration monitoring | 50+ calibration data points |
1106
+ | Gold-standard drift detection | 2nd annotation batch |
1107
+ | Gap Analysis Protocol | 100+ papers ingested |
1108
+ | Manual Synthesis Mode | Thesis writing phase |
1109
+
1110
+ ---
1111
+
1112
+ ## 9. File Structure (v2.0)
1113
+
1114
+ ```
1115
+ phd-research-os/
1116
+ β”œβ”€β”€ SYSTEM_DESIGN.md # THIS DOCUMENT
1117
+ β”œβ”€β”€ BLINDSPOT_AUDIT_COMPLETE.md # 87-blindspot audit
1118
+ β”‚
1119
+ β”œβ”€β”€ phd_research_os/ # Core Python package
1120
+ β”‚ β”œβ”€β”€ __init__.py
1121
+ β”‚ β”‚
1122
+ β”‚ β”œβ”€β”€ layer0/ # Structural Ingestion
1123
+ β”‚ β”‚ β”œβ”€β”€ parser.py # Marker + Nougat + GROBID orchestrator
1124
+ β”‚ β”‚ β”œβ”€β”€ region_classifier.py # LayoutLMv3 region classification
1125
+ β”‚ β”‚ β”œβ”€β”€ chunker.py # Section-aware chunking
1126
+ β”‚ β”‚ β”œβ”€β”€ figure_router.py # VLM vs Digitizer routing
1127
+ β”‚ β”‚ β”œβ”€β”€ plot_digitizer.py # Quantitative plot β†’ CSV
1128
+ β”‚ β”‚ β”œβ”€β”€ quality_scorer.py # Per-span quality scoring
1129
+ β”‚ β”‚ └── cross_ref_verifier.py # Figure/Table reference integrity
1130
+ β”‚ β”‚
1131
+ β”‚ β”œβ”€β”€ layer1/ # Entity Resolution
1132
+ β”‚ β”‚ β”œβ”€β”€ entity_normalizer.py # Ontology-aware normalization
1133
+ β”‚ β”‚ β”œβ”€β”€ citation_resolver.py # In-text [32] β†’ DOI
1134
+ β”‚ β”‚ β”œβ”€β”€ vor_lineage.py # Version of Record tracking
1135
+ β”‚ β”‚ └── retraction_checker.py # CrossRef + Retraction Watch
1136
+ β”‚ β”‚
1137
+ β”‚ β”œβ”€β”€ layer2/ # Qualified Extraction
1138
+ β”‚ β”‚ β”œβ”€β”€ council.py # Parallel-then-merge council (upgraded)
1139
+ β”‚ β”‚ β”œβ”€β”€ epistemic_separator.py # Abstract vs Results scoring
1140
+ β”‚ β”‚ β”œβ”€β”€ qualifier_extractor.py # Hedging, negation, conditions
1141
+ β”‚ β”‚ β”œβ”€β”€ statistical_extractor.py # N, p, d, CI extraction
1142
+ β”‚ β”‚ β”œβ”€β”€ constrained_decoder.py # Guidance engine integration
1143
+ β”‚ β”‚ └── ood_detector.py # Mahalanobis distance OOD gating
1144
+ β”‚ β”‚
1145
+ β”‚ β”œβ”€β”€ layer3/ # Canonicalization
1146
+ β”‚ β”‚ β”œβ”€β”€ deduplicator.py # Embedding-based near-duplicate detection
1147
+ β”‚ β”‚ β”œβ”€β”€ canonical_registry.py # Canonical claim management
1148
+ β”‚ β”‚ β”œβ”€β”€ alias_merger.py # Alias mapping and merging
1149
+ β”‚ β”‚ └── temporal_versioner.py # Claim version history
1150
+ β”‚ β”‚
1151
+ β”‚ β”œβ”€β”€ layer4/ # Knowledge Graph
1152
+ β”‚ β”‚ β”œβ”€β”€ graph.py # SQLite-backed graph with typed edges
1153
+ β”‚ β”‚ β”œβ”€β”€ conflict_detector.py # Pairwise conflict detection (upgraded)
1154
+ β”‚ β”‚ β”œβ”€β”€ conflict_clusterer.py # Case file generation
1155
+ β”‚ β”‚ β”œβ”€β”€ method_compatibility.py # Cross-paper method comparison
1156
+ β”‚ β”‚ β”œβ”€β”€ gap_analyzer.py # Structural hole detection
1157
+ β”‚ β”‚ └── transitive_constraints.py # Multi-hop inference safety
1158
+ β”‚ β”‚
1159
+ β”‚ β”œβ”€β”€ layer5/ # Calibrated Scoring
1160
+ β”‚ β”‚ β”œβ”€β”€ scorer.py # Code-computed 3-score system
1161
+ β”‚ β”‚ β”œβ”€β”€ statistical_gate.py # Effect size / practical significance
1162
+ β”‚ β”‚ β”œβ”€β”€ section_modifiers.py # Abstract/Results/Discussion weights
1163
+ β”‚ β”‚ └── calibration_monitor.py # Brier score tracking
1164
+ β”‚ β”‚
1165
+ β”‚ β”œβ”€β”€ layer6/ # Evaluation
1166
+ β”‚ β”‚ β”œβ”€β”€ regression_gate.py # Golden dataset regression
1167
+ β”‚ β”‚ β”œβ”€β”€ llm_judge.py # Faithfulness/grounding evaluation
1168
+ β”‚ β”‚ β”œβ”€β”€ stochastic_tester.py # Run-N-times variance check
1169
+ β”‚ β”‚ └── annotation_drift.py # Gold-standard drift detection
1170
+ β”‚ β”‚
1171
+ β”‚ β”œβ”€β”€ layer7/ # Provenance
1172
+ β”‚ β”‚ β”œβ”€β”€ lineage_tagger.py # Pipeline version tagging
1173
+ β”‚ β”‚ β”œβ”€β”€ security_sandbox.py # Isolated URL/repo validation
1174
+ β”‚ β”‚ β”œβ”€β”€ license_checker.py # Usage rights verification
1175
+ β”‚ β”‚ └── embargo_manager.py # Private graph / merge workflow
1176
+ β”‚ β”‚
1177
+ β”‚ β”œβ”€β”€ ui/ # Gradio UI
1178
+ β”‚ β”‚ β”œβ”€β”€ app.py # Main application
1179
+ β”‚ β”‚ β”œβ”€β”€ courtroom.py # Conflict resolution courtroom
1180
+ β”‚ β”‚ β”œβ”€β”€ dashboard.py # Epistemic health dashboard
1181
+ β”‚ β”‚ β”œβ”€β”€ pdf_viewer.py # PDF.js with bbox highlighting
1182
+ β”‚ β”‚ β”œβ”€β”€ manual_synthesis.py # AI-free exploration mode
1183
+ β”‚ β”‚ └── export.py # CSV, BibTeX, JSON, Obsidian export
1184
+ β”‚ β”‚
1185
+ β”‚ β”œβ”€β”€ core/ # Shared infrastructure
1186
+ β”‚ β”‚ β”œβ”€β”€ db.py # SQLite data layer (existing, extended)
1187
+ β”‚ β”‚ β”œβ”€β”€ taxonomy.py # Quantum-Bio V2 (existing)
1188
+ β”‚ β”‚ β”œβ”€β”€ agents.py # Brain interface (existing, upgraded)
1189
+ β”‚ β”‚ β”œβ”€β”€ agent_os.py # ECC Harness (existing)
1190
+ β”‚ β”‚ β”œβ”€β”€ meta_improver.py # Meta-Improver (existing)
1191
+ β”‚ β”‚ └── skills/ # Superpowers (existing)
1192
+ β”‚ β”‚
1193
+ β”‚ β”œβ”€β”€ training/ # Model training
1194
+ β”‚ β”‚ β”œβ”€β”€ train_sft.py # Stage 1: SFT
1195
+ β”‚ β”‚ β”œβ”€β”€ train_dpo.py # Stage 2: DPO
1196
+ β”‚ β”‚ β”œβ”€β”€ train_grpo.py # Stage 3: GRPO with epistemic rewards
1197
+ β”‚ β”‚ β”œβ”€β”€ train_calibration.py # Stage 4: ConfTuner
1198
+ β”‚ β”‚ β”œβ”€β”€ reward_functions.py # JSON validity, schema, qualifier rewards
1199
+ β”‚ β”‚ └── generate_dataset.py # Synthetic + real data generation
1200
+ β”‚ β”‚
1201
+ β”‚ └── config/ # Version-controlled configuration
1202
+ β”‚ β”œβ”€β”€ prompts/ # All system prompts (git-tracked)
1203
+ β”‚ β”œβ”€β”€ taxonomy/ # Domain taxonomies
1204
+ β”‚ β”œβ”€β”€ scoring/ # Weight tables, thresholds
1205
+ β”‚ └── evaluation/ # Golden dataset + guidelines
1206
+ β”‚
1207
+ β”œβ”€β”€ tests/
1208
+ β”‚ β”œβ”€β”€ test_layer0.py # Structural ingestion tests
1209
+ β”‚ β”œβ”€β”€ test_layer1.py # Entity resolution tests
1210
+ β”‚ β”œβ”€β”€ test_layer2.py # Extraction tests
1211
+ β”‚ β”œβ”€β”€ test_layer3.py # Canonicalization tests
1212
+ β”‚ β”œβ”€β”€ test_layer4.py # Knowledge graph tests
1213
+ β”‚ β”œβ”€β”€ test_layer5.py # Scoring tests
1214
+ β”‚ β”œβ”€β”€ test_layer6.py # Evaluation tests
1215
+ β”‚ β”œβ”€β”€ test_layer7.py # Provenance tests
1216
+ β”‚ β”œβ”€β”€ test_db.py # Data layer (existing 22 tests)
1217
+ β”‚ β”œβ”€β”€ test_agent_os.py # ECC harness (existing 21 tests)
1218
+ β”‚ β”œβ”€β”€ test_taxonomy.py # Taxonomy (existing 27 tests)
1219
+ β”‚ β”œβ”€β”€ test_skills_and_meta.py # Skills + meta (existing 30 tests)
1220
+ β”‚ └── test_council.py # Council (existing 19 tests)
1221
+ β”‚
1222
+ └── docs/
1223
+ β”œβ”€β”€ ARCHITECTURE.md # Project map (existing)
1224
+ β”œβ”€β”€ AGENTS.md # Agent registry (existing)
1225
+ β”œβ”€β”€ USAGE.md # Daily workflow guide
1226
+ β”œβ”€β”€ ANNOTATION_GUIDELINES.md # Versioned golden dataset rules
1227
+ └── DEPLOYMENT.md # Local setup guide
1228
+ ```
1229
+
1230
+ ---
1231
+
1232
+ ## 10. Success Criteria
1233
+
1234
+ The system is DONE when:
1235
+
1236
+ 1. **A researcher can drop a PDF and get back epistemic-tagged claims with source bounding boxes in under 5 minutes**
1237
+ 2. **Two claims from different papers that say the same thing are automatically recognized as the same canonical claim**
1238
+ 3. **A null result creates a blocking edge, not a gap, in the knowledge graph**
1239
+ 4. **An Abstract claim that overstates the Results gets automatically penalized**
1240
+ 5. **The courtroom shows three conflicting papers side-by-side with method comparison and the researcher can resolve in 2 clicks**
1241
+ 6. **The gap analyzer identifies untested entity pairs and generates Decision Objects**
1242
+ 7. **The system knows when it doesn't know β€” OOD papers, unextractable regions, and uncalibrated confidence all surface to the human**
1243
+ 8. **All of the above works on a 16GB consumer GPU with zero internet dependency for paper processing**
1244
+
1245
+ ---
1246
+
1247
+ *This design addresses all 87 blindspots from the complete audit.*
1248
+ *Implementation timeline: ~32 weeks pre-PhD + ongoing during PhD Year 1-3.*
1249
+ *The hardest part is not building it. It's keeping it honest.*
1250
 
1251
  ---
1252