Upload SYSTEM_DESIGN.md with huggingface_hub
Browse files- SYSTEM_DESIGN.md +1249 -0
SYSTEM_DESIGN.md
CHANGED
|
@@ -1,3 +1,1252 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
---
|
| 3 |
|
|
|
|
| 1 |
+
# PhD Research OS β Complete System Design
|
| 2 |
+
## Version 2.0 | Post-Audit Architecture
|
| 3 |
+
|
| 4 |
+
**Date**: 2026-04-23
|
| 5 |
+
**Status**: DESIGN COMPLETE β Ready for phased implementation
|
| 6 |
+
**Addresses**: All 87 blindspots from the audit
|
| 7 |
+
**Hardware Target**: 16-24GB VRAM consumer GPU (RTX 4090 / RTX 3090 / A6000)
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. System Overview
|
| 12 |
+
|
| 13 |
+
```
|
| 14 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 15 |
+
β PhD Research OS v2.0 β
|
| 16 |
+
β "The Epistemic Engine" β
|
| 17 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 18 |
+
β β
|
| 19 |
+
β ββββ INPUTS βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 20 |
+
β β PDF Bundles β Supplements β Datasets β Code Repos β Lab Notes β β
|
| 21 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 22 |
+
β βΌ β
|
| 23 |
+
β ββββ LAYER 0: STRUCTURAL INGESTION βββββββββββββββββββββββββββββββ β
|
| 24 |
+
β β Marker β Nougat β GROBID β Region Classifier β Plot Digitizer β β
|
| 25 |
+
β β Section-aware chunks β Bounding boxes β Quality scores β β
|
| 26 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 27 |
+
β βΌ β
|
| 28 |
+
β ββββ LAYER 1: ENTITY RESOLUTION ββββββββββββββββββββββββββββββββββ β
|
| 29 |
+
β β Ontology normalizer β Citation resolver β VoR lineage β Retract. β β
|
| 30 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 31 |
+
β βΌ β
|
| 32 |
+
β ββββ LAYER 2: QUALIFIED EXTRACTION βββββββββββββββββββββββββββββββ β
|
| 33 |
+
β β AI Model Council (parallel) β Epistemic Separation Engine β β
|
| 34 |
+
β β Qualifier preservation β Statistical extraction β OOD gating β β
|
| 35 |
+
β β Guidance constrained decoding β Source quotes + bboxes β β
|
| 36 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 37 |
+
β βΌ β
|
| 38 |
+
β ββββ LAYER 3: CANONICALIZATION βββββββββββββββββββββββββββββββββββ β
|
| 39 |
+
β β Embedding dedup β Canonical registry β Alias merging β β
|
| 40 |
+
β β Evidence aggregation β Temporal versioning β Lineage diff β β
|
| 41 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 42 |
+
β βΌ β
|
| 43 |
+
β ββββ LAYER 4: KNOWLEDGE GRAPH ββββββββββββββββββββββββββββββββββββ β
|
| 44 |
+
β β SQLite-backed graph β Typed epistemic edges β Lab lineage β β
|
| 45 |
+
β β Method compatibility β Transitive constraints β Gap analysis β β
|
| 46 |
+
β β Null evidence β Conflict clustering β Versioned ontology β β
|
| 47 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 48 |
+
β βΌ β
|
| 49 |
+
β ββββ LAYER 5: CALIBRATED SCORING βββββββββββββββββββββββββββββββββ β
|
| 50 |
+
β β Code-computed confidence β 3 separate scores β Statistical gateοΏ½οΏ½ β
|
| 51 |
+
β β Parser confidence propagation β Section modifiers β Brier mon. β β
|
| 52 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 53 |
+
β βΌ β
|
| 54 |
+
β ββββ LAYER 6: EVALUATION βββββββββββββββββββββββββββββββββββββββββ β
|
| 55 |
+
β β LLM-as-Judge CI/CD β Versioned golden set β Stochastic tests β β
|
| 56 |
+
β β Hidden holdout β Fatigue management β Counter-metrics β β
|
| 57 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 58 |
+
β βΌ β
|
| 59 |
+
β ββββ LAYER 7: PROVENANCE & REPRODUCIBILITY βββββββββββββββββββββββ β
|
| 60 |
+
β β Version pinning β Output lineage β PDF.js viewer β Containers β β
|
| 61 |
+
β β Security sandbox β License checking β Epistemic Embargo β β
|
| 62 |
+
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
|
| 63 |
+
β βΌ β
|
| 64 |
+
β ββββ OUTPUTS ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 65 |
+
β β Obsidian Vault β Courtroom UI β Gap Analysis β Decision Objectsβ β
|
| 66 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 67 |
+
β β
|
| 68 |
+
β ββββ CROSS-CUTTING βββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 69 |
+
β β AI Model Council β Meta-Improver β Superpowers Skills β β
|
| 70 |
+
β β ECC Harness β Companion Agents β Manual Synthesis Mode β β
|
| 71 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 72 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## 2. Model Architecture
|
| 78 |
+
|
| 79 |
+
### 2.1 The Two-Model Strategy
|
| 80 |
+
|
| 81 |
+
The system runs TWO models, not one. This solves the local-vs-online tension:
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 85 |
+
β PRIMARY BRAIN (Fully Local β Never Touches Internet) β
|
| 86 |
+
β β
|
| 87 |
+
β Model: Qwen3-8B Q4 AWQ β
|
| 88 |
+
β VRAM: ~5GB weights + ~4GB KV cache (PolarQuant) β
|
| 89 |
+
β Total: ~9GB (fits 16GB GPU with room for batch) β
|
| 90 |
+
β Context: 128K tokens (full paper length) β
|
| 91 |
+
β Serving: Ollama (simplest) or vLLM (fastest) β
|
| 92 |
+
β β
|
| 93 |
+
β Tasks: β
|
| 94 |
+
β β’ Claim extraction (Layer 2) β
|
| 95 |
+
β β’ Epistemic classification β
|
| 96 |
+
β β’ Confidence component estimation β
|
| 97 |
+
β β’ Conflict hypothesis generation β
|
| 98 |
+
β β’ Query decomposition β
|
| 99 |
+
β β’ Decision object generation β
|
| 100 |
+
β β
|
| 101 |
+
β Constrained decoding: Guidance engine β
|
| 102 |
+
β Training: SFT β DPO β GRPO (4-stage pipeline) β
|
| 103 |
+
β Privacy: ALL paper data stays local β
|
| 104 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 105 |
+
β COMPANION BRAIN (Online β For Non-Sensitive Tasks) β
|
| 106 |
+
β β
|
| 107 |
+
β Model: Claude API / GPT-4o-mini / OpenRouter β
|
| 108 |
+
β OR: Local Qwen3-30B-A3B MoE Q4 (~6GB, 3B active) β
|
| 109 |
+
β β
|
| 110 |
+
β Tasks: β
|
| 111 |
+
β β’ Meta-Improver external scanning (arXiv, GitHub) β
|
| 112 |
+
β β’ Prompt optimization A/B testing β
|
| 113 |
+
β β’ Training data generation for new domains β
|
| 114 |
+
β β’ Retraction/correction checking (needs internet) β
|
| 115 |
+
β β’ Repository URL validation β
|
| 116 |
+
β β
|
| 117 |
+
β Privacy: NEVER sees raw paper text β
|
| 118 |
+
β Only receives: metadata, queries, anonymized claims β
|
| 119 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### 2.2 Why Qwen3-8B, Not Qwen2.5-3B
|
| 123 |
+
|
| 124 |
+
| Metric | Qwen2.5-3B | Qwen3-8B | Improvement |
|
| 125 |
+
|--------|-----------|----------|-------------|
|
| 126 |
+
| AIME (math reasoning) | ~15% | ~45%+ | 3Γ |
|
| 127 |
+
| MATH-500 | ~85% | ~95%+ | +10 pts |
|
| 128 |
+
| JSON structural accuracy (SFT) | ~65% | ~80%+ | +15 pts |
|
| 129 |
+
| Context window | 32K | 128K | 4Γ |
|
| 130 |
+
| Hybrid thinking mode | No | Yes | New capability |
|
| 131 |
+
| VRAM at Q4 AWQ | ~2.5GB | ~5GB | Acceptable |
|
| 132 |
+
|
| 133 |
+
### 2.3 Alternative: Qwen3-30B-A3B MoE (The Stealth Option)
|
| 134 |
+
|
| 135 |
+
For users with 8GB+ VRAM who want maximum quality:
|
| 136 |
+
- 30B total parameters, only 3B activated per token (Mixture of Experts)
|
| 137 |
+
- ~6GB at Q4 quantization
|
| 138 |
+
- Quality equivalent to dense 14B+ models
|
| 139 |
+
- Apache 2.0 license
|
| 140 |
+
- Available: `Qwen/Qwen3-30B-A3B-Instruct-2507` (1M downloads)
|
| 141 |
+
|
| 142 |
+
### 2.4 Multimodal: Qwen3-VL-8B-Instruct
|
| 143 |
+
|
| 144 |
+
For figure/diagram processing (Layer 0):
|
| 145 |
+
- Same architecture as text model but with vision encoder
|
| 146 |
+
- Available: `Qwen/Qwen3-VL-8B-Instruct` (3.9M downloads)
|
| 147 |
+
- AWQ 4-bit: `cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit` (~5GB)
|
| 148 |
+
- Handles: figure classification, diagram understanding, micrograph analysis
|
| 149 |
+
- Does NOT replace plot digitizer for quantitative data
|
| 150 |
+
|
| 151 |
+
### 2.5 VLM for Multimodal Figures: Qwen3-VL-30B-A3B-Instruct
|
| 152 |
+
|
| 153 |
+
For maximum figure understanding with MoE efficiency:
|
| 154 |
+
- Available: `Qwen/Qwen3-VL-30B-A3B-Instruct` (1.5M downloads)
|
| 155 |
+
- AWQ: `QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ` (667K downloads)
|
| 156 |
+
- Only 3B active params β fits alongside primary brain
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## 3. Training Pipeline (4-Stage)
|
| 161 |
+
|
| 162 |
+
### Stage 1: SFT on Domain Data
|
| 163 |
+
|
| 164 |
+
```python
|
| 165 |
+
# Current implementation (train.py) β KEEP but upgrade base model
|
| 166 |
+
from trl import SFTConfig, SFTTrainer
|
| 167 |
+
from peft import LoraConfig
|
| 168 |
+
|
| 169 |
+
trainer = SFTTrainer(
|
| 170 |
+
model="Qwen/Qwen3-8B", # Upgraded from Qwen2.5-3B
|
| 171 |
+
args=SFTConfig(
|
| 172 |
+
output_dir="./research-os-sft",
|
| 173 |
+
num_train_epochs=3,
|
| 174 |
+
per_device_train_batch_size=2,
|
| 175 |
+
gradient_accumulation_steps=8,
|
| 176 |
+
learning_rate=2e-4,
|
| 177 |
+
max_length=4096, # Longer for paper sections
|
| 178 |
+
assistant_only_loss=True,
|
| 179 |
+
bf16=True,
|
| 180 |
+
gradient_checkpointing=True,
|
| 181 |
+
push_to_hub=True,
|
| 182 |
+
hub_model_id="nkshirsa/phd-research-os-brain-v2",
|
| 183 |
+
),
|
| 184 |
+
train_dataset=expanded_dataset, # 10K+ examples (up from 1,900)
|
| 185 |
+
peft_config=LoraConfig(r=64, lora_alpha=16, target_modules="all-linear"),
|
| 186 |
+
)
|
| 187 |
+
trainer.train()
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
### Stage 2: DPO on Preference Pairs
|
| 191 |
+
|
| 192 |
+
```python
|
| 193 |
+
from trl import DPOConfig, DPOTrainer
|
| 194 |
+
|
| 195 |
+
# Dataset: pairs of (correct extraction, incorrect extraction) for same text
|
| 196 |
+
trainer = DPOTrainer(
|
| 197 |
+
model="./research-os-sft", # From stage 1
|
| 198 |
+
args=DPOConfig(
|
| 199 |
+
output_dir="./research-os-dpo",
|
| 200 |
+
learning_rate=5e-7,
|
| 201 |
+
num_train_epochs=1,
|
| 202 |
+
max_length=4096,
|
| 203 |
+
bf16=True,
|
| 204 |
+
push_to_hub=True,
|
| 205 |
+
),
|
| 206 |
+
train_dataset=preference_dataset,
|
| 207 |
+
peft_config=LoraConfig(r=64, target_modules="all-linear"),
|
| 208 |
+
)
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
### Stage 3: GRPO with Epistemic Reward Functions
|
| 212 |
+
|
| 213 |
+
This is the critical stage that bakes JSON reliability and epistemic correctness into the model:
|
| 214 |
+
|
| 215 |
+
```python
|
| 216 |
+
from trl import GRPOTrainer, GRPOConfig
|
| 217 |
+
from trl.rewards import think_format_reward
|
| 218 |
+
import json
|
| 219 |
+
|
| 220 |
+
# ββ Reward Function 1: JSON Validity ββ
|
| 221 |
+
def json_validity_reward(completions, **kwargs):
|
| 222 |
+
"""Binary reward: is the output valid JSON?"""
|
| 223 |
+
rewards = []
|
| 224 |
+
for completion in completions:
|
| 225 |
+
content = completion[0]["content"] if isinstance(completion, list) else completion
|
| 226 |
+
try:
|
| 227 |
+
json.loads(content)
|
| 228 |
+
rewards.append(1.0)
|
| 229 |
+
except (json.JSONDecodeError, TypeError):
|
| 230 |
+
rewards.append(0.0)
|
| 231 |
+
return rewards
|
| 232 |
+
|
| 233 |
+
# ββ Reward Function 2: Schema Compliance ββ
|
| 234 |
+
REQUIRED_KEYS = {"text", "epistemic_tag", "confidence", "missing_fields", "status"}
|
| 235 |
+
VALID_TAGS = {"Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"}
|
| 236 |
+
|
| 237 |
+
def schema_compliance_reward(completions, **kwargs):
|
| 238 |
+
"""Reward for matching the Research OS claim schema."""
|
| 239 |
+
rewards = []
|
| 240 |
+
for completion in completions:
|
| 241 |
+
content = completion[0]["content"] if isinstance(completion, list) else completion
|
| 242 |
+
score = 0.0
|
| 243 |
+
try:
|
| 244 |
+
data = json.loads(content)
|
| 245 |
+
claims = data if isinstance(data, list) else data.get("claims", [data])
|
| 246 |
+
|
| 247 |
+
for claim in claims:
|
| 248 |
+
if not isinstance(claim, dict):
|
| 249 |
+
continue
|
| 250 |
+
# Key presence: 0.3
|
| 251 |
+
present_keys = set(claim.keys()) & REQUIRED_KEYS
|
| 252 |
+
score += 0.3 * len(present_keys) / len(REQUIRED_KEYS)
|
| 253 |
+
# Valid epistemic tag: 0.3
|
| 254 |
+
if claim.get("epistemic_tag") in VALID_TAGS:
|
| 255 |
+
score += 0.3
|
| 256 |
+
# Confidence in range: 0.2
|
| 257 |
+
conf = claim.get("confidence", -1)
|
| 258 |
+
if isinstance(conf, (int, float)) and 0 <= conf <= 1:
|
| 259 |
+
score += 0.2
|
| 260 |
+
# Status consistency: 0.2
|
| 261 |
+
missing = claim.get("missing_fields", [])
|
| 262 |
+
status = claim.get("status", "")
|
| 263 |
+
if (missing and status == "Incomplete") or (not missing and status == "Complete"):
|
| 264 |
+
score += 0.2
|
| 265 |
+
|
| 266 |
+
if claims:
|
| 267 |
+
score /= len(claims)
|
| 268 |
+
except:
|
| 269 |
+
pass
|
| 270 |
+
rewards.append(score)
|
| 271 |
+
return rewards
|
| 272 |
+
|
| 273 |
+
# ββ Reward Function 3: Qualifier Preservation ββ
|
| 274 |
+
HEDGING_WORDS = {"may", "might", "could", "suggests", "possibly", "potentially",
|
| 275 |
+
"appears", "seems", "likely", "unlikely", "not significant"}
|
| 276 |
+
|
| 277 |
+
def qualifier_preservation_reward(completions, prompts, **kwargs):
|
| 278 |
+
"""Reward for preserving hedging language from source text."""
|
| 279 |
+
rewards = []
|
| 280 |
+
for completion, prompt in zip(completions, prompts):
|
| 281 |
+
content = completion[0]["content"] if isinstance(completion, list) else completion
|
| 282 |
+
prompt_text = prompt[0]["content"] if isinstance(prompt, list) else prompt
|
| 283 |
+
|
| 284 |
+
# Find hedging words in source
|
| 285 |
+
source_hedges = {w for w in HEDGING_WORDS if w in prompt_text.lower()}
|
| 286 |
+
if not source_hedges:
|
| 287 |
+
rewards.append(0.5) # Neutral if no hedging in source
|
| 288 |
+
continue
|
| 289 |
+
|
| 290 |
+
# Check if hedging is preserved in extraction
|
| 291 |
+
try:
|
| 292 |
+
data = json.loads(content)
|
| 293 |
+
claims = data if isinstance(data, list) else data.get("claims", [data])
|
| 294 |
+
claim_text = " ".join(c.get("text", "") for c in claims if isinstance(c, dict)).lower()
|
| 295 |
+
|
| 296 |
+
preserved = sum(1 for h in source_hedges if h in claim_text)
|
| 297 |
+
rewards.append(preserved / len(source_hedges))
|
| 298 |
+
except:
|
| 299 |
+
rewards.append(0.0)
|
| 300 |
+
return rewards
|
| 301 |
+
|
| 302 |
+
# ββ GRPO Training ββ
|
| 303 |
+
trainer = GRPOTrainer(
|
| 304 |
+
model="./research-os-dpo", # From stage 2
|
| 305 |
+
reward_funcs=[
|
| 306 |
+
json_validity_reward, # Weight: 0.3
|
| 307 |
+
schema_compliance_reward, # Weight: 0.4
|
| 308 |
+
qualifier_preservation_reward, # Weight: 0.3
|
| 309 |
+
],
|
| 310 |
+
args=GRPOConfig(
|
| 311 |
+
output_dir="./research-os-grpo",
|
| 312 |
+
learning_rate=1e-6,
|
| 313 |
+
num_generations=8,
|
| 314 |
+
max_completion_length=2048,
|
| 315 |
+
bf16=True,
|
| 316 |
+
gradient_checkpointing=True,
|
| 317 |
+
logging_steps=10,
|
| 318 |
+
push_to_hub=True,
|
| 319 |
+
hub_model_id="nkshirsa/phd-research-os-brain-v2",
|
| 320 |
+
reward_weights=[0.3, 0.4, 0.3],
|
| 321 |
+
),
|
| 322 |
+
train_dataset=prompt_dataset, # "prompt" column with paper excerpts
|
| 323 |
+
peft_config=LoraConfig(r=64, target_modules="all-linear"),
|
| 324 |
+
)
|
| 325 |
+
trainer.train()
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
+
### Stage 4: Calibration Fine-Tuning (ConfTuner)
|
| 329 |
+
|
| 330 |
+
After GRPO, apply ConfTuner with tokenized Brier score loss to fix confidence calibration. This is a specialized fine-tuning pass that targets only the confidence output tokens.
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## 4. Layer Specifications
|
| 335 |
+
|
| 336 |
+
### 4.0 Layer 0: Structural Ingestion Engine
|
| 337 |
+
|
| 338 |
+
**Purpose**: Convert PDF bundles into section-aware, bbox-annotated, quality-scored structured regions.
|
| 339 |
+
|
| 340 |
+
**Technology Stack**:
|
| 341 |
+
|
| 342 |
+
| Component | Tool | Purpose |
|
| 343 |
+
|-----------|------|---------|
|
| 344 |
+
| Layout detection | Marker (VikParuchuri/marker) | PDF β structured markdown with layout awareness |
|
| 345 |
+
| Math/equation | Nougat (facebookresearch/nougat) | Scientific PDFs β LaTeX equations |
|
| 346 |
+
| Bibliographic | GROBID | Headers, authors, citations, references |
|
| 347 |
+
| Region classifier | LayoutLMv3 or DocTR | Classify page regions: text, table, figure, equation |
|
| 348 |
+
| Plot digitizer | PlotDigitizer (algorithmic) | Quantitative plots β CSV of (x,y) coordinates |
|
| 349 |
+
| VLM for figures | Qwen3-VL-8B-Instruct Q4 AWQ | Semantic figure understanding |
|
| 350 |
+
| OCR quality | Per-span confidence scoring | Flag degraded regions |
|
| 351 |
+
|
| 352 |
+
**Output Schema** (per region):
|
| 353 |
+
|
| 354 |
+
```json
|
| 355 |
+
{
|
| 356 |
+
"region_id": "REG_00042",
|
| 357 |
+
"document_type": "main|supplement_1|supplement_2",
|
| 358 |
+
"page": 5,
|
| 359 |
+
"bbox": [72, 340, 540, 420],
|
| 360 |
+
"region_type": "body_text|table|figure|equation|caption|header|reference|footnote",
|
| 361 |
+
"section": "results",
|
| 362 |
+
"subsection": "3.2_sensitivity_characterization",
|
| 363 |
+
"content": {
|
| 364 |
+
"text": "The LOD was 0.8 Β± 0.03 fM (Table 2)",
|
| 365 |
+
"markdown": "The LOD was 0.8 Β± 0.03 fM ([Table 2](#table-2))",
|
| 366 |
+
"parse_method": "marker",
|
| 367 |
+
"parse_confidence": 0.95,
|
| 368 |
+
"ocr_source": false
|
| 369 |
+
},
|
| 370 |
+
"cross_references": [
|
| 371 |
+
{"ref_text": "Table 2", "ref_type": "table", "resolved_to": "REG_00038", "verified": true}
|
| 372 |
+
],
|
| 373 |
+
"extraction_status": "extractable|low_confidence|unextractable",
|
| 374 |
+
"quality_flags": [],
|
| 375 |
+
"figures": {
|
| 376 |
+
"detected": true,
|
| 377 |
+
"figure_type": "scatter_plot|bar_chart|diagram|micrograph|schematic",
|
| 378 |
+
"digitizable": true,
|
| 379 |
+
"digitized_data": null
|
| 380 |
+
}
|
| 381 |
+
}
|
| 382 |
+
```
|
| 383 |
+
|
| 384 |
+
**Chunking Strategy**: Section-aware, NOT page-based.
|
| 385 |
+
1. Marker identifies section boundaries (Introduction, Methods, Results subsections)
|
| 386 |
+
2. Chunk by section with 1-paragraph overlap to preceding and following sections
|
| 387 |
+
3. Tables always kept whole (never split across chunks)
|
| 388 |
+
4. Figure + caption always kept together
|
| 389 |
+
5. Maximum chunk size: 4096 tokens (model context allows it)
|
| 390 |
+
|
| 391 |
+
**Paper Bundle Handling**:
|
| 392 |
+
```
|
| 393 |
+
Input: {
|
| 394 |
+
"main_pdf": "path/to/paper.pdf",
|
| 395 |
+
"supplements": ["path/to/supplement_1.pdf", "path/to/supplement_data.xlsx"],
|
| 396 |
+
"code_repo": "https://github.com/author/repo",
|
| 397 |
+
"dataset": "https://zenodo.org/record/12345"
|
| 398 |
+
}
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
### 4.1 Layer 1: Entity Resolution
|
| 402 |
+
|
| 403 |
+
**Purpose**: Normalize entities, resolve citations, check retractions, establish version lineage.
|
| 404 |
+
|
| 405 |
+
**Components**:
|
| 406 |
+
|
| 407 |
+
```
|
| 408 |
+
Entity Normalizer
|
| 409 |
+
βββ Gene/protein names β UniProt ID
|
| 410 |
+
βββ Chemical names β PubChem CID
|
| 411 |
+
βββ Disease names β MeSH ID
|
| 412 |
+
βββ Assay names β BAO ontology
|
| 413 |
+
βββ Abbreviations β canonical form (LRU cache)
|
| 414 |
+
βββ Custom domain ontology (user-extensible)
|
| 415 |
+
|
| 416 |
+
Citation Chain Resolver
|
| 417 |
+
βββ In-text "[32]" β reference list β DOI
|
| 418 |
+
βββ DOI β CrossRef metadata
|
| 419 |
+
βββ Check: is cited paper in knowledge base?
|
| 420 |
+
βββ If yes: link claim to original source
|
| 421 |
+
βββ If no: flag as "citation_orphan" for potential ingestion
|
| 422 |
+
βββ Classify: primary claim vs inherited citation
|
| 423 |
+
|
| 424 |
+
Version of Record (VoR) Lineage
|
| 425 |
+
βββ Before ingestion: query DOI/arXiv for version chain
|
| 426 |
+
βββ If preprint exists in DB and VoR arriving: supersede
|
| 427 |
+
βββ If VoR exists and erratum arriving: amend specific claims
|
| 428 |
+
βββ If retraction: invalidate ALL claims, propagate penalty
|
| 429 |
+
βββ Store full lineage: preprint_doi β vor_doi β errata β retraction
|
| 430 |
+
|
| 431 |
+
Retraction Checker
|
| 432 |
+
βββ CrossRef "update-to" relationship
|
| 433 |
+
βββ Retraction Watch database (periodic sync via companion model)
|
| 434 |
+
βββ Propagate retraction status through citation chains
|
| 435 |
+
```
|
| 436 |
+
|
| 437 |
+
### 4.2 Layer 2: Qualified Extraction
|
| 438 |
+
|
| 439 |
+
**Purpose**: Extract claims with full epistemic qualification using the AI Model Council.
|
| 440 |
+
|
| 441 |
+
**Council Architecture** (Parallel-Then-Merge):
|
| 442 |
+
|
| 443 |
+
```
|
| 444 |
+
Round 1 (PARALLEL β no visibility between members):
|
| 445 |
+
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
|
| 446 |
+
β Query Plannerβ β Extractor β β Extractor 2 β β Critic β
|
| 447 |
+
β (decompose) β β (Qwen3-8B) β β (if heterog.)β β (adversarial)β
|
| 448 |
+
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
|
| 449 |
+
β β β β
|
| 450 |
+
βΌ βΌ βΌ βΌ
|
| 451 |
+
sub-queries claims_A claims_B critique
|
| 452 |
+
|
| 453 |
+
Round 2 (DEBATE β see tags and reasoning, NOT confidence):
|
| 454 |
+
All members see each other's epistemic tags and reasoning chains
|
| 455 |
+
Each member can revise their classification
|
| 456 |
+
Confidence scores remain HIDDEN (prevents anchoring)
|
| 457 |
+
|
| 458 |
+
Round 3 (SYNTHESIS β Chairman):
|
| 459 |
+
Chairman sees everything including confidence
|
| 460 |
+
Applies completeness penalty (code-enforced, not prompt-instructed)
|
| 461 |
+
Resolves disagreements with documented reasoning
|
| 462 |
+
Tags each claim with council_vote_distribution
|
| 463 |
+
```
|
| 464 |
+
|
| 465 |
+
**Epistemic Separation Engine**:
|
| 466 |
+
|
| 467 |
+
| Section | Epistemic Default | Confidence Modifier |
|
| 468 |
+
|---------|-------------------|-------------------|
|
| 469 |
+
| Results (with statistics) | Fact (if p < threshold) | 1.0 |
|
| 470 |
+
| Results (narrative) | Interpretation | 0.85 |
|
| 471 |
+
| Methods | Protocol metadata (not a claim) | N/A |
|
| 472 |
+
| Abstract | Interpretation (forced) | 0.7 penalty |
|
| 473 |
+
| Discussion | Interpretation or Hypothesis | 0.75 penalty |
|
| 474 |
+
| Conclusion | Cross-check against Results | 0.8 if supported, 0.5 if not |
|
| 475 |
+
| Supplement | Same as main body section rules | 1.0 (no penalty for supplement source) |
|
| 476 |
+
|
| 477 |
+
**Constrained Decoding** (Guidance engine):
|
| 478 |
+
|
| 479 |
+
```python
|
| 480 |
+
from guidance import models, gen, select
|
| 481 |
+
|
| 482 |
+
TAGS = ["Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"]
|
| 483 |
+
|
| 484 |
+
lm = models.Transformers("./research-os-grpo") # Local model
|
| 485 |
+
|
| 486 |
+
with lm:
|
| 487 |
+
output = lm + f"""
|
| 488 |
+
Analyze this scientific text and extract claims.
|
| 489 |
+
|
| 490 |
+
Text: {section_text}
|
| 491 |
+
Section: {section_name}
|
| 492 |
+
|
| 493 |
+
<reasoning>{gen("reasoning", max_tokens=500)}</reasoning>
|
| 494 |
+
|
| 495 |
+
Claims:
|
| 496 |
+
[
|
| 497 |
+
{{
|
| 498 |
+
"text": "{gen("claim_text", max_tokens=200)}",
|
| 499 |
+
"epistemic_tag": "{select(TAGS, name="tag")}",
|
| 500 |
+
"confidence_components": {{
|
| 501 |
+
"evidence_strength": {gen("ev_str", regex=r"0\.[0-9][0-9]?[0-9]?", name="evidence")},
|
| 502 |
+
"qualifiers": ["{gen("qualifiers", max_tokens=100)}"]
|
| 503 |
+
}},
|
| 504 |
+
"source_quote": "{gen("source_quote", max_tokens=200)}",
|
| 505 |
+
"source_page": {gen("page", regex=r"[0-9]+", name="page")},
|
| 506 |
+
"is_null_result": {select(["true", "false"], name="is_null")},
|
| 507 |
+
"is_inherited_citation": {select(["true", "false"], name="is_inherited")}
|
| 508 |
+
}}
|
| 509 |
+
]
|
| 510 |
+
"""
|
| 511 |
+
# output["tag"] is GUARANTEED to be in TAGS
|
| 512 |
+
# output["is_null"] is GUARANTEED to be boolean
|
| 513 |
+
```
|
| 514 |
+
|
| 515 |
+
**Claim Schema v2** (expanded from v1):
|
| 516 |
+
|
| 517 |
+
```json
|
| 518 |
+
{
|
| 519 |
+
"claim_id": "CLM_00042",
|
| 520 |
+
"text": "The LOD was 0.8 fM in 10 mM PBS",
|
| 521 |
+
"epistemic_tag": "Fact",
|
| 522 |
+
"confidence": 0.855,
|
| 523 |
+
"confidence_components": {
|
| 524 |
+
"evidence_strength": 900,
|
| 525 |
+
"study_quality_weight": 1000,
|
| 526 |
+
"journal_tier_weight": 1000,
|
| 527 |
+
"completeness_penalty": 1000,
|
| 528 |
+
"section_modifier": 1000,
|
| 529 |
+
"qualifier_penalty": 950
|
| 530 |
+
},
|
| 531 |
+
"qualifiers": ["in 10 mM PBS only", "n=5"],
|
| 532 |
+
"missing_fields": [],
|
| 533 |
+
"status": "Complete",
|
| 534 |
+
"is_null_result": false,
|
| 535 |
+
"is_inherited_citation": false,
|
| 536 |
+
"causal_direction": "observed_correlation",
|
| 537 |
+
"statistical_evidence": {
|
| 538 |
+
"p_value": 0.001,
|
| 539 |
+
"effect_size": 2.1,
|
| 540 |
+
"effect_size_type": "cohens_d",
|
| 541 |
+
"sample_size": 5,
|
| 542 |
+
"confidence_interval": [0.6, 1.0],
|
| 543 |
+
"practical_significance": true
|
| 544 |
+
},
|
| 545 |
+
"source_quote": "The limit of detection was determined to be 0.8 fM using the 3Ο/slope method.",
|
| 546 |
+
"source_page": 5,
|
| 547 |
+
"source_bbox": [72, 340, 540, 365],
|
| 548 |
+
"source_section": "results",
|
| 549 |
+
"source_doi": "10.1234/example",
|
| 550 |
+
"council_vote": {
|
| 551 |
+
"extractor_1": {"tag": "Fact", "reasoning": "Direct measurement with statistics"},
|
| 552 |
+
"extractor_2": {"tag": "Fact", "reasoning": "Quantitative with clear methodology"},
|
| 553 |
+
"critic": {"tag": "Fact", "reasoning": "Supported by Table 2 data"},
|
| 554 |
+
"chairman": {"tag": "Fact", "reasoning": "Unanimous agreement, strong statistics"}
|
| 555 |
+
},
|
| 556 |
+
"granularity": "atomic",
|
| 557 |
+
"parent_claim_id": null,
|
| 558 |
+
"sub_claims": [],
|
| 559 |
+
"ontology_version": "quantum_bio_v1",
|
| 560 |
+
"pipeline_version": "2.1.0",
|
| 561 |
+
"taxonomy_version": "quantum_bio_v1",
|
| 562 |
+
"extraction_timestamp": "2026-04-23T10:30:00Z"
|
| 563 |
+
}
|
| 564 |
+
```
|
| 565 |
+
|
| 566 |
+
### 4.3 Layer 3: Canonicalization
|
| 567 |
+
|
| 568 |
+
**Purpose**: Deduplicate claims, merge aliases, aggregate evidence, track temporal versions.
|
| 569 |
+
|
| 570 |
+
```
|
| 571 |
+
New claim arrives β
|
| 572 |
+
1. Embed claim text (local embedding model or Qwen3-8B last-hidden-state)
|
| 573 |
+
2. Search existing canonical claims (cosine similarity)
|
| 574 |
+
3. If similarity > 0.85:
|
| 575 |
+
βββ MERGE: Add new source as evidence for existing canonical claim
|
| 576 |
+
βββ Update evidence_count, source_list, confidence (re-aggregate)
|
| 577 |
+
βββ If confidence_components differ significantly: flag for human review
|
| 578 |
+
βββ Store alias mapping: new_claim_id β canonical_claim_id
|
| 579 |
+
4. If similarity 0.70-0.85:
|
| 580 |
+
βββ FLAG as "potential duplicate β review recommended"
|
| 581 |
+
βββ Show both claims in review queue with similarity score
|
| 582 |
+
5. If similarity < 0.70:
|
| 583 |
+
βββ CREATE new canonical claim
|
| 584 |
+
```
|
| 585 |
+
|
| 586 |
+
**Temporal Versioning**:
|
| 587 |
+
```
|
| 588 |
+
canonical_claim:
|
| 589 |
+
version_history: [
|
| 590 |
+
{version: 1, source: "preprint_2024", confidence: 0.65, date: "2024-03"},
|
| 591 |
+
{version: 2, source: "vor_2024", confidence: 0.85, date: "2024-09"},
|
| 592 |
+
{version: 3, source: "new_study_2025", confidence: 0.90, date: "2025-02"}
|
| 593 |
+
]
|
| 594 |
+
current_version: 3
|
| 595 |
+
supersedes: null
|
| 596 |
+
superseded_by: null
|
| 597 |
+
```
|
| 598 |
+
|
| 599 |
+
### 4.4 Layer 4: Knowledge Graph
|
| 600 |
+
|
| 601 |
+
**Implementation**: SQLite-backed adjacency list (NOT Neo4j β keeps the system local and zero-dependency).
|
| 602 |
+
|
| 603 |
+
**Schema**:
|
| 604 |
+
|
| 605 |
+
```sql
|
| 606 |
+
CREATE TABLE graph_nodes (
|
| 607 |
+
node_id TEXT PRIMARY KEY, -- canonical_claim_id or entity_id
|
| 608 |
+
node_type TEXT NOT NULL, -- claim | entity | method | condition
|
| 609 |
+
label TEXT NOT NULL,
|
| 610 |
+
properties TEXT, -- JSON
|
| 611 |
+
created_at TEXT NOT NULL
|
| 612 |
+
);
|
| 613 |
+
|
| 614 |
+
CREATE TABLE graph_edges (
|
| 615 |
+
edge_id TEXT PRIMARY KEY,
|
| 616 |
+
source_node TEXT NOT NULL,
|
| 617 |
+
target_node TEXT NOT NULL,
|
| 618 |
+
edge_type TEXT NOT NULL, -- supports | refutes | extends | depends_on |
|
| 619 |
+
-- supersedes | blocks | investigative_hypothesis |
|
| 620 |
+
-- method_uses | condition_applies
|
| 621 |
+
confidence INTEGER NOT NULL, -- Fixed-point Γ1000
|
| 622 |
+
evidence_sources TEXT, -- JSON array of source DOIs
|
| 623 |
+
is_inferred INTEGER DEFAULT 0, -- 0=observed, 1=inferred (transitive)
|
| 624 |
+
inference_chain TEXT, -- JSON: hop details if inferred
|
| 625 |
+
method_compatible INTEGER, -- NULL=unchecked, 0=incompatible, 1=compatible
|
| 626 |
+
created_at TEXT NOT NULL,
|
| 627 |
+
updated_at TEXT NOT NULL,
|
| 628 |
+
FOREIGN KEY(source_node) REFERENCES graph_nodes(node_id),
|
| 629 |
+
FOREIGN KEY(target_node) REFERENCES graph_nodes(node_id)
|
| 630 |
+
);
|
| 631 |
+
|
| 632 |
+
-- Index for fast graph traversal
|
| 633 |
+
CREATE INDEX idx_edges_source ON graph_edges(source_node);
|
| 634 |
+
CREATE INDEX idx_edges_target ON graph_edges(target_node);
|
| 635 |
+
CREATE INDEX idx_edges_type ON graph_edges(edge_type);
|
| 636 |
+
```
|
| 637 |
+
|
| 638 |
+
**Edge Types**:
|
| 639 |
+
|
| 640 |
+
| Type | Meaning | Confidence Rule |
|
| 641 |
+
|------|---------|----------------|
|
| 642 |
+
| `supports` | Claim A provides evidence for Claim B | From source text, observed |
|
| 643 |
+
| `refutes` | Claim A contradicts Claim B | From source text or conflict detection |
|
| 644 |
+
| `extends` | Claim A adds conditions/parameters to B | Section analysis |
|
| 645 |
+
| `depends_on` | Claim A assumes Claim B is true | Citation chain analysis |
|
| 646 |
+
| `supersedes` | Claim A replaces older Claim B (newer data) | Temporal versioning |
|
| 647 |
+
| `blocks` | Null finding: no evidence of relationship | Null result extraction |
|
| 648 |
+
| `investigative_hypothesis` | Inferred multi-hop (NOT observed) | min(hop_confidences) Γ 0.5 |
|
| 649 |
+
|
| 650 |
+
**Transitive Inference Constraints**:
|
| 651 |
+
- NEVER auto-generate `supports` across multiple hops
|
| 652 |
+
- Only `investigative_hypothesis` edges for multi-hop
|
| 653 |
+
- Require method_compatible=1 for each hop before generating inference
|
| 654 |
+
- Default queries return observed edges only
|
| 655 |
+
- `include_inferred=True` flag required for graph queries that include inferences
|
| 656 |
+
|
| 657 |
+
**Gap Analysis Protocol**:
|
| 658 |
+
```python
|
| 659 |
+
def find_gaps(self, domain_id: str) -> list:
|
| 660 |
+
"""Find structural holes in the knowledge graph."""
|
| 661 |
+
# 1. Get all entities in domain
|
| 662 |
+
entities = self.get_entities(domain_id)
|
| 663 |
+
|
| 664 |
+
# 2. For each entity pair in same domain
|
| 665 |
+
for a, b in combinations(entities, 2):
|
| 666 |
+
# 3. Check if edge exists
|
| 667 |
+
edges = self.get_edges(a.id, b.id)
|
| 668 |
+
if not edges:
|
| 669 |
+
# 4. Check if both are well-connected (dense neighborhood)
|
| 670 |
+
a_degree = self.get_degree(a.id)
|
| 671 |
+
b_degree = self.get_degree(b.id)
|
| 672 |
+
if a_degree > 3 and b_degree > 3:
|
| 673 |
+
# 5. This is a high-value gap
|
| 674 |
+
info_gain = (a_degree + b_degree) / max_degree
|
| 675 |
+
gaps.append({
|
| 676 |
+
"entity_a": a, "entity_b": b,
|
| 677 |
+
"information_gain": info_gain,
|
| 678 |
+
"suggested_action": "experiment" if info_gain > 0.7 else "literature_search"
|
| 679 |
+
})
|
| 680 |
+
|
| 681 |
+
return sorted(gaps, key=lambda g: -g["information_gain"])
|
| 682 |
+
```
|
| 683 |
+
|
| 684 |
+
### 4.5 Layer 5: Calibrated Scoring
|
| 685 |
+
|
| 686 |
+
**Purpose**: Compute confidence using CODE, not LLM. Three separate scores.
|
| 687 |
+
|
| 688 |
+
```python
|
| 689 |
+
def compute_claim_scores(claim: dict, source: dict, section: str) -> dict:
|
| 690 |
+
"""
|
| 691 |
+
Code-computed scoring. The LLM provides COMPONENTS,
|
| 692 |
+
the code computes the FINAL SCORES.
|
| 693 |
+
|
| 694 |
+
The LLM NEVER sets the final confidence directly.
|
| 695 |
+
"""
|
| 696 |
+
# ββ Score 1: Evidence Quality ββ
|
| 697 |
+
evidence_strength = claim["confidence_components"]["evidence_strength"] # From LLM
|
| 698 |
+
study_quality = taxonomy.get_weight(source["study_type"], domain_id) # From taxonomy
|
| 699 |
+
journal_tier = JOURNAL_TIER_WEIGHTS[source["journal_tier"]] # From config
|
| 700 |
+
completeness = 700 if claim["missing_fields"] else 1000 # Binary: code enforced
|
| 701 |
+
section_mod = SECTION_MODIFIERS[section] # From config
|
| 702 |
+
|
| 703 |
+
# Fixed-point multiplication chain
|
| 704 |
+
evidence_quality = (evidence_strength * study_quality // 1000
|
| 705 |
+
* journal_tier // 1000
|
| 706 |
+
* completeness // 1000
|
| 707 |
+
* section_mod // 1000)
|
| 708 |
+
|
| 709 |
+
# ββ Score 2: Claim Truth Likelihood ββ
|
| 710 |
+
# Based on evidence quality + source count + conflict status
|
| 711 |
+
source_count_bonus = min(claim["evidence_count"] * 50, 200) # Max +0.2 for multiple sources
|
| 712 |
+
conflict_penalty = -300 if claim.get("has_active_conflict") else 0
|
| 713 |
+
null_evidence_penalty = -200 if claim.get("has_null_evidence") else 0
|
| 714 |
+
|
| 715 |
+
truth_likelihood = min(1000, max(0,
|
| 716 |
+
evidence_quality + source_count_bonus + conflict_penalty + null_evidence_penalty
|
| 717 |
+
))
|
| 718 |
+
|
| 719 |
+
# ββ Score 3: Qualifier Strength ββ
|
| 720 |
+
# How definitive is the claim's language?
|
| 721 |
+
qualifier_count = len(claim.get("qualifiers", []))
|
| 722 |
+
is_null = claim.get("is_null_result", False)
|
| 723 |
+
is_inherited = claim.get("is_inherited_citation", False)
|
| 724 |
+
|
| 725 |
+
qualifier_strength = 1000
|
| 726 |
+
if qualifier_count > 0:
|
| 727 |
+
qualifier_strength -= qualifier_count * 100 # -0.1 per qualifier
|
| 728 |
+
if is_null:
|
| 729 |
+
qualifier_strength = min(qualifier_strength, 500) # Cap at 0.5 for null results
|
| 730 |
+
if is_inherited:
|
| 731 |
+
qualifier_strength -= 200 # -0.2 for inherited citations
|
| 732 |
+
qualifier_strength = max(0, qualifier_strength)
|
| 733 |
+
|
| 734 |
+
# ββ Statistical Evidence Gate ββ
|
| 735 |
+
stats = claim.get("statistical_evidence", {})
|
| 736 |
+
if stats.get("effect_size") is not None:
|
| 737 |
+
effect = stats["effect_size"]
|
| 738 |
+
sample_n = stats.get("sample_size", 0)
|
| 739 |
+
|
| 740 |
+
# Large N + tiny effect = statistically significant but practically meaningless
|
| 741 |
+
if sample_n > 1000 and abs(effect) < 0.1:
|
| 742 |
+
# Override: this is NOT practically significant
|
| 743 |
+
evidence_quality = min(evidence_quality, 400) # Cap at 0.4
|
| 744 |
+
claim["practical_significance"] = False
|
| 745 |
+
|
| 746 |
+
# ββ Parser Confidence Propagation ββ
|
| 747 |
+
parse_conf = claim.get("parse_confidence", 1000)
|
| 748 |
+
evidence_quality = min(evidence_quality, parse_conf) # Parser uncertainty CAPS claim
|
| 749 |
+
|
| 750 |
+
return {
|
| 751 |
+
"evidence_quality": evidence_quality, # Fixed-point Γ1000
|
| 752 |
+
"truth_likelihood": truth_likelihood, # Fixed-point Γ1000
|
| 753 |
+
"qualifier_strength": qualifier_strength, # Fixed-point Γ1000
|
| 754 |
+
"composite_confidence": (evidence_quality + truth_likelihood + qualifier_strength) // 3,
|
| 755 |
+
"practical_significance": claim.get("practical_significance", True),
|
| 756 |
+
}
|
| 757 |
+
```
|
| 758 |
+
|
| 759 |
+
### 4.6 Layer 6: Evaluation
|
| 760 |
+
|
| 761 |
+
**Evaluation Pipeline** (runs in CI/CD on every prompt/model/taxonomy change):
|
| 762 |
+
|
| 763 |
+
```
|
| 764 |
+
1. STRUCTURAL TESTS (existing 119 tests β code correctness)
|
| 765 |
+
βββ pytest tests/ β all pass?
|
| 766 |
+
|
| 767 |
+
2. GOLDEN DATASET REGRESSION (versioned annotations)
|
| 768 |
+
βββ Extraction recall β₯ 70%
|
| 769 |
+
βββ Hallucination rate β€ 10%
|
| 770 |
+
βββ Epistemic accuracy β₯ 60%
|
| 771 |
+
βββ Qualifier preservation rate β₯ 80% (NEW)
|
| 772 |
+
βββ Null result detection rate β₯ 50% (NEW)
|
| 773 |
+
|
| 774 |
+
3. LLM-AS-JUDGE (faithfulness & grounding)
|
| 775 |
+
βββ Faithfulness: does extracted claim appear in source text?
|
| 776 |
+
βββ Grounding: can claim be traced to specific source quote?
|
| 777 |
+
βββ Tag correctness: does epistemic tag match expert judgment?
|
| 778 |
+
βββ Qualifier preservation: are hedging words maintained?
|
| 779 |
+
βββ Run on 5 golden papers, 3 times each (stochastic check)
|
| 780 |
+
|
| 781 |
+
4. CALIBRATION CHECK (monthly)
|
| 782 |
+
βββ Brier score from calibration_log
|
| 783 |
+
βββ Alert if ECE > 0.25
|
| 784 |
+
βββ Trigger ConfTuner re-training if needed
|
| 785 |
+
|
| 786 |
+
5. HIDDEN HOLDOUT (never seen during development)
|
| 787 |
+
βββ 3 papers reserved, never used in training or golden set
|
| 788 |
+
βββ Evaluated quarterly
|
| 789 |
+
βββ Detects benchmark overfitting
|
| 790 |
+
```
|
| 791 |
+
|
| 792 |
+
**Versioned Annotation Guidelines**:
|
| 793 |
+
```
|
| 794 |
+
/evaluation/
|
| 795 |
+
βββ guidelines_v1.0.md # Annotation rules (version controlled)
|
| 796 |
+
βββ golden_dataset/
|
| 797 |
+
β βββ paper_001.json # Annotated under guidelines v1.0
|
| 798 |
+
β βββ paper_002.json # Annotated under guidelines v1.0
|
| 799 |
+
β βββ paper_006.json # Annotated under guidelines v1.1
|
| 800 |
+
βββ frozen_anchors/ # NEVER re-annotated
|
| 801 |
+
β βββ paper_001_frozen.json
|
| 802 |
+
β βββ paper_002_frozen.json
|
| 803 |
+
βββ holdout/ # NEVER seen during development
|
| 804 |
+
βββ paper_H1.json
|
| 805 |
+
βββ paper_H2.json
|
| 806 |
+
```
|
| 807 |
+
|
| 808 |
+
### 4.7 Layer 7: Provenance & Reproducibility
|
| 809 |
+
|
| 810 |
+
**Output Lineage** (every claim tagged):
|
| 811 |
+
```json
|
| 812 |
+
{
|
| 813 |
+
"pipeline_version": "2.1.0",
|
| 814 |
+
"model_checkpoint": "research-os-grpo-v2-step-5000",
|
| 815 |
+
"parser_version": "marker-1.2.0",
|
| 816 |
+
"taxonomy_version": "quantum_bio_v1",
|
| 817 |
+
"prompt_hash": "sha256:a3b4c5...",
|
| 818 |
+
"extraction_timestamp": "2026-04-23T10:30:00Z",
|
| 819 |
+
"guidance_schema_version": "1.0"
|
| 820 |
+
}
|
| 821 |
+
```
|
| 822 |
+
|
| 823 |
+
**Security Sandbox** (for repository validation):
|
| 824 |
+
```
|
| 825 |
+
ββββ SANDBOX (isolated from main system) ββββββββββββββββββ
|
| 826 |
+
β β’ Timeout: 60 seconds max per URL check β
|
| 827 |
+
β β’ Network: HTTP GET only, no POST/PUT/DELETE β
|
| 828 |
+
β β’ Download limit: 100MB per artifact β
|
| 829 |
+
β β’ No code execution (dry-run validation only) β
|
| 830 |
+
β β’ Actual code execution requires human authorization β
|
| 831 |
+
β β’ Credential isolation: no access to main DB or API keysβ
|
| 832 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 833 |
+
```
|
| 834 |
+
|
| 835 |
+
**Epistemic Embargo** (for IP protection):
|
| 836 |
+
```
|
| 837 |
+
User creates "Private Graph" β
|
| 838 |
+
All claims extracted in this mode go to private subgraph β
|
| 839 |
+
Private subgraph is NOT visible to other users / companion agents β
|
| 840 |
+
After paper submission: user clicks "Merge to Lab Graph" β
|
| 841 |
+
Claims move from private to shared graph with full provenance
|
| 842 |
+
```
|
| 843 |
+
|
| 844 |
+
---
|
| 845 |
+
|
| 846 |
+
## 5. UI Architecture
|
| 847 |
+
|
| 848 |
+
### 5.1 Courtroom UI (Conflict Resolution)
|
| 849 |
+
|
| 850 |
+
```
|
| 851 |
+
Default View (Review Queue):
|
| 852 |
+
β οΈ 3-way conflict detected β Debye screening threshold
|
| 853 |
+
Papers: Chen 2022, Nakamura 2023, Williams 2024
|
| 854 |
+
Comparability confidence: 0.58 (method differences detected)
|
| 855 |
+
[Review] [Defer] [Dismiss]
|
| 856 |
+
|
| 857 |
+
Expanded View (Courtroom β click to open):
|
| 858 |
+
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ
|
| 859 |
+
β Chen 2022 β Nakamura 23 β Williams 24 β
|
| 860 |
+
β ACS Nano T1 β Biosens. T1 β Sensors T3 β
|
| 861 |
+
βββββββββββββββΌββββββββββββββΌββββββββββββββ€
|
| 862 |
+
β Claim text β Claim text β Claim text β
|
| 863 |
+
β (nestable) β (nestable) β (nestable) β
|
| 864 |
+
βββββββββββββββΌββββββββββββββΌββββββββββββββ€
|
| 865 |
+
β Method box β Method box β Method box β
|
| 866 |
+
β N=5 p<.001 β N=12 p<.01 β N=3 p=.12 β
|
| 867 |
+
β [PDFπ] β [PDFπ] β [PDFπ] β
|
| 868 |
+
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ
|
| 869 |
+
|
| 870 |
+
System Analysis (Level 5 β unverified):
|
| 871 |
+
"These claims are not directly comparable..."
|
| 872 |
+
Confidence in analysis: 0.62
|
| 873 |
+
|
| 874 |
+
Council Votes: Ext1: scope_diff | Ext2: value_mismatch | Critic: scope_diff
|
| 875 |
+
|
| 876 |
+
[Agree] [Override with custom] [Defer β need more info]
|
| 877 |
+
|
| 878 |
+
β οΈ Missing competitor evidence:
|
| 879 |
+
"3 papers cited by these sources are not yet ingested"
|
| 880 |
+
[Ingest Park 2023] [Ingest Liu 2024] [Ingest Fernandez 2023]
|
| 881 |
+
```
|
| 882 |
+
|
| 883 |
+
### 5.2 Progressive Disclosure Levels
|
| 884 |
+
|
| 885 |
+
```
|
| 886 |
+
Level 0: Dashboard
|
| 887 |
+
Epistemic Health Score per claim cluster
|
| 888 |
+
Today's review queue (priority-ranked)
|
| 889 |
+
|
| 890 |
+
Level 1: Claim Detail
|
| 891 |
+
Text + tag + composite confidence + source
|
| 892 |
+
[Expand to see scoring breakdown]
|
| 893 |
+
|
| 894 |
+
Level 2: Scoring Breakdown
|
| 895 |
+
3 separate scores (evidence, truth, qualifier)
|
| 896 |
+
Statistical evidence if available
|
| 897 |
+
Parser confidence for this region
|
| 898 |
+
|
| 899 |
+
Level 3: Provenance Chain
|
| 900 |
+
Source quote + page + bbox
|
| 901 |
+
Council vote distribution
|
| 902 |
+
Pipeline version + model checkpoint
|
| 903 |
+
|
| 904 |
+
Level 4: Graph Neighborhood
|
| 905 |
+
2-hop subgraph around this claim
|
| 906 |
+
Typed edges visible
|
| 907 |
+
Inferred edges dashed + labeled
|
| 908 |
+
|
| 909 |
+
Level 5: Full Debug
|
| 910 |
+
Raw LLM outputs from each council member
|
| 911 |
+
Token-level confidence distribution
|
| 912 |
+
Parse regions and quality flags
|
| 913 |
+
```
|
| 914 |
+
|
| 915 |
+
### 5.3 Manual Synthesis Mode
|
| 916 |
+
|
| 917 |
+
```
|
| 918 |
+
[Toggle] π§ Manual Synthesis Mode: ON
|
| 919 |
+
|
| 920 |
+
In this mode:
|
| 921 |
+
β
Claims displayed (text + source)
|
| 922 |
+
β
Organized by topic clusters
|
| 923 |
+
β NO confidence scores shown
|
| 924 |
+
β NO conflict flags shown
|
| 925 |
+
β NO gap analysis shown
|
| 926 |
+
β NO system suggestions
|
| 927 |
+
|
| 928 |
+
The researcher draws connections manually.
|
| 929 |
+
Then switches back to compare with system's analysis.
|
| 930 |
+
```
|
| 931 |
+
|
| 932 |
+
---
|
| 933 |
+
|
| 934 |
+
## 6. Local Deployment
|
| 935 |
+
|
| 936 |
+
### 6.1 Minimal Setup (16GB VRAM)
|
| 937 |
+
|
| 938 |
+
```bash
|
| 939 |
+
# 1. Install Ollama (simplest local LLM server)
|
| 940 |
+
curl -fsSL https://ollama.com/install.sh | sh
|
| 941 |
+
|
| 942 |
+
# 2. Pull quantized model (after fine-tuning and uploading GGUF)
|
| 943 |
+
ollama pull nkshirsa/research-os-brain:q4_k_m
|
| 944 |
+
|
| 945 |
+
# 3. Verify it's running
|
| 946 |
+
curl http://localhost:11434/api/generate -d '{"model": "research-os-brain:q4_k_m", "prompt": "test"}'
|
| 947 |
+
|
| 948 |
+
# 4. Start the Research OS
|
| 949 |
+
pip install -r requirements.txt
|
| 950 |
+
python -m phd_research_os.serve --model ollama://research-os-brain:q4_k_m --port 8080
|
| 951 |
+
|
| 952 |
+
# 5. Open UI
|
| 953 |
+
# http://localhost:8080
|
| 954 |
+
```
|
| 955 |
+
|
| 956 |
+
### 6.2 VRAM Budget
|
| 957 |
+
|
| 958 |
+
```
|
| 959 |
+
Qwen3-8B Q4 AWQ weights: ~5.0 GB
|
| 960 |
+
PolarQuant KV cache (128K): ~3.8 GB
|
| 961 |
+
Qwen3-VL-8B Q4 (for figures): ~5.0 GB (loaded on-demand, not persistent)
|
| 962 |
+
Guidance engine overhead: ~0.5 GB
|
| 963 |
+
ChromaDB embeddings: ~0.5 GB
|
| 964 |
+
ββββββββββββββββββββββββββββββββββββββ
|
| 965 |
+
Total (text only): ~9.8 GB β fits 16GB GPU
|
| 966 |
+
Total (with VLM loaded): ~14.8 GB β fits 16GB GPU (tight)
|
| 967 |
+
Total (with VLM on-demand): ~9.8 GB β swap VLM in/out per figure
|
| 968 |
+
```
|
| 969 |
+
|
| 970 |
+
---
|
| 971 |
+
|
| 972 |
+
## 7. Data Flow (Complete Pipeline)
|
| 973 |
+
|
| 974 |
+
```
|
| 975 |
+
PDF Bundle arrives
|
| 976 |
+
β
|
| 977 |
+
βΌ
|
| 978 |
+
LAYER 0: Structural Ingestion
|
| 979 |
+
βββ Marker: layout-aware markdown with section boundaries
|
| 980 |
+
βββ Nougat: equations β LaTeX (routed by region classifier)
|
| 981 |
+
βββ GROBID: references β structured citations
|
| 982 |
+
βββ Figure regions β classify β VLM (semantic) or Digitizer (quantitative)
|
| 983 |
+
βββ Per-region quality scoring (parse_confidence, ocr_confidence)
|
| 984 |
+
βββ Cross-reference verification (Figure 3 β correct figure object?)
|
| 985 |
+
βββ Output: list of annotated regions with bbox, section, quality
|
| 986 |
+
β
|
| 987 |
+
βΌ
|
| 988 |
+
LAYER 1: Entity Resolution
|
| 989 |
+
βββ Normalize entities (gene names, chemicals, assays β canonical IDs)
|
| 990 |
+
βββ Resolve in-text citations ([32] β DOI β metadata)
|
| 991 |
+
βββ Check VoR lineage (is this a preprint we already have?)
|
| 992 |
+
βββ Check retraction status (CrossRef + Retraction Watch)
|
| 993 |
+
βββ Tag: primary vs inherited claims
|
| 994 |
+
β
|
| 995 |
+
βΌ
|
| 996 |
+
LAYER 2: Qualified Extraction (AI Model Council)
|
| 997 |
+
βββ Round 1 (parallel): Query Planner + 2 Extractors + Critic
|
| 998 |
+
β Each independently processes section-aware chunks
|
| 999 |
+
β Guidance engine enforces: valid JSON, valid tags, valid ranges
|
| 1000 |
+
β Section modifier applied (Abstract=0.7, Results=1.0, Discussion=0.75)
|
| 1001 |
+
βββ Round 2 (debate): Share tags + reasoning (NOT confidence)
|
| 1002 |
+
βββ Round 3 (chairman): Synthesize final claims
|
| 1003 |
+
β Apply completeness penalty (code-enforced: 0.7 if missing fields)
|
| 1004 |
+
β Preserve qualifiers from source text
|
| 1005 |
+
β Extract statistical evidence (N, p, d, CI)
|
| 1006 |
+
β Tag null results, inherited citations, causal direction
|
| 1007 |
+
βββ Output: list of qualified claims with full provenance
|
| 1008 |
+
β
|
| 1009 |
+
βΌ
|
| 1010 |
+
LAYER 3: Canonicalization
|
| 1011 |
+
βββ Embed each new claim
|
| 1012 |
+
βββ Compare against existing canonical claims (cosine > 0.85 = merge)
|
| 1013 |
+
βββ Merge: add source as evidence, update confidence aggregation
|
| 1014 |
+
βββ Create: new canonical claim with first source
|
| 1015 |
+
βββ Temporal versioning: if same claim from VoR supersedes preprint version
|
| 1016 |
+
β
|
| 1017 |
+
βΌ
|
| 1018 |
+
LAYER 4: Knowledge Graph
|
| 1019 |
+
βββ Insert claim as graph node
|
| 1020 |
+
βββ Create edges from citation analysis (supports, depends_on)
|
| 1021 |
+
βββ Run conflict detector (keyword + embedding similarity for candidates)
|
| 1022 |
+
βββ Council evaluates candidate conflicts β typed edges (refutes, scope_diff)
|
| 1023 |
+
βββ Check for null evidence β blocking edges
|
| 1024 |
+
βββ Update method-compatibility metadata on edges
|
| 1025 |
+
βββ Cluster related conflicts into case files
|
| 1026 |
+
βββ Run gap analysis (if in Research Landscape mode)
|
| 1027 |
+
β
|
| 1028 |
+
βΌ
|
| 1029 |
+
LAYER 5: Calibrated Scoring (CODE-COMPUTED)
|
| 1030 |
+
βββ evidence_quality = evidence Γ quality Γ tier Γ completeness Γ section
|
| 1031 |
+
βββ truth_likelihood = evidence_quality + source_bonus - conflict_penalty
|
| 1032 |
+
βββ qualifier_strength = 1.0 - qualifier_countΓ0.1 - null_penalty - inherited_penalty
|
| 1033 |
+
βββ Statistical evidence gate: large N + tiny effect β cap confidence
|
| 1034 |
+
βββ Parser confidence propagation: parse_confidence caps evidence_quality
|
| 1035 |
+
βββ Store all 3 scores + composite on claim
|
| 1036 |
+
β
|
| 1037 |
+
βΌ
|
| 1038 |
+
LAYER 6: Evaluation (on config change)
|
| 1039 |
+
βββ Regression gate against golden dataset
|
| 1040 |
+
βββ LLM-as-Judge faithfulness + grounding check
|
| 1041 |
+
βββ Brier score monitoring (monthly)
|
| 1042 |
+
βββ Hidden holdout benchmark (quarterly)
|
| 1043 |
+
β
|
| 1044 |
+
βΌ
|
| 1045 |
+
LAYER 7: Provenance
|
| 1046 |
+
βββ Tag claim with full pipeline version lineage
|
| 1047 |
+
βββ Store bbox + source quote for UI traceability
|
| 1048 |
+
βββ Export: Obsidian vault, Courtroom UI, CSV, BibTeX
|
| 1049 |
+
```
|
| 1050 |
+
|
| 1051 |
+
---
|
| 1052 |
+
|
| 1053 |
+
## 8. Implementation Phases (Aligned with PhD Timeline)
|
| 1054 |
+
|
| 1055 |
+
### Phase A: Foundation (Weeks 1-6) β MUST BE FIRST
|
| 1056 |
+
|
| 1057 |
+
| Week | Task | Deliverable |
|
| 1058 |
+
|------|------|-------------|
|
| 1059 |
+
| 1-2 | Integrate Marker for PDF β structured markdown | Section-aware regions with bbox |
|
| 1060 |
+
| 3 | Add Nougat routing for equation-heavy regions | LaTeX preservation |
|
| 1061 |
+
| 4 | Implement section-aware chunking (replace page-based) | Semantic chunks |
|
| 1062 |
+
| 5 | Add quality scoring per-region | parse_confidence on every span |
|
| 1063 |
+
| 6 | Integrate Guidance engine for constrained decoding | Guaranteed valid JSON output |
|
| 1064 |
+
|
| 1065 |
+
### Phase B: Identity (Weeks 7-12)
|
| 1066 |
+
|
| 1067 |
+
| Week | Task | Deliverable |
|
| 1068 |
+
|------|------|-------------|
|
| 1069 |
+
| 7-8 | Claim canonicalization with embedding dedup | Canonical registry |
|
| 1070 |
+
| 9 | Entity normalization (abbreviations, synonyms) | Ontology mapper |
|
| 1071 |
+
| 10-11 | Citation chain resolution ([32] β DOI) | Primary vs inherited tagging |
|
| 1072 |
+
| 12 | VoR lineage detection | Preprint β VoR superseding |
|
| 1073 |
+
|
| 1074 |
+
### Phase C: Structure (Weeks 13-20)
|
| 1075 |
+
|
| 1076 |
+
| Week | Task | Deliverable |
|
| 1077 |
+
|------|------|-------------|
|
| 1078 |
+
| 13-14 | SQLite-backed knowledge graph with typed edges | Graph schema + CRUD |
|
| 1079 |
+
| 15-16 | Qualifier preservation + null result handling | Blocking edges |
|
| 1080 |
+
| 17-18 | Method-compatibility layer | Comparability confidence |
|
| 1081 |
+
| 19-20 | Conflict clustering into case files | Case file UI |
|
| 1082 |
+
|
| 1083 |
+
### Phase D: Calibration (Weeks 21-26)
|
| 1084 |
+
|
| 1085 |
+
| Week | Task | Deliverable |
|
| 1086 |
+
|------|------|-------------|
|
| 1087 |
+
| 21-22 | Epistemic Separation Engine (section modifiers) | Section-aware scoring |
|
| 1088 |
+
| 23-24 | Statistical evidence extraction (N, p, d, CI) | Practical significance gate |
|
| 1089 |
+
| 25-26 | GRPO training with epistemic reward functions | Trained model v2 |
|
| 1090 |
+
|
| 1091 |
+
### Phase E: Judgment (Weeks 27-32)
|
| 1092 |
+
|
| 1093 |
+
| Week | Task | Deliverable |
|
| 1094 |
+
|------|------|-------------|
|
| 1095 |
+
| 27-28 | Courtroom UI with PDF.js bounding box viewer | Provenance display |
|
| 1096 |
+
| 29-30 | Council parallel-then-merge architecture | Hidden confidence protocol |
|
| 1097 |
+
| 31-32 | Conflict clustering + case file resolution | Batch conflict resolution |
|
| 1098 |
+
|
| 1099 |
+
### Phase F: Longevity (Ongoing, PhD Year 1+)
|
| 1100 |
+
|
| 1101 |
+
| Task | Trigger |
|
| 1102 |
+
|------|---------|
|
| 1103 |
+
| Versioned ontology with backward-compatible queries | 3rd taxonomy update |
|
| 1104 |
+
| VoR lineage tracking | First preprint β VoR encounter |
|
| 1105 |
+
| Ongoing Brier calibration monitoring | 50+ calibration data points |
|
| 1106 |
+
| Gold-standard drift detection | 2nd annotation batch |
|
| 1107 |
+
| Gap Analysis Protocol | 100+ papers ingested |
|
| 1108 |
+
| Manual Synthesis Mode | Thesis writing phase |
|
| 1109 |
+
|
| 1110 |
+
---
|
| 1111 |
+
|
| 1112 |
+
## 9. File Structure (v2.0)
|
| 1113 |
+
|
| 1114 |
+
```
|
| 1115 |
+
phd-research-os/
|
| 1116 |
+
βββ SYSTEM_DESIGN.md # THIS DOCUMENT
|
| 1117 |
+
βββ BLINDSPOT_AUDIT_COMPLETE.md # 87-blindspot audit
|
| 1118 |
+
β
|
| 1119 |
+
βββ phd_research_os/ # Core Python package
|
| 1120 |
+
β βββ __init__.py
|
| 1121 |
+
β β
|
| 1122 |
+
β βββ layer0/ # Structural Ingestion
|
| 1123 |
+
β β βββ parser.py # Marker + Nougat + GROBID orchestrator
|
| 1124 |
+
β β βββ region_classifier.py # LayoutLMv3 region classification
|
| 1125 |
+
β β βββ chunker.py # Section-aware chunking
|
| 1126 |
+
β β βββ figure_router.py # VLM vs Digitizer routing
|
| 1127 |
+
β β βββ plot_digitizer.py # Quantitative plot β CSV
|
| 1128 |
+
β β βββ quality_scorer.py # Per-span quality scoring
|
| 1129 |
+
β β βββ cross_ref_verifier.py # Figure/Table reference integrity
|
| 1130 |
+
β β
|
| 1131 |
+
β βββ layer1/ # Entity Resolution
|
| 1132 |
+
β β βββ entity_normalizer.py # Ontology-aware normalization
|
| 1133 |
+
β β βββ citation_resolver.py # In-text [32] β DOI
|
| 1134 |
+
β β βββ vor_lineage.py # Version of Record tracking
|
| 1135 |
+
β β βββ retraction_checker.py # CrossRef + Retraction Watch
|
| 1136 |
+
β β
|
| 1137 |
+
β βββ layer2/ # Qualified Extraction
|
| 1138 |
+
β β βββ council.py # Parallel-then-merge council (upgraded)
|
| 1139 |
+
β β βββ epistemic_separator.py # Abstract vs Results scoring
|
| 1140 |
+
β β βββ qualifier_extractor.py # Hedging, negation, conditions
|
| 1141 |
+
β β βββ statistical_extractor.py # N, p, d, CI extraction
|
| 1142 |
+
β β βββ constrained_decoder.py # Guidance engine integration
|
| 1143 |
+
β β βββ ood_detector.py # Mahalanobis distance OOD gating
|
| 1144 |
+
β β
|
| 1145 |
+
β βββ layer3/ # Canonicalization
|
| 1146 |
+
β β βββ deduplicator.py # Embedding-based near-duplicate detection
|
| 1147 |
+
β β βββ canonical_registry.py # Canonical claim management
|
| 1148 |
+
β β βββ alias_merger.py # Alias mapping and merging
|
| 1149 |
+
β β βββ temporal_versioner.py # Claim version history
|
| 1150 |
+
β β
|
| 1151 |
+
β βββ layer4/ # Knowledge Graph
|
| 1152 |
+
β β βββ graph.py # SQLite-backed graph with typed edges
|
| 1153 |
+
β β βββ conflict_detector.py # Pairwise conflict detection (upgraded)
|
| 1154 |
+
β β βββ conflict_clusterer.py # Case file generation
|
| 1155 |
+
β β βββ method_compatibility.py # Cross-paper method comparison
|
| 1156 |
+
β β βββ gap_analyzer.py # Structural hole detection
|
| 1157 |
+
β β βββ transitive_constraints.py # Multi-hop inference safety
|
| 1158 |
+
β β
|
| 1159 |
+
β βββ layer5/ # Calibrated Scoring
|
| 1160 |
+
β β βββ scorer.py # Code-computed 3-score system
|
| 1161 |
+
β β βββ statistical_gate.py # Effect size / practical significance
|
| 1162 |
+
β β βββ section_modifiers.py # Abstract/Results/Discussion weights
|
| 1163 |
+
β β βββ calibration_monitor.py # Brier score tracking
|
| 1164 |
+
β β
|
| 1165 |
+
β βββ layer6/ # Evaluation
|
| 1166 |
+
β β βββ regression_gate.py # Golden dataset regression
|
| 1167 |
+
β β βββ llm_judge.py # Faithfulness/grounding evaluation
|
| 1168 |
+
β β βββ stochastic_tester.py # Run-N-times variance check
|
| 1169 |
+
β β βββ annotation_drift.py # Gold-standard drift detection
|
| 1170 |
+
β β
|
| 1171 |
+
β βββ layer7/ # Provenance
|
| 1172 |
+
β β βββ lineage_tagger.py # Pipeline version tagging
|
| 1173 |
+
β β βββ security_sandbox.py # Isolated URL/repo validation
|
| 1174 |
+
β β βββ license_checker.py # Usage rights verification
|
| 1175 |
+
β β βββ embargo_manager.py # Private graph / merge workflow
|
| 1176 |
+
β β
|
| 1177 |
+
β βββ ui/ # Gradio UI
|
| 1178 |
+
β β βββ app.py # Main application
|
| 1179 |
+
β β βββ courtroom.py # Conflict resolution courtroom
|
| 1180 |
+
β β βββ dashboard.py # Epistemic health dashboard
|
| 1181 |
+
β β βββ pdf_viewer.py # PDF.js with bbox highlighting
|
| 1182 |
+
β β βββ manual_synthesis.py # AI-free exploration mode
|
| 1183 |
+
β β βββ export.py # CSV, BibTeX, JSON, Obsidian export
|
| 1184 |
+
β β
|
| 1185 |
+
β βββ core/ # Shared infrastructure
|
| 1186 |
+
β β βββ db.py # SQLite data layer (existing, extended)
|
| 1187 |
+
β β βββ taxonomy.py # Quantum-Bio V2 (existing)
|
| 1188 |
+
β β βββ agents.py # Brain interface (existing, upgraded)
|
| 1189 |
+
β β βββ agent_os.py # ECC Harness (existing)
|
| 1190 |
+
β β βββ meta_improver.py # Meta-Improver (existing)
|
| 1191 |
+
β β βββ skills/ # Superpowers (existing)
|
| 1192 |
+
β β
|
| 1193 |
+
β βββ training/ # Model training
|
| 1194 |
+
β β βββ train_sft.py # Stage 1: SFT
|
| 1195 |
+
β β βββ train_dpo.py # Stage 2: DPO
|
| 1196 |
+
β β βββ train_grpo.py # Stage 3: GRPO with epistemic rewards
|
| 1197 |
+
β β βββ train_calibration.py # Stage 4: ConfTuner
|
| 1198 |
+
β β βββ reward_functions.py # JSON validity, schema, qualifier rewards
|
| 1199 |
+
β β βββ generate_dataset.py # Synthetic + real data generation
|
| 1200 |
+
β β
|
| 1201 |
+
β βββ config/ # Version-controlled configuration
|
| 1202 |
+
β βββ prompts/ # All system prompts (git-tracked)
|
| 1203 |
+
β βββ taxonomy/ # Domain taxonomies
|
| 1204 |
+
β βββ scoring/ # Weight tables, thresholds
|
| 1205 |
+
β βββ evaluation/ # Golden dataset + guidelines
|
| 1206 |
+
β
|
| 1207 |
+
βββ tests/
|
| 1208 |
+
β βββ test_layer0.py # Structural ingestion tests
|
| 1209 |
+
β βββ test_layer1.py # Entity resolution tests
|
| 1210 |
+
β βββ test_layer2.py # Extraction tests
|
| 1211 |
+
β βββ test_layer3.py # Canonicalization tests
|
| 1212 |
+
β βββ test_layer4.py # Knowledge graph tests
|
| 1213 |
+
β βββ test_layer5.py # Scoring tests
|
| 1214 |
+
β βββ test_layer6.py # Evaluation tests
|
| 1215 |
+
β βββ test_layer7.py # Provenance tests
|
| 1216 |
+
β βββ test_db.py # Data layer (existing 22 tests)
|
| 1217 |
+
β βββ test_agent_os.py # ECC harness (existing 21 tests)
|
| 1218 |
+
β βββ test_taxonomy.py # Taxonomy (existing 27 tests)
|
| 1219 |
+
β βββ test_skills_and_meta.py # Skills + meta (existing 30 tests)
|
| 1220 |
+
β βββ test_council.py # Council (existing 19 tests)
|
| 1221 |
+
β
|
| 1222 |
+
βββ docs/
|
| 1223 |
+
βββ ARCHITECTURE.md # Project map (existing)
|
| 1224 |
+
βββ AGENTS.md # Agent registry (existing)
|
| 1225 |
+
βββ USAGE.md # Daily workflow guide
|
| 1226 |
+
βββ ANNOTATION_GUIDELINES.md # Versioned golden dataset rules
|
| 1227 |
+
βββ DEPLOYMENT.md # Local setup guide
|
| 1228 |
+
```
|
| 1229 |
+
|
| 1230 |
+
---
|
| 1231 |
+
|
| 1232 |
+
## 10. Success Criteria
|
| 1233 |
+
|
| 1234 |
+
The system is DONE when:
|
| 1235 |
+
|
| 1236 |
+
1. **A researcher can drop a PDF and get back epistemic-tagged claims with source bounding boxes in under 5 minutes**
|
| 1237 |
+
2. **Two claims from different papers that say the same thing are automatically recognized as the same canonical claim**
|
| 1238 |
+
3. **A null result creates a blocking edge, not a gap, in the knowledge graph**
|
| 1239 |
+
4. **An Abstract claim that overstates the Results gets automatically penalized**
|
| 1240 |
+
5. **The courtroom shows three conflicting papers side-by-side with method comparison and the researcher can resolve in 2 clicks**
|
| 1241 |
+
6. **The gap analyzer identifies untested entity pairs and generates Decision Objects**
|
| 1242 |
+
7. **The system knows when it doesn't know β OOD papers, unextractable regions, and uncalibrated confidence all surface to the human**
|
| 1243 |
+
8. **All of the above works on a 16GB consumer GPU with zero internet dependency for paper processing**
|
| 1244 |
+
|
| 1245 |
+
---
|
| 1246 |
+
|
| 1247 |
+
*This design addresses all 87 blindspots from the complete audit.*
|
| 1248 |
+
*Implementation timeline: ~32 weeks pre-PhD + ongoing during PhD Year 1-3.*
|
| 1249 |
+
*The hardest part is not building it. It's keeping it honest.*
|
| 1250 |
|
| 1251 |
---
|
| 1252 |
|