Upload SYSTEM_DESIGN.md with huggingface_hub

Browse files

Files changed (1) hide show

SYSTEM_DESIGN.md +1249 -0

SYSTEM_DESIGN.md CHANGED Viewed

	@@ -1,3 +1,1252 @@

































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































1
2	---
3

+# PhD Research OS — Complete System Design
+## Version 2.0 | Post-Audit Architecture
+**Date**: 2026-04-23
+**Status**: DESIGN COMPLETE — Ready for phased implementation
+**Addresses**: All 87 blindspots from the audit
+**Hardware Target**: 16-24GB VRAM consumer GPU (RTX 4090 / RTX 3090 / A6000)
+---
+## 1. System Overview
+```
+╔══════════════════════════════════════════════════════════════════════════╗
+║                    PhD Research OS v2.0                                  ║
+║                    "The Epistemic Engine"                                ║
+╠══════════════════════════════════════════════════════════════════════════╣
+║                                                                          ║
+║  ┌─── INPUTS ──────────────────────────────────────────────────────┐    ║
+║  │  PDF Bundles │ Supplements │ Datasets │ Code Repos │ Lab Notes  │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 0: STRUCTURAL INGESTION ──────────────────────────────┐    ║
+║  │  Marker → Nougat → GROBID │ Region Classifier │ Plot Digitizer │    ║
+║  │  Section-aware chunks │ Bounding boxes │ Quality scores          │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 1: ENTITY RESOLUTION ─────────────────────────────────┐    ║
+║  │  Ontology normalizer │ Citation resolver │ VoR lineage │ Retract. │  ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 2: QUALIFIED EXTRACTION ──────────────────────────────┐    ║
+║  │  AI Model Council (parallel) │ Epistemic Separation Engine      │    ║
+║  │  Qualifier preservation │ Statistical extraction │ OOD gating   │    ║
+║  │  Guidance constrained decoding │ Source quotes + bboxes          │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 3: CANONICALIZATION ──────────────────────────────────┐    ║
+║  │  Embedding dedup │ Canonical registry │ Alias merging           │    ║
+║  │  Evidence aggregation │ Temporal versioning │ Lineage diff      │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 4: KNOWLEDGE GRAPH ───────────────────────────────────┐    ║
+║  │  SQLite-backed graph │ Typed epistemic edges │ Lab lineage      │    ║
+║  │  Method compatibility │ Transitive constraints │ Gap analysis   │    ║
+║  │  Null evidence │ Conflict clustering │ Versioned ontology       │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 5: CALIBRATED SCORING ────────────────────────────────┐    ║
+║  │  Code-computed confidence │ 3 separate scores │ Statistical gate��    ║
+║  │  Parser confidence propagation │ Section modifiers │ Brier mon. │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 6: EVALUATION ────────────────────────────────────────┐    ║
+║  │  LLM-as-Judge CI/CD │ Versioned golden set │ Stochastic tests  │    ║
+║  │  Hidden holdout │ Fatigue management │ Counter-metrics          │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── LAYER 7: PROVENANCE & REPRODUCIBILITY ──────────────────────┐    ║
+║  │  Version pinning │ Output lineage │ PDF.js viewer │ Containers  │    ║
+║  │  Security sandbox │ License checking │ Epistemic Embargo        │    ║
+║  └──────────────────────────┬─────────────────────────────────────┘    ║
+║                              ▼                                          ║
+║  ┌─── OUTPUTS ─────────────────────────────────────────────────────┐   ║
+║  │  Obsidian Vault │ Courtroom UI │ Gap Analysis │ Decision Objects│   ║
+║  └─────────────────────────────────────────────────────────────────┘   ║
+║                                                                          ║
+║  ┌─── CROSS-CUTTING ──────────────────────────────────────────────┐    ║
+║  │  AI Model Council │ Meta-Improver │ Superpowers Skills          │    ║
+║  │  ECC Harness │ Companion Agents │ Manual Synthesis Mode         │    ║
+║  └─────────────────────────────────────────────────────────────────┘   ║
+╚══════════════════════════════════════════════════════════════════════════╝
+```
+---
+## 2. Model Architecture
+### 2.1 The Two-Model Strategy
+The system runs TWO models, not one. This solves the local-vs-online tension:
+```
+┌─────────────────────────────────────────────────────────┐
+│  PRIMARY BRAIN (Fully Local — Never Touches Internet)    │
+│                                                          │
+│  Model: Qwen3-8B Q4 AWQ                                │
+│  VRAM: ~5GB weights + ~4GB KV cache (PolarQuant)        │
+│  Total: ~9GB (fits 16GB GPU with room for batch)        │
+│  Context: 128K tokens (full paper length)               │
+│  Serving: Ollama (simplest) or vLLM (fastest)           │
+│                                                          │
+│  Tasks:                                                  │
+│  • Claim extraction (Layer 2)                           │
+│  • Epistemic classification                              │
+│  • Confidence component estimation                       │
+│  • Conflict hypothesis generation                        │
+│  • Query decomposition                                   │
+│  • Decision object generation                            │
+│                                                          │
+│  Constrained decoding: Guidance engine                   │
+│  Training: SFT → DPO → GRPO (4-stage pipeline)         │
+│  Privacy: ALL paper data stays local                     │
+├─────────────────────────────────────────────────────────┤
+│  COMPANION BRAIN (Online — For Non-Sensitive Tasks)      │
+│                                                          │
+│  Model: Claude API / GPT-4o-mini / OpenRouter            │
+│  OR: Local Qwen3-30B-A3B MoE Q4 (~6GB, 3B active)      │
+│                                                          │
+│  Tasks:                                                  │
+│  • Meta-Improver external scanning (arXiv, GitHub)      │
+│  • Prompt optimization A/B testing                       │
+│  • Training data generation for new domains             │
+│  • Retraction/correction checking (needs internet)      │
+│  • Repository URL validation                             │
+│                                                          │
+│  Privacy: NEVER sees raw paper text                      │
+│  Only receives: metadata, queries, anonymized claims     │
+└─────────────────────────────────────────────────────────┘
+```
+### 2.2 Why Qwen3-8B, Not Qwen2.5-3B
+| Metric | Qwen2.5-3B | Qwen3-8B | Improvement |
+|--------|-----------|----------|-------------|
+| AIME (math reasoning) | ~15% | ~45%+ | 3× |
+| MATH-500 | ~85% | ~95%+ | +10 pts |
+| JSON structural accuracy (SFT) | ~65% | ~80%+ | +15 pts |
+| Context window | 32K | 128K | 4× |
+| Hybrid thinking mode | No | Yes | New capability |
+| VRAM at Q4 AWQ | ~2.5GB | ~5GB | Acceptable |
+### 2.3 Alternative: Qwen3-30B-A3B MoE (The Stealth Option)
+For users with 8GB+ VRAM who want maximum quality:
+- 30B total parameters, only 3B activated per token (Mixture of Experts)
+- ~6GB at Q4 quantization
+- Quality equivalent to dense 14B+ models
+- Apache 2.0 license
+- Available: `Qwen/Qwen3-30B-A3B-Instruct-2507` (1M downloads)
+### 2.4 Multimodal: Qwen3-VL-8B-Instruct
+For figure/diagram processing (Layer 0):
+- Same architecture as text model but with vision encoder
+- Available: `Qwen/Qwen3-VL-8B-Instruct` (3.9M downloads)
+- AWQ 4-bit: `cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit` (~5GB)
+- Handles: figure classification, diagram understanding, micrograph analysis
+- Does NOT replace plot digitizer for quantitative data
+### 2.5 VLM for Multimodal Figures: Qwen3-VL-30B-A3B-Instruct
+For maximum figure understanding with MoE efficiency:
+- Available: `Qwen/Qwen3-VL-30B-A3B-Instruct` (1.5M downloads)
+- AWQ: `QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ` (667K downloads)
+- Only 3B active params — fits alongside primary brain
+---
+## 3. Training Pipeline (4-Stage)
+### Stage 1: SFT on Domain Data
+```python
+# Current implementation (train.py) — KEEP but upgrade base model
+from trl import SFTConfig, SFTTrainer
+from peft import LoraConfig
+trainer = SFTTrainer(
+    model="Qwen/Qwen3-8B",  # Upgraded from Qwen2.5-3B
+    args=SFTConfig(
+        output_dir="./research-os-sft",
+        num_train_epochs=3,
+        per_device_train_batch_size=2,
+        gradient_accumulation_steps=8,
+        learning_rate=2e-4,
+        max_length=4096,  # Longer for paper sections
+        assistant_only_loss=True,
+        bf16=True,
+        gradient_checkpointing=True,
+        push_to_hub=True,
+        hub_model_id="nkshirsa/phd-research-os-brain-v2",
+    ),
+    train_dataset=expanded_dataset,  # 10K+ examples (up from 1,900)
+    peft_config=LoraConfig(r=64, lora_alpha=16, target_modules="all-linear"),
+)
+trainer.train()
+```
+### Stage 2: DPO on Preference Pairs
+```python
+from trl import DPOConfig, DPOTrainer
+# Dataset: pairs of (correct extraction, incorrect extraction) for same text
+trainer = DPOTrainer(
+    model="./research-os-sft",  # From stage 1
+    args=DPOConfig(
+        output_dir="./research-os-dpo",
+        learning_rate=5e-7,
+        num_train_epochs=1,
+        max_length=4096,
+        bf16=True,
+        push_to_hub=True,
+    ),
+    train_dataset=preference_dataset,
+    peft_config=LoraConfig(r=64, target_modules="all-linear"),
+)
+```
+### Stage 3: GRPO with Epistemic Reward Functions
+This is the critical stage that bakes JSON reliability and epistemic correctness into the model:
+```python
+from trl import GRPOTrainer, GRPOConfig
+from trl.rewards import think_format_reward
+import json
+# ── Reward Function 1: JSON Validity ──
+def json_validity_reward(completions, **kwargs):
+    """Binary reward: is the output valid JSON?"""
+    rewards = []
+    for completion in completions:
+        content = completion[0]["content"] if isinstance(completion, list) else completion
+        try:
+            json.loads(content)
+            rewards.append(1.0)
+        except (json.JSONDecodeError, TypeError):
+            rewards.append(0.0)
+    return rewards
+# ── Reward Function 2: Schema Compliance ──
+REQUIRED_KEYS = {"text", "epistemic_tag", "confidence", "missing_fields", "status"}
+VALID_TAGS = {"Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"}
+def schema_compliance_reward(completions, **kwargs):
+    """Reward for matching the Research OS claim schema."""
+    rewards = []
+    for completion in completions:
+        content = completion[0]["content"] if isinstance(completion, list) else completion
+        score = 0.0
+        try:
+            data = json.loads(content)
+            claims = data if isinstance(data, list) else data.get("claims", [data])
+            for claim in claims:
+                if not isinstance(claim, dict):
+                    continue
+                # Key presence: 0.3
+                present_keys = set(claim.keys()) & REQUIRED_KEYS
+                score += 0.3 * len(present_keys) / len(REQUIRED_KEYS)
+                # Valid epistemic tag: 0.3
+                if claim.get("epistemic_tag") in VALID_TAGS:
+                    score += 0.3
+                # Confidence in range: 0.2
+                conf = claim.get("confidence", -1)
+                if isinstance(conf, (int, float)) and 0 <= conf <= 1:
+                    score += 0.2
+                # Status consistency: 0.2
+                missing = claim.get("missing_fields", [])
+                status = claim.get("status", "")
+                if (missing and status == "Incomplete") or (not missing and status == "Complete"):
+                    score += 0.2
+            if claims:
+                score /= len(claims)
+        except:
+            pass
+        rewards.append(score)
+    return rewards
+# ── Reward Function 3: Qualifier Preservation ──
+HEDGING_WORDS = {"may", "might", "could", "suggests", "possibly", "potentially",
+                 "appears", "seems", "likely", "unlikely", "not significant"}
+def qualifier_preservation_reward(completions, prompts, **kwargs):
+    """Reward for preserving hedging language from source text."""
+    rewards = []
+    for completion, prompt in zip(completions, prompts):
+        content = completion[0]["content"] if isinstance(completion, list) else completion
+        prompt_text = prompt[0]["content"] if isinstance(prompt, list) else prompt
+        # Find hedging words in source
+        source_hedges = {w for w in HEDGING_WORDS if w in prompt_text.lower()}
+        if not source_hedges:
+            rewards.append(0.5)  # Neutral if no hedging in source
+            continue
+        # Check if hedging is preserved in extraction
+        try:
+            data = json.loads(content)
+            claims = data if isinstance(data, list) else data.get("claims", [data])
+            claim_text = " ".join(c.get("text", "") for c in claims if isinstance(c, dict)).lower()
+            preserved = sum(1 for h in source_hedges if h in claim_text)
+            rewards.append(preserved / len(source_hedges))
+        except:
+            rewards.append(0.0)
+    return rewards
+# ── GRPO Training ──
+trainer = GRPOTrainer(
+    model="./research-os-dpo",  # From stage 2
+    reward_funcs=[
+        json_validity_reward,        # Weight: 0.3
+        schema_compliance_reward,    # Weight: 0.4
+        qualifier_preservation_reward, # Weight: 0.3
+    ],
+    args=GRPOConfig(
+        output_dir="./research-os-grpo",
+        learning_rate=1e-6,
+        num_generations=8,
+        max_completion_length=2048,
+        bf16=True,
+        gradient_checkpointing=True,
+        logging_steps=10,
+        push_to_hub=True,
+        hub_model_id="nkshirsa/phd-research-os-brain-v2",
+        reward_weights=[0.3, 0.4, 0.3],
+    ),
+    train_dataset=prompt_dataset,  # "prompt" column with paper excerpts
+    peft_config=LoraConfig(r=64, target_modules="all-linear"),
+)
+trainer.train()
+```
+### Stage 4: Calibration Fine-Tuning (ConfTuner)
+After GRPO, apply ConfTuner with tokenized Brier score loss to fix confidence calibration. This is a specialized fine-tuning pass that targets only the confidence output tokens.
+---
+## 4. Layer Specifications
+### 4.0 Layer 0: Structural Ingestion Engine
+**Purpose**: Convert PDF bundles into section-aware, bbox-annotated, quality-scored structured regions.
+**Technology Stack**:
+| Component | Tool | Purpose |
+|-----------|------|---------|
+| Layout detection | Marker (VikParuchuri/marker) | PDF → structured markdown with layout awareness |
+| Math/equation | Nougat (facebookresearch/nougat) | Scientific PDFs → LaTeX equations |
+| Bibliographic | GROBID | Headers, authors, citations, references |
+| Region classifier | LayoutLMv3 or DocTR | Classify page regions: text, table, figure, equation |
+| Plot digitizer | PlotDigitizer (algorithmic) | Quantitative plots → CSV of (x,y) coordinates |
+| VLM for figures | Qwen3-VL-8B-Instruct Q4 AWQ | Semantic figure understanding |
+| OCR quality | Per-span confidence scoring | Flag degraded regions |
+**Output Schema** (per region):
+```json
+{
+  "region_id": "REG_00042",
+  "document_type": "main|supplement_1|supplement_2",
+  "page": 5,
+  "bbox": [72, 340, 540, 420],
+  "region_type": "body_text|table|figure|equation|caption|header|reference|footnote",
+  "section": "results",
+  "subsection": "3.2_sensitivity_characterization",
+  "content": {
+    "text": "The LOD was 0.8 ± 0.03 fM (Table 2)",
+    "markdown": "The LOD was 0.8 ± 0.03 fM ([Table 2](#table-2))",
+    "parse_method": "marker",
+    "parse_confidence": 0.95,
+    "ocr_source": false
+  },
+  "cross_references": [
+    {"ref_text": "Table 2", "ref_type": "table", "resolved_to": "REG_00038", "verified": true}
+  ],
+  "extraction_status": "extractable|low_confidence|unextractable",
+  "quality_flags": [],
+  "figures": {
+    "detected": true,
+    "figure_type": "scatter_plot|bar_chart|diagram|micrograph|schematic",
+    "digitizable": true,
+    "digitized_data": null
+  }
+}
+```
+**Chunking Strategy**: Section-aware, NOT page-based.
+1. Marker identifies section boundaries (Introduction, Methods, Results subsections)
+2. Chunk by section with 1-paragraph overlap to preceding and following sections
+3. Tables always kept whole (never split across chunks)
+4. Figure + caption always kept together
+5. Maximum chunk size: 4096 tokens (model context allows it)
+**Paper Bundle Handling**:
+```
+Input: {
+  "main_pdf": "path/to/paper.pdf",
+  "supplements": ["path/to/supplement_1.pdf", "path/to/supplement_data.xlsx"],
+  "code_repo": "https://github.com/author/repo",
+  "dataset": "https://zenodo.org/record/12345"
+}
+```
+### 4.1 Layer 1: Entity Resolution
+**Purpose**: Normalize entities, resolve citations, check retractions, establish version lineage.
+**Components**:
+```
+Entity Normalizer
+  ├── Gene/protein names → UniProt ID
+  ├── Chemical names → PubChem CID
+  ├── Disease names → MeSH ID
+  ├── Assay names → BAO ontology
+  ├── Abbreviations → canonical form (LRU cache)
+  └── Custom domain ontology (user-extensible)
+Citation Chain Resolver
+  ├── In-text "[32]" → reference list → DOI
+  ├── DOI → CrossRef metadata
+  ├── Check: is cited paper in knowledge base?
+  ├── If yes: link claim to original source
+  ├── If no: flag as "citation_orphan" for potential ingestion
+  └── Classify: primary claim vs inherited citation
+Version of Record (VoR) Lineage
+  ├── Before ingestion: query DOI/arXiv for version chain
+  ├── If preprint exists in DB and VoR arriving: supersede
+  ├── If VoR exists and erratum arriving: amend specific claims
+  ├── If retraction: invalidate ALL claims, propagate penalty
+  └── Store full lineage: preprint_doi → vor_doi → errata → retraction
+Retraction Checker
+  ├── CrossRef "update-to" relationship
+  ├── Retraction Watch database (periodic sync via companion model)
+  └── Propagate retraction status through citation chains
+```
+### 4.2 Layer 2: Qualified Extraction
+**Purpose**: Extract claims with full epistemic qualification using the AI Model Council.
+**Council Architecture** (Parallel-Then-Merge):
+```
+Round 1 (PARALLEL — no visibility between members):
+  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
+  │ Query Planner│  │  Extractor   │  │  Extractor 2 │  │   Critic     │
+  │ (decompose)  │  │ (Qwen3-8B)  │  │ (if heterog.)│  │ (adversarial)│
+  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
+         │                 │                 │                 │
+         ▼                 ▼                 ▼                 ▼
+  sub-queries         claims_A          claims_B          critique
+Round 2 (DEBATE — see tags and reasoning, NOT confidence):
+  All members see each other's epistemic tags and reasoning chains
+  Each member can revise their classification
+  Confidence scores remain HIDDEN (prevents anchoring)
+Round 3 (SYNTHESIS — Chairman):
+  Chairman sees everything including confidence
+  Applies completeness penalty (code-enforced, not prompt-instructed)
+  Resolves disagreements with documented reasoning
+  Tags each claim with council_vote_distribution
+```
+**Epistemic Separation Engine**:
+| Section | Epistemic Default | Confidence Modifier |
+|---------|-------------------|-------------------|
+| Results (with statistics) | Fact (if p < threshold) | 1.0 |
+| Results (narrative) | Interpretation | 0.85 |
+| Methods | Protocol metadata (not a claim) | N/A |
+| Abstract | Interpretation (forced) | 0.7 penalty |
+| Discussion | Interpretation or Hypothesis | 0.75 penalty |
+| Conclusion | Cross-check against Results | 0.8 if supported, 0.5 if not |
+| Supplement | Same as main body section rules | 1.0 (no penalty for supplement source) |
+**Constrained Decoding** (Guidance engine):
+```python
+from guidance import models, gen, select
+TAGS = ["Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"]
+lm = models.Transformers("./research-os-grpo")  # Local model
+with lm:
+    output = lm + f"""
+    Analyze this scientific text and extract claims.
+    Text: {section_text}
+    Section: {section_name}
+    <reasoning>{gen("reasoning", max_tokens=500)}</reasoning>
+    Claims:
+    [
+      {{
+        "text": "{gen("claim_text", max_tokens=200)}",
+        "epistemic_tag": "{select(TAGS, name="tag")}",
+        "confidence_components": {{
+          "evidence_strength": {gen("ev_str", regex=r"0\.[0-9][0-9]?[0-9]?", name="evidence")},
+          "qualifiers": ["{gen("qualifiers", max_tokens=100)}"]
+        }},
+        "source_quote": "{gen("source_quote", max_tokens=200)}",
+        "source_page": {gen("page", regex=r"[0-9]+", name="page")},
+        "is_null_result": {select(["true", "false"], name="is_null")},
+        "is_inherited_citation": {select(["true", "false"], name="is_inherited")}
+      }}
+    ]
+    """
+# output["tag"] is GUARANTEED to be in TAGS
+# output["is_null"] is GUARANTEED to be boolean
+```
+**Claim Schema v2** (expanded from v1):
+```json
+{
+  "claim_id": "CLM_00042",
+  "text": "The LOD was 0.8 fM in 10 mM PBS",
+  "epistemic_tag": "Fact",
+  "confidence": 0.855,
+  "confidence_components": {
+    "evidence_strength": 900,
+    "study_quality_weight": 1000,
+    "journal_tier_weight": 1000,
+    "completeness_penalty": 1000,
+    "section_modifier": 1000,
+    "qualifier_penalty": 950
+  },
+  "qualifiers": ["in 10 mM PBS only", "n=5"],
+  "missing_fields": [],
+  "status": "Complete",
+  "is_null_result": false,
+  "is_inherited_citation": false,
+  "causal_direction": "observed_correlation",
+  "statistical_evidence": {
+    "p_value": 0.001,
+    "effect_size": 2.1,
+    "effect_size_type": "cohens_d",
+    "sample_size": 5,
+    "confidence_interval": [0.6, 1.0],
+    "practical_significance": true
+  },
+  "source_quote": "The limit of detection was determined to be 0.8 fM using the 3σ/slope method.",
+  "source_page": 5,
+  "source_bbox": [72, 340, 540, 365],
+  "source_section": "results",
+  "source_doi": "10.1234/example",
+  "council_vote": {
+    "extractor_1": {"tag": "Fact", "reasoning": "Direct measurement with statistics"},
+    "extractor_2": {"tag": "Fact", "reasoning": "Quantitative with clear methodology"},
+    "critic": {"tag": "Fact", "reasoning": "Supported by Table 2 data"},
+    "chairman": {"tag": "Fact", "reasoning": "Unanimous agreement, strong statistics"}
+  },
+  "granularity": "atomic",
+  "parent_claim_id": null,
+  "sub_claims": [],
+  "ontology_version": "quantum_bio_v1",
+  "pipeline_version": "2.1.0",
+  "taxonomy_version": "quantum_bio_v1",
+  "extraction_timestamp": "2026-04-23T10:30:00Z"
+}
+```
+### 4.3 Layer 3: Canonicalization
+**Purpose**: Deduplicate claims, merge aliases, aggregate evidence, track temporal versions.
+```
+New claim arrives →
+  1. Embed claim text (local embedding model or Qwen3-8B last-hidden-state)
+  2. Search existing canonical claims (cosine similarity)
+  3. If similarity > 0.85:
+     ├── MERGE: Add new source as evidence for existing canonical claim
+     ├── Update evidence_count, source_list, confidence (re-aggregate)
+     ├── If confidence_components differ significantly: flag for human review
+     └── Store alias mapping: new_claim_id → canonical_claim_id
+  4. If similarity 0.70-0.85:
+     ├── FLAG as "potential duplicate — review recommended"
+     └── Show both claims in review queue with similarity score
+  5. If similarity < 0.70:
+     └── CREATE new canonical claim
+```
+**Temporal Versioning**:
+```
+canonical_claim:
+  version_history: [
+    {version: 1, source: "preprint_2024", confidence: 0.65, date: "2024-03"},
+    {version: 2, source: "vor_2024", confidence: 0.85, date: "2024-09"},
+    {version: 3, source: "new_study_2025", confidence: 0.90, date: "2025-02"}
+  ]
+  current_version: 3
+  supersedes: null
+  superseded_by: null
+```
+### 4.4 Layer 4: Knowledge Graph
+**Implementation**: SQLite-backed adjacency list (NOT Neo4j — keeps the system local and zero-dependency).
+**Schema**:
+```sql
+CREATE TABLE graph_nodes (
+    node_id TEXT PRIMARY KEY,       -- canonical_claim_id or entity_id
+    node_type TEXT NOT NULL,        -- claim | entity | method | condition
+    label TEXT NOT NULL,
+    properties TEXT,                -- JSON
+    created_at TEXT NOT NULL
+);
+CREATE TABLE graph_edges (
+    edge_id TEXT PRIMARY KEY,
+    source_node TEXT NOT NULL,
+    target_node TEXT NOT NULL,
+    edge_type TEXT NOT NULL,        -- supports | refutes | extends | depends_on |
+                                    -- supersedes | blocks | investigative_hypothesis |
+                                    -- method_uses | condition_applies
+    confidence INTEGER NOT NULL,    -- Fixed-point ×1000
+    evidence_sources TEXT,          -- JSON array of source DOIs
+    is_inferred INTEGER DEFAULT 0,  -- 0=observed, 1=inferred (transitive)
+    inference_chain TEXT,           -- JSON: hop details if inferred
+    method_compatible INTEGER,      -- NULL=unchecked, 0=incompatible, 1=compatible
+    created_at TEXT NOT NULL,
+    updated_at TEXT NOT NULL,
+    FOREIGN KEY(source_node) REFERENCES graph_nodes(node_id),
+    FOREIGN KEY(target_node) REFERENCES graph_nodes(node_id)
+);
+-- Index for fast graph traversal
+CREATE INDEX idx_edges_source ON graph_edges(source_node);
+CREATE INDEX idx_edges_target ON graph_edges(target_node);
+CREATE INDEX idx_edges_type ON graph_edges(edge_type);
+```
+**Edge Types**:
+| Type | Meaning | Confidence Rule |
+|------|---------|----------------|
+| `supports` | Claim A provides evidence for Claim B | From source text, observed |
+| `refutes` | Claim A contradicts Claim B | From source text or conflict detection |
+| `extends` | Claim A adds conditions/parameters to B | Section analysis |
+| `depends_on` | Claim A assumes Claim B is true | Citation chain analysis |
+| `supersedes` | Claim A replaces older Claim B (newer data) | Temporal versioning |
+| `blocks` | Null finding: no evidence of relationship | Null result extraction |
+| `investigative_hypothesis` | Inferred multi-hop (NOT observed) | min(hop_confidences) × 0.5 |
+**Transitive Inference Constraints**:
+- NEVER auto-generate `supports` across multiple hops
+- Only `investigative_hypothesis` edges for multi-hop
+- Require method_compatible=1 for each hop before generating inference
+- Default queries return observed edges only
+- `include_inferred=True` flag required for graph queries that include inferences
+**Gap Analysis Protocol**:
+```python
+def find_gaps(self, domain_id: str) -> list:
+    """Find structural holes in the knowledge graph."""
+    # 1. Get all entities in domain
+    entities = self.get_entities(domain_id)
+    # 2. For each entity pair in same domain
+    for a, b in combinations(entities, 2):
+        # 3. Check if edge exists
+        edges = self.get_edges(a.id, b.id)
+        if not edges:
+            # 4. Check if both are well-connected (dense neighborhood)
+            a_degree = self.get_degree(a.id)
+            b_degree = self.get_degree(b.id)
+            if a_degree > 3 and b_degree > 3:
+                # 5. This is a high-value gap
+                info_gain = (a_degree + b_degree) / max_degree
+                gaps.append({
+                    "entity_a": a, "entity_b": b,
+                    "information_gain": info_gain,
+                    "suggested_action": "experiment" if info_gain > 0.7 else "literature_search"
+                })
+    return sorted(gaps, key=lambda g: -g["information_gain"])
+```
+### 4.5 Layer 5: Calibrated Scoring
+**Purpose**: Compute confidence using CODE, not LLM. Three separate scores.
+```python
+def compute_claim_scores(claim: dict, source: dict, section: str) -> dict:
+    """
+    Code-computed scoring. The LLM provides COMPONENTS,
+    the code computes the FINAL SCORES.
+    The LLM NEVER sets the final confidence directly.
+    """
+    # ── Score 1: Evidence Quality ──
+    evidence_strength = claim["confidence_components"]["evidence_strength"]  # From LLM
+    study_quality = taxonomy.get_weight(source["study_type"], domain_id)     # From taxonomy
+    journal_tier = JOURNAL_TIER_WEIGHTS[source["journal_tier"]]              # From config
+    completeness = 700 if claim["missing_fields"] else 1000                  # Binary: code enforced
+    section_mod = SECTION_MODIFIERS[section]                                  # From config
+    # Fixed-point multiplication chain
+    evidence_quality = (evidence_strength * study_quality // 1000
+                       * journal_tier // 1000
+                       * completeness // 1000
+                       * section_mod // 1000)
+    # ── Score 2: Claim Truth Likelihood ──
+    # Based on evidence quality + source count + conflict status
+    source_count_bonus = min(claim["evidence_count"] * 50, 200)  # Max +0.2 for multiple sources
+    conflict_penalty = -300 if claim.get("has_active_conflict") else 0
+    null_evidence_penalty = -200 if claim.get("has_null_evidence") else 0
+    truth_likelihood = min(1000, max(0,
+        evidence_quality + source_count_bonus + conflict_penalty + null_evidence_penalty
+    ))
+    # ── Score 3: Qualifier Strength ──
+    # How definitive is the claim's language?
+    qualifier_count = len(claim.get("qualifiers", []))
+    is_null = claim.get("is_null_result", False)
+    is_inherited = claim.get("is_inherited_citation", False)
+    qualifier_strength = 1000
+    if qualifier_count > 0:
+        qualifier_strength -= qualifier_count * 100  # -0.1 per qualifier
+    if is_null:
+        qualifier_strength = min(qualifier_strength, 500)  # Cap at 0.5 for null results
+    if is_inherited:
+        qualifier_strength -= 200  # -0.2 for inherited citations
+    qualifier_strength = max(0, qualifier_strength)
+    # ── Statistical Evidence Gate ──
+    stats = claim.get("statistical_evidence", {})
+    if stats.get("effect_size") is not None:
+        effect = stats["effect_size"]
+        sample_n = stats.get("sample_size", 0)
+        # Large N + tiny effect = statistically significant but practically meaningless
+        if sample_n > 1000 and abs(effect) < 0.1:
+            # Override: this is NOT practically significant
+            evidence_quality = min(evidence_quality, 400)  # Cap at 0.4
+            claim["practical_significance"] = False
+    # ── Parser Confidence Propagation ──
+    parse_conf = claim.get("parse_confidence", 1000)
+    evidence_quality = min(evidence_quality, parse_conf)  # Parser uncertainty CAPS claim
+    return {
+        "evidence_quality": evidence_quality,            # Fixed-point ×1000
+        "truth_likelihood": truth_likelihood,            # Fixed-point ×1000
+        "qualifier_strength": qualifier_strength,        # Fixed-point ×1000
+        "composite_confidence": (evidence_quality + truth_likelihood + qualifier_strength) // 3,
+        "practical_significance": claim.get("practical_significance", True),
+    }
+```
+### 4.6 Layer 6: Evaluation
+**Evaluation Pipeline** (runs in CI/CD on every prompt/model/taxonomy change):
+```
+1. STRUCTURAL TESTS (existing 119 tests — code correctness)
+   └── pytest tests/ → all pass?
+2. GOLDEN DATASET REGRESSION (versioned annotations)
+   ├── Extraction recall ≥ 70%
+   ├── Hallucination rate ≤ 10%
+   ├── Epistemic accuracy ≥ 60%
+   ├── Qualifier preservation rate ≥ 80% (NEW)
+   └── Null result detection rate ≥ 50% (NEW)
+3. LLM-AS-JUDGE (faithfulness & grounding)
+   ├── Faithfulness: does extracted claim appear in source text?
+   ├── Grounding: can claim be traced to specific source quote?
+   ├── Tag correctness: does epistemic tag match expert judgment?
+   ├── Qualifier preservation: are hedging words maintained?
+   └── Run on 5 golden papers, 3 times each (stochastic check)
+4. CALIBRATION CHECK (monthly)
+   ├── Brier score from calibration_log
+   ├── Alert if ECE > 0.25
+   └── Trigger ConfTuner re-training if needed
+5. HIDDEN HOLDOUT (never seen during development)
+   ├── 3 papers reserved, never used in training or golden set
+   ├── Evaluated quarterly
+   └── Detects benchmark overfitting
+```
+**Versioned Annotation Guidelines**:
+```
+/evaluation/
+├── guidelines_v1.0.md           # Annotation rules (version controlled)
+├── golden_dataset/
+│   ├── paper_001.json           # Annotated under guidelines v1.0
+│   ├── paper_002.json           # Annotated under guidelines v1.0
+│   └── paper_006.json           # Annotated under guidelines v1.1
+├── frozen_anchors/              # NEVER re-annotated
+│   ├── paper_001_frozen.json
+│   └── paper_002_frozen.json
+└── holdout/                     # NEVER seen during development
+    ├── paper_H1.json
+    └── paper_H2.json
+```
+### 4.7 Layer 7: Provenance & Reproducibility
+**Output Lineage** (every claim tagged):
+```json
+{
+  "pipeline_version": "2.1.0",
+  "model_checkpoint": "research-os-grpo-v2-step-5000",
+  "parser_version": "marker-1.2.0",
+  "taxonomy_version": "quantum_bio_v1",
+  "prompt_hash": "sha256:a3b4c5...",
+  "extraction_timestamp": "2026-04-23T10:30:00Z",
+  "guidance_schema_version": "1.0"
+}
+```
+**Security Sandbox** (for repository validation):
+```
+┌─── SANDBOX (isolated from main system) ─────────────────┐
+│  • Timeout: 60 seconds max per URL check                 │
+│  • Network: HTTP GET only, no POST/PUT/DELETE             │
+│  • Download limit: 100MB per artifact                    │
+│  • No code execution (dry-run validation only)           │
+│  • Actual code execution requires human authorization    │
+│  • Credential isolation: no access to main DB or API keys│
+└──────────────────────────────────────────────────────────┘
+```
+**Epistemic Embargo** (for IP protection):
+```
+User creates "Private Graph" →
+  All claims extracted in this mode go to private subgraph →
+  Private subgraph is NOT visible to other users / companion agents →
+  After paper submission: user clicks "Merge to Lab Graph" →
+  Claims move from private to shared graph with full provenance
+```
+---
+## 5. UI Architecture
+### 5.1 Courtroom UI (Conflict Resolution)
+```
+Default View (Review Queue):
+  ⚠️ 3-way conflict detected — Debye screening threshold
+  Papers: Chen 2022, Nakamura 2023, Williams 2024
+  Comparability confidence: 0.58 (method differences detected)
+  [Review] [Defer] [Dismiss]
+Expanded View (Courtroom — click to open):
+  ┌─────────────┬─────────────┬─────────────┐
+  │ Chen 2022   │ Nakamura 23 │ Williams 24 │
+  │ ACS Nano T1 │ Biosens. T1 │ Sensors T3  │
+  ├─────────────┼─────────────┼─────────────┤
+  │ Claim text  │ Claim text  │ Claim text  │
+  │ (nestable)  │ (nestable)  │ (nestable)  │
+  ├─────────────┼─────────────┼─────────────┤
+  │ Method box  │ Method box  │ Method box  │
+  │ N=5 p<.001  │ N=12 p<.01 │ N=3 p=.12  │
+  │ [PDF📄]     │ [PDF📄]     │ [PDF📄]     │
+  └─────────────┴─────────────┴─────────────┘
+  System Analysis (Level 5 — unverified):
+  "These claims are not directly comparable..."
+  Confidence in analysis: 0.62
+  Council Votes: Ext1: scope_diff | Ext2: value_mismatch | Critic: scope_diff
+  [Agree] [Override with custom] [Defer — need more info]
+  ⚠️ Missing competitor evidence:
+  "3 papers cited by these sources are not yet ingested"
+  [Ingest Park 2023] [Ingest Liu 2024] [Ingest Fernandez 2023]
+```
+### 5.2 Progressive Disclosure Levels
+```
+Level 0: Dashboard
+  Epistemic Health Score per claim cluster
+  Today's review queue (priority-ranked)
+Level 1: Claim Detail
+  Text + tag + composite confidence + source
+  [Expand to see scoring breakdown]
+Level 2: Scoring Breakdown
+  3 separate scores (evidence, truth, qualifier)
+  Statistical evidence if available
+  Parser confidence for this region
+Level 3: Provenance Chain
+  Source quote + page + bbox
+  Council vote distribution
+  Pipeline version + model checkpoint
+Level 4: Graph Neighborhood
+  2-hop subgraph around this claim
+  Typed edges visible
+  Inferred edges dashed + labeled
+Level 5: Full Debug
+  Raw LLM outputs from each council member
+  Token-level confidence distribution
+  Parse regions and quality flags
+```
+### 5.3 Manual Synthesis Mode
+```
+[Toggle] 🧠 Manual Synthesis Mode: ON
+In this mode:
+  ✅ Claims displayed (text + source)
+  ✅ Organized by topic clusters
+  ❌ NO confidence scores shown
+  ❌ NO conflict flags shown
+  ❌ NO gap analysis shown
+  ❌ NO system suggestions
+  The researcher draws connections manually.
+  Then switches back to compare with system's analysis.
+```
+---
+## 6. Local Deployment
+### 6.1 Minimal Setup (16GB VRAM)
+```bash
+# 1. Install Ollama (simplest local LLM server)
+curl -fsSL https://ollama.com/install.sh | sh
+# 2. Pull quantized model (after fine-tuning and uploading GGUF)
+ollama pull nkshirsa/research-os-brain:q4_k_m
+# 3. Verify it's running
+curl http://localhost:11434/api/generate -d '{"model": "research-os-brain:q4_k_m", "prompt": "test"}'
+# 4. Start the Research OS
+pip install -r requirements.txt
+python -m phd_research_os.serve --model ollama://research-os-brain:q4_k_m --port 8080
+# 5. Open UI
+# http://localhost:8080
+```
+### 6.2 VRAM Budget
+```
+Qwen3-8B Q4 AWQ weights:     ~5.0 GB
+PolarQuant KV cache (128K):   ~3.8 GB
+Qwen3-VL-8B Q4 (for figures): ~5.0 GB (loaded on-demand, not persistent)
+Guidance engine overhead:      ~0.5 GB
+ChromaDB embeddings:           ~0.5 GB
+──────────────────────────────────────
+Total (text only):             ~9.8 GB  ← fits 16GB GPU
+Total (with VLM loaded):      ~14.8 GB  ← fits 16GB GPU (tight)
+Total (with VLM on-demand):    ~9.8 GB  ← swap VLM in/out per figure
+```
+---
+## 7. Data Flow (Complete Pipeline)
+```
+PDF Bundle arrives
+  │
+  ▼
+LAYER 0: Structural Ingestion
+  ├── Marker: layout-aware markdown with section boundaries
+  ├── Nougat: equations → LaTeX (routed by region classifier)
+  ├── GROBID: references → structured citations
+  ├── Figure regions → classify → VLM (semantic) or Digitizer (quantitative)
+  ├── Per-region quality scoring (parse_confidence, ocr_confidence)
+  ├── Cross-reference verification (Figure 3 → correct figure object?)
+  └── Output: list of annotated regions with bbox, section, quality
+  │
+  ▼
+LAYER 1: Entity Resolution
+  ├── Normalize entities (gene names, chemicals, assays → canonical IDs)
+  ├── Resolve in-text citations ([32] → DOI → metadata)
+  ├── Check VoR lineage (is this a preprint we already have?)
+  ├── Check retraction status (CrossRef + Retraction Watch)
+  └── Tag: primary vs inherited claims
+  │
+  ▼
+LAYER 2: Qualified Extraction (AI Model Council)
+  ├── Round 1 (parallel): Query Planner + 2 Extractors + Critic
+  │   Each independently processes section-aware chunks
+  │   Guidance engine enforces: valid JSON, valid tags, valid ranges
+  │   Section modifier applied (Abstract=0.7, Results=1.0, Discussion=0.75)
+  ├── Round 2 (debate): Share tags + reasoning (NOT confidence)
+  ├── Round 3 (chairman): Synthesize final claims
+  │   Apply completeness penalty (code-enforced: 0.7 if missing fields)
+  │   Preserve qualifiers from source text
+  │   Extract statistical evidence (N, p, d, CI)
+  │   Tag null results, inherited citations, causal direction
+  └── Output: list of qualified claims with full provenance
+  │
+  ▼
+LAYER 3: Canonicalization
+  ├── Embed each new claim
+  ├── Compare against existing canonical claims (cosine > 0.85 = merge)
+  ├── Merge: add source as evidence, update confidence aggregation
+  ├── Create: new canonical claim with first source
+  └── Temporal versioning: if same claim from VoR supersedes preprint version
+  │
+  ▼
+LAYER 4: Knowledge Graph
+  ├── Insert claim as graph node
+  ├── Create edges from citation analysis (supports, depends_on)
+  ├── Run conflict detector (keyword + embedding similarity for candidates)
+  ├── Council evaluates candidate conflicts → typed edges (refutes, scope_diff)
+  ├── Check for null evidence → blocking edges
+  ├── Update method-compatibility metadata on edges
+  ├── Cluster related conflicts into case files
+  └── Run gap analysis (if in Research Landscape mode)
+  │
+  ▼
+LAYER 5: Calibrated Scoring (CODE-COMPUTED)
+  ├── evidence_quality = evidence × quality × tier × completeness × section
+  ├── truth_likelihood = evidence_quality + source_bonus - conflict_penalty
+  ├── qualifier_strength = 1.0 - qualifier_count×0.1 - null_penalty - inherited_penalty
+  ├── Statistical evidence gate: large N + tiny effect → cap confidence
+  ├── Parser confidence propagation: parse_confidence caps evidence_quality
+  └── Store all 3 scores + composite on claim
+  │
+  ▼
+LAYER 6: Evaluation (on config change)
+  ├── Regression gate against golden dataset
+  ├── LLM-as-Judge faithfulness + grounding check
+  ├── Brier score monitoring (monthly)
+  └── Hidden holdout benchmark (quarterly)
+  │
+  ▼
+LAYER 7: Provenance
+  ├── Tag claim with full pipeline version lineage
+  ├── Store bbox + source quote for UI traceability
+  └── Export: Obsidian vault, Courtroom UI, CSV, BibTeX
+```
+---
+## 8. Implementation Phases (Aligned with PhD Timeline)
+### Phase A: Foundation (Weeks 1-6) — MUST BE FIRST
+| Week | Task | Deliverable |
+|------|------|-------------|
+| 1-2 | Integrate Marker for PDF → structured markdown | Section-aware regions with bbox |
+| 3 | Add Nougat routing for equation-heavy regions | LaTeX preservation |
+| 4 | Implement section-aware chunking (replace page-based) | Semantic chunks |
+| 5 | Add quality scoring per-region | parse_confidence on every span |
+| 6 | Integrate Guidance engine for constrained decoding | Guaranteed valid JSON output |
+### Phase B: Identity (Weeks 7-12)
+| Week | Task | Deliverable |
+|------|------|-------------|
+| 7-8 | Claim canonicalization with embedding dedup | Canonical registry |
+| 9 | Entity normalization (abbreviations, synonyms) | Ontology mapper |
+| 10-11 | Citation chain resolution ([32] → DOI) | Primary vs inherited tagging |
+| 12 | VoR lineage detection | Preprint → VoR superseding |
+### Phase C: Structure (Weeks 13-20)
+| Week | Task | Deliverable |
+|------|------|-------------|
+| 13-14 | SQLite-backed knowledge graph with typed edges | Graph schema + CRUD |
+| 15-16 | Qualifier preservation + null result handling | Blocking edges |
+| 17-18 | Method-compatibility layer | Comparability confidence |
+| 19-20 | Conflict clustering into case files | Case file UI |
+### Phase D: Calibration (Weeks 21-26)
+| Week | Task | Deliverable |
+|------|------|-------------|
+| 21-22 | Epistemic Separation Engine (section modifiers) | Section-aware scoring |
+| 23-24 | Statistical evidence extraction (N, p, d, CI) | Practical significance gate |
+| 25-26 | GRPO training with epistemic reward functions | Trained model v2 |
+### Phase E: Judgment (Weeks 27-32)
+| Week | Task | Deliverable |
+|------|------|-------------|
+| 27-28 | Courtroom UI with PDF.js bounding box viewer | Provenance display |
+| 29-30 | Council parallel-then-merge architecture | Hidden confidence protocol |
+| 31-32 | Conflict clustering + case file resolution | Batch conflict resolution |
+### Phase F: Longevity (Ongoing, PhD Year 1+)
+| Task | Trigger |
+|------|---------|
+| Versioned ontology with backward-compatible queries | 3rd taxonomy update |
+| VoR lineage tracking | First preprint → VoR encounter |
+| Ongoing Brier calibration monitoring | 50+ calibration data points |
+| Gold-standard drift detection | 2nd annotation batch |
+| Gap Analysis Protocol | 100+ papers ingested |
+| Manual Synthesis Mode | Thesis writing phase |
+---
+## 9. File Structure (v2.0)
+```
+phd-research-os/
+├── SYSTEM_DESIGN.md                    # THIS DOCUMENT
+├── BLINDSPOT_AUDIT_COMPLETE.md         # 87-blindspot audit
+│
+├── phd_research_os/                    # Core Python package
+│   ├── __init__.py
+│   │
+│   ├── layer0/                         # Structural Ingestion
+│   │   ├── parser.py                   # Marker + Nougat + GROBID orchestrator
+│   │   ├── region_classifier.py        # LayoutLMv3 region classification
+│   │   ├── chunker.py                  # Section-aware chunking
+│   │   ├── figure_router.py            # VLM vs Digitizer routing
+│   │   ├── plot_digitizer.py           # Quantitative plot → CSV
+│   │   ├── quality_scorer.py           # Per-span quality scoring
+│   │   └── cross_ref_verifier.py       # Figure/Table reference integrity
+│   │
+│   ├── layer1/                         # Entity Resolution
+│   │   ├── entity_normalizer.py        # Ontology-aware normalization
+│   │   ├── citation_resolver.py        # In-text [32] → DOI
+│   │   ├── vor_lineage.py              # Version of Record tracking
+│   │   └── retraction_checker.py       # CrossRef + Retraction Watch
+│   │
+│   ├── layer2/                         # Qualified Extraction
+│   │   ├── council.py                  # Parallel-then-merge council (upgraded)
+│   │   ├── epistemic_separator.py      # Abstract vs Results scoring
+│   │   ├── qualifier_extractor.py      # Hedging, negation, conditions
+│   │   ├── statistical_extractor.py    # N, p, d, CI extraction
+│   │   ├── constrained_decoder.py      # Guidance engine integration
+│   │   └── ood_detector.py             # Mahalanobis distance OOD gating
+│   │
+│   ├── layer3/                         # Canonicalization
+│   │   ├── deduplicator.py             # Embedding-based near-duplicate detection
+│   │   ├── canonical_registry.py       # Canonical claim management
+│   │   ├── alias_merger.py             # Alias mapping and merging
+│   │   └── temporal_versioner.py       # Claim version history
+│   │
+│   ├── layer4/                         # Knowledge Graph
+│   │   ├── graph.py                    # SQLite-backed graph with typed edges
+│   │   ├── conflict_detector.py        # Pairwise conflict detection (upgraded)
+│   │   ├── conflict_clusterer.py       # Case file generation
+│   │   ├── method_compatibility.py     # Cross-paper method comparison
+│   │   ├── gap_analyzer.py             # Structural hole detection
+│   │   └── transitive_constraints.py   # Multi-hop inference safety
+│   │
+│   ├── layer5/                         # Calibrated Scoring
+│   │   ├── scorer.py                   # Code-computed 3-score system
+│   │   ├── statistical_gate.py         # Effect size / practical significance
+│   │   ├── section_modifiers.py        # Abstract/Results/Discussion weights
+│   │   └── calibration_monitor.py      # Brier score tracking
+│   │
+│   ├── layer6/                         # Evaluation
+│   │   ├── regression_gate.py          # Golden dataset regression
+│   │   ├── llm_judge.py               # Faithfulness/grounding evaluation
+│   │   ├── stochastic_tester.py        # Run-N-times variance check
+│   │   └── annotation_drift.py         # Gold-standard drift detection
+│   │
+│   ├── layer7/                         # Provenance
+│   │   ├── lineage_tagger.py           # Pipeline version tagging
+│   │   ├── security_sandbox.py         # Isolated URL/repo validation
+│   │   ├── license_checker.py          # Usage rights verification
+│   │   └── embargo_manager.py          # Private graph / merge workflow
+│   │
+│   ├── ui/                             # Gradio UI
+│   │   ├── app.py                      # Main application
+│   │   ├── courtroom.py                # Conflict resolution courtroom
+│   │   ├── dashboard.py                # Epistemic health dashboard
+│   │   ├── pdf_viewer.py               # PDF.js with bbox highlighting
+│   │   ├── manual_synthesis.py         # AI-free exploration mode
+│   │   └── export.py                   # CSV, BibTeX, JSON, Obsidian export
+│   │
+│   ├── core/                           # Shared infrastructure
+│   │   ├── db.py                       # SQLite data layer (existing, extended)
+│   │   ├── taxonomy.py                 # Quantum-Bio V2 (existing)
+│   │   ├── agents.py                   # Brain interface (existing, upgraded)
+│   │   ├── agent_os.py                 # ECC Harness (existing)
+│   │   ├── meta_improver.py            # Meta-Improver (existing)
+│   │   └── skills/                     # Superpowers (existing)
+│   │
+│   ├── training/                       # Model training
+│   │   ├── train_sft.py                # Stage 1: SFT
+│   │   ├── train_dpo.py                # Stage 2: DPO
+│   │   ├── train_grpo.py              # Stage 3: GRPO with epistemic rewards
+│   │   ├── train_calibration.py        # Stage 4: ConfTuner
+│   │   ├── reward_functions.py         # JSON validity, schema, qualifier rewards
+│   │   └── generate_dataset.py         # Synthetic + real data generation
+│   │
+│   └── config/                         # Version-controlled configuration
+│       ├── prompts/                    # All system prompts (git-tracked)
+│       ├── taxonomy/                   # Domain taxonomies
+│       ├── scoring/                    # Weight tables, thresholds
+│       └── evaluation/                 # Golden dataset + guidelines
+│
+├── tests/
+│   ├── test_layer0.py                  # Structural ingestion tests
+│   ├── test_layer1.py                  # Entity resolution tests
+│   ├── test_layer2.py                  # Extraction tests
+│   ├── test_layer3.py                  # Canonicalization tests
+│   ├── test_layer4.py                  # Knowledge graph tests
+│   ├── test_layer5.py                  # Scoring tests
+│   ├── test_layer6.py                  # Evaluation tests
+│   ├── test_layer7.py                  # Provenance tests
+│   ├── test_db.py                      # Data layer (existing 22 tests)
+│   ├── test_agent_os.py                # ECC harness (existing 21 tests)
+│   ├── test_taxonomy.py                # Taxonomy (existing 27 tests)
+│   ├── test_skills_and_meta.py         # Skills + meta (existing 30 tests)
+│   └── test_council.py                 # Council (existing 19 tests)
+│
+└── docs/
+    ├── ARCHITECTURE.md                 # Project map (existing)
+    ├── AGENTS.md                       # Agent registry (existing)
+    ├── USAGE.md                        # Daily workflow guide
+    ├── ANNOTATION_GUIDELINES.md        # Versioned golden dataset rules
+    └── DEPLOYMENT.md                   # Local setup guide
+```
+---
+## 10. Success Criteria
+The system is DONE when:
+1. **A researcher can drop a PDF and get back epistemic-tagged claims with source bounding boxes in under 5 minutes**
+2. **Two claims from different papers that say the same thing are automatically recognized as the same canonical claim**
+3. **A null result creates a blocking edge, not a gap, in the knowledge graph**
+4. **An Abstract claim that overstates the Results gets automatically penalized**
+5. **The courtroom shows three conflicting papers side-by-side with method comparison and the researcher can resolve in 2 clicks**
+6. **The gap analyzer identifies untested entity pairs and generates Decision Objects**
+7. **The system knows when it doesn't know — OOD papers, unextractable regions, and uncalibrated confidence all surface to the human**
+8. **All of the above works on a 16GB consumer GPU with zero internet dependency for paper processing**
+---
+*This design addresses all 87 blindspots from the complete audit.*
+*Implementation timeline: ~32 weeks pre-PhD + ongoing during PhD Year 1-3.*
+*The hardest part is not building it. It's keeping it honest.*
 ---