phd-research-os-brain / DEFERRED_FEATURES.md
nkshirsa's picture
Add DEFERRED_FEATURES.md — registry of significant-effort and aspirational features with prerequisites and revisit triggers
056b450 verified

PhD Research OS — Deferred Features Registry

Features That Are Real But Not Built Now

Date: 2026-04-23 Status: LIVING DOCUMENT — updated as features move from deferred → active Purpose: Honest accounting of what we're NOT building yet and WHY


How This Document Works

Every feature below was validated as technically sound. They are NOT rejected — they are deferred. Each one has:

  • What it is (plain English)
  • Why it's deferred (the honest reason)
  • What must exist first (prerequisites)
  • Estimated effort (in weeks, not days)
  • When to revisit (trigger condition)

Features are split into two tiers:

  • Tier 1: Significant Effort — implementable with current knowledge, but requires weeks of focused engineering
  • Tier 2: Aspirational / Research-Grade — requires novel research, unproven at our scale, or depends on immature ecosystems

Tier 1: Implementable But Requires Significant Effort

These are engineering tasks, not research tasks. The path is clear but the work is substantial.


T1-1: 4-Stage Training Pipeline (SFT → DPO → GRPO → ConfTuner)

What it is: The full training progression where each stage teaches the model something the previous stage couldn't:

  • SFT: "Here's the right answer" (~70% quality)
  • DPO: "This answer is better than that one" (~80%)
  • GRPO: "Reward for good JSON, correct tags, preserved qualifiers" (~85-90%)
  • ConfTuner: "Make your confidence numbers match reality" (calibrated)

Why it's deferred: Each stage depends on the previous stage's output AND requires its own dataset format:

  • DPO needs preference pairs (correct vs incorrect extractions for the same input)
  • GRPO needs a prompt-only dataset plus reward functions
  • ConfTuner needs calibration data (model predictions vs ground truth) None of these datasets exist yet.

What must exist first:

  • ✅ SFT stage (exists, being upgraded)
  • A working model from SFT that produces plausible but imperfect outputs
  • 500+ preference pairs for DPO (run SFT model, collect its mistakes, pair with correct answers)
  • 3 reward functions validated on held-out data for GRPO
  • 100+ calibration data points for ConfTuner

Estimated effort: 8-12 weeks total (2-3 weeks per stage including dataset creation)

When to revisit: After the SFT-trained model has been evaluated on SciFact and its error patterns are understood. You need to know WHERE it fails before you can build DPO pairs that target those failures.

Source documents: SYSTEM_DESIGN.md §3, FUTURE_IMPROVEMENTS.md §9 (TR-1)


T1-2: Conflict Investigation Protocol (from CLAIRE)

What it is: When two claims potentially contradict, don't just flag it — investigate. Retrieve surrounding context for both claims, check method compatibility, search the knowledge graph for additional evidence, and produce a conflict report grounded in multiple sources.

Why it's deferred: Requires three infrastructure pieces that don't exist yet:

  1. Embedding-based retrieval (to find related claims) — being built now
  2. A populated knowledge graph with >100 claims (to have enough context to investigate)
  3. Method-compatibility metadata on claims (buffer, species, assay type)

What must exist first:

  • ✅ SPECTER2 embeddings (being integrated)
  • Knowledge graph with typed edges populated from real papers
  • Method-compatibility fields in claim schema
  • At least 50 papers ingested to have enough cross-paper evidence

Estimated effort: 3-4 weeks

When to revisit: After 50+ papers have been ingested through the pipeline and the knowledge graph has enough density for cross-referencing.

Source documents: SYSTEM_INSPIRATIONS.md §2 (AD-6), PRIOR_ART_ANALYSIS.md §2 (CLAIRE)


T1-3: Devil's Advocate Mode

What it is: A UI mode that actively challenges high-confidence claims by finding counter-evidence, suggesting alternative epistemic labels, and identifying what would need to be true for the claim to be wrong.

Why it's deferred: Requires a working knowledge graph with conflict detection. Without a populated graph, the devil's advocate has nothing to advocate with.

What must exist first:

  • Knowledge graph with >200 claims and typed edges
  • Conflict detection pipeline (at least the SciBERT-NLI pre-filter)
  • Epistemic Trigger Words validator (to suggest alternative labels)
  • Working Courtroom UI (to display the challenge)

Estimated effort: 2-3 weeks

When to revisit: After the Courtroom UI is functional and conflict detection is producing real results on real papers.

Source documents: SYSTEM_INSPIRATIONS.md §4 (NF-2)


T1-4: Pre-Extraction Filter (RCS Pattern from PaperQA2)

What it is: Before sending paper sections to the expensive AI Council, run a lightweight filter that scores each chunk for "claim density." Only send high-density chunks to the full extraction pipeline. Low-density chunks (boilerplate, acknowledgments, generic intro) get lightweight extraction or skip.

Why it's deferred: Requires a claim-density scoring model or heuristic, plus enough real extraction data to know what "high density" looks like in practice. Building the filter before understanding the extraction patterns would be premature optimization.

What must exist first:

  • End-to-end pipeline running on real papers (to understand what chunks produce claims)
  • SPECTER2 embeddings (for semantic scoring)
  • At least 20 papers processed to establish claim-density baselines per section type

Estimated effort: 2 weeks

When to revisit: After processing 20+ real papers and observing which sections consistently produce zero claims (those are the ones to filter).

Source documents: SYSTEM_INSPIRATIONS.md §2 (AD-1), PRIOR_ART_ANALYSIS.md §2 (PaperQA2 RCS)


T1-5: Cross-Reference Verification (Dual Evidence from FactReview)

What it is: For every extracted claim, check it against TWO sources: (1) the paper's own evidence (does the data actually support the claim?), and (2) the existing knowledge graph (do other papers agree or disagree?).

Why it's deferred: The internal check requires structured understanding of tables and figures (Layer 0 improvements). The external check requires a populated knowledge graph. Both dependencies are not yet met.

What must exist first:

  • Table parsing that preserves header-cell associations
  • Knowledge graph with >100 claims for external cross-referencing
  • A verification prompt that can compare a claim against its source quote

Estimated effort: 2-3 weeks

When to revisit: After table parsing is improved and the knowledge graph has enough claims for meaningful cross-referencing.

Source documents: SYSTEM_INSPIRATIONS.md §2 (AD-2), PRIOR_ART_ANALYSIS.md §2 (FactReview)


T1-6: Coverage Checker / Completeness Auditor (from Paper Circle)

What it is: After the AI Council finishes extraction, a separate agent checks: Did we extract at least one claim from every Results subsection? Are there orphaned figures/tables that no claim references? Are there unreferenced statistics?

Why it's deferred: Requires section-aware parsing (Layer 0) to reliably identify which sections exist, and figure/table detection to know what should be referenced. Also requires the extraction pipeline to be stable enough that "missing" claims are genuinely missed, not artifacts of pipeline bugs.

What must exist first:

  • Reliable section identification in Layer 0
  • Figure and table detection with labels
  • Stable extraction pipeline (not producing false positives that mask false negatives)

Estimated effort: 2 weeks

When to revisit: After the extraction pipeline has been validated on 10+ papers with manual review confirming it's stable.

Source documents: SYSTEM_INSPIRATIONS.md §2 (AD-4), PRIOR_ART_ANALYSIS.md §2 (Paper Circle)


T1-7: Epistemic Provenance Levels (Human Verification Tracking)

What it is: Every claim gets a verification level from 🤖 (machine-extracted, unreviewed) through ⭐ (expert-verified) to 📄 (published/peer-reviewed). Level 0 claims can't be cited in exports. Level 2+ claims become training data.

Why it's deferred: Requires a UI for human review and verification, plus a feedback loop that captures corrections. The correction-to-training-data pipeline is itself a significant piece of infrastructure.

What must exist first:

  • Working UI with claim display and edit capability
  • Database schema for verification levels (straightforward)
  • A workflow for researchers to review and upgrade claims
  • Pipeline to convert human corrections into training examples

Estimated effort: 2-3 weeks

When to revisit: After the core UI (Courtroom View) is functional and researchers can interact with individual claims.

Source documents: SYSTEM_INSPIRATIONS.md §4 (NF-3)


T1-8: Constrained Decoding (Guidance / SGLang Integration)

What it is: Instead of asking the LLM to output JSON and hoping for the best, use the Guidance library or SGLang to GUARANTEE valid JSON with valid enum values. The model literally cannot output an invalid epistemic tag.

Why it's deferred: Requires a local model serving layer (Ollama/vLLM) to be integrated first. Guidance/SGLang operate at the token-generation level, which means they need direct access to the model's logits, not just an API endpoint.

What must exist first:

  • Local model serving (Ollama or vLLM running)
  • A finalized claim schema (the constrained schema must match what the training data produces)
  • Testing infrastructure to compare constrained vs unconstrained output quality

Estimated effort: 2 weeks

When to revisit: After local model serving is set up and the claim schema is stable.

Source documents: SYSTEM_DESIGN.md §4.2, BLINDSPOT_AUDIT_COMPLETE.md (I-3)


T1-9: Marker PDF Parser Integration

What it is: Replace PyMuPDF/pdfplumber with Marker (ML-based layout-aware parser) for section-aware parsing, table structure preservation, and equation handling.

Why it's deferred: Marker requires PyTorch and has significant dependencies. Integration needs testing against real papers to verify it improves over the current parser, and a fallback strategy for when Marker fails.

What must exist first:

  • A test set of 10+ diverse real papers (different layouts, formats, journals)
  • Comparison metrics: current parser vs Marker on the same papers
  • Fallback routing logic (Marker fails → PyMuPDF)

Estimated effort: 2-3 weeks (including testing and fallback logic)

When to revisit: After the end-to-end pipeline works with the basic parser. Parser upgrade should improve quality, not block initial functionality.

Source documents: SYSTEM_DESIGN.md §4.0, FUTURE_IMPROVEMENTS.md §4 (P-1)


Tier 2: Aspirational / Research-Grade

These are intellectually compelling but carry research risk. Some require novel work at the frontier. Others depend on ecosystems that aren't mature enough.


T2-1: BDH (Baby Dragon Hatchling) as Verification Model

What it is: Train a 100-500M parameter BDH model as an interpretable claim verifier. BDH's monosemantic synapses would let you literally see which concepts the model activates during verification — "statistical evidence" synapse fires, "negation" synapse fires, etc.

Why it's aspirational:

  • BDH is at GPT-2 scale (10M-1B params). Proven on Sudoku and simple language tasks, NOT on scientific claim verification.
  • Training a BDH model from scratch on scientific text is itself a research project (no pre-trained scientific BDH exists).
  • The interpretability story is compelling but unvalidated for this specific task.
  • The codebase is relatively new and community support is limited.

Research risk: HIGH. The synapse activation patterns that work for "Euro" and "France" may not cleanly separate "statistical significance" from "hedging language" in scientific text. Monosemanticity at the concept level doesn't guarantee monosemanticity at the epistemic level.

What would need to happen:

  • Train a BDH model on a large scientific corpus (months of work)
  • Validate that monosemantic synapses emerge for epistemic concepts (not guaranteed)
  • Compare verification quality against a fine-tuned BERT/SciBERT baseline (must justify the effort)

Estimated effort: 3-6 months (research project)

When to revisit: When BDH pre-trained models for scientific text become available on HuggingFace, OR when someone publishes BDH results on a scientific NLU task.

Source documents: REQUIREMENTS_FROM_SOURCES.md (B-REQ-4, B-REQ-5, B-REQ-6, INT-REQ-2, INT-REQ-3)


T2-2: DTA-Coupled Multi-Agent Council (from Emergence Transformer)

What it is: Instead of a sequential council pipeline, implement Dynamical Temporal Attention between council members. Each member attends to other members' reasoning history (neighbor-DTA) and its own past outputs (self-DTA), with tunable α (consensus vs diversity) and β (memory decay) parameters.

Why it's aspirational:

  • The Emergence Transformer paper proves DTA works on opinion dynamics models and Hopfield networks — NOT on LLM agent systems.
  • The analogy between coupled oscillators and LLM council members is conceptually elegant but lacks empirical validation.
  • Implementing "attention between agents across papers" requires a persistent state store for each agent's history, plus attention computation across that history — this is a new architecture, not a configuration change.
  • The α/β tuning requires ground truth about when consensus IS correct (to know if your consensus level is right).

Research risk: HIGH. The oscillator→agent mapping is a metaphor, not a proof. DTA might not improve council quality at all — or it might improve it in ways that are hard to measure without a gold standard.

What would need to happen:

  • Implement basic persistent agent memory (store outputs across papers)
  • Design and run a controlled experiment: sequential council vs DTA-coupled council on the same 50 papers
  • Develop metrics for "council quality" beyond just extraction accuracy

Estimated effort: 2-4 months (research + engineering)

When to revisit: After the sequential council is working well enough to serve as a baseline. You need a working baseline to measure improvement against.

Source documents: REQUIREMENTS_FROM_SOURCES.md (B-REQ-3, A-REQ-1, G-REQ-1, TR-REQ-3)


T2-3: MAGMA Multi-Graph Memory Architecture

What it is: Instead of one knowledge graph mixing all relationship types, maintain separate graphs for temporal, causal, semantic, and entity relationships. Queries are routed to the appropriate graph based on question type.

Why it's aspirational:

  • The current SQLite-backed single graph hasn't hit its limits yet. Premature optimization for scale we don't have.
  • Multi-graph consistency maintenance (keeping 4 graphs in sync when a claim is added or modified) is a hard distributed systems problem.
  • Query routing ("is this a temporal question or a causal question?") is itself a classification task that can fail.

Research risk: MEDIUM. The concept is well-understood in graph database literature, but the epistemic-specific routing is novel.

What would need to happen:

  • The single graph needs to hit a performance or quality wall (>10,000 claims, queries becoming slow or returning irrelevant results)
  • A clear demonstration that query quality improves with graph separation (not just query speed)

Estimated effort: 4-6 weeks (engineering, not research)

When to revisit: After 1,000+ claims are in the graph and you can measure query quality degradation.

Source documents: SYSTEM_DESIGN.md Appendix A


T2-4: Post-Transformer Architecture Migration (Mamba/RWKV/Jamba)

What it is: Replace the Transformer backbone (Qwen) with a State Space Model (Mamba), linear attention model (RWKV), or hybrid (Jamba/Griffin) for better long-sequence handling and constant-memory inference.

Why it's aspirational:

  • The Transformer ecosystem (AWQ quantization, vLLM serving, TRL training, LoRA adapters) is mature and battle-tested.
  • SSM fine-tuning ecosystems are immature. LoRA for Mamba is experimental. GRPO for Mamba doesn't exist in TRL.
  • Qwen3-8B at 128K context already handles full papers. The quadratic attention bottleneck isn't the current blocker.
  • Migration would require retraining from scratch or developing Mamba-specific adapter techniques.

Research risk: MEDIUM-HIGH. The architectures work well for pre-training. Fine-tuning for structured output (our use case) is uncharted.

What would need to happen:

  • A Mamba/RWKV/Jamba model pre-trained on scientific text with instruction-following capability
  • TRL support for SSM fine-tuning (SFT at minimum, DPO/GRPO preferred)
  • Demonstrated advantage over Transformer on our specific task (claim extraction quality, not just perplexity)

Estimated effort: 3-6 months (depends on ecosystem maturity)

When to revisit: When a Mamba-based instruction model appears on HuggingFace with >10K downloads and TRL adds SSM support. Check quarterly.

Source documents: SYSTEM_DESIGN.md Appendix A, REQUIREMENTS_FROM_SOURCES.md (Emergence Transformer discussion)


T2-5: CompreSSM (Control-Theory-Based Model Compression)

What it is: Train a large SSM, then use Hankel singular value decomposition to identify which hidden dimensions matter and surgically remove the rest. Produces a compact model that outperforms one trained small from the start.

Why it's aspirational:

  • Requires a trained large SSM first (see T2-4 — which is itself aspirational).
  • CompreSSM is bleeding-edge research from a single paper. No production implementations exist.
  • The control theory mathematics (Hankel singular values, balanced truncation) require specialized expertise.

Research risk: VERY HIGH. Novel technique, single paper, no independent replication.

What would need to happen:

  • T2-4 must be completed first (have a trained SSM)
  • CompreSSM paper must be replicated by other groups
  • Open-source implementation of the compression pipeline

Estimated effort: Unknown (research frontier)

When to revisit: If CompreSSM gets replicated and an open-source tool is published.

Source documents: SYSTEM_DESIGN.md Appendix A


T2-6: BDH Model Concatenation for Domain Composability

What it is: Train separate BDH models for different scientific domains (biosensors, materials science, ecology) and physically concatenate them into a combined model with no retraining and no forgetting.

Why it's aspirational:

  • Depends on T2-1 (BDH verification model) existing first.
  • Model concatenation has been demonstrated for translation tasks in the BDH paper, not for scientific claim extraction.
  • The claim that "each domain's neurons are physically separate" assumes monosemantic domain boundaries — which may not hold for interdisciplinary papers (biosensor papers that use materials science concepts).

Research risk: HIGH. Interesting if BDH works for science at all (T2-1), but the composability claim needs empirical validation on overlapping domains.

Estimated effort: 2-3 months (after T2-1 is complete)

When to revisit: Only after T2-1 demonstrates that BDH works for scientific claim verification at all.

Source documents: REQUIREMENTS_FROM_SOURCES.md (B-REQ-6)


T2-7: Hebbian Working Memory for Paper-Level Context (from BDH)

What it is: Instead of processing paper chunks independently, use BDH's synaptic plasticity to build cumulative understanding as the model reads through a paper. By the time it reaches the Discussion, it "remembers" the Introduction through synapse state.

Why it's aspirational:

  • Requires BDH (T2-1).
  • Standard Transformer context windows (128K for Qwen3-8B) already solve this problem for most papers.
  • The Hebbian approach is architecturally elegant but requires the entire paper to be processed sequentially (no parallelization), which is slower.

Research risk: MEDIUM. Interesting alternative to long-context Transformers, but the practical advantage over "just use 128K context" is unclear.

When to revisit: If context window limitations become a real blocker (papers >128K tokens, or KV cache memory becomes prohibitive).

Source documents: REQUIREMENTS_FROM_SOURCES.md (B-REQ-5)


T2-8: DTA-Inspired Continual Learning for Domain Adaptation

What it is: When fine-tuning on a new scientific domain, use the Emergence Transformer's DTA principle: the model attends to its own past activations (self-DTA) to remember old domains while learning new ones. Practically maps to O-LoRA (orthogonal LoRA adapters).

Why it's aspirational:

  • O-LoRA itself is implementable (Tier 1 level), but the DTA-theoretic justification for WHY orthogonal subspaces prevent forgetting is the aspirational part.
  • The Emergence Transformer paper provides theoretical backing for Hopfield networks, not LoRA adapters. The connection is by analogy, not by proof.
  • Standard sequential LoRA training with replay buffers may work just as well without the theoretical framework.

Research risk: LOW for O-LoRA itself, HIGH for proving the DTA connection.

What would need to happen:

  • Standard O-LoRA implementation (weeks)
  • Controlled experiment comparing O-LoRA vs naive sequential LoRA vs full retraining on domain expansion
  • Optional: verify that the DTA theory predicts the empirical results

Estimated effort: O-LoRA alone: 2-3 weeks. With DTA validation: 2-3 months.

When to revisit: When the model needs to learn a second scientific domain and forgetting is observed.

Source documents: REQUIREMENTS_FROM_SOURCES.md (TR-REQ-3), BLINDSPOT_AUDIT_COMPLETE.md (I-8)


Summary Table

Tier 1: Significant Effort (Implementable)

ID Feature Effort Prerequisite Revisit When
T1-1 4-Stage Training (DPO→GRPO→ConfTuner) 8-12 weeks SFT model + error analysis SFT evaluated on SciFact
T1-2 Conflict Investigation Protocol 3-4 weeks Embeddings + populated graph 50+ papers ingested
T1-3 Devil's Advocate Mode 2-3 weeks Conflict detection + graph Courtroom UI functional
T1-4 Pre-Extraction Filter (RCS) 2 weeks Embeddings + 20 papers processed Section-level claim density baselines exist
T1-5 Cross-Reference Verification 2-3 weeks Table parsing + populated graph 100+ claims in graph
T1-6 Completeness Auditor 2 weeks Section detection + stable pipeline Pipeline validated on 10+ papers
T1-7 Epistemic Provenance Levels 2-3 weeks UI for claim review + feedback pipeline Courtroom UI functional
T1-8 Constrained Decoding 2 weeks Local model serving Ollama/vLLM integrated
T1-9 Marker PDF Parser 2-3 weeks Test set of 10+ real papers End-to-end pipeline works with basic parser

Total Tier 1: ~27-40 weeks of work (spread over PhD Year 1-2, NOT sequential)

Tier 2: Aspirational / Research-Grade

ID Feature Risk Effort Revisit When
T2-1 BDH Verification Model HIGH 3-6 months Scientific BDH model on HuggingFace
T2-2 DTA-Coupled Council HIGH 2-4 months Sequential council is baseline
T2-3 MAGMA Multi-Graph MEDIUM 4-6 weeks 1,000+ claims in graph
T2-4 Post-Transformer Migration MED-HIGH 3-6 months SSM fine-tuning in TRL
T2-5 CompreSSM Compression VERY HIGH Unknown Paper replicated + open-source tool
T2-6 BDH Domain Concatenation HIGH 2-3 months T2-1 works for science
T2-7 Hebbian Working Memory MEDIUM 2-3 months Context window is a real bottleneck
T2-8 DTA Continual Learning LOW-HIGH 2 weeks - 3 months Domain forgetting is observed

What We're Building NOW (for reference)

The following features are in the "strongly implementable" category and are being built in the current sprint:

Feature Effort Status
SPECTER2 embedding integration 1-2 days 🔨 Building
SciFact benchmark evaluation 1 day 🔨 Building
SciRIFF training data integration 2-3 days 🔨 Building
Epistemic Trigger Words validator 1 week 🔨 Building
Low Confidence Quarantine 3-5 days 🔨 Building
Epistemic Velocity tracking 3-5 days 🔨 Building
Confidence Decomposition Display 3-5 days 🔨 Building
SciBERT-NLI contradiction pre-filter 1-2 days 🔨 Building
Upgraded SFT training with expanded data 2-3 days 🔨 Building

This document is the complement to FUTURE_IMPROVEMENTS.md and SYSTEM_INSPIRATIONS.md. Those documents describe WHAT to build. This document describes WHAT TO WAIT ON and WHY.

Updated: 2026-04-23