Integrate BDH into requirements — interpretable verifier, Hebbian memory, domain composability, sparsity monitoring

Browse files

Files changed (1) hide show

REQUIREMENTS_FROM_SOURCES.md +508 -1

REQUIREMENTS_FROM_SOURCES.md CHANGED Viewed

	@@ -1 +1,508 @@
1	- ~~replace_with_file:/app/REQUIREMENTS_FROM_SOURCES.md~~

+# PhD Research OS — Requirements Derived from Sources
+## Grounded Entirely in the Emergence Transformer Paper + Awesome Open Source AI List
+**Written so a high school student can understand every word.**
+**Sources**:
+1. 📄 [Emergence Transformer: Dynamical Temporal Attention Matters](https://arxiv.org/abs/2604.19816) — A paper that redesigns the Transformer's attention mechanism so components interact with their own past states through time-varying queries, keys, and values. Shows that neighbor-attention promotes coherence while self-attention has an optimal sweet spot, and applies this to social opinion models and Hopfield networks for continual learning without forgetting.
+2. 📦 [Awesome Open Source AI](https://github.com/alvinreal/awesome-opensource-ai) — A curated list of 500+ battle-tested, production-proven open-source AI tools across 14 categories (April 2026).
+3. 🧠 [Baby Dragon Hatchling (BDH)](https://github.com/pathwaycom/bdh) — A biologically-inspired language model architecture from Pathway that replaces the standard Transformer with a scale-free network of locally-interacting neuron particles. Matches GPT-2 performance at same parameter count (10M-1B) while providing inherent interpretability, monosemantic synapses, sparse positive activations, and continual learning via Hebbian plasticity. Paper: [arXiv:2509.26507](https://arxiv.org/abs/2509.26507).
+---
+## What These Sources Tell Us
+### From the Emergence Transformer Paper
+The paper introduces **Dynamical Temporal Attention (DTA)** — a version of the Transformer where the query, key, and value matrices change over time. The key insights that apply to our Research OS:
+1. **Neighbor-DTA vs Self-DTA**: When components pay attention to their neighbors' history, coherence (agreement) always increases. When they pay attention to their OWN history, there's an optimal attention weight — too much self-attention actually hurts. This directly maps to our AI Council: council members should attend to EACH OTHER'S reasoning (neighbor-DTA), not just their own previous outputs (self-DTA).
+2. **Emergent Continual Learning**: The paper shows DTA applied to Hopfield neural networks achieves continual learning WITHOUT catastrophic forgetting. This is exactly what our Research OS needs — the model should learn from new papers without forgetting what it learned from old ones.
+3. **Social Coherence Modulation**: DTA can either enhance agreement or preserve plurality in social opinion models. For our system, this means the AI Council should be designable — we can tune it to either push toward consensus (for clear-cut cases) or deliberately preserve disagreement (for genuinely ambiguous cases).
+4. **Time-Varying Attention Kernels**: Standard Transformers have fixed attention patterns. DTA makes attention evolve over time. For our system, this means: as the model processes more of a paper, its attention to earlier sections should change. Reading the Discussion should update how the model interprets the Abstract.
+### From the Baby Dragon Hatchling (BDH) Paper
+BDH is a completely different way of building a language model. Instead of the standard Transformer (which our Qwen model uses), BDH builds the model as a **network of neurons that talk to their neighbors** — like a brain, not like a calculator.
+Here's why this matters for our Research OS:
+1. **Inherent Interpretability**: In a standard Transformer, we can't easily tell WHY the model made a decision. In BDH, individual synapses (connections between neurons) activate for specific concepts. The paper shows that when BDH processes text containing "Euro" or "France," specific, identifiable synapses light up. For our system, this means: if BDH processes a science paper, we could literally SEE which synapses activate for "LOD" vs "p-value" vs "hypothesis." We'd know exactly what the model is paying attention to.
+2. **Monosemantic Neurons**: In standard Transformers, each neuron fires for many unrelated concepts (polysemantic — one neuron means many things). BDH's neurons are monosemantic — each one means ONE thing. This is like the difference between a filing cabinet where each drawer has one label (monosemantic) and one where every drawer contains a random mix of documents (polysemantic). For scientific claim extraction, monosemantic representations mean: one neuron for "statistical significance," another for "qualifier words," another for "causal language." Debugging becomes possible.
+3. **Sparse, Positive Activations**: BDH's activation vectors are positive (no negative numbers) and sparse (~5% active at any time). This is like a brain where only a few neurons fire at once. For our system, sparsity means: we can quickly check which neurons are active and verify they make sense for the input. If the "statistical significance" neuron fires but the text has no statistics, something is wrong.
+4. **Model Composability**: BDH models can be concatenated — two separately trained models can be stitched together into a larger model. The paper demonstrates this for translation tasks. For our system, this means: train one model on biosensors, another on materials science, concatenate them into a model that knows both domains. No catastrophic forgetting because the models are physically separate.
+5. **Hebbian Working Memory**: BDH's in-context learning works through synaptic plasticity — connections strengthen when neurons fire together ("neurons that fire together wire together"). This is real-time learning during inference, not just retrieval. For our system, this means: as BDH reads through a paper, it builds paper-specific working memory in its synapses. By the time it reaches the Discussion, its synapses already encode the key findings from the Results section.
+6. **97.4% on Sudoku Extreme**: BDH achieves 97.4% accuracy on extreme Sudoku puzzles WITHOUT chain-of-thought reasoning, where leading LLMs (including O3-mini, DeepSeek R1) score ~0%. This demonstrates that BDH can perform complex constraint satisfaction internally. Scientific claim extraction IS a constraint satisfaction problem — the model must satisfy constraints like "claims from the Abstract must be tagged Interpretation" and "null results must be flagged" simultaneously.
+### From the Awesome Open Source AI List
+The list catalogs the production-ready tools that exist today. Here are the specific tools relevant to each part of our system, organized by what they replace or enable:
+---
+## Requirements by System Layer
+### Layer 0: PDF Parsing — Replace Basic Scrapers with ML Parsers
+**Current state**: PyMuPDF/pdfplumber (basic text extraction)
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Marker](https://github.com/datalab-to/marker)** | Fast, accurate PDF-to-markdown with table extraction and equation handling | Replaces pdfplumber — preserves document structure, tables, equations |
+| **[MinerU](https://github.com/opendatalab/MinerU)** | High-accuracy PDF parsing with VLM+OCR dual engine | Handles scanned papers and complex layouts that Marker misses |
+| **[Docling](https://github.com/docling-project/docling)** | Document processing toolkit for GenAI workflows | Backup parser for non-standard formats (Word, PPT, Excel supplements) |
+| **[Unstructured](https://github.com/Unstructured-IO/unstructured)** | Best-in-class document preprocessing | Universal fallback for any document type |
+| **[MarkItDown](https://github.com/microsoft/markitdown)** | Microsoft's file-to-Markdown converter | Handles supplementary files (Excel data, PowerPoint presentations) |
+| **[OmniParse](https://github.com/adithya-s-k/omniparse)** | Parses documents, tables, images, videos, audio, web pages | Multi-modal supplement handling (video supplements, audio recordings) |
+**Requirement P-REQ-1**: Integrate Marker as the primary parser. Fall back to MinerU for scanned/OCR documents. Use Docling/MarkItDown for non-PDF supplements.
+**Requirement P-REQ-2**: Use Chonkie ([chonkie-inc/chonkie](https://github.com/chonkie-inc/chonkie)) for intelligent document chunking. It supports semantic, token, and recursive chunking strategies — replacing the current simple section-merge chunking in parser.py.
+---
+### Layer 1: Entity Resolution — Add Embedding-Based Matching
+**Current state**: No embedding model, no entity normalization
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[BGE / FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)** | Best-in-class text embeddings | Convert claims to vectors for semantic matching instead of word overlap |
+| **[FastEmbed](https://github.com/qdrant/fastembed)** | Lightweight embedding with ONNX Runtime, no GPU needed | Local-first embedding that runs on CPU for privacy |
+| **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search as a SQLite extension | Adds vector similarity to our existing SQLite database — zero new infrastructure |
+| **[MTEB](https://github.com/embeddings-benchmark/mteb)** | Embedding benchmark | Choose the best embedding model for scientific text by testing on MTEB |
+**Requirement M-REQ-1**: Replace Jaccard word overlap in `canonicalizer.py` with embedding-based cosine similarity using FastEmbed + sqlite-vec. This keeps the system local-first (SQLite + CPU embeddings) while enabling semantic deduplication.
+**Requirement M-REQ-2**: Use MTEB to benchmark which embedding model performs best on scientific claim similarity before committing to one.
+---
+### Layer 2: Extraction — Better Models and Constrained Output
+**Current state**: Qwen2.5-3B with mock fallback, no output guarantees
+**Required models from the awesome list**:
+| Model | Why It's Better |
+|-------|----------------|
+| **[Qwen3.6-Plus](https://github.com/QwenLM/Qwen)** | April 2026 flagship, 1M context window, competitive with Claude 4.5 Opus |
+| **[Kimi K2.5](https://github.com/MoonshotAI/Kimi-K2.5)** | 256K context, strong reasoning, native tool-use for agentic workflows |
+| **[Phi-4](https://github.com/microsoft/PhiCookBook)** | Small but highly capable for reasoning and edge/on-device inference |
+| **[OLMo 2](https://github.com/allenai/OLMo)** | Fully open-source (data + code + logs) — by scientists, for scientists |
+| **[GLM-5](https://github.com/zai-org/GLM-5)** | Strong coding, reasoning, and agentic-task performance |
+**Requirement B-REQ-1**: Upgrade the primary brain to Qwen3.6-Plus (or its quantized variant) for maximum reasoning quality. Use Phi-4 as the local/edge fallback for 16GB VRAM deployment.
+**Requirement B-REQ-2**: Use the **Instructor** library ([jxnl/instructor](https://github.com/jxnl/instructor)) for structured output extraction with Pydantic validation. This replaces the need for the Guidance library — Instructor handles validation, retries, and error handling for extracting claims as structured JSON from any LLM.
+**From the Emergence Transformer paper — Requirement B-REQ-3**: Implement **Dynamical Temporal Attention** in the council architecture:
+The paper shows that **neighbor-DTA consistently promotes coherence** while **self-DTA has an optimal weight**. Applied to the AI Council:
+- **Neighbor-DTA for council members**: Each council member (Extractor, Critic, Chairman) should attend to OTHER members' reasoning history, not just their own. This promotes convergence on genuinely shared insights.
+- **Self-DTA with tunable weight (α)**: Each member also attends to their OWN past outputs, but with a tunable weight. The paper proves there's an optimal α — too much self-attention causes overconfidence. Too little means no memory.
+- **Practical implementation**: Store each council member's outputs across multiple papers. When processing paper N, the Extractor can attend to its OWN extractions from papers 1…N-1 (self-DTA) and to the Critic's feedback from papers 1…N-1 (neighbor-DTA). The attention weights are tunable per task.
+This turns the council from a stateless sequential pipeline into a **stateful attention-based ensemble** where members learn from each other's history.
+**From the BDH paper — Requirement B-REQ-4**: Evaluate **BDH as the verification model** architecture:
+BDH's unique properties make it ideal for the VERIFICATION step (not the primary extraction, but the double-check):
+- **Why BDH for verification, not extraction**: BDH currently matches GPT-2 scale (10M-1B parameters). That's too small for primary claim extraction from complex papers. But verification is a simpler task — "Does this claim actually appear in the source text? Yes or no." — and BDH's constraint satisfaction ability (97.4% Sudoku) maps directly to this.
+- **Interpretable verification**: When BDH says "yes, this claim is supported," we can inspect WHICH synapses activated and see if they correspond to the relevant text spans. If the "statistical evidence" synapse activates but the source text has no statistics, we know the verification is wrong. No other architecture offers this level of transparency.
+- **Practical implementation**: Train a small BDH model (100M-500M parameters) as the "verifier head." For each claim the primary Qwen model extracts, the BDH verifier re-reads the source span and answers: "Is the claim faithfully grounded in this text?" Because BDH's activations are sparse and monosemantic, we can audit every verification decision.
+**From the BDH paper — Requirement B-REQ-5**: Use **BDH's Hebbian working memory** for paper-level context:
+Standard Transformers process each chunk of a paper independently — they don't remember what they read 3 pages ago unless it's explicitly in the context window. BDH's Hebbian synaptic plasticity builds working memory DURING inference:
+- As BDH processes Section 1, its synapses encode the key terms and relationships
+- When it reaches Section 3 (Results), those synapses still carry the priming from Section 1 (Introduction)
+- The model doesn't need explicit "here's what we said earlier" prompting — it remembers through its synapse state
+For our system, this means BDH can process an entire paper sequentially, building a cumulative understanding, rather than the current approach of processing isolated chunks.
+**From the BDH paper — Requirement B-REQ-6**: Use **BDH's model composability** for domain specialization:
+The paper proves that BDH models can be concatenated: a biosensors-BDH + a materials-science-BDH = a combined model that knows both domains, with no retraining. This is fundamentally different from LoRA adapters (which can interfere) or full retraining (which can forget).
+Implementation path:
+1. Train BDH-biosensors (100M params) on biosensor papers
+2. Train BDH-materials (100M params) on materials science papers
+3. Concatenate → BDH-combined (200M params) knows both domains
+4. Each domain's knowledge lives in physically separate neurons — no interference
+This directly addresses the continual learning requirement without any of the forgetting risks of standard approaches.
+---
+### Layer 3: Deduplication — Semantic Matching at Scale
+**Current state**: Jaccard word overlap
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search in SQLite | Find duplicate claims by meaning, not word overlap, inside our existing DB |
+| **[Chroma](https://github.com/chroma-core/chroma)** | Embedding database | If claim count exceeds SQLite performance, scale to dedicated vector DB |
+| **[rerankers](https://github.com/AnswerDotAI/rerankers)** | Unified reranking API | After finding candidate duplicates by embedding, use cross-encoder reranking for precision |
+**Requirement M-REQ-3**: Implement two-stage deduplication: (1) Fast approximate matching via sqlite-vec embeddings (recall-optimized), (2) Precise reranking via cross-encoder for candidate pairs (precision-optimized).
+---
+### Layer 4: Knowledge Graph — Add Temporal Reasoning and Graph RAG
+**Current state**: SQLite adjacency list, word-overlap conflict detection
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Graphiti](https://github.com/getzep/graphiti)** | Real-time temporal knowledge graphs with provenance tracking | Tracks how facts change over time — exactly what we need for claim versioning |
+| **[GraphRAG](https://github.com/microsoft/graphrag)** | Knowledge-graph-based retrieval | Enables multi-hop reasoning over the claim graph |
+| **[LightRAG](https://github.com/HKUDS/LightRAG)** | Graph-based RAG with dual-level retrieval | Simpler alternative to GraphRAG for our scale |
+| **[KAG (OpenSPG)](https://github.com/OpenSPG/KAG)** | Knowledge Augmented Generation for logical reasoning | Schema-constrained knowledge construction for professional domains |
+**From the Emergence Transformer paper — Requirement G-REQ-1**: Apply DTA to the knowledge graph for **emergent conflict detection**:
+The paper models N coupled oscillators where coherence emerges from attention-mediated interactions. Claims in the knowledge graph are analogous to oscillators:
+- Each claim has a "phase" (its epistemic state: Fact/Interpretation/Hypothesis and confidence)
+- Claims interact through graph edges (supports/refutes/extends)
+- **Neighbor-DTA** on the graph: When scoring a claim, attend to the HISTORY of its graph neighbors. A claim that was "Interpretation" but whose supporting claims have all been upgraded to "Fact" over time should be reconsidered.
+- **Conflict detection as coherence breakdown**: The paper's order parameter (r_t) measures global coherence. In our graph, sudden drops in local coherence (a cluster of claims that were previously consistent suddenly becoming contradictory because of a new paper) are analogous to desynchronization events. These should trigger alerts.
+**Requirement G-REQ-2**: Use Graphiti for temporal provenance. Every claim stores when it was first extracted, when it was last confirmed by a new source, and when it was contradicted. The graph should answer queries like "What changed about LOD claims for GFET sensors between 2022 and 2025?"
+---
+### Layer 5: Scoring — Add Calibration Infrastructure
+**Current state**: Fixed-point formula works, calibration only planned
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[DeepEval](https://github.com/confident-ai/deepeval)** | LLM evaluation with hallucination detection, bias detection | Automated checking of extraction quality and confidence calibration |
+| **[RAGAs](https://github.com/explodinggradients/ragas)** | RAG evaluation (faithfulness, relevance, context recall) | Evaluate whether extracted claims are faithful to source text |
+**Requirement S-REQ-1**: Use DeepEval's faithfulness metric to automatically check: "Does this extracted claim actually appear in the source text?" This replaces manual gold-standard checking for high-volume papers.
+**Requirement S-REQ-2**: Use RAGAs for end-to-end pipeline evaluation — measure whether the system retrieves the right evidence and generates faithful extractions.
+---
+### Layer 6: Evaluation — Build a Real Test Suite
+**Current state**: Counts distributions, no ground-truth comparison
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** | De-facto standard for model evaluation | Standardized evaluation across model versions |
+| **[DeepEval](https://github.com/confident-ai/deepeval)** | "Pytest for LLMs" | Unit-test each extraction task with pass/fail criteria |
+| **[Promptfoo](https://github.com/promptfoo/promptfoo)** | LLM testing and red-teaming | Systematic prompt testing, side-by-side model comparison |
+| **[Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)** | UK AI Safety Institute's evaluation framework | Multi-turn dialog evaluation with tool use |
+| **[Lighteval](https://github.com/huggingface/lighteval)** | Lightweight model evaluation | Quick evaluation during training |
+**Requirement T-REQ-1**: Implement DeepEval-based unit tests for each extraction task. Each test has:
+- Input: paper excerpt
+- Expected output: correct claims with correct tags
+- Pass criteria: extraction recall ≥ 70%, epistemic accuracy ≥ 60%, qualifier preservation ≥ 80%
+**Requirement T-REQ-2**: Use Promptfoo for prompt regression testing. Every time a system prompt changes, automatically compare outputs before and after.
+---
+### Training Pipeline — Use Production Frameworks
+**Current state**: Custom train.py with TRL SFTTrainer, ZeroGPU micro-batching
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[TRL](https://github.com/huggingface/trl)** | Official SFT, DPO, GRPO, PPO | Already used for SFT — extend to DPO and GRPO stages |
+| **[Axolotl](https://github.com/axolotl-ai-cloud/axolotl)** | YAML-driven SFT, DPO, GRPO pipeline | Simpler configuration than custom scripts |
+| **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** | One-stop SFT, DPO, ORPO with web UI | GUI for non-programmers to run training |
+| **[Unsloth](https://github.com/unslothai/unsloth)** | 2× faster, 70% less memory fine-tuning | Makes training feasible on consumer GPUs |
+| **[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)** | Scalable RLHF with PPO, GRPO, REINFORCE++ | For the GRPO stage with custom epistemic reward functions |
+| **[verl](https://github.com/volcengine/verl)** | ByteDance's RL for LLMs with PPO, GRPO | Alternative GRPO implementation |
+| **[PEFT](https://github.com/huggingface/peft)** | Parameter-efficient fine-tuning (LoRA, etc.) | Already used — continue with LoRA |
+**Requirement TR-REQ-1**: Replace ZeroGPU micro-batching with a single continuous training job using Unsloth (for 2× speedup on consumer GPU) or Axolotl (for YAML-driven pipeline).
+**Requirement TR-REQ-2**: Implement the 4-stage pipeline:
+1. **SFT** via TRL/Unsloth (already partially built)
+2. **DPO** via TRL DPOTrainer on preference pairs
+3. **GRPO** via OpenRLHF or verl with the 3 custom reward functions (JSON validity, schema compliance, qualifier preservation)
+4. **ConfTuner** via custom training loop with tokenized Brier score loss
+**From the Emergence Transformer paper — Requirement TR-REQ-3**: Implement **DTA-inspired continual learning** for domain adaptation:
+The paper demonstrates that DTA applied to Hopfield networks achieves continual learning without catastrophic forgetting. For our model:
+- When fine-tuning on a new scientific domain (e.g., adding ecology to a biosensors-trained model), use the DTA principle: the model should attend to its OWN past activations (self-DTA) to remember old domains while learning new ones.
+- Practically: this maps to **O-LoRA** (orthogonal LoRA) — training new LoRA adapters in orthogonal subspaces so they don't interfere with existing adapters. The DTA paper provides the theoretical foundation for WHY this works: self-attention on past states preserves memory while neighbor-attention on new data drives learning.
+---
+### Synthetic Data Generation
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Distilabel](https://github.com/argilla-io/distilabel)** | Synthetic data generation and distillation | Generate 10K+ training examples using teacher models |
+| **[Argilla](https://github.com/argilla-io/argilla)** | Data annotation and human-in-the-loop | Label real paper extractions with human experts |
+**Requirement D-REQ-1**: Use Distilabel to generate the teacher ensemble outputs. Run 3-5 teacher models (Qwen3.6-Plus, Kimi K2.5, GLM-5) on 100 real papers and store ALL outputs with disagreement signals.
+**Requirement D-REQ-2**: Use Argilla for human expert labeling of the gold standard test set (10 papers, every claim manually annotated).
+---
+### Data Quality and Labeling
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Cleanlab](https://github.com/cleanlab/cleanlab)** | Find and fix label errors in datasets | Detect mislabeled training examples automatically |
+| **[Great Expectations](https://github.com/great-expectations/great_expectations)** | Data validation for pipelines | Validate that every training example has required fields, valid JSON, correct tag |
+| **[Label Studio](https://github.com/HumanSignal/label-studio)** | Multi-type data labeling | Interface for human annotators to label paper excerpts |
+**Requirement D-REQ-3**: Run Cleanlab on the existing 1,900 training examples to detect any mislabeled examples (wrong epistemic tags, missing qualifiers).
+**Requirement D-REQ-4**: Use Great Expectations to validate every training example before it enters the training pipeline: valid JSON, tag in allowed set, confidence in [0,1], non-empty source quote.
+---
+### Inference Serving — Replace Mock with Real AI
+**Current state**: No model serving, everything runs through optional API calls
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Ollama](https://github.com/ollama/ollama)** | Simplest local LLM serving | One-command model serving on consumer hardware |
+| **[vLLM](https://github.com/vllm-project/vllm)** | High-throughput LLM serving | Fast batch processing of many paper sections |
+| **[llama.cpp](https://github.com/ggml-org/llama.cpp)** | CPU/GPU inference for quantized models | Run on laptops without dedicated GPU |
+| **[SGLang](https://github.com/sgl-project/sglang)** | Fast structured generation | Guaranteed valid JSON output via grammar-constrained decoding |
+**Requirement I-REQ-1**: Integrate Ollama as the default local model server. One command to start: `ollama pull qwen3.6-plus:q4` → model available at `http://localhost:11434`.
+**Requirement I-REQ-2**: Use SGLang for constrained decoding — it guarantees valid JSON output with valid enum values. This eliminates broken JSON, invalid tags, and mixed text/JSON output.
+---
+### AI Safety and Security
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Guardrails AI](https://github.com/guardrails-ai/guardrails)** | Input/output validation for LLMs | Validate extraction outputs match expected schema |
+| **[LLM Guard](https://github.com/protectai/llm-guard)** | Security toolkit for LLM interactions | Detect prompt injection if system is exposed as API |
+| **[Garak](https://github.com/NVIDIA/garak)** | LLM vulnerability scanner | Test model for hallucination patterns specific to scientific claims |
+| **[DeepTeam](https://github.com/confident-ai/deepteam)** | Red teaming framework | Adversarial testing of extraction robustness |
+**Requirement SEC-REQ-1**: Use Guardrails AI to validate every LLM output before it enters the database. Schema validation, tag validation, confidence range checking.
+**Requirement SEC-REQ-2**: Use Garak to scan the fine-tuned model for scientific hallucination patterns. Test: does the model invent statistics? Does it fabricate citations? Does it claim certainty where the paper was uncertain?
+---
+### Interpretability
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)** | Mechanistic interpretability | Understand WHICH attention heads are responsible for qualifier detection vs claim extraction |
+| **[Captum](https://github.com/pytorch/captum)** | PyTorch interpretability | Attribution analysis — which input tokens influenced each output |
+**From the Emergence Transformer paper — Requirement INT-REQ-1**: Use the paper's DTA attention kernel analysis to interpret the fine-tuned model:
+The paper derives explicit formulas for how attention weights evolve over time (Equations 9-10). After fine-tuning, we can:
+- Visualize which tokens in the input (qualifier words like "may," "suggests") have the highest attention weight when the model outputs epistemic tags
+- Track whether attention to the Abstract section decreases when the model has already processed the Results section (temporal attention shift)
+- Identify attention heads that specialize in specific tasks (one head for qualifier detection, another for statistical parsing) — this validates whether the model has learned task-specific representations, answering the "specialist heads" question empirically
+**From the BDH paper — Requirement INT-REQ-2**: Use **BDH's monosemantic synapses** for claim provenance auditing:
+The BDH paper demonstrates that individual synapses activate consistently for specific concepts (e.g., currency names, country names) across multiple prompts. For the Research OS:
+- After training BDH on scientific text, identify which synapses activate for epistemic keywords ("significant," "may," "hypothesis," "demonstrates")
+- Build a **synapse audit dashboard**: for each extracted claim, show which synapses were active during extraction. If the "negation" synapse fires when the model tags something as a Fact, that's a red flag — the model may be ignoring a negation.
+- This is inherent to BDH's architecture — no external interpretability tool needed. The sparse, positive activations ARE the explanation. Compare this to standard Transformers where TransformerLens requires significant post-hoc analysis to achieve similar (but less clean) results.
+**From the BDH paper — Requirement INT-REQ-3**: Use **BDH's sparsity as a silent failure detector**:
+BDH activations are ~5% sparse. The paper shows that sparsity levels reflect "the amount of activity being performed by BDH-GPU for a given token." For our system:
+- If a claim extraction triggers unusually HIGH sparsity (many neurons active), the model is working hard — this is likely a complex, ambiguous case that should be flagged for human review
+- If a claim extraction triggers unusually LOW sparsity (very few neurons active), the model is barely engaged — it may be producing a default/generic response (silent failure / mode collapse)
+- Track sparsity distributions over time. If average sparsity drifts, the model's behavior is changing — early drift detection without needing a gold standard test set
+---
+### MLOps and Monitoring
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[MLflow](https://github.com/mlflow/mlflow)** | Experiment tracking, model registry | Track every training run, compare model versions |
+| **[Weights & Biases (wandb)](https://github.com/wandb/wandb)** | Experiment tracking with visualization | Dashboard for training metrics across all 4 stages |
+| **[DVC](https://github.com/iterative/dvc)** | Data and model versioning | Version the training dataset and gold standard |
+| **[Evidently](https://github.com/evidentlyai/evidently)** | ML monitoring and observability | Detect model drift in production |
+| **[Phoenix](https://github.com/Arize-ai/phoenix)** | AI observability | Monitor extraction quality in real-time |
+**Requirement OPS-REQ-1**: Use MLflow or W&B to track all training experiments. Every training run logs: loss curves, evaluation metrics, model checkpoints, hyperparameters, dataset version.
+**Requirement OPS-REQ-2**: Use Evidently for drift detection. Weekly check: run the model on the gold standard test set. If any metric drops >5%, alert.
+---
+### Agent Framework — Connect Real Brains to Agent Bodies
+**Current state**: Full agent lifecycle works, but agents have no AI model connected
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[smolagents](https://github.com/huggingface/smolagents)** | Lightweight agent framework from HuggingFace | Simpler than current custom AgentOS for basic tasks |
+| **[LangGraph](https://github.com/langchain-ai/langgraph)** | Stateful, multi-actor agent orchestration | For the multi-agent council with memory |
+| **[CrewAI](https://github.com/crewAIInc/crewAI)** | Multi-agent collaboration framework | Define roles (Extractor, Critic, Chairman) with collaboration protocols |
+| **[Letta (MemGPT)](https://github.com/letta-ai/letta)** | Stateful agents with persistent memory | Agents that remember across sessions |
+| **[Mem0](https://github.com/mem0ai/mem0)** | Universal memory layer for agents | Persistent memory for the MetaImprover and CitationChaser agents |
+**From the Emergence Transformer paper — Requirement A-REQ-1**: Implement the AI Council as a **DTA-coupled multi-agent system**:
+The paper's model has N oscillators coupled through an adjacency matrix A_ij. The AI Council has N=4 members coupled through information sharing. The DTA framework tells us:
+- **Coupling topology matters**: The paper shows that different network structures (fully connected, small-world, scale-free) produce different coherence patterns. For 4 council members, fully connected (everyone sees everyone) promotes maximum consensus. Star topology (Chairman sees all, others only see Chairman) preserves more diversity.
+- **α parameter tunes consensus vs diversity**: At α=0, no temporal attention → members are stateless → pure diversity. At α=1, full temporal attention → members converge → pure consensus. The paper proves there's an optimal α between 0 and 1 for maximum USEFUL coherence. Tune this per task: high α for clear-cut Fact/Interpretation decisions, low α for ambiguous Conflict_Hypothesis cases.
+- **β parameter controls memory decay**: In the paper, β determines how fast old attention information decays. For the council, β controls how much members remember from previous papers. High β = short memory (each paper is fresh). Low β = long memory (patterns from 100 papers ago still influence decisions).
+---
+### UI and User Experience
+**Required tools from the awesome list**:
+| Tool | What It Does | Why We Need It |
+|------|-------------|----------------|
+| **[Gradio](https://github.com/gradio-app/gradio)** | Already used | Continue using for main UI |
+| **[Kotaemon](https://github.com/Cinnamon/kotaemon)** | RAG-based document chat with Gradio UI | Reference implementation for document Q&A interface |
+| **[Open Notebook](https://github.com/lfnovo/open-notebook)** | AI-powered notebook with multi-modal support | Model for how to build the Obsidian-like research interface |
+**Requirement UI-REQ-1**: Study Kotaemon's architecture for the "chat with your papers" interface. It has hybrid RAG, re-ranking, and multi-modal support — exactly what the Research OS courtroom UI needs.
+---
+## Summary: The Complete Requirements Stack
+| Layer | Current Tool | Required Tool(s) | Source |
+|-------|-------------|------------------|--------|
+| PDF Parsing | pdfplumber/PyMuPDF | **Marker** + MinerU + Docling | Awesome List §5 |
+| Chunking | Custom section merger | **Chonkie** | Awesome List §5 |
+| Embeddings | None | **FastEmbed** + sqlite-vec | Awesome List §5 |
+| Deduplication | Jaccard overlap | FastEmbed + **rerankers** | Awesome List §5 |
+| Base Model | Qwen2.5-3B | **Qwen3.6-Plus** / Phi-4 | Awesome List §2 |
+| Structured Output | Hope-for-JSON | **Instructor** / SGLang | Awesome List §13 |
+| Model Serving | None (mock) | **Ollama** / vLLM | Awesome List §3 |
+| Council Architecture | Sequential pipeline | **DTA-coupled agents** (LangGraph) | Emergence Paper |
+| Knowledge Graph | SQLite adjacency | SQLite + **Graphiti** temporal layer | Awesome List §5 |
+| Graph Reasoning | Word-overlap conflicts | **LightRAG** / GraphRAG | Awesome List §5 |
+| Continual Learning | Retrain from scratch | **O-LoRA** (DTA-inspired) | Emergence Paper |
+| Training Framework | Custom train.py | **Unsloth** / Axolotl / TRL | Awesome List §7 |
+| GRPO Training | Not built | **OpenRLHF** / verl | Awesome List §7 |
+| Synthetic Data | Template generator | **Distilabel** | Awesome List §7 |
+| Human Labeling | Not built | **Argilla** / Label Studio | Awesome List §7 |
+| Data Validation | Not built | **Cleanlab** + Great Expectations | Awesome List §9 |
+| LLM Evaluation | Count-based metrics | **DeepEval** + Promptfoo | Awesome List §9 |
+| RAG Evaluation | Not built | **RAGAs** | Awesome List §9 |
+| Safety Scanning | Not built | **Garak** + LLM Guard | Awesome List §10 |
+| Output Validation | Not built | **Guardrails AI** | Awesome List §10 |
+| Interpretability | Not built | **TransformerLens** + Captum | Awesome List §10 |
+| Experiment Tracking | Tensorboard only | **MLflow** / W&B | Awesome List §8 |
+| Drift Detection | Not built | **Evidently** | Awesome List §8 |
+| Data Versioning | Not built | **DVC** | Awesome List §8 |
+| Agent Memory | Custom memory_store | **Letta** / Mem0 | Awesome List §4 |
+| Agent Orchestration | Custom AgentOS | Keep AgentOS + add **LangGraph** | Awesome List §4 |
+| Document Chat UI | Gradio tabs | Study **Kotaemon** architecture | Awesome List §5 |
+| Claim Verification | Not built | **BDH** (100-500M, interpretable verifier) | BDH Paper |
+| Paper-Level Memory | Chunk-based (no memory) | **BDH Hebbian** working memory | BDH Paper |
+| Domain Composability | Retrain or LoRA merge | **BDH concatenation** (no forgetting) | BDH Paper |
+| Silent Failure Detection | Not built | **BDH sparsity** monitoring | BDH Paper |
+| Claim Audit Trail | Source quote only | **BDH synapse** activation maps | BDH Paper |
+---
+## The Emergence Transformer's 3 Key Contributions to This System
+### 1. Council as Coupled Oscillators (Neighbor-DTA)
+Instead of a sequential pipeline, council members interact through attention-mediated coupling. The Emergence Transformer proves that neighbor-attention promotes coherence — members converge on genuinely shared insights while preserving dissent where it matters.
+### 2. Continual Learning Without Forgetting (Self-DTA + Hopfield)
+When adding new scientific domains, the model maintains its existing knowledge through self-attention on past states. The paper provides the theoretical proof that DTA-modified Hopfield networks achieve continual memory storage — directly applicable to our O-LoRA domain adaptation strategy.
+### 3. Tunable Consensus vs Diversity (α Parameter)
+The system can be configured to either push toward agreement (for clear-cut cases) or deliberately preserve plurality (for genuinely ambiguous epistemic classifications). The paper proves that the optimal α depends on network structure — for our 4-member council, this is a tunable hyperparameter.
+---
+## BDH's 4 Key Contributions to This System
+### 1. Interpretable Verification (Monosemantic Synapses)
+BDH's synapses activate for specific concepts — we can literally see what the model is "thinking about" when it verifies a claim. No other architecture provides this level of transparency. This makes BDH the ideal architecture for the verification layer where trust matters most.
+### 2. Paper-Level Working Memory (Hebbian Plasticity)
+Standard Transformers forget what they read at the start of a paper by the time they reach the end (unless everything fits in the context window). BDH's synaptic plasticity builds cumulative understanding through the paper — the Results section is processed with the Introduction already "wired in."
+### 3. Domain Composability Without Forgetting (Model Concatenation)
+Train separate BDH models for each scientific domain and physically concatenate them. Each domain's neurons are separate — adding materials science never touches the biosensor neurons. This is categorically different from LoRA merging (where adapters can interfere) or full retraining (where old knowledge can be overwritten).
+### 4. Built-In Silent Failure Detection (Sparsity Monitoring)
+BDH's ~5% activation sparsity is a free diagnostic signal. High sparsity = complex input the model is working hard on (flag for human review). Low sparsity = the model is coasting on defaults (flag as potential mode collapse). No external monitoring tool needed — the architecture IS the monitor.
+### How BDH and the Emergence Transformer Complement Each Other
+| Property | Emergence Transformer (DTA) | Baby Dragon Hatchling (BDH) |
+|----------|---------------------------|----------------------------|
+| **Core idea** | Time-varying attention between components | Scale-free network of locally-interacting neurons |
+| **Best applied to** | Multi-agent council coupling | Single-model verification and interpretation |
+| **Continual learning** | Via attention on past states (theoretical) | Via Hebbian plasticity (architectural, inherent) |
+| **Interpretability** | Attention weight analysis (post-hoc) | Monosemantic synapses (built-in) |
+| **Composability** | Tune α,β parameters for council dynamics | Physically concatenate domain models |
+| **Our system uses it for** | How agents COMMUNICATE and REMEMBER | How the verifier WORKS and EXPLAINS itself |
+The Emergence Transformer tells us HOW TO ORGANIZE the multi-agent system (coupling topology, memory decay, consensus tuning). BDH tells us WHAT TO BUILD the verification layer FROM (an architecture that is interpretable by design, not by analysis).
+Together: the AI Council uses DTA-inspired coupling to debate and converge, then each claim is verified by a BDH model that shows its work through synapse activations, and domain expansion happens through BDH concatenation without forgetting.
+---
+*Every requirement in this document traces to a specific tool in the Awesome Open Source AI list, a specific result in the Emergence Transformer paper, or a specific property of the BDH architecture. No requirements were invented outside these three sources.*