Integrate BDH (Baby Dragon Hatchling) into requirements — interpretable verifier, Hebbian memory, domain composability, sparsity monitoring
Browse files- REQUIREMENTS_FROM_SOURCES.md +1 -407
REQUIREMENTS_FROM_SOURCES.md
CHANGED
|
@@ -1,407 +1 @@
|
|
| 1 |
-
|
| 2 |
-
## Grounded Entirely in the Emergence Transformer Paper + Awesome Open Source AI List
|
| 3 |
-
|
| 4 |
-
**Written so a high school student can understand every word.**
|
| 5 |
-
|
| 6 |
-
**Sources**:
|
| 7 |
-
1. 📄 [Emergence Transformer: Dynamical Temporal Attention Matters](https://arxiv.org/abs/2604.19816) — A paper that redesigns the Transformer's attention mechanism so components interact with their own past states through time-varying queries, keys, and values. Shows that neighbor-attention promotes coherence while self-attention has an optimal sweet spot, and applies this to social opinion models and Hopfield networks for continual learning without forgetting.
|
| 8 |
-
2. 📦 [Awesome Open Source AI](https://github.com/alvinreal/awesome-opensource-ai) — A curated list of 500+ battle-tested, production-proven open-source AI tools across 14 categories (April 2026).
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## What These Sources Tell Us
|
| 13 |
-
|
| 14 |
-
### From the Emergence Transformer Paper
|
| 15 |
-
|
| 16 |
-
The paper introduces **Dynamical Temporal Attention (DTA)** — a version of the Transformer where the query, key, and value matrices change over time. The key insights that apply to our Research OS:
|
| 17 |
-
|
| 18 |
-
1. **Neighbor-DTA vs Self-DTA**: When components pay attention to their neighbors' history, coherence (agreement) always increases. When they pay attention to their OWN history, there's an optimal attention weight — too much self-attention actually hurts. This directly maps to our AI Council: council members should attend to EACH OTHER'S reasoning (neighbor-DTA), not just their own previous outputs (self-DTA).
|
| 19 |
-
|
| 20 |
-
2. **Emergent Continual Learning**: The paper shows DTA applied to Hopfield neural networks achieves continual learning WITHOUT catastrophic forgetting. This is exactly what our Research OS needs — the model should learn from new papers without forgetting what it learned from old ones.
|
| 21 |
-
|
| 22 |
-
3. **Social Coherence Modulation**: DTA can either enhance agreement or preserve plurality in social opinion models. For our system, this means the AI Council should be designable — we can tune it to either push toward consensus (for clear-cut cases) or deliberately preserve disagreement (for genuinely ambiguous cases).
|
| 23 |
-
|
| 24 |
-
4. **Time-Varying Attention Kernels**: Standard Transformers have fixed attention patterns. DTA makes attention evolve over time. For our system, this means: as the model processes more of a paper, its attention to earlier sections should change. Reading the Discussion should update how the model interprets the Abstract.
|
| 25 |
-
|
| 26 |
-
### From the Awesome Open Source AI List
|
| 27 |
-
|
| 28 |
-
The list catalogs the production-ready tools that exist today. Here are the specific tools relevant to each part of our system, organized by what they replace or enable:
|
| 29 |
-
|
| 30 |
-
---
|
| 31 |
-
|
| 32 |
-
## Requirements by System Layer
|
| 33 |
-
|
| 34 |
-
### Layer 0: PDF Parsing — Replace Basic Scrapers with ML Parsers
|
| 35 |
-
|
| 36 |
-
**Current state**: PyMuPDF/pdfplumber (basic text extraction)
|
| 37 |
-
|
| 38 |
-
**Required tools from the awesome list**:
|
| 39 |
-
|
| 40 |
-
| Tool | What It Does | Why We Need It |
|
| 41 |
-
|------|-------------|----------------|
|
| 42 |
-
| **[Marker](https://github.com/datalab-to/marker)** | Fast, accurate PDF-to-markdown with table extraction and equation handling | Replaces pdfplumber — preserves document structure, tables, equations |
|
| 43 |
-
| **[MinerU](https://github.com/opendatalab/MinerU)** | High-accuracy PDF parsing with VLM+OCR dual engine | Handles scanned papers and complex layouts that Marker misses |
|
| 44 |
-
| **[Docling](https://github.com/docling-project/docling)** | Document processing toolkit for GenAI workflows | Backup parser for non-standard formats (Word, PPT, Excel supplements) |
|
| 45 |
-
| **[Unstructured](https://github.com/Unstructured-IO/unstructured)** | Best-in-class document preprocessing | Universal fallback for any document type |
|
| 46 |
-
| **[MarkItDown](https://github.com/microsoft/markitdown)** | Microsoft's file-to-Markdown converter | Handles supplementary files (Excel data, PowerPoint presentations) |
|
| 47 |
-
| **[OmniParse](https://github.com/adithya-s-k/omniparse)** | Parses documents, tables, images, videos, audio, web pages | Multi-modal supplement handling (video supplements, audio recordings) |
|
| 48 |
-
|
| 49 |
-
**Requirement P-REQ-1**: Integrate Marker as the primary parser. Fall back to MinerU for scanned/OCR documents. Use Docling/MarkItDown for non-PDF supplements.
|
| 50 |
-
|
| 51 |
-
**Requirement P-REQ-2**: Use Chonkie ([chonkie-inc/chonkie](https://github.com/chonkie-inc/chonkie)) for intelligent document chunking. It supports semantic, token, and recursive chunking strategies — replacing the current simple section-merge chunking in parser.py.
|
| 52 |
-
|
| 53 |
-
---
|
| 54 |
-
|
| 55 |
-
### Layer 1: Entity Resolution — Add Embedding-Based Matching
|
| 56 |
-
|
| 57 |
-
**Current state**: No embedding model, no entity normalization
|
| 58 |
-
|
| 59 |
-
**Required tools from the awesome list**:
|
| 60 |
-
|
| 61 |
-
| Tool | What It Does | Why We Need It |
|
| 62 |
-
|------|-------------|----------------|
|
| 63 |
-
| **[BGE / FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)** | Best-in-class text embeddings | Convert claims to vectors for semantic matching instead of word overlap |
|
| 64 |
-
| **[FastEmbed](https://github.com/qdrant/fastembed)** | Lightweight embedding with ONNX Runtime, no GPU needed | Local-first embedding that runs on CPU for privacy |
|
| 65 |
-
| **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search as a SQLite extension | Adds vector similarity to our existing SQLite database — zero new infrastructure |
|
| 66 |
-
| **[MTEB](https://github.com/embeddings-benchmark/mteb)** | Embedding benchmark | Choose the best embedding model for scientific text by testing on MTEB |
|
| 67 |
-
|
| 68 |
-
**Requirement M-REQ-1**: Replace Jaccard word overlap in `canonicalizer.py` with embedding-based cosine similarity using FastEmbed + sqlite-vec. This keeps the system local-first (SQLite + CPU embeddings) while enabling semantic deduplication.
|
| 69 |
-
|
| 70 |
-
**Requirement M-REQ-2**: Use MTEB to benchmark which embedding model performs best on scientific claim similarity before committing to one.
|
| 71 |
-
|
| 72 |
-
---
|
| 73 |
-
|
| 74 |
-
### Layer 2: Extraction — Better Models and Constrained Output
|
| 75 |
-
|
| 76 |
-
**Current state**: Qwen2.5-3B with mock fallback, no output guarantees
|
| 77 |
-
|
| 78 |
-
**Required models from the awesome list**:
|
| 79 |
-
|
| 80 |
-
| Model | Why It's Better |
|
| 81 |
-
|-------|----------------|
|
| 82 |
-
| **[Qwen3.6-Plus](https://github.com/QwenLM/Qwen)** | April 2026 flagship, 1M context window, competitive with Claude 4.5 Opus |
|
| 83 |
-
| **[Kimi K2.5](https://github.com/MoonshotAI/Kimi-K2.5)** | 256K context, strong reasoning, native tool-use for agentic workflows |
|
| 84 |
-
| **[Phi-4](https://github.com/microsoft/PhiCookBook)** | Small but highly capable for reasoning and edge/on-device inference |
|
| 85 |
-
| **[OLMo 2](https://github.com/allenai/OLMo)** | Fully open-source (data + code + logs) — by scientists, for scientists |
|
| 86 |
-
| **[GLM-5](https://github.com/zai-org/GLM-5)** | Strong coding, reasoning, and agentic-task performance |
|
| 87 |
-
|
| 88 |
-
**Requirement B-REQ-1**: Upgrade the primary brain to Qwen3.6-Plus (or its quantized variant) for maximum reasoning quality. Use Phi-4 as the local/edge fallback for 16GB VRAM deployment.
|
| 89 |
-
|
| 90 |
-
**Requirement B-REQ-2**: Use the **Instructor** library ([jxnl/instructor](https://github.com/jxnl/instructor)) for structured output extraction with Pydantic validation. This replaces the need for the Guidance library — Instructor handles validation, retries, and error handling for extracting claims as structured JSON from any LLM.
|
| 91 |
-
|
| 92 |
-
**From the Emergence Transformer paper — Requirement B-REQ-3**: Implement **Dynamical Temporal Attention** in the council architecture:
|
| 93 |
-
|
| 94 |
-
The paper shows that **neighbor-DTA consistently promotes coherence** while **self-DTA has an optimal weight**. Applied to the AI Council:
|
| 95 |
-
|
| 96 |
-
- **Neighbor-DTA for council members**: Each council member (Extractor, Critic, Chairman) should attend to OTHER members' reasoning history, not just their own. This promotes convergence on genuinely shared insights.
|
| 97 |
-
- **Self-DTA with tunable weight (α)**: Each member also attends to their OWN past outputs, but with a tunable weight. The paper proves there's an optimal α — too much self-attention causes overconfidence. Too little means no memory.
|
| 98 |
-
- **Practical implementation**: Store each council member's outputs across multiple papers. When processing paper N, the Extractor can attend to its OWN extractions from papers 1…N-1 (self-DTA) and to the Critic's feedback from papers 1…N-1 (neighbor-DTA). The attention weights are tunable per task.
|
| 99 |
-
|
| 100 |
-
This turns the council from a stateless sequential pipeline into a **stateful attention-based ensemble** where members learn from each other's history.
|
| 101 |
-
|
| 102 |
-
---
|
| 103 |
-
|
| 104 |
-
### Layer 3: Deduplication — Semantic Matching at Scale
|
| 105 |
-
|
| 106 |
-
**Current state**: Jaccard word overlap
|
| 107 |
-
|
| 108 |
-
**Required tools from the awesome list**:
|
| 109 |
-
|
| 110 |
-
| Tool | What It Does | Why We Need It |
|
| 111 |
-
|------|-------------|----------------|
|
| 112 |
-
| **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search in SQLite | Find duplicate claims by meaning, not word overlap, inside our existing DB |
|
| 113 |
-
| **[Chroma](https://github.com/chroma-core/chroma)** | Embedding database | If claim count exceeds SQLite performance, scale to dedicated vector DB |
|
| 114 |
-
| **[rerankers](https://github.com/AnswerDotAI/rerankers)** | Unified reranking API | After finding candidate duplicates by embedding, use cross-encoder reranking for precision |
|
| 115 |
-
|
| 116 |
-
**Requirement M-REQ-3**: Implement two-stage deduplication: (1) Fast approximate matching via sqlite-vec embeddings (recall-optimized), (2) Precise reranking via cross-encoder for candidate pairs (precision-optimized).
|
| 117 |
-
|
| 118 |
-
---
|
| 119 |
-
|
| 120 |
-
### Layer 4: Knowledge Graph — Add Temporal Reasoning and Graph RAG
|
| 121 |
-
|
| 122 |
-
**Current state**: SQLite adjacency list, word-overlap conflict detection
|
| 123 |
-
|
| 124 |
-
**Required tools from the awesome list**:
|
| 125 |
-
|
| 126 |
-
| Tool | What It Does | Why We Need It |
|
| 127 |
-
|------|-------------|----------------|
|
| 128 |
-
| **[Graphiti](https://github.com/getzep/graphiti)** | Real-time temporal knowledge graphs with provenance tracking | Tracks how facts change over time — exactly what we need for claim versioning |
|
| 129 |
-
| **[GraphRAG](https://github.com/microsoft/graphrag)** | Knowledge-graph-based retrieval | Enables multi-hop reasoning over the claim graph |
|
| 130 |
-
| **[LightRAG](https://github.com/HKUDS/LightRAG)** | Graph-based RAG with dual-level retrieval | Simpler alternative to GraphRAG for our scale |
|
| 131 |
-
| **[KAG (OpenSPG)](https://github.com/OpenSPG/KAG)** | Knowledge Augmented Generation for logical reasoning | Schema-constrained knowledge construction for professional domains |
|
| 132 |
-
|
| 133 |
-
**From the Emergence Transformer paper — Requirement G-REQ-1**: Apply DTA to the knowledge graph for **emergent conflict detection**:
|
| 134 |
-
|
| 135 |
-
The paper models N coupled oscillators where coherence emerges from attention-mediated interactions. Claims in the knowledge graph are analogous to oscillators:
|
| 136 |
-
|
| 137 |
-
- Each claim has a "phase" (its epistemic state: Fact/Interpretation/Hypothesis and confidence)
|
| 138 |
-
- Claims interact through graph edges (supports/refutes/extends)
|
| 139 |
-
- **Neighbor-DTA** on the graph: When scoring a claim, attend to the HISTORY of its graph neighbors. A claim that was "Interpretation" but whose supporting claims have all been upgraded to "Fact" over time should be reconsidered.
|
| 140 |
-
- **Conflict detection as coherence breakdown**: The paper's order parameter (r_t) measures global coherence. In our graph, sudden drops in local coherence (a cluster of claims that were previously consistent suddenly becoming contradictory because of a new paper) are analogous to desynchronization events. These should trigger alerts.
|
| 141 |
-
|
| 142 |
-
**Requirement G-REQ-2**: Use Graphiti for temporal provenance. Every claim stores when it was first extracted, when it was last confirmed by a new source, and when it was contradicted. The graph should answer queries like "What changed about LOD claims for GFET sensors between 2022 and 2025?"
|
| 143 |
-
|
| 144 |
-
---
|
| 145 |
-
|
| 146 |
-
### Layer 5: Scoring — Add Calibration Infrastructure
|
| 147 |
-
|
| 148 |
-
**Current state**: Fixed-point formula works, calibration only planned
|
| 149 |
-
|
| 150 |
-
**Required tools from the awesome list**:
|
| 151 |
-
|
| 152 |
-
| Tool | What It Does | Why We Need It |
|
| 153 |
-
|------|-------------|----------------|
|
| 154 |
-
| **[DeepEval](https://github.com/confident-ai/deepeval)** | LLM evaluation with hallucination detection, bias detection | Automated checking of extraction quality and confidence calibration |
|
| 155 |
-
| **[RAGAs](https://github.com/explodinggradients/ragas)** | RAG evaluation (faithfulness, relevance, context recall) | Evaluate whether extracted claims are faithful to source text |
|
| 156 |
-
|
| 157 |
-
**Requirement S-REQ-1**: Use DeepEval's faithfulness metric to automatically check: "Does this extracted claim actually appear in the source text?" This replaces manual gold-standard checking for high-volume papers.
|
| 158 |
-
|
| 159 |
-
**Requirement S-REQ-2**: Use RAGAs for end-to-end pipeline evaluation — measure whether the system retrieves the right evidence and generates faithful extractions.
|
| 160 |
-
|
| 161 |
-
---
|
| 162 |
-
|
| 163 |
-
### Layer 6: Evaluation — Build a Real Test Suite
|
| 164 |
-
|
| 165 |
-
**Current state**: Counts distributions, no ground-truth comparison
|
| 166 |
-
|
| 167 |
-
**Required tools from the awesome list**:
|
| 168 |
-
|
| 169 |
-
| Tool | What It Does | Why We Need It |
|
| 170 |
-
|------|-------------|----------------|
|
| 171 |
-
| **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** | De-facto standard for model evaluation | Standardized evaluation across model versions |
|
| 172 |
-
| **[DeepEval](https://github.com/confident-ai/deepeval)** | "Pytest for LLMs" | Unit-test each extraction task with pass/fail criteria |
|
| 173 |
-
| **[Promptfoo](https://github.com/promptfoo/promptfoo)** | LLM testing and red-teaming | Systematic prompt testing, side-by-side model comparison |
|
| 174 |
-
| **[Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)** | UK AI Safety Institute's evaluation framework | Multi-turn dialog evaluation with tool use |
|
| 175 |
-
| **[Lighteval](https://github.com/huggingface/lighteval)** | Lightweight model evaluation | Quick evaluation during training |
|
| 176 |
-
|
| 177 |
-
**Requirement T-REQ-1**: Implement DeepEval-based unit tests for each extraction task. Each test has:
|
| 178 |
-
- Input: paper excerpt
|
| 179 |
-
- Expected output: correct claims with correct tags
|
| 180 |
-
- Pass criteria: extraction recall ≥ 70%, epistemic accuracy ≥ 60%, qualifier preservation ≥ 80%
|
| 181 |
-
|
| 182 |
-
**Requirement T-REQ-2**: Use Promptfoo for prompt regression testing. Every time a system prompt changes, automatically compare outputs before and after.
|
| 183 |
-
|
| 184 |
-
---
|
| 185 |
-
|
| 186 |
-
### Training Pipeline — Use Production Frameworks
|
| 187 |
-
|
| 188 |
-
**Current state**: Custom train.py with TRL SFTTrainer, ZeroGPU micro-batching
|
| 189 |
-
|
| 190 |
-
**Required tools from the awesome list**:
|
| 191 |
-
|
| 192 |
-
| Tool | What It Does | Why We Need It |
|
| 193 |
-
|------|-------------|----------------|
|
| 194 |
-
| **[TRL](https://github.com/huggingface/trl)** | Official SFT, DPO, GRPO, PPO | Already used for SFT — extend to DPO and GRPO stages |
|
| 195 |
-
| **[Axolotl](https://github.com/axolotl-ai-cloud/axolotl)** | YAML-driven SFT, DPO, GRPO pipeline | Simpler configuration than custom scripts |
|
| 196 |
-
| **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** | One-stop SFT, DPO, ORPO with web UI | GUI for non-programmers to run training |
|
| 197 |
-
| **[Unsloth](https://github.com/unslothai/unsloth)** | 2× faster, 70% less memory fine-tuning | Makes training feasible on consumer GPUs |
|
| 198 |
-
| **[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)** | Scalable RLHF with PPO, GRPO, REINFORCE++ | For the GRPO stage with custom epistemic reward functions |
|
| 199 |
-
| **[verl](https://github.com/volcengine/verl)** | ByteDance's RL for LLMs with PPO, GRPO | Alternative GRPO implementation |
|
| 200 |
-
| **[PEFT](https://github.com/huggingface/peft)** | Parameter-efficient fine-tuning (LoRA, etc.) | Already used — continue with LoRA |
|
| 201 |
-
|
| 202 |
-
**Requirement TR-REQ-1**: Replace ZeroGPU micro-batching with a single continuous training job using Unsloth (for 2× speedup on consumer GPU) or Axolotl (for YAML-driven pipeline).
|
| 203 |
-
|
| 204 |
-
**Requirement TR-REQ-2**: Implement the 4-stage pipeline:
|
| 205 |
-
1. **SFT** via TRL/Unsloth (already partially built)
|
| 206 |
-
2. **DPO** via TRL DPOTrainer on preference pairs
|
| 207 |
-
3. **GRPO** via OpenRLHF or verl with the 3 custom reward functions (JSON validity, schema compliance, qualifier preservation)
|
| 208 |
-
4. **ConfTuner** via custom training loop with tokenized Brier score loss
|
| 209 |
-
|
| 210 |
-
**From the Emergence Transformer paper — Requirement TR-REQ-3**: Implement **DTA-inspired continual learning** for domain adaptation:
|
| 211 |
-
|
| 212 |
-
The paper demonstrates that DTA applied to Hopfield networks achieves continual learning without catastrophic forgetting. For our model:
|
| 213 |
-
|
| 214 |
-
- When fine-tuning on a new scientific domain (e.g., adding ecology to a biosensors-trained model), use the DTA principle: the model should attend to its OWN past activations (self-DTA) to remember old domains while learning new ones.
|
| 215 |
-
- Practically: this maps to **O-LoRA** (orthogonal LoRA) — training new LoRA adapters in orthogonal subspaces so they don't interfere with existing adapters. The DTA paper provides the theoretical foundation for WHY this works: self-attention on past states preserves memory while neighbor-attention on new data drives learning.
|
| 216 |
-
|
| 217 |
-
---
|
| 218 |
-
|
| 219 |
-
### Synthetic Data Generation
|
| 220 |
-
|
| 221 |
-
**Required tools from the awesome list**:
|
| 222 |
-
|
| 223 |
-
| Tool | What It Does | Why We Need It |
|
| 224 |
-
|------|-------------|----------------|
|
| 225 |
-
| **[Distilabel](https://github.com/argilla-io/distilabel)** | Synthetic data generation and distillation | Generate 10K+ training examples using teacher models |
|
| 226 |
-
| **[Argilla](https://github.com/argilla-io/argilla)** | Data annotation and human-in-the-loop | Label real paper extractions with human experts |
|
| 227 |
-
|
| 228 |
-
**Requirement D-REQ-1**: Use Distilabel to generate the teacher ensemble outputs. Run 3-5 teacher models (Qwen3.6-Plus, Kimi K2.5, GLM-5) on 100 real papers and store ALL outputs with disagreement signals.
|
| 229 |
-
|
| 230 |
-
**Requirement D-REQ-2**: Use Argilla for human expert labeling of the gold standard test set (10 papers, every claim manually annotated).
|
| 231 |
-
|
| 232 |
-
---
|
| 233 |
-
|
| 234 |
-
### Data Quality and Labeling
|
| 235 |
-
|
| 236 |
-
**Required tools from the awesome list**:
|
| 237 |
-
|
| 238 |
-
| Tool | What It Does | Why We Need It |
|
| 239 |
-
|------|-------------|----------------|
|
| 240 |
-
| **[Cleanlab](https://github.com/cleanlab/cleanlab)** | Find and fix label errors in datasets | Detect mislabeled training examples automatically |
|
| 241 |
-
| **[Great Expectations](https://github.com/great-expectations/great_expectations)** | Data validation for pipelines | Validate that every training example has required fields, valid JSON, correct tag |
|
| 242 |
-
| **[Label Studio](https://github.com/HumanSignal/label-studio)** | Multi-type data labeling | Interface for human annotators to label paper excerpts |
|
| 243 |
-
|
| 244 |
-
**Requirement D-REQ-3**: Run Cleanlab on the existing 1,900 training examples to detect any mislabeled examples (wrong epistemic tags, missing qualifiers).
|
| 245 |
-
|
| 246 |
-
**Requirement D-REQ-4**: Use Great Expectations to validate every training example before it enters the training pipeline: valid JSON, tag in allowed set, confidence in [0,1], non-empty source quote.
|
| 247 |
-
|
| 248 |
-
---
|
| 249 |
-
|
| 250 |
-
### Inference Serving — Replace Mock with Real AI
|
| 251 |
-
|
| 252 |
-
**Current state**: No model serving, everything runs through optional API calls
|
| 253 |
-
|
| 254 |
-
**Required tools from the awesome list**:
|
| 255 |
-
|
| 256 |
-
| Tool | What It Does | Why We Need It |
|
| 257 |
-
|------|-------------|----------------|
|
| 258 |
-
| **[Ollama](https://github.com/ollama/ollama)** | Simplest local LLM serving | One-command model serving on consumer hardware |
|
| 259 |
-
| **[vLLM](https://github.com/vllm-project/vllm)** | High-throughput LLM serving | Fast batch processing of many paper sections |
|
| 260 |
-
| **[llama.cpp](https://github.com/ggml-org/llama.cpp)** | CPU/GPU inference for quantized models | Run on laptops without dedicated GPU |
|
| 261 |
-
| **[SGLang](https://github.com/sgl-project/sglang)** | Fast structured generation | Guaranteed valid JSON output via grammar-constrained decoding |
|
| 262 |
-
|
| 263 |
-
**Requirement I-REQ-1**: Integrate Ollama as the default local model server. One command to start: `ollama pull qwen3.6-plus:q4` → model available at `http://localhost:11434`.
|
| 264 |
-
|
| 265 |
-
**Requirement I-REQ-2**: Use SGLang for constrained decoding — it guarantees valid JSON output with valid enum values. This eliminates broken JSON, invalid tags, and mixed text/JSON output.
|
| 266 |
-
|
| 267 |
-
---
|
| 268 |
-
|
| 269 |
-
### AI Safety and Security
|
| 270 |
-
|
| 271 |
-
**Required tools from the awesome list**:
|
| 272 |
-
|
| 273 |
-
| Tool | What It Does | Why We Need It |
|
| 274 |
-
|------|-------------|----------------|
|
| 275 |
-
| **[Guardrails AI](https://github.com/guardrails-ai/guardrails)** | Input/output validation for LLMs | Validate extraction outputs match expected schema |
|
| 276 |
-
| **[LLM Guard](https://github.com/protectai/llm-guard)** | Security toolkit for LLM interactions | Detect prompt injection if system is exposed as API |
|
| 277 |
-
| **[Garak](https://github.com/NVIDIA/garak)** | LLM vulnerability scanner | Test model for hallucination patterns specific to scientific claims |
|
| 278 |
-
| **[DeepTeam](https://github.com/confident-ai/deepteam)** | Red teaming framework | Adversarial testing of extraction robustness |
|
| 279 |
-
|
| 280 |
-
**Requirement SEC-REQ-1**: Use Guardrails AI to validate every LLM output before it enters the database. Schema validation, tag validation, confidence range checking.
|
| 281 |
-
|
| 282 |
-
**Requirement SEC-REQ-2**: Use Garak to scan the fine-tuned model for scientific hallucination patterns. Test: does the model invent statistics? Does it fabricate citations? Does it claim certainty where the paper was uncertain?
|
| 283 |
-
|
| 284 |
-
---
|
| 285 |
-
|
| 286 |
-
### Interpretability
|
| 287 |
-
|
| 288 |
-
**Required tools from the awesome list**:
|
| 289 |
-
|
| 290 |
-
| Tool | What It Does | Why We Need It |
|
| 291 |
-
|------|-------------|----------------|
|
| 292 |
-
| **[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)** | Mechanistic interpretability | Understand WHICH attention heads are responsible for qualifier detection vs claim extraction |
|
| 293 |
-
| **[Captum](https://github.com/pytorch/captum)** | PyTorch interpretability | Attribution analysis — which input tokens influenced each output |
|
| 294 |
-
|
| 295 |
-
**From the Emergence Transformer paper — Requirement INT-REQ-1**: Use the paper's DTA attention kernel analysis to interpret the fine-tuned model:
|
| 296 |
-
|
| 297 |
-
The paper derives explicit formulas for how attention weights evolve over time (Equations 9-10). After fine-tuning, we can:
|
| 298 |
-
- Visualize which tokens in the input (qualifier words like "may," "suggests") have the highest attention weight when the model outputs epistemic tags
|
| 299 |
-
- Track whether attention to the Abstract section decreases when the model has already processed the Results section (temporal attention shift)
|
| 300 |
-
- Identify attention heads that specialize in specific tasks (one head for qualifier detection, another for statistical parsing) — this validates whether the model has learned task-specific representations, answering the "specialist heads" question empirically
|
| 301 |
-
|
| 302 |
-
---
|
| 303 |
-
|
| 304 |
-
### MLOps and Monitoring
|
| 305 |
-
|
| 306 |
-
**Required tools from the awesome list**:
|
| 307 |
-
|
| 308 |
-
| Tool | What It Does | Why We Need It |
|
| 309 |
-
|------|-------------|----------------|
|
| 310 |
-
| **[MLflow](https://github.com/mlflow/mlflow)** | Experiment tracking, model registry | Track every training run, compare model versions |
|
| 311 |
-
| **[Weights & Biases (wandb)](https://github.com/wandb/wandb)** | Experiment tracking with visualization | Dashboard for training metrics across all 4 stages |
|
| 312 |
-
| **[DVC](https://github.com/iterative/dvc)** | Data and model versioning | Version the training dataset and gold standard |
|
| 313 |
-
| **[Evidently](https://github.com/evidentlyai/evidently)** | ML monitoring and observability | Detect model drift in production |
|
| 314 |
-
| **[Phoenix](https://github.com/Arize-ai/phoenix)** | AI observability | Monitor extraction quality in real-time |
|
| 315 |
-
|
| 316 |
-
**Requirement OPS-REQ-1**: Use MLflow or W&B to track all training experiments. Every training run logs: loss curves, evaluation metrics, model checkpoints, hyperparameters, dataset version.
|
| 317 |
-
|
| 318 |
-
**Requirement OPS-REQ-2**: Use Evidently for drift detection. Weekly check: run the model on the gold standard test set. If any metric drops >5%, alert.
|
| 319 |
-
|
| 320 |
-
---
|
| 321 |
-
|
| 322 |
-
### Agent Framework — Connect Real Brains to Agent Bodies
|
| 323 |
-
|
| 324 |
-
**Current state**: Full agent lifecycle works, but agents have no AI model connected
|
| 325 |
-
|
| 326 |
-
**Required tools from the awesome list**:
|
| 327 |
-
|
| 328 |
-
| Tool | What It Does | Why We Need It |
|
| 329 |
-
|------|-------------|----------------|
|
| 330 |
-
| **[smolagents](https://github.com/huggingface/smolagents)** | Lightweight agent framework from HuggingFace | Simpler than current custom AgentOS for basic tasks |
|
| 331 |
-
| **[LangGraph](https://github.com/langchain-ai/langgraph)** | Stateful, multi-actor agent orchestration | For the multi-agent council with memory |
|
| 332 |
-
| **[CrewAI](https://github.com/crewAIInc/crewAI)** | Multi-agent collaboration framework | Define roles (Extractor, Critic, Chairman) with collaboration protocols |
|
| 333 |
-
| **[Letta (MemGPT)](https://github.com/letta-ai/letta)** | Stateful agents with persistent memory | Agents that remember across sessions |
|
| 334 |
-
| **[Mem0](https://github.com/mem0ai/mem0)** | Universal memory layer for agents | Persistent memory for the MetaImprover and CitationChaser agents |
|
| 335 |
-
|
| 336 |
-
**From the Emergence Transformer paper — Requirement A-REQ-1**: Implement the AI Council as a **DTA-coupled multi-agent system**:
|
| 337 |
-
|
| 338 |
-
The paper's model has N oscillators coupled through an adjacency matrix A_ij. The AI Council has N=4 members coupled through information sharing. The DTA framework tells us:
|
| 339 |
-
|
| 340 |
-
- **Coupling topology matters**: The paper shows that different network structures (fully connected, small-world, scale-free) produce different coherence patterns. For 4 council members, fully connected (everyone sees everyone) promotes maximum consensus. Star topology (Chairman sees all, others only see Chairman) preserves more diversity.
|
| 341 |
-
- **α parameter tunes consensus vs diversity**: At α=0, no temporal attention �� members are stateless → pure diversity. At α=1, full temporal attention → members converge → pure consensus. The paper proves there's an optimal α between 0 and 1 for maximum USEFUL coherence. Tune this per task: high α for clear-cut Fact/Interpretation decisions, low α for ambiguous Conflict_Hypothesis cases.
|
| 342 |
-
- **β parameter controls memory decay**: In the paper, β determines how fast old attention information decays. For the council, β controls how much members remember from previous papers. High β = short memory (each paper is fresh). Low β = long memory (patterns from 100 papers ago still influence decisions).
|
| 343 |
-
|
| 344 |
-
---
|
| 345 |
-
|
| 346 |
-
### UI and User Experience
|
| 347 |
-
|
| 348 |
-
**Required tools from the awesome list**:
|
| 349 |
-
|
| 350 |
-
| Tool | What It Does | Why We Need It |
|
| 351 |
-
|------|-------------|----------------|
|
| 352 |
-
| **[Gradio](https://github.com/gradio-app/gradio)** | Already used | Continue using for main UI |
|
| 353 |
-
| **[Kotaemon](https://github.com/Cinnamon/kotaemon)** | RAG-based document chat with Gradio UI | Reference implementation for document Q&A interface |
|
| 354 |
-
| **[Open Notebook](https://github.com/lfnovo/open-notebook)** | AI-powered notebook with multi-modal support | Model for how to build the Obsidian-like research interface |
|
| 355 |
-
|
| 356 |
-
**Requirement UI-REQ-1**: Study Kotaemon's architecture for the "chat with your papers" interface. It has hybrid RAG, re-ranking, and multi-modal support — exactly what the Research OS courtroom UI needs.
|
| 357 |
-
|
| 358 |
-
---
|
| 359 |
-
|
| 360 |
-
## Summary: The Complete Requirements Stack
|
| 361 |
-
|
| 362 |
-
| Layer | Current Tool | Required Tool(s) | Source |
|
| 363 |
-
|-------|-------------|------------------|--------|
|
| 364 |
-
| PDF Parsing | pdfplumber/PyMuPDF | **Marker** + MinerU + Docling | Awesome List §5 |
|
| 365 |
-
| Chunking | Custom section merger | **Chonkie** | Awesome List §5 |
|
| 366 |
-
| Embeddings | None | **FastEmbed** + sqlite-vec | Awesome List §5 |
|
| 367 |
-
| Deduplication | Jaccard overlap | FastEmbed + **rerankers** | Awesome List §5 |
|
| 368 |
-
| Base Model | Qwen2.5-3B | **Qwen3.6-Plus** / Phi-4 | Awesome List §2 |
|
| 369 |
-
| Structured Output | Hope-for-JSON | **Instructor** / SGLang | Awesome List §13 |
|
| 370 |
-
| Model Serving | None (mock) | **Ollama** / vLLM | Awesome List §3 |
|
| 371 |
-
| Council Architecture | Sequential pipeline | **DTA-coupled agents** (LangGraph) | Emergence Paper |
|
| 372 |
-
| Knowledge Graph | SQLite adjacency | SQLite + **Graphiti** temporal layer | Awesome List §5 |
|
| 373 |
-
| Graph Reasoning | Word-overlap conflicts | **LightRAG** / GraphRAG | Awesome List §5 |
|
| 374 |
-
| Continual Learning | Retrain from scratch | **O-LoRA** (DTA-inspired) | Emergence Paper |
|
| 375 |
-
| Training Framework | Custom train.py | **Unsloth** / Axolotl / TRL | Awesome List §7 |
|
| 376 |
-
| GRPO Training | Not built | **OpenRLHF** / verl | Awesome List §7 |
|
| 377 |
-
| Synthetic Data | Template generator | **Distilabel** | Awesome List §7 |
|
| 378 |
-
| Human Labeling | Not built | **Argilla** / Label Studio | Awesome List §7 |
|
| 379 |
-
| Data Validation | Not built | **Cleanlab** + Great Expectations | Awesome List §9 |
|
| 380 |
-
| LLM Evaluation | Count-based metrics | **DeepEval** + Promptfoo | Awesome List §9 |
|
| 381 |
-
| RAG Evaluation | Not built | **RAGAs** | Awesome List §9 |
|
| 382 |
-
| Safety Scanning | Not built | **Garak** + LLM Guard | Awesome List §10 |
|
| 383 |
-
| Output Validation | Not built | **Guardrails AI** | Awesome List §10 |
|
| 384 |
-
| Interpretability | Not built | **TransformerLens** + Captum | Awesome List §10 |
|
| 385 |
-
| Experiment Tracking | Tensorboard only | **MLflow** / W&B | Awesome List §8 |
|
| 386 |
-
| Drift Detection | Not built | **Evidently** | Awesome List §8 |
|
| 387 |
-
| Data Versioning | Not built | **DVC** | Awesome List §8 |
|
| 388 |
-
| Agent Memory | Custom memory_store | **Letta** / Mem0 | Awesome List §4 |
|
| 389 |
-
| Agent Orchestration | Custom AgentOS | Keep AgentOS + add **LangGraph** | Awesome List §4 |
|
| 390 |
-
| Document Chat UI | Gradio tabs | Study **Kotaemon** architecture | Awesome List §5 |
|
| 391 |
-
|
| 392 |
-
---
|
| 393 |
-
|
| 394 |
-
## The Emergence Transformer's 3 Key Contributions to This System
|
| 395 |
-
|
| 396 |
-
### 1. Council as Coupled Oscillators (Neighbor-DTA)
|
| 397 |
-
Instead of a sequential pipeline, council members interact through attention-mediated coupling. The Emergence Transformer proves that neighbor-attention promotes coherence — members converge on genuinely shared insights while preserving dissent where it matters.
|
| 398 |
-
|
| 399 |
-
### 2. Continual Learning Without Forgetting (Self-DTA + Hopfield)
|
| 400 |
-
When adding new scientific domains, the model maintains its existing knowledge through self-attention on past states. The paper provides the theoretical proof that DTA-modified Hopfield networks achieve continual memory storage — directly applicable to our O-LoRA domain adaptation strategy.
|
| 401 |
-
|
| 402 |
-
### 3. Tunable Consensus vs Diversity (α Parameter)
|
| 403 |
-
The system can be configured to either push toward agreement (for clear-cut cases) or deliberately preserve plurality (for genuinely ambiguous epistemic classifications). The paper proves that the optimal α depends on network structure — for our 4-member council, this is a tunable hyperparameter.
|
| 404 |
-
|
| 405 |
-
---
|
| 406 |
-
|
| 407 |
-
*Every requirement in this document traces to a specific tool in the Awesome Open Source AI list or a specific result in the Emergence Transformer paper. No requirements were invented outside these two sources.*
|
|
|
|
| 1 |
+
replace_with_file:/app/REQUIREMENTS_FROM_SOURCES.md
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|