nkshirsa commited on
Commit
2d0df06
·
verified ·
1 Parent(s): 2e5325c

Integrate BDH (Baby Dragon Hatchling) into requirements — interpretable verifier, Hebbian memory, domain composability, sparsity monitoring

Browse files
Files changed (1) hide show
  1. REQUIREMENTS_FROM_SOURCES.md +1 -407
REQUIREMENTS_FROM_SOURCES.md CHANGED
@@ -1,407 +1 @@
1
- # PhD Research OS — Requirements Derived from Sources
2
- ## Grounded Entirely in the Emergence Transformer Paper + Awesome Open Source AI List
3
-
4
- **Written so a high school student can understand every word.**
5
-
6
- **Sources**:
7
- 1. 📄 [Emergence Transformer: Dynamical Temporal Attention Matters](https://arxiv.org/abs/2604.19816) — A paper that redesigns the Transformer's attention mechanism so components interact with their own past states through time-varying queries, keys, and values. Shows that neighbor-attention promotes coherence while self-attention has an optimal sweet spot, and applies this to social opinion models and Hopfield networks for continual learning without forgetting.
8
- 2. 📦 [Awesome Open Source AI](https://github.com/alvinreal/awesome-opensource-ai) — A curated list of 500+ battle-tested, production-proven open-source AI tools across 14 categories (April 2026).
9
-
10
- ---
11
-
12
- ## What These Sources Tell Us
13
-
14
- ### From the Emergence Transformer Paper
15
-
16
- The paper introduces **Dynamical Temporal Attention (DTA)** — a version of the Transformer where the query, key, and value matrices change over time. The key insights that apply to our Research OS:
17
-
18
- 1. **Neighbor-DTA vs Self-DTA**: When components pay attention to their neighbors' history, coherence (agreement) always increases. When they pay attention to their OWN history, there's an optimal attention weight — too much self-attention actually hurts. This directly maps to our AI Council: council members should attend to EACH OTHER'S reasoning (neighbor-DTA), not just their own previous outputs (self-DTA).
19
-
20
- 2. **Emergent Continual Learning**: The paper shows DTA applied to Hopfield neural networks achieves continual learning WITHOUT catastrophic forgetting. This is exactly what our Research OS needs — the model should learn from new papers without forgetting what it learned from old ones.
21
-
22
- 3. **Social Coherence Modulation**: DTA can either enhance agreement or preserve plurality in social opinion models. For our system, this means the AI Council should be designable — we can tune it to either push toward consensus (for clear-cut cases) or deliberately preserve disagreement (for genuinely ambiguous cases).
23
-
24
- 4. **Time-Varying Attention Kernels**: Standard Transformers have fixed attention patterns. DTA makes attention evolve over time. For our system, this means: as the model processes more of a paper, its attention to earlier sections should change. Reading the Discussion should update how the model interprets the Abstract.
25
-
26
- ### From the Awesome Open Source AI List
27
-
28
- The list catalogs the production-ready tools that exist today. Here are the specific tools relevant to each part of our system, organized by what they replace or enable:
29
-
30
- ---
31
-
32
- ## Requirements by System Layer
33
-
34
- ### Layer 0: PDF Parsing — Replace Basic Scrapers with ML Parsers
35
-
36
- **Current state**: PyMuPDF/pdfplumber (basic text extraction)
37
-
38
- **Required tools from the awesome list**:
39
-
40
- | Tool | What It Does | Why We Need It |
41
- |------|-------------|----------------|
42
- | **[Marker](https://github.com/datalab-to/marker)** | Fast, accurate PDF-to-markdown with table extraction and equation handling | Replaces pdfplumber — preserves document structure, tables, equations |
43
- | **[MinerU](https://github.com/opendatalab/MinerU)** | High-accuracy PDF parsing with VLM+OCR dual engine | Handles scanned papers and complex layouts that Marker misses |
44
- | **[Docling](https://github.com/docling-project/docling)** | Document processing toolkit for GenAI workflows | Backup parser for non-standard formats (Word, PPT, Excel supplements) |
45
- | **[Unstructured](https://github.com/Unstructured-IO/unstructured)** | Best-in-class document preprocessing | Universal fallback for any document type |
46
- | **[MarkItDown](https://github.com/microsoft/markitdown)** | Microsoft's file-to-Markdown converter | Handles supplementary files (Excel data, PowerPoint presentations) |
47
- | **[OmniParse](https://github.com/adithya-s-k/omniparse)** | Parses documents, tables, images, videos, audio, web pages | Multi-modal supplement handling (video supplements, audio recordings) |
48
-
49
- **Requirement P-REQ-1**: Integrate Marker as the primary parser. Fall back to MinerU for scanned/OCR documents. Use Docling/MarkItDown for non-PDF supplements.
50
-
51
- **Requirement P-REQ-2**: Use Chonkie ([chonkie-inc/chonkie](https://github.com/chonkie-inc/chonkie)) for intelligent document chunking. It supports semantic, token, and recursive chunking strategies — replacing the current simple section-merge chunking in parser.py.
52
-
53
- ---
54
-
55
- ### Layer 1: Entity Resolution — Add Embedding-Based Matching
56
-
57
- **Current state**: No embedding model, no entity normalization
58
-
59
- **Required tools from the awesome list**:
60
-
61
- | Tool | What It Does | Why We Need It |
62
- |------|-------------|----------------|
63
- | **[BGE / FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)** | Best-in-class text embeddings | Convert claims to vectors for semantic matching instead of word overlap |
64
- | **[FastEmbed](https://github.com/qdrant/fastembed)** | Lightweight embedding with ONNX Runtime, no GPU needed | Local-first embedding that runs on CPU for privacy |
65
- | **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search as a SQLite extension | Adds vector similarity to our existing SQLite database — zero new infrastructure |
66
- | **[MTEB](https://github.com/embeddings-benchmark/mteb)** | Embedding benchmark | Choose the best embedding model for scientific text by testing on MTEB |
67
-
68
- **Requirement M-REQ-1**: Replace Jaccard word overlap in `canonicalizer.py` with embedding-based cosine similarity using FastEmbed + sqlite-vec. This keeps the system local-first (SQLite + CPU embeddings) while enabling semantic deduplication.
69
-
70
- **Requirement M-REQ-2**: Use MTEB to benchmark which embedding model performs best on scientific claim similarity before committing to one.
71
-
72
- ---
73
-
74
- ### Layer 2: Extraction — Better Models and Constrained Output
75
-
76
- **Current state**: Qwen2.5-3B with mock fallback, no output guarantees
77
-
78
- **Required models from the awesome list**:
79
-
80
- | Model | Why It's Better |
81
- |-------|----------------|
82
- | **[Qwen3.6-Plus](https://github.com/QwenLM/Qwen)** | April 2026 flagship, 1M context window, competitive with Claude 4.5 Opus |
83
- | **[Kimi K2.5](https://github.com/MoonshotAI/Kimi-K2.5)** | 256K context, strong reasoning, native tool-use for agentic workflows |
84
- | **[Phi-4](https://github.com/microsoft/PhiCookBook)** | Small but highly capable for reasoning and edge/on-device inference |
85
- | **[OLMo 2](https://github.com/allenai/OLMo)** | Fully open-source (data + code + logs) — by scientists, for scientists |
86
- | **[GLM-5](https://github.com/zai-org/GLM-5)** | Strong coding, reasoning, and agentic-task performance |
87
-
88
- **Requirement B-REQ-1**: Upgrade the primary brain to Qwen3.6-Plus (or its quantized variant) for maximum reasoning quality. Use Phi-4 as the local/edge fallback for 16GB VRAM deployment.
89
-
90
- **Requirement B-REQ-2**: Use the **Instructor** library ([jxnl/instructor](https://github.com/jxnl/instructor)) for structured output extraction with Pydantic validation. This replaces the need for the Guidance library — Instructor handles validation, retries, and error handling for extracting claims as structured JSON from any LLM.
91
-
92
- **From the Emergence Transformer paper — Requirement B-REQ-3**: Implement **Dynamical Temporal Attention** in the council architecture:
93
-
94
- The paper shows that **neighbor-DTA consistently promotes coherence** while **self-DTA has an optimal weight**. Applied to the AI Council:
95
-
96
- - **Neighbor-DTA for council members**: Each council member (Extractor, Critic, Chairman) should attend to OTHER members' reasoning history, not just their own. This promotes convergence on genuinely shared insights.
97
- - **Self-DTA with tunable weight (α)**: Each member also attends to their OWN past outputs, but with a tunable weight. The paper proves there's an optimal α — too much self-attention causes overconfidence. Too little means no memory.
98
- - **Practical implementation**: Store each council member's outputs across multiple papers. When processing paper N, the Extractor can attend to its OWN extractions from papers 1…N-1 (self-DTA) and to the Critic's feedback from papers 1…N-1 (neighbor-DTA). The attention weights are tunable per task.
99
-
100
- This turns the council from a stateless sequential pipeline into a **stateful attention-based ensemble** where members learn from each other's history.
101
-
102
- ---
103
-
104
- ### Layer 3: Deduplication — Semantic Matching at Scale
105
-
106
- **Current state**: Jaccard word overlap
107
-
108
- **Required tools from the awesome list**:
109
-
110
- | Tool | What It Does | Why We Need It |
111
- |------|-------------|----------------|
112
- | **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search in SQLite | Find duplicate claims by meaning, not word overlap, inside our existing DB |
113
- | **[Chroma](https://github.com/chroma-core/chroma)** | Embedding database | If claim count exceeds SQLite performance, scale to dedicated vector DB |
114
- | **[rerankers](https://github.com/AnswerDotAI/rerankers)** | Unified reranking API | After finding candidate duplicates by embedding, use cross-encoder reranking for precision |
115
-
116
- **Requirement M-REQ-3**: Implement two-stage deduplication: (1) Fast approximate matching via sqlite-vec embeddings (recall-optimized), (2) Precise reranking via cross-encoder for candidate pairs (precision-optimized).
117
-
118
- ---
119
-
120
- ### Layer 4: Knowledge Graph — Add Temporal Reasoning and Graph RAG
121
-
122
- **Current state**: SQLite adjacency list, word-overlap conflict detection
123
-
124
- **Required tools from the awesome list**:
125
-
126
- | Tool | What It Does | Why We Need It |
127
- |------|-------------|----------------|
128
- | **[Graphiti](https://github.com/getzep/graphiti)** | Real-time temporal knowledge graphs with provenance tracking | Tracks how facts change over time — exactly what we need for claim versioning |
129
- | **[GraphRAG](https://github.com/microsoft/graphrag)** | Knowledge-graph-based retrieval | Enables multi-hop reasoning over the claim graph |
130
- | **[LightRAG](https://github.com/HKUDS/LightRAG)** | Graph-based RAG with dual-level retrieval | Simpler alternative to GraphRAG for our scale |
131
- | **[KAG (OpenSPG)](https://github.com/OpenSPG/KAG)** | Knowledge Augmented Generation for logical reasoning | Schema-constrained knowledge construction for professional domains |
132
-
133
- **From the Emergence Transformer paper — Requirement G-REQ-1**: Apply DTA to the knowledge graph for **emergent conflict detection**:
134
-
135
- The paper models N coupled oscillators where coherence emerges from attention-mediated interactions. Claims in the knowledge graph are analogous to oscillators:
136
-
137
- - Each claim has a "phase" (its epistemic state: Fact/Interpretation/Hypothesis and confidence)
138
- - Claims interact through graph edges (supports/refutes/extends)
139
- - **Neighbor-DTA** on the graph: When scoring a claim, attend to the HISTORY of its graph neighbors. A claim that was "Interpretation" but whose supporting claims have all been upgraded to "Fact" over time should be reconsidered.
140
- - **Conflict detection as coherence breakdown**: The paper's order parameter (r_t) measures global coherence. In our graph, sudden drops in local coherence (a cluster of claims that were previously consistent suddenly becoming contradictory because of a new paper) are analogous to desynchronization events. These should trigger alerts.
141
-
142
- **Requirement G-REQ-2**: Use Graphiti for temporal provenance. Every claim stores when it was first extracted, when it was last confirmed by a new source, and when it was contradicted. The graph should answer queries like "What changed about LOD claims for GFET sensors between 2022 and 2025?"
143
-
144
- ---
145
-
146
- ### Layer 5: Scoring — Add Calibration Infrastructure
147
-
148
- **Current state**: Fixed-point formula works, calibration only planned
149
-
150
- **Required tools from the awesome list**:
151
-
152
- | Tool | What It Does | Why We Need It |
153
- |------|-------------|----------------|
154
- | **[DeepEval](https://github.com/confident-ai/deepeval)** | LLM evaluation with hallucination detection, bias detection | Automated checking of extraction quality and confidence calibration |
155
- | **[RAGAs](https://github.com/explodinggradients/ragas)** | RAG evaluation (faithfulness, relevance, context recall) | Evaluate whether extracted claims are faithful to source text |
156
-
157
- **Requirement S-REQ-1**: Use DeepEval's faithfulness metric to automatically check: "Does this extracted claim actually appear in the source text?" This replaces manual gold-standard checking for high-volume papers.
158
-
159
- **Requirement S-REQ-2**: Use RAGAs for end-to-end pipeline evaluation — measure whether the system retrieves the right evidence and generates faithful extractions.
160
-
161
- ---
162
-
163
- ### Layer 6: Evaluation — Build a Real Test Suite
164
-
165
- **Current state**: Counts distributions, no ground-truth comparison
166
-
167
- **Required tools from the awesome list**:
168
-
169
- | Tool | What It Does | Why We Need It |
170
- |------|-------------|----------------|
171
- | **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** | De-facto standard for model evaluation | Standardized evaluation across model versions |
172
- | **[DeepEval](https://github.com/confident-ai/deepeval)** | "Pytest for LLMs" | Unit-test each extraction task with pass/fail criteria |
173
- | **[Promptfoo](https://github.com/promptfoo/promptfoo)** | LLM testing and red-teaming | Systematic prompt testing, side-by-side model comparison |
174
- | **[Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)** | UK AI Safety Institute's evaluation framework | Multi-turn dialog evaluation with tool use |
175
- | **[Lighteval](https://github.com/huggingface/lighteval)** | Lightweight model evaluation | Quick evaluation during training |
176
-
177
- **Requirement T-REQ-1**: Implement DeepEval-based unit tests for each extraction task. Each test has:
178
- - Input: paper excerpt
179
- - Expected output: correct claims with correct tags
180
- - Pass criteria: extraction recall ≥ 70%, epistemic accuracy ≥ 60%, qualifier preservation ≥ 80%
181
-
182
- **Requirement T-REQ-2**: Use Promptfoo for prompt regression testing. Every time a system prompt changes, automatically compare outputs before and after.
183
-
184
- ---
185
-
186
- ### Training Pipeline — Use Production Frameworks
187
-
188
- **Current state**: Custom train.py with TRL SFTTrainer, ZeroGPU micro-batching
189
-
190
- **Required tools from the awesome list**:
191
-
192
- | Tool | What It Does | Why We Need It |
193
- |------|-------------|----------------|
194
- | **[TRL](https://github.com/huggingface/trl)** | Official SFT, DPO, GRPO, PPO | Already used for SFT — extend to DPO and GRPO stages |
195
- | **[Axolotl](https://github.com/axolotl-ai-cloud/axolotl)** | YAML-driven SFT, DPO, GRPO pipeline | Simpler configuration than custom scripts |
196
- | **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** | One-stop SFT, DPO, ORPO with web UI | GUI for non-programmers to run training |
197
- | **[Unsloth](https://github.com/unslothai/unsloth)** | 2× faster, 70% less memory fine-tuning | Makes training feasible on consumer GPUs |
198
- | **[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)** | Scalable RLHF with PPO, GRPO, REINFORCE++ | For the GRPO stage with custom epistemic reward functions |
199
- | **[verl](https://github.com/volcengine/verl)** | ByteDance's RL for LLMs with PPO, GRPO | Alternative GRPO implementation |
200
- | **[PEFT](https://github.com/huggingface/peft)** | Parameter-efficient fine-tuning (LoRA, etc.) | Already used — continue with LoRA |
201
-
202
- **Requirement TR-REQ-1**: Replace ZeroGPU micro-batching with a single continuous training job using Unsloth (for 2× speedup on consumer GPU) or Axolotl (for YAML-driven pipeline).
203
-
204
- **Requirement TR-REQ-2**: Implement the 4-stage pipeline:
205
- 1. **SFT** via TRL/Unsloth (already partially built)
206
- 2. **DPO** via TRL DPOTrainer on preference pairs
207
- 3. **GRPO** via OpenRLHF or verl with the 3 custom reward functions (JSON validity, schema compliance, qualifier preservation)
208
- 4. **ConfTuner** via custom training loop with tokenized Brier score loss
209
-
210
- **From the Emergence Transformer paper — Requirement TR-REQ-3**: Implement **DTA-inspired continual learning** for domain adaptation:
211
-
212
- The paper demonstrates that DTA applied to Hopfield networks achieves continual learning without catastrophic forgetting. For our model:
213
-
214
- - When fine-tuning on a new scientific domain (e.g., adding ecology to a biosensors-trained model), use the DTA principle: the model should attend to its OWN past activations (self-DTA) to remember old domains while learning new ones.
215
- - Practically: this maps to **O-LoRA** (orthogonal LoRA) — training new LoRA adapters in orthogonal subspaces so they don't interfere with existing adapters. The DTA paper provides the theoretical foundation for WHY this works: self-attention on past states preserves memory while neighbor-attention on new data drives learning.
216
-
217
- ---
218
-
219
- ### Synthetic Data Generation
220
-
221
- **Required tools from the awesome list**:
222
-
223
- | Tool | What It Does | Why We Need It |
224
- |------|-------------|----------------|
225
- | **[Distilabel](https://github.com/argilla-io/distilabel)** | Synthetic data generation and distillation | Generate 10K+ training examples using teacher models |
226
- | **[Argilla](https://github.com/argilla-io/argilla)** | Data annotation and human-in-the-loop | Label real paper extractions with human experts |
227
-
228
- **Requirement D-REQ-1**: Use Distilabel to generate the teacher ensemble outputs. Run 3-5 teacher models (Qwen3.6-Plus, Kimi K2.5, GLM-5) on 100 real papers and store ALL outputs with disagreement signals.
229
-
230
- **Requirement D-REQ-2**: Use Argilla for human expert labeling of the gold standard test set (10 papers, every claim manually annotated).
231
-
232
- ---
233
-
234
- ### Data Quality and Labeling
235
-
236
- **Required tools from the awesome list**:
237
-
238
- | Tool | What It Does | Why We Need It |
239
- |------|-------------|----------------|
240
- | **[Cleanlab](https://github.com/cleanlab/cleanlab)** | Find and fix label errors in datasets | Detect mislabeled training examples automatically |
241
- | **[Great Expectations](https://github.com/great-expectations/great_expectations)** | Data validation for pipelines | Validate that every training example has required fields, valid JSON, correct tag |
242
- | **[Label Studio](https://github.com/HumanSignal/label-studio)** | Multi-type data labeling | Interface for human annotators to label paper excerpts |
243
-
244
- **Requirement D-REQ-3**: Run Cleanlab on the existing 1,900 training examples to detect any mislabeled examples (wrong epistemic tags, missing qualifiers).
245
-
246
- **Requirement D-REQ-4**: Use Great Expectations to validate every training example before it enters the training pipeline: valid JSON, tag in allowed set, confidence in [0,1], non-empty source quote.
247
-
248
- ---
249
-
250
- ### Inference Serving — Replace Mock with Real AI
251
-
252
- **Current state**: No model serving, everything runs through optional API calls
253
-
254
- **Required tools from the awesome list**:
255
-
256
- | Tool | What It Does | Why We Need It |
257
- |------|-------------|----------------|
258
- | **[Ollama](https://github.com/ollama/ollama)** | Simplest local LLM serving | One-command model serving on consumer hardware |
259
- | **[vLLM](https://github.com/vllm-project/vllm)** | High-throughput LLM serving | Fast batch processing of many paper sections |
260
- | **[llama.cpp](https://github.com/ggml-org/llama.cpp)** | CPU/GPU inference for quantized models | Run on laptops without dedicated GPU |
261
- | **[SGLang](https://github.com/sgl-project/sglang)** | Fast structured generation | Guaranteed valid JSON output via grammar-constrained decoding |
262
-
263
- **Requirement I-REQ-1**: Integrate Ollama as the default local model server. One command to start: `ollama pull qwen3.6-plus:q4` → model available at `http://localhost:11434`.
264
-
265
- **Requirement I-REQ-2**: Use SGLang for constrained decoding — it guarantees valid JSON output with valid enum values. This eliminates broken JSON, invalid tags, and mixed text/JSON output.
266
-
267
- ---
268
-
269
- ### AI Safety and Security
270
-
271
- **Required tools from the awesome list**:
272
-
273
- | Tool | What It Does | Why We Need It |
274
- |------|-------------|----------------|
275
- | **[Guardrails AI](https://github.com/guardrails-ai/guardrails)** | Input/output validation for LLMs | Validate extraction outputs match expected schema |
276
- | **[LLM Guard](https://github.com/protectai/llm-guard)** | Security toolkit for LLM interactions | Detect prompt injection if system is exposed as API |
277
- | **[Garak](https://github.com/NVIDIA/garak)** | LLM vulnerability scanner | Test model for hallucination patterns specific to scientific claims |
278
- | **[DeepTeam](https://github.com/confident-ai/deepteam)** | Red teaming framework | Adversarial testing of extraction robustness |
279
-
280
- **Requirement SEC-REQ-1**: Use Guardrails AI to validate every LLM output before it enters the database. Schema validation, tag validation, confidence range checking.
281
-
282
- **Requirement SEC-REQ-2**: Use Garak to scan the fine-tuned model for scientific hallucination patterns. Test: does the model invent statistics? Does it fabricate citations? Does it claim certainty where the paper was uncertain?
283
-
284
- ---
285
-
286
- ### Interpretability
287
-
288
- **Required tools from the awesome list**:
289
-
290
- | Tool | What It Does | Why We Need It |
291
- |------|-------------|----------------|
292
- | **[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)** | Mechanistic interpretability | Understand WHICH attention heads are responsible for qualifier detection vs claim extraction |
293
- | **[Captum](https://github.com/pytorch/captum)** | PyTorch interpretability | Attribution analysis — which input tokens influenced each output |
294
-
295
- **From the Emergence Transformer paper — Requirement INT-REQ-1**: Use the paper's DTA attention kernel analysis to interpret the fine-tuned model:
296
-
297
- The paper derives explicit formulas for how attention weights evolve over time (Equations 9-10). After fine-tuning, we can:
298
- - Visualize which tokens in the input (qualifier words like "may," "suggests") have the highest attention weight when the model outputs epistemic tags
299
- - Track whether attention to the Abstract section decreases when the model has already processed the Results section (temporal attention shift)
300
- - Identify attention heads that specialize in specific tasks (one head for qualifier detection, another for statistical parsing) — this validates whether the model has learned task-specific representations, answering the "specialist heads" question empirically
301
-
302
- ---
303
-
304
- ### MLOps and Monitoring
305
-
306
- **Required tools from the awesome list**:
307
-
308
- | Tool | What It Does | Why We Need It |
309
- |------|-------------|----------------|
310
- | **[MLflow](https://github.com/mlflow/mlflow)** | Experiment tracking, model registry | Track every training run, compare model versions |
311
- | **[Weights & Biases (wandb)](https://github.com/wandb/wandb)** | Experiment tracking with visualization | Dashboard for training metrics across all 4 stages |
312
- | **[DVC](https://github.com/iterative/dvc)** | Data and model versioning | Version the training dataset and gold standard |
313
- | **[Evidently](https://github.com/evidentlyai/evidently)** | ML monitoring and observability | Detect model drift in production |
314
- | **[Phoenix](https://github.com/Arize-ai/phoenix)** | AI observability | Monitor extraction quality in real-time |
315
-
316
- **Requirement OPS-REQ-1**: Use MLflow or W&B to track all training experiments. Every training run logs: loss curves, evaluation metrics, model checkpoints, hyperparameters, dataset version.
317
-
318
- **Requirement OPS-REQ-2**: Use Evidently for drift detection. Weekly check: run the model on the gold standard test set. If any metric drops >5%, alert.
319
-
320
- ---
321
-
322
- ### Agent Framework — Connect Real Brains to Agent Bodies
323
-
324
- **Current state**: Full agent lifecycle works, but agents have no AI model connected
325
-
326
- **Required tools from the awesome list**:
327
-
328
- | Tool | What It Does | Why We Need It |
329
- |------|-------------|----------------|
330
- | **[smolagents](https://github.com/huggingface/smolagents)** | Lightweight agent framework from HuggingFace | Simpler than current custom AgentOS for basic tasks |
331
- | **[LangGraph](https://github.com/langchain-ai/langgraph)** | Stateful, multi-actor agent orchestration | For the multi-agent council with memory |
332
- | **[CrewAI](https://github.com/crewAIInc/crewAI)** | Multi-agent collaboration framework | Define roles (Extractor, Critic, Chairman) with collaboration protocols |
333
- | **[Letta (MemGPT)](https://github.com/letta-ai/letta)** | Stateful agents with persistent memory | Agents that remember across sessions |
334
- | **[Mem0](https://github.com/mem0ai/mem0)** | Universal memory layer for agents | Persistent memory for the MetaImprover and CitationChaser agents |
335
-
336
- **From the Emergence Transformer paper — Requirement A-REQ-1**: Implement the AI Council as a **DTA-coupled multi-agent system**:
337
-
338
- The paper's model has N oscillators coupled through an adjacency matrix A_ij. The AI Council has N=4 members coupled through information sharing. The DTA framework tells us:
339
-
340
- - **Coupling topology matters**: The paper shows that different network structures (fully connected, small-world, scale-free) produce different coherence patterns. For 4 council members, fully connected (everyone sees everyone) promotes maximum consensus. Star topology (Chairman sees all, others only see Chairman) preserves more diversity.
341
- - **α parameter tunes consensus vs diversity**: At α=0, no temporal attention �� members are stateless → pure diversity. At α=1, full temporal attention → members converge → pure consensus. The paper proves there's an optimal α between 0 and 1 for maximum USEFUL coherence. Tune this per task: high α for clear-cut Fact/Interpretation decisions, low α for ambiguous Conflict_Hypothesis cases.
342
- - **β parameter controls memory decay**: In the paper, β determines how fast old attention information decays. For the council, β controls how much members remember from previous papers. High β = short memory (each paper is fresh). Low β = long memory (patterns from 100 papers ago still influence decisions).
343
-
344
- ---
345
-
346
- ### UI and User Experience
347
-
348
- **Required tools from the awesome list**:
349
-
350
- | Tool | What It Does | Why We Need It |
351
- |------|-------------|----------------|
352
- | **[Gradio](https://github.com/gradio-app/gradio)** | Already used | Continue using for main UI |
353
- | **[Kotaemon](https://github.com/Cinnamon/kotaemon)** | RAG-based document chat with Gradio UI | Reference implementation for document Q&A interface |
354
- | **[Open Notebook](https://github.com/lfnovo/open-notebook)** | AI-powered notebook with multi-modal support | Model for how to build the Obsidian-like research interface |
355
-
356
- **Requirement UI-REQ-1**: Study Kotaemon's architecture for the "chat with your papers" interface. It has hybrid RAG, re-ranking, and multi-modal support — exactly what the Research OS courtroom UI needs.
357
-
358
- ---
359
-
360
- ## Summary: The Complete Requirements Stack
361
-
362
- | Layer | Current Tool | Required Tool(s) | Source |
363
- |-------|-------------|------------------|--------|
364
- | PDF Parsing | pdfplumber/PyMuPDF | **Marker** + MinerU + Docling | Awesome List §5 |
365
- | Chunking | Custom section merger | **Chonkie** | Awesome List §5 |
366
- | Embeddings | None | **FastEmbed** + sqlite-vec | Awesome List §5 |
367
- | Deduplication | Jaccard overlap | FastEmbed + **rerankers** | Awesome List §5 |
368
- | Base Model | Qwen2.5-3B | **Qwen3.6-Plus** / Phi-4 | Awesome List §2 |
369
- | Structured Output | Hope-for-JSON | **Instructor** / SGLang | Awesome List §13 |
370
- | Model Serving | None (mock) | **Ollama** / vLLM | Awesome List §3 |
371
- | Council Architecture | Sequential pipeline | **DTA-coupled agents** (LangGraph) | Emergence Paper |
372
- | Knowledge Graph | SQLite adjacency | SQLite + **Graphiti** temporal layer | Awesome List §5 |
373
- | Graph Reasoning | Word-overlap conflicts | **LightRAG** / GraphRAG | Awesome List §5 |
374
- | Continual Learning | Retrain from scratch | **O-LoRA** (DTA-inspired) | Emergence Paper |
375
- | Training Framework | Custom train.py | **Unsloth** / Axolotl / TRL | Awesome List §7 |
376
- | GRPO Training | Not built | **OpenRLHF** / verl | Awesome List §7 |
377
- | Synthetic Data | Template generator | **Distilabel** | Awesome List §7 |
378
- | Human Labeling | Not built | **Argilla** / Label Studio | Awesome List §7 |
379
- | Data Validation | Not built | **Cleanlab** + Great Expectations | Awesome List §9 |
380
- | LLM Evaluation | Count-based metrics | **DeepEval** + Promptfoo | Awesome List §9 |
381
- | RAG Evaluation | Not built | **RAGAs** | Awesome List §9 |
382
- | Safety Scanning | Not built | **Garak** + LLM Guard | Awesome List §10 |
383
- | Output Validation | Not built | **Guardrails AI** | Awesome List §10 |
384
- | Interpretability | Not built | **TransformerLens** + Captum | Awesome List §10 |
385
- | Experiment Tracking | Tensorboard only | **MLflow** / W&B | Awesome List §8 |
386
- | Drift Detection | Not built | **Evidently** | Awesome List §8 |
387
- | Data Versioning | Not built | **DVC** | Awesome List §8 |
388
- | Agent Memory | Custom memory_store | **Letta** / Mem0 | Awesome List §4 |
389
- | Agent Orchestration | Custom AgentOS | Keep AgentOS + add **LangGraph** | Awesome List §4 |
390
- | Document Chat UI | Gradio tabs | Study **Kotaemon** architecture | Awesome List §5 |
391
-
392
- ---
393
-
394
- ## The Emergence Transformer's 3 Key Contributions to This System
395
-
396
- ### 1. Council as Coupled Oscillators (Neighbor-DTA)
397
- Instead of a sequential pipeline, council members interact through attention-mediated coupling. The Emergence Transformer proves that neighbor-attention promotes coherence — members converge on genuinely shared insights while preserving dissent where it matters.
398
-
399
- ### 2. Continual Learning Without Forgetting (Self-DTA + Hopfield)
400
- When adding new scientific domains, the model maintains its existing knowledge through self-attention on past states. The paper provides the theoretical proof that DTA-modified Hopfield networks achieve continual memory storage — directly applicable to our O-LoRA domain adaptation strategy.
401
-
402
- ### 3. Tunable Consensus vs Diversity (α Parameter)
403
- The system can be configured to either push toward agreement (for clear-cut cases) or deliberately preserve plurality (for genuinely ambiguous epistemic classifications). The paper proves that the optimal α depends on network structure — for our 4-member council, this is a tunable hyperparameter.
404
-
405
- ---
406
-
407
- *Every requirement in this document traces to a specific tool in the Awesome Open Source AI list or a specific result in the Emergence Transformer paper. No requirements were invented outside these two sources.*
 
1
+ replace_with_file:/app/REQUIREMENTS_FROM_SOURCES.md