nkshirsa commited on
Commit
2e5325c
·
verified ·
1 Parent(s): 3f62132

Add REQUIREMENTS_FROM_SOURCES.md — requirements grounded in Emergence Transformer paper + Awesome Open Source AI list, highschool-readable

Browse files
Files changed (1) hide show
  1. REQUIREMENTS_FROM_SOURCES.md +407 -0
REQUIREMENTS_FROM_SOURCES.md ADDED
@@ -0,0 +1,407 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PhD Research OS — Requirements Derived from Sources
2
+ ## Grounded Entirely in the Emergence Transformer Paper + Awesome Open Source AI List
3
+
4
+ **Written so a high school student can understand every word.**
5
+
6
+ **Sources**:
7
+ 1. 📄 [Emergence Transformer: Dynamical Temporal Attention Matters](https://arxiv.org/abs/2604.19816) — A paper that redesigns the Transformer's attention mechanism so components interact with their own past states through time-varying queries, keys, and values. Shows that neighbor-attention promotes coherence while self-attention has an optimal sweet spot, and applies this to social opinion models and Hopfield networks for continual learning without forgetting.
8
+ 2. 📦 [Awesome Open Source AI](https://github.com/alvinreal/awesome-opensource-ai) — A curated list of 500+ battle-tested, production-proven open-source AI tools across 14 categories (April 2026).
9
+
10
+ ---
11
+
12
+ ## What These Sources Tell Us
13
+
14
+ ### From the Emergence Transformer Paper
15
+
16
+ The paper introduces **Dynamical Temporal Attention (DTA)** — a version of the Transformer where the query, key, and value matrices change over time. The key insights that apply to our Research OS:
17
+
18
+ 1. **Neighbor-DTA vs Self-DTA**: When components pay attention to their neighbors' history, coherence (agreement) always increases. When they pay attention to their OWN history, there's an optimal attention weight — too much self-attention actually hurts. This directly maps to our AI Council: council members should attend to EACH OTHER'S reasoning (neighbor-DTA), not just their own previous outputs (self-DTA).
19
+
20
+ 2. **Emergent Continual Learning**: The paper shows DTA applied to Hopfield neural networks achieves continual learning WITHOUT catastrophic forgetting. This is exactly what our Research OS needs — the model should learn from new papers without forgetting what it learned from old ones.
21
+
22
+ 3. **Social Coherence Modulation**: DTA can either enhance agreement or preserve plurality in social opinion models. For our system, this means the AI Council should be designable — we can tune it to either push toward consensus (for clear-cut cases) or deliberately preserve disagreement (for genuinely ambiguous cases).
23
+
24
+ 4. **Time-Varying Attention Kernels**: Standard Transformers have fixed attention patterns. DTA makes attention evolve over time. For our system, this means: as the model processes more of a paper, its attention to earlier sections should change. Reading the Discussion should update how the model interprets the Abstract.
25
+
26
+ ### From the Awesome Open Source AI List
27
+
28
+ The list catalogs the production-ready tools that exist today. Here are the specific tools relevant to each part of our system, organized by what they replace or enable:
29
+
30
+ ---
31
+
32
+ ## Requirements by System Layer
33
+
34
+ ### Layer 0: PDF Parsing — Replace Basic Scrapers with ML Parsers
35
+
36
+ **Current state**: PyMuPDF/pdfplumber (basic text extraction)
37
+
38
+ **Required tools from the awesome list**:
39
+
40
+ | Tool | What It Does | Why We Need It |
41
+ |------|-------------|----------------|
42
+ | **[Marker](https://github.com/datalab-to/marker)** | Fast, accurate PDF-to-markdown with table extraction and equation handling | Replaces pdfplumber — preserves document structure, tables, equations |
43
+ | **[MinerU](https://github.com/opendatalab/MinerU)** | High-accuracy PDF parsing with VLM+OCR dual engine | Handles scanned papers and complex layouts that Marker misses |
44
+ | **[Docling](https://github.com/docling-project/docling)** | Document processing toolkit for GenAI workflows | Backup parser for non-standard formats (Word, PPT, Excel supplements) |
45
+ | **[Unstructured](https://github.com/Unstructured-IO/unstructured)** | Best-in-class document preprocessing | Universal fallback for any document type |
46
+ | **[MarkItDown](https://github.com/microsoft/markitdown)** | Microsoft's file-to-Markdown converter | Handles supplementary files (Excel data, PowerPoint presentations) |
47
+ | **[OmniParse](https://github.com/adithya-s-k/omniparse)** | Parses documents, tables, images, videos, audio, web pages | Multi-modal supplement handling (video supplements, audio recordings) |
48
+
49
+ **Requirement P-REQ-1**: Integrate Marker as the primary parser. Fall back to MinerU for scanned/OCR documents. Use Docling/MarkItDown for non-PDF supplements.
50
+
51
+ **Requirement P-REQ-2**: Use Chonkie ([chonkie-inc/chonkie](https://github.com/chonkie-inc/chonkie)) for intelligent document chunking. It supports semantic, token, and recursive chunking strategies — replacing the current simple section-merge chunking in parser.py.
52
+
53
+ ---
54
+
55
+ ### Layer 1: Entity Resolution — Add Embedding-Based Matching
56
+
57
+ **Current state**: No embedding model, no entity normalization
58
+
59
+ **Required tools from the awesome list**:
60
+
61
+ | Tool | What It Does | Why We Need It |
62
+ |------|-------------|----------------|
63
+ | **[BGE / FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)** | Best-in-class text embeddings | Convert claims to vectors for semantic matching instead of word overlap |
64
+ | **[FastEmbed](https://github.com/qdrant/fastembed)** | Lightweight embedding with ONNX Runtime, no GPU needed | Local-first embedding that runs on CPU for privacy |
65
+ | **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search as a SQLite extension | Adds vector similarity to our existing SQLite database — zero new infrastructure |
66
+ | **[MTEB](https://github.com/embeddings-benchmark/mteb)** | Embedding benchmark | Choose the best embedding model for scientific text by testing on MTEB |
67
+
68
+ **Requirement M-REQ-1**: Replace Jaccard word overlap in `canonicalizer.py` with embedding-based cosine similarity using FastEmbed + sqlite-vec. This keeps the system local-first (SQLite + CPU embeddings) while enabling semantic deduplication.
69
+
70
+ **Requirement M-REQ-2**: Use MTEB to benchmark which embedding model performs best on scientific claim similarity before committing to one.
71
+
72
+ ---
73
+
74
+ ### Layer 2: Extraction — Better Models and Constrained Output
75
+
76
+ **Current state**: Qwen2.5-3B with mock fallback, no output guarantees
77
+
78
+ **Required models from the awesome list**:
79
+
80
+ | Model | Why It's Better |
81
+ |-------|----------------|
82
+ | **[Qwen3.6-Plus](https://github.com/QwenLM/Qwen)** | April 2026 flagship, 1M context window, competitive with Claude 4.5 Opus |
83
+ | **[Kimi K2.5](https://github.com/MoonshotAI/Kimi-K2.5)** | 256K context, strong reasoning, native tool-use for agentic workflows |
84
+ | **[Phi-4](https://github.com/microsoft/PhiCookBook)** | Small but highly capable for reasoning and edge/on-device inference |
85
+ | **[OLMo 2](https://github.com/allenai/OLMo)** | Fully open-source (data + code + logs) — by scientists, for scientists |
86
+ | **[GLM-5](https://github.com/zai-org/GLM-5)** | Strong coding, reasoning, and agentic-task performance |
87
+
88
+ **Requirement B-REQ-1**: Upgrade the primary brain to Qwen3.6-Plus (or its quantized variant) for maximum reasoning quality. Use Phi-4 as the local/edge fallback for 16GB VRAM deployment.
89
+
90
+ **Requirement B-REQ-2**: Use the **Instructor** library ([jxnl/instructor](https://github.com/jxnl/instructor)) for structured output extraction with Pydantic validation. This replaces the need for the Guidance library — Instructor handles validation, retries, and error handling for extracting claims as structured JSON from any LLM.
91
+
92
+ **From the Emergence Transformer paper — Requirement B-REQ-3**: Implement **Dynamical Temporal Attention** in the council architecture:
93
+
94
+ The paper shows that **neighbor-DTA consistently promotes coherence** while **self-DTA has an optimal weight**. Applied to the AI Council:
95
+
96
+ - **Neighbor-DTA for council members**: Each council member (Extractor, Critic, Chairman) should attend to OTHER members' reasoning history, not just their own. This promotes convergence on genuinely shared insights.
97
+ - **Self-DTA with tunable weight (α)**: Each member also attends to their OWN past outputs, but with a tunable weight. The paper proves there's an optimal α — too much self-attention causes overconfidence. Too little means no memory.
98
+ - **Practical implementation**: Store each council member's outputs across multiple papers. When processing paper N, the Extractor can attend to its OWN extractions from papers 1…N-1 (self-DTA) and to the Critic's feedback from papers 1…N-1 (neighbor-DTA). The attention weights are tunable per task.
99
+
100
+ This turns the council from a stateless sequential pipeline into a **stateful attention-based ensemble** where members learn from each other's history.
101
+
102
+ ---
103
+
104
+ ### Layer 3: Deduplication — Semantic Matching at Scale
105
+
106
+ **Current state**: Jaccard word overlap
107
+
108
+ **Required tools from the awesome list**:
109
+
110
+ | Tool | What It Does | Why We Need It |
111
+ |------|-------------|----------------|
112
+ | **[sqlite-vec](https://github.com/asg017/sqlite-vec)** | Vector search in SQLite | Find duplicate claims by meaning, not word overlap, inside our existing DB |
113
+ | **[Chroma](https://github.com/chroma-core/chroma)** | Embedding database | If claim count exceeds SQLite performance, scale to dedicated vector DB |
114
+ | **[rerankers](https://github.com/AnswerDotAI/rerankers)** | Unified reranking API | After finding candidate duplicates by embedding, use cross-encoder reranking for precision |
115
+
116
+ **Requirement M-REQ-3**: Implement two-stage deduplication: (1) Fast approximate matching via sqlite-vec embeddings (recall-optimized), (2) Precise reranking via cross-encoder for candidate pairs (precision-optimized).
117
+
118
+ ---
119
+
120
+ ### Layer 4: Knowledge Graph — Add Temporal Reasoning and Graph RAG
121
+
122
+ **Current state**: SQLite adjacency list, word-overlap conflict detection
123
+
124
+ **Required tools from the awesome list**:
125
+
126
+ | Tool | What It Does | Why We Need It |
127
+ |------|-------------|----------------|
128
+ | **[Graphiti](https://github.com/getzep/graphiti)** | Real-time temporal knowledge graphs with provenance tracking | Tracks how facts change over time — exactly what we need for claim versioning |
129
+ | **[GraphRAG](https://github.com/microsoft/graphrag)** | Knowledge-graph-based retrieval | Enables multi-hop reasoning over the claim graph |
130
+ | **[LightRAG](https://github.com/HKUDS/LightRAG)** | Graph-based RAG with dual-level retrieval | Simpler alternative to GraphRAG for our scale |
131
+ | **[KAG (OpenSPG)](https://github.com/OpenSPG/KAG)** | Knowledge Augmented Generation for logical reasoning | Schema-constrained knowledge construction for professional domains |
132
+
133
+ **From the Emergence Transformer paper — Requirement G-REQ-1**: Apply DTA to the knowledge graph for **emergent conflict detection**:
134
+
135
+ The paper models N coupled oscillators where coherence emerges from attention-mediated interactions. Claims in the knowledge graph are analogous to oscillators:
136
+
137
+ - Each claim has a "phase" (its epistemic state: Fact/Interpretation/Hypothesis and confidence)
138
+ - Claims interact through graph edges (supports/refutes/extends)
139
+ - **Neighbor-DTA** on the graph: When scoring a claim, attend to the HISTORY of its graph neighbors. A claim that was "Interpretation" but whose supporting claims have all been upgraded to "Fact" over time should be reconsidered.
140
+ - **Conflict detection as coherence breakdown**: The paper's order parameter (r_t) measures global coherence. In our graph, sudden drops in local coherence (a cluster of claims that were previously consistent suddenly becoming contradictory because of a new paper) are analogous to desynchronization events. These should trigger alerts.
141
+
142
+ **Requirement G-REQ-2**: Use Graphiti for temporal provenance. Every claim stores when it was first extracted, when it was last confirmed by a new source, and when it was contradicted. The graph should answer queries like "What changed about LOD claims for GFET sensors between 2022 and 2025?"
143
+
144
+ ---
145
+
146
+ ### Layer 5: Scoring — Add Calibration Infrastructure
147
+
148
+ **Current state**: Fixed-point formula works, calibration only planned
149
+
150
+ **Required tools from the awesome list**:
151
+
152
+ | Tool | What It Does | Why We Need It |
153
+ |------|-------------|----------------|
154
+ | **[DeepEval](https://github.com/confident-ai/deepeval)** | LLM evaluation with hallucination detection, bias detection | Automated checking of extraction quality and confidence calibration |
155
+ | **[RAGAs](https://github.com/explodinggradients/ragas)** | RAG evaluation (faithfulness, relevance, context recall) | Evaluate whether extracted claims are faithful to source text |
156
+
157
+ **Requirement S-REQ-1**: Use DeepEval's faithfulness metric to automatically check: "Does this extracted claim actually appear in the source text?" This replaces manual gold-standard checking for high-volume papers.
158
+
159
+ **Requirement S-REQ-2**: Use RAGAs for end-to-end pipeline evaluation — measure whether the system retrieves the right evidence and generates faithful extractions.
160
+
161
+ ---
162
+
163
+ ### Layer 6: Evaluation — Build a Real Test Suite
164
+
165
+ **Current state**: Counts distributions, no ground-truth comparison
166
+
167
+ **Required tools from the awesome list**:
168
+
169
+ | Tool | What It Does | Why We Need It |
170
+ |------|-------------|----------------|
171
+ | **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** | De-facto standard for model evaluation | Standardized evaluation across model versions |
172
+ | **[DeepEval](https://github.com/confident-ai/deepeval)** | "Pytest for LLMs" | Unit-test each extraction task with pass/fail criteria |
173
+ | **[Promptfoo](https://github.com/promptfoo/promptfoo)** | LLM testing and red-teaming | Systematic prompt testing, side-by-side model comparison |
174
+ | **[Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)** | UK AI Safety Institute's evaluation framework | Multi-turn dialog evaluation with tool use |
175
+ | **[Lighteval](https://github.com/huggingface/lighteval)** | Lightweight model evaluation | Quick evaluation during training |
176
+
177
+ **Requirement T-REQ-1**: Implement DeepEval-based unit tests for each extraction task. Each test has:
178
+ - Input: paper excerpt
179
+ - Expected output: correct claims with correct tags
180
+ - Pass criteria: extraction recall ≥ 70%, epistemic accuracy ≥ 60%, qualifier preservation ≥ 80%
181
+
182
+ **Requirement T-REQ-2**: Use Promptfoo for prompt regression testing. Every time a system prompt changes, automatically compare outputs before and after.
183
+
184
+ ---
185
+
186
+ ### Training Pipeline — Use Production Frameworks
187
+
188
+ **Current state**: Custom train.py with TRL SFTTrainer, ZeroGPU micro-batching
189
+
190
+ **Required tools from the awesome list**:
191
+
192
+ | Tool | What It Does | Why We Need It |
193
+ |------|-------------|----------------|
194
+ | **[TRL](https://github.com/huggingface/trl)** | Official SFT, DPO, GRPO, PPO | Already used for SFT — extend to DPO and GRPO stages |
195
+ | **[Axolotl](https://github.com/axolotl-ai-cloud/axolotl)** | YAML-driven SFT, DPO, GRPO pipeline | Simpler configuration than custom scripts |
196
+ | **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** | One-stop SFT, DPO, ORPO with web UI | GUI for non-programmers to run training |
197
+ | **[Unsloth](https://github.com/unslothai/unsloth)** | 2× faster, 70% less memory fine-tuning | Makes training feasible on consumer GPUs |
198
+ | **[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)** | Scalable RLHF with PPO, GRPO, REINFORCE++ | For the GRPO stage with custom epistemic reward functions |
199
+ | **[verl](https://github.com/volcengine/verl)** | ByteDance's RL for LLMs with PPO, GRPO | Alternative GRPO implementation |
200
+ | **[PEFT](https://github.com/huggingface/peft)** | Parameter-efficient fine-tuning (LoRA, etc.) | Already used — continue with LoRA |
201
+
202
+ **Requirement TR-REQ-1**: Replace ZeroGPU micro-batching with a single continuous training job using Unsloth (for 2× speedup on consumer GPU) or Axolotl (for YAML-driven pipeline).
203
+
204
+ **Requirement TR-REQ-2**: Implement the 4-stage pipeline:
205
+ 1. **SFT** via TRL/Unsloth (already partially built)
206
+ 2. **DPO** via TRL DPOTrainer on preference pairs
207
+ 3. **GRPO** via OpenRLHF or verl with the 3 custom reward functions (JSON validity, schema compliance, qualifier preservation)
208
+ 4. **ConfTuner** via custom training loop with tokenized Brier score loss
209
+
210
+ **From the Emergence Transformer paper — Requirement TR-REQ-3**: Implement **DTA-inspired continual learning** for domain adaptation:
211
+
212
+ The paper demonstrates that DTA applied to Hopfield networks achieves continual learning without catastrophic forgetting. For our model:
213
+
214
+ - When fine-tuning on a new scientific domain (e.g., adding ecology to a biosensors-trained model), use the DTA principle: the model should attend to its OWN past activations (self-DTA) to remember old domains while learning new ones.
215
+ - Practically: this maps to **O-LoRA** (orthogonal LoRA) — training new LoRA adapters in orthogonal subspaces so they don't interfere with existing adapters. The DTA paper provides the theoretical foundation for WHY this works: self-attention on past states preserves memory while neighbor-attention on new data drives learning.
216
+
217
+ ---
218
+
219
+ ### Synthetic Data Generation
220
+
221
+ **Required tools from the awesome list**:
222
+
223
+ | Tool | What It Does | Why We Need It |
224
+ |------|-------------|----------------|
225
+ | **[Distilabel](https://github.com/argilla-io/distilabel)** | Synthetic data generation and distillation | Generate 10K+ training examples using teacher models |
226
+ | **[Argilla](https://github.com/argilla-io/argilla)** | Data annotation and human-in-the-loop | Label real paper extractions with human experts |
227
+
228
+ **Requirement D-REQ-1**: Use Distilabel to generate the teacher ensemble outputs. Run 3-5 teacher models (Qwen3.6-Plus, Kimi K2.5, GLM-5) on 100 real papers and store ALL outputs with disagreement signals.
229
+
230
+ **Requirement D-REQ-2**: Use Argilla for human expert labeling of the gold standard test set (10 papers, every claim manually annotated).
231
+
232
+ ---
233
+
234
+ ### Data Quality and Labeling
235
+
236
+ **Required tools from the awesome list**:
237
+
238
+ | Tool | What It Does | Why We Need It |
239
+ |------|-------------|----------------|
240
+ | **[Cleanlab](https://github.com/cleanlab/cleanlab)** | Find and fix label errors in datasets | Detect mislabeled training examples automatically |
241
+ | **[Great Expectations](https://github.com/great-expectations/great_expectations)** | Data validation for pipelines | Validate that every training example has required fields, valid JSON, correct tag |
242
+ | **[Label Studio](https://github.com/HumanSignal/label-studio)** | Multi-type data labeling | Interface for human annotators to label paper excerpts |
243
+
244
+ **Requirement D-REQ-3**: Run Cleanlab on the existing 1,900 training examples to detect any mislabeled examples (wrong epistemic tags, missing qualifiers).
245
+
246
+ **Requirement D-REQ-4**: Use Great Expectations to validate every training example before it enters the training pipeline: valid JSON, tag in allowed set, confidence in [0,1], non-empty source quote.
247
+
248
+ ---
249
+
250
+ ### Inference Serving — Replace Mock with Real AI
251
+
252
+ **Current state**: No model serving, everything runs through optional API calls
253
+
254
+ **Required tools from the awesome list**:
255
+
256
+ | Tool | What It Does | Why We Need It |
257
+ |------|-------------|----------------|
258
+ | **[Ollama](https://github.com/ollama/ollama)** | Simplest local LLM serving | One-command model serving on consumer hardware |
259
+ | **[vLLM](https://github.com/vllm-project/vllm)** | High-throughput LLM serving | Fast batch processing of many paper sections |
260
+ | **[llama.cpp](https://github.com/ggml-org/llama.cpp)** | CPU/GPU inference for quantized models | Run on laptops without dedicated GPU |
261
+ | **[SGLang](https://github.com/sgl-project/sglang)** | Fast structured generation | Guaranteed valid JSON output via grammar-constrained decoding |
262
+
263
+ **Requirement I-REQ-1**: Integrate Ollama as the default local model server. One command to start: `ollama pull qwen3.6-plus:q4` → model available at `http://localhost:11434`.
264
+
265
+ **Requirement I-REQ-2**: Use SGLang for constrained decoding — it guarantees valid JSON output with valid enum values. This eliminates broken JSON, invalid tags, and mixed text/JSON output.
266
+
267
+ ---
268
+
269
+ ### AI Safety and Security
270
+
271
+ **Required tools from the awesome list**:
272
+
273
+ | Tool | What It Does | Why We Need It |
274
+ |------|-------------|----------------|
275
+ | **[Guardrails AI](https://github.com/guardrails-ai/guardrails)** | Input/output validation for LLMs | Validate extraction outputs match expected schema |
276
+ | **[LLM Guard](https://github.com/protectai/llm-guard)** | Security toolkit for LLM interactions | Detect prompt injection if system is exposed as API |
277
+ | **[Garak](https://github.com/NVIDIA/garak)** | LLM vulnerability scanner | Test model for hallucination patterns specific to scientific claims |
278
+ | **[DeepTeam](https://github.com/confident-ai/deepteam)** | Red teaming framework | Adversarial testing of extraction robustness |
279
+
280
+ **Requirement SEC-REQ-1**: Use Guardrails AI to validate every LLM output before it enters the database. Schema validation, tag validation, confidence range checking.
281
+
282
+ **Requirement SEC-REQ-2**: Use Garak to scan the fine-tuned model for scientific hallucination patterns. Test: does the model invent statistics? Does it fabricate citations? Does it claim certainty where the paper was uncertain?
283
+
284
+ ---
285
+
286
+ ### Interpretability
287
+
288
+ **Required tools from the awesome list**:
289
+
290
+ | Tool | What It Does | Why We Need It |
291
+ |------|-------------|----------------|
292
+ | **[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)** | Mechanistic interpretability | Understand WHICH attention heads are responsible for qualifier detection vs claim extraction |
293
+ | **[Captum](https://github.com/pytorch/captum)** | PyTorch interpretability | Attribution analysis — which input tokens influenced each output |
294
+
295
+ **From the Emergence Transformer paper — Requirement INT-REQ-1**: Use the paper's DTA attention kernel analysis to interpret the fine-tuned model:
296
+
297
+ The paper derives explicit formulas for how attention weights evolve over time (Equations 9-10). After fine-tuning, we can:
298
+ - Visualize which tokens in the input (qualifier words like "may," "suggests") have the highest attention weight when the model outputs epistemic tags
299
+ - Track whether attention to the Abstract section decreases when the model has already processed the Results section (temporal attention shift)
300
+ - Identify attention heads that specialize in specific tasks (one head for qualifier detection, another for statistical parsing) — this validates whether the model has learned task-specific representations, answering the "specialist heads" question empirically
301
+
302
+ ---
303
+
304
+ ### MLOps and Monitoring
305
+
306
+ **Required tools from the awesome list**:
307
+
308
+ | Tool | What It Does | Why We Need It |
309
+ |------|-------------|----------------|
310
+ | **[MLflow](https://github.com/mlflow/mlflow)** | Experiment tracking, model registry | Track every training run, compare model versions |
311
+ | **[Weights & Biases (wandb)](https://github.com/wandb/wandb)** | Experiment tracking with visualization | Dashboard for training metrics across all 4 stages |
312
+ | **[DVC](https://github.com/iterative/dvc)** | Data and model versioning | Version the training dataset and gold standard |
313
+ | **[Evidently](https://github.com/evidentlyai/evidently)** | ML monitoring and observability | Detect model drift in production |
314
+ | **[Phoenix](https://github.com/Arize-ai/phoenix)** | AI observability | Monitor extraction quality in real-time |
315
+
316
+ **Requirement OPS-REQ-1**: Use MLflow or W&B to track all training experiments. Every training run logs: loss curves, evaluation metrics, model checkpoints, hyperparameters, dataset version.
317
+
318
+ **Requirement OPS-REQ-2**: Use Evidently for drift detection. Weekly check: run the model on the gold standard test set. If any metric drops >5%, alert.
319
+
320
+ ---
321
+
322
+ ### Agent Framework — Connect Real Brains to Agent Bodies
323
+
324
+ **Current state**: Full agent lifecycle works, but agents have no AI model connected
325
+
326
+ **Required tools from the awesome list**:
327
+
328
+ | Tool | What It Does | Why We Need It |
329
+ |------|-------------|----------------|
330
+ | **[smolagents](https://github.com/huggingface/smolagents)** | Lightweight agent framework from HuggingFace | Simpler than current custom AgentOS for basic tasks |
331
+ | **[LangGraph](https://github.com/langchain-ai/langgraph)** | Stateful, multi-actor agent orchestration | For the multi-agent council with memory |
332
+ | **[CrewAI](https://github.com/crewAIInc/crewAI)** | Multi-agent collaboration framework | Define roles (Extractor, Critic, Chairman) with collaboration protocols |
333
+ | **[Letta (MemGPT)](https://github.com/letta-ai/letta)** | Stateful agents with persistent memory | Agents that remember across sessions |
334
+ | **[Mem0](https://github.com/mem0ai/mem0)** | Universal memory layer for agents | Persistent memory for the MetaImprover and CitationChaser agents |
335
+
336
+ **From the Emergence Transformer paper — Requirement A-REQ-1**: Implement the AI Council as a **DTA-coupled multi-agent system**:
337
+
338
+ The paper's model has N oscillators coupled through an adjacency matrix A_ij. The AI Council has N=4 members coupled through information sharing. The DTA framework tells us:
339
+
340
+ - **Coupling topology matters**: The paper shows that different network structures (fully connected, small-world, scale-free) produce different coherence patterns. For 4 council members, fully connected (everyone sees everyone) promotes maximum consensus. Star topology (Chairman sees all, others only see Chairman) preserves more diversity.
341
+ - **α parameter tunes consensus vs diversity**: At α=0, no temporal attention → members are stateless → pure diversity. At α=1, full temporal attention → members converge → pure consensus. The paper proves there's an optimal α between 0 and 1 for maximum USEFUL coherence. Tune this per task: high α for clear-cut Fact/Interpretation decisions, low α for ambiguous Conflict_Hypothesis cases.
342
+ - **β parameter controls memory decay**: In the paper, β determines how fast old attention information decays. For the council, β controls how much members remember from previous papers. High β = short memory (each paper is fresh). Low β = long memory (patterns from 100 papers ago still influence decisions).
343
+
344
+ ---
345
+
346
+ ### UI and User Experience
347
+
348
+ **Required tools from the awesome list**:
349
+
350
+ | Tool | What It Does | Why We Need It |
351
+ |------|-------------|----------------|
352
+ | **[Gradio](https://github.com/gradio-app/gradio)** | Already used | Continue using for main UI |
353
+ | **[Kotaemon](https://github.com/Cinnamon/kotaemon)** | RAG-based document chat with Gradio UI | Reference implementation for document Q&A interface |
354
+ | **[Open Notebook](https://github.com/lfnovo/open-notebook)** | AI-powered notebook with multi-modal support | Model for how to build the Obsidian-like research interface |
355
+
356
+ **Requirement UI-REQ-1**: Study Kotaemon's architecture for the "chat with your papers" interface. It has hybrid RAG, re-ranking, and multi-modal support — exactly what the Research OS courtroom UI needs.
357
+
358
+ ---
359
+
360
+ ## Summary: The Complete Requirements Stack
361
+
362
+ | Layer | Current Tool | Required Tool(s) | Source |
363
+ |-------|-------------|------------------|--------|
364
+ | PDF Parsing | pdfplumber/PyMuPDF | **Marker** + MinerU + Docling | Awesome List §5 |
365
+ | Chunking | Custom section merger | **Chonkie** | Awesome List §5 |
366
+ | Embeddings | None | **FastEmbed** + sqlite-vec | Awesome List §5 |
367
+ | Deduplication | Jaccard overlap | FastEmbed + **rerankers** | Awesome List §5 |
368
+ | Base Model | Qwen2.5-3B | **Qwen3.6-Plus** / Phi-4 | Awesome List §2 |
369
+ | Structured Output | Hope-for-JSON | **Instructor** / SGLang | Awesome List §13 |
370
+ | Model Serving | None (mock) | **Ollama** / vLLM | Awesome List §3 |
371
+ | Council Architecture | Sequential pipeline | **DTA-coupled agents** (LangGraph) | Emergence Paper |
372
+ | Knowledge Graph | SQLite adjacency | SQLite + **Graphiti** temporal layer | Awesome List §5 |
373
+ | Graph Reasoning | Word-overlap conflicts | **LightRAG** / GraphRAG | Awesome List §5 |
374
+ | Continual Learning | Retrain from scratch | **O-LoRA** (DTA-inspired) | Emergence Paper |
375
+ | Training Framework | Custom train.py | **Unsloth** / Axolotl / TRL | Awesome List §7 |
376
+ | GRPO Training | Not built | **OpenRLHF** / verl | Awesome List §7 |
377
+ | Synthetic Data | Template generator | **Distilabel** | Awesome List §7 |
378
+ | Human Labeling | Not built | **Argilla** / Label Studio | Awesome List §7 |
379
+ | Data Validation | Not built | **Cleanlab** + Great Expectations | Awesome List §9 |
380
+ | LLM Evaluation | Count-based metrics | **DeepEval** + Promptfoo | Awesome List §9 |
381
+ | RAG Evaluation | Not built | **RAGAs** | Awesome List §9 |
382
+ | Safety Scanning | Not built | **Garak** + LLM Guard | Awesome List §10 |
383
+ | Output Validation | Not built | **Guardrails AI** | Awesome List §10 |
384
+ | Interpretability | Not built | **TransformerLens** + Captum | Awesome List §10 |
385
+ | Experiment Tracking | Tensorboard only | **MLflow** / W&B | Awesome List §8 |
386
+ | Drift Detection | Not built | **Evidently** | Awesome List §8 |
387
+ | Data Versioning | Not built | **DVC** | Awesome List §8 |
388
+ | Agent Memory | Custom memory_store | **Letta** / Mem0 | Awesome List §4 |
389
+ | Agent Orchestration | Custom AgentOS | Keep AgentOS + add **LangGraph** | Awesome List §4 |
390
+ | Document Chat UI | Gradio tabs | Study **Kotaemon** architecture | Awesome List §5 |
391
+
392
+ ---
393
+
394
+ ## The Emergence Transformer's 3 Key Contributions to This System
395
+
396
+ ### 1. Council as Coupled Oscillators (Neighbor-DTA)
397
+ Instead of a sequential pipeline, council members interact through attention-mediated coupling. The Emergence Transformer proves that neighbor-attention promotes coherence — members converge on genuinely shared insights while preserving dissent where it matters.
398
+
399
+ ### 2. Continual Learning Without Forgetting (Self-DTA + Hopfield)
400
+ When adding new scientific domains, the model maintains its existing knowledge through self-attention on past states. The paper provides the theoretical proof that DTA-modified Hopfield networks achieve continual memory storage — directly applicable to our O-LoRA domain adaptation strategy.
401
+
402
+ ### 3. Tunable Consensus vs Diversity (α Parameter)
403
+ The system can be configured to either push toward agreement (for clear-cut cases) or deliberately preserve plurality (for genuinely ambiguous epistemic classifications). The paper proves that the optimal α depends on network structure — for our 4-member council, this is a tunable hyperparameter.
404
+
405
+ ---
406
+
407
+ *Every requirement in this document traces to a specific tool in the Awesome Open Source AI list or a specific result in the Emergence Transformer paper. No requirements were invented outside these two sources.*