Add SYSTEM_INSPIRATIONS.md — concrete adoption plan: what to steal, what to adapt, what to learn from

Browse files

Files changed (1) hide show

SYSTEM_INSPIRATIONS.md +810 -0

SYSTEM_INSPIRATIONS.md ADDED Viewed

	@@ -0,0 +1,810 @@

+# PhD Research OS — System Inspirations
+## What to Steal, What to Adapt, and What to Learn From
+**Date**: 2026-04-23
+**Status**: ACTIONABLE — Every item maps to a specific layer, priority, and source system
+**Written for**: Anyone with a high school education
+**Purpose**: Turn the Prior Art Analysis into concrete things we should actually DO
+---
+## What Is This Document?
+The [PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md) tells you WHO has built what. This document tells you WHAT we should do about it. Every item here is either:
+- **🔵 DIRECT ADOPTION** — Use this system/tool/dataset exactly as-is, plugged into our pipeline
+- **🟣 ADAPTATION** — Take the idea or technique and rebuild it to fit our specific needs
+- **🟤 INDIRECT INSPIRATION** — Learn from their approach without copying their code
+Think of it like cooking. Direct adoption = buying a jar of pasta sauce. Adaptation = taking someone's recipe but adjusting the spices. Inspiration = visiting a great restaurant and getting ideas for your own menu.
+---
+## Table of Contents
+1. [Direct Adoptions — Plug These In](#1-direct-adoptions)
+2. [Adaptations — Rebuild These for Our Needs](#2-adaptations)
+3. [Indirect Inspirations — Learn From These Approaches](#3-indirect-inspirations)
+4. [New Features Inspired by Prior Art](#4-new-features)
+5. [Integration Map — Where Each Inspiration Fits](#5-integration-map)
+6. [Priority Execution Order](#6-priority-order)
+---
+## 1. Direct Adoptions — Plug These In {#1-direct-adoptions}
+These are tools, datasets, and models that we should use EXACTLY as they are. No modification needed. Just install and connect.
+---
+### 🔵 DA-1: SPECTER2 for Claim Embeddings (Layer 3: Deduplication)
+**Source system**: AllenAI / Semantic Scholar
+**What to adopt**: [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) with the proximity adapter
+**The problem we have**: Our deduplication system uses Jaccard word overlap. "The LOD was 0.8 fM" and "A detection limit of 800 attomolar was achieved" share almost no words but mean the SAME thing. Our current system thinks they're different. That's like a librarian who thinks "dog" and "canine" are different animals.
+**The fix**: SPECTER2 converts any scientific sentence into a list of 768 numbers (an "embedding"). Sentences that mean the same thing get similar numbers, even if they use completely different words. It was trained on millions of scientific papers, so it actually understands scientific vocabulary.
+**How to plug it in**:
+```python
+# Replace word-overlap deduplication with embedding-based similarity
+from transformers import AutoTokenizer
+from adapters import AutoAdapterModel
+import torch
+# Load once at startup
+tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
+model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
+model.load_adapter("allenai/specter2", source="hf", set_active=True)
+def embed_claim(text: str) -> list[float]:
+    """Convert a claim into a 768-dimensional vector."""
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+    with torch.no_grad():
+        output = model(**inputs)
+    return output.last_hidden_state[:, 0, :].squeeze().tolist()
+def are_duplicates(claim_a: str, claim_b: str, threshold: float = 0.85) -> bool:
+    """Check if two claims are semantically duplicate."""
+    emb_a = embed_claim(claim_a)
+    emb_b = embed_claim(claim_b)
+    cosine_sim = sum(a*b for a,b in zip(emb_a, emb_b)) / (
+        sum(a**2 for a in emb_a)**0.5 * sum(b**2 for b in emb_b)**0.5
+    )
+    return cosine_sim > threshold
+```
+**Analogy**: This is like upgrading from a librarian who only matches exact book titles to a librarian who actually reads the books and knows which ones are about the same topic.
+**Layer affected**: Layer 3 (Canonicalization)
+**Priority**: 🔴 Critical — do this first
+**Effort**: 1-2 days (model download + integration)
+---
+### 🔵 DA-2: SciFact as Our Evaluation Benchmark (Layer 6: Evaluation)
+**Source system**: AllenAI
+**What to adopt**: [`bigbio/scifact`](https://huggingface.co/datasets/bigbio/scifact)
+**The problem we have**: We have 143 tests that check if the code runs. Zero tests that check if the SCIENCE is right. That's like testing a calculator by checking if the buttons click, but never checking if 2+2 actually equals 4.
+**The fix**: SciFact contains 1,400 scientific claims with expert labels (SUPPORTS / REFUTES / NOINFO) and rationale sentences. Feed our extraction pipeline the same claims and check if our labels match the expert labels. Instant quality measurement.
+**How to plug it in**:
+```python
+from datasets import load_dataset
+# Load SciFact benchmark
+scifact = load_dataset("bigbio/scifact")
+# For each claim in the test set:
+for example in scifact["test"]:
+    claim_text = example["text_1"]  # The scientific claim
+    evidence_text = example["text_2"]  # The abstract/evidence
+    expert_label = example["label"]  # SUPPORTS, REFUTES, or NOINFO
+    # Run our pipeline on the same inputs
+    our_label = our_pipeline.extract_and_classify(claim_text, evidence_text)
+    # Compare: do we agree with the expert?
+    correct = (our_label == expert_label)
+```
+**Analogy**: SciFact is like the answer key for a standardized test. We can't know how well our system works until we check it against the answer key.
+**Layer affected**: Layer 6 (Evaluation)
+**Priority**: 🔴 Critical — do this alongside SPECTER2
+**Effort**: 1 day (dataset download + evaluation script)
+---
+### 🔵 DA-3: SciRIFF for Training Data (Training Pipeline)
+**Source system**: AllenAI
+**What to adopt**: [`allenai/SciRIFF`](https://huggingface.co/datasets/allenai/SciRIFF)
+**The problem we have**: Our training data is 1,900 examples generated by a Python template script. SciRIFF has 137,000 examples written by human experts across 54 scientific tasks. That's 72× more data, and it's real human expertise, not computer-generated patterns.
+**The fix**: Use SciRIFF as supplementary training data alongside our existing dataset. The SciRIFF paper specifically found that expert-written examples outperform GPT-4-generated examples — confirming that our synthetic data is a weakness.
+**Key tasks from SciRIFF that map to our pipeline**:
+| SciRIFF Task | Our Pipeline Layer | How It Helps |
+|-------------|-------------------|-------------|
+| Claim verification (from SciFact) | Layer 2 + Layer 4 | Teaches claim extraction + supports/refutes labeling |
+| Information extraction (from SciERC) | Layer 2 | Teaches entity + relation extraction from papers |
+| NER (from various sources) | Layer 1 | Teaches entity recognition |
+| Summarization | Layer 2 | Teaches faithful text compression |
+| Question answering (from QASPER) | Cross-cutting | Teaches reading comprehension of full papers |
+**Analogy**: We're currently training our student using a workbook where all the practice problems were invented by a computer. SciRIFF is like getting 137,000 practice problems written by real teachers. The student will learn much better.
+**Layer affected**: Training pipeline (all stages)
+**Priority**: 🔴 Critical — do this before next training run
+**Effort**: 2-3 days (dataset conversion to our format + integration)
+---
+### 🔵 DA-4: Nougat for Equation Parsing (Layer 0: Ingestion)
+**Source system**: Meta/Facebook Research
+**What to adopt**: [`facebook/nougat-base`](https://huggingface.co/facebook/nougat-base)
+**The problem we have**: Our parser turns equations into garbage. A paper that says "E = mc²" gets turned into "E mc 2" or worse. For papers in physics, chemistry, and engineering, equations ARE the findings.
+**The fix**: Nougat was specifically trained to convert scientific PDFs into markdown WITH proper LaTeX equations. It reads the visual layout of the PDF and outputs structured text that preserves math symbols, subscripts, superscripts, and equation structure.
+**Analogy**: Our current parser is like a translator who speaks English but not math. Nougat speaks both fluently.
+**Layer affected**: Layer 0 (Structural Ingestion)
+**Priority**: 🔴 Critical — already in our design, needs actual integration
+**Effort**: 2-3 days (model integration + routing logic for equation regions)
+---
+### 🔵 DA-5: SciBERT-NLI as a Fast Contradiction Pre-Filter (Layer 4: Knowledge Graph)
+**Source system**: AllenAI (base) + Gabriele Sarti (NLI fine-tuning)
+**What to adopt**: [`gsarti/scibert-nli`](https://huggingface.co/gsarti/scibert-nli)
+**The problem we have**: Our conflict detection checks all claim pairs using expensive LLM calls. With 1,000 claims, that's 499,500 pairs. Even checking only the top 500 claims is slow.
+**The fix**: SciBERT-NLI is a small, fast model specifically trained to detect entailment and contradiction in scientific text. Use it as a pre-filter: run all pairs through SciBERT-NLI (fast, cheap), then only send the pairs flagged as potential contradictions to the expensive main model.
+**Analogy**: Instead of having a doctor examine every person in a city (expensive), first have nurses do quick screenings (cheap), then only send people with concerning results to the doctor.
+**Layer affected**: Layer 4 (Knowledge Graph — conflict detection)
+**Priority**: 🟠 High — significant speed improvement
+**Effort**: 1-2 days (model download + pre-filter pipeline)
+---
+## 2. Adaptations — Rebuild These for Our Needs {#2-adaptations}
+These are techniques and patterns from other systems that we should adapt and rebuild, not copy directly.
+---
+### 🟣 AD-1: PaperQA2's RCS Pattern → Our "Pre-Extraction Filter"
+**Source system**: PaperQA2 (FutureHouse)
+**Paper**: [arxiv:2409.13740](https://arxiv.org/abs/2409.13740)
+**What they do**: Before generating an answer, PaperQA2 runs **RCS (Rerank + Contextual Summarize)**:
+1. Retrieve many text chunks from papers
+2. Have the AI rerank them by relevance (throw away the irrelevant ones)
+3. Have the AI write a contextual summary of each remaining chunk
+4. Generate the final answer using ONLY the summaries
+**Why this matters**: PaperQA2 found that the RCS step is the single biggest contributor to answer quality. Without it, the model drowns in irrelevant text. With it, the model sees only the information that matters.
+**How we adapt it**: Before our AI Council processes a paper section, add a pre-filtering step:
+1. After Layer 0 parsing, get all text chunks from a section
+2. Have a lightweight model (SciBERT) score each chunk for "claim density" — how many extractable findings does this chunk contain?
+3. Only send high-density chunks to the expensive Council extraction
+4. Low-density chunks (boilerplate methodology, acknowledgments, generic intro text) get a lightweight extraction or skip
+**The analogy**: When studying for an exam, you don't re-read the entire textbook. You first skim to find the important sections, highlight them, then study the highlights deeply. RCS is the highlighting step.
+**Our adaptation vs. their original**: PaperQA2 uses RCS for question-answering. We adapt it for claim extraction. Different task, same principle — filter before you analyze.
+**Layer affected**: Layer 2 (Qualified Extraction — pre-processing step)
+**Priority**: 🟠 High — significant quality improvement
+**Effort**: 1 week
+---
+### 🟣 AD-2: FactReview's Dual Evidence Pattern → Our "Cross-Reference Verification"
+**Source system**: FactReview
+**Paper**: [arxiv:2604.04074](https://arxiv.org/abs/2604.04074)
+**What they do**: For every claim in a paper, FactReview checks it against TWO sources:
+1. The paper's own evidence (does the data in the paper actually support this claim?)
+2. External literature (do other papers agree or disagree?)
+**Why this matters**: A claim might look strong within its own paper but be contradicted by 10 other papers. Or a claim might look weak in isolation but be supported by a mountain of external evidence. You need both perspectives.
+**How we adapt it**: After Layer 2 extraction produces claims from a single paper, add a cross-reference step before Layer 4 scoring:
+1. For each extracted claim, check: does the claim's evidence (source quote, table reference, figure reference) actually support what the claim says? (Internal check)
+2. Then check: does the claim align with or contradict the existing knowledge graph? (External check)
+3. If internal and external evidence disagree, flag for human review
+**The analogy**: FactReview is like a detective who interviews both the suspect (the paper) AND the witnesses (other papers). We should do the same — don't trust what a paper says about itself without checking the neighbors.
+**Layer affected**: Between Layer 2 and Layer 4 (new verification step)
+**Priority**: 🟠 High — reduces hallucination
+**Effort**: 1-2 weeks
+---
+### 🟣 AD-3: KGX3's Deterministic Language Filters → Our "Epistemic Trigger Words"
+**Source system**: KGX3 / iKuhn
+**Paper**: [arxiv:2002.03531](https://arxiv.org/abs/2002.03531)
+**What they do**: KGX3 doesn't ask an AI "is this a confirmation or a challenge?" Instead, they have **specific word patterns** (called "language-game filters") that detect each epistemic category:
+- "consistent with previous findings" → Confirms
+- "contrary to expectations" → Challenges
+- "first time that" → Novel
+- "replication of" → Confirms
+Each filter has a fixed activation threshold (θ=0.7), validated across multiple scientific disciplines.
+**Why this matters**: Rules-based classification is deterministic — it gives the same answer every time, and you can audit exactly WHY a claim was classified a certain way. AI-based classification is stochastic — it might give different answers on different runs, and you can't fully explain why.
+**How we adapt it**: Build a set of "epistemic trigger word" dictionaries for our three categories:
+```python
+FACT_TRIGGERS = {
+    "strong": ["demonstrated", "measured", "observed", "detected", "confirmed",
+               "showed that", "resulted in", "was found to be", "achieved"],
+    "moderate": ["correlated with", "associated with", "was consistent with"],
+}
+INTERPRETATION_TRIGGERS = {
+    "strong": ["suggests that", "indicates that", "implies", "may be attributed to",
+               "could be explained by", "appears to", "is likely due to"],
+    "moderate": ["consistent with", "in line with", "supports the hypothesis"],
+}
+HYPOTHESIS_TRIGGERS = {
+    "strong": ["may", "might", "could potentially", "we hypothesize",
+               "it is possible that", "remains to be determined", "future work"],
+    "moderate": ["we propose", "we speculate", "it is conceivable"],
+}
+def compute_epistemic_trigger_score(claim_text: str, source_section: str) -> dict:
+    """
+    Code-computed epistemic classification based on trigger words.
+    This runs ALONGSIDE the AI classification as a validator.
+    If the AI says 'Fact' but the trigger words say 'Hypothesis',
+    flag for human review.
+    """
+    scores = {"Fact": 0, "Interpretation": 0, "Hypothesis": 0}
+    text_lower = claim_text.lower()
+    for trigger in FACT_TRIGGERS["strong"]:
+        if trigger in text_lower:
+            scores["Fact"] += 0.3
+    for trigger in FACT_TRIGGERS["moderate"]:
+        if trigger in text_lower:
+            scores["Fact"] += 0.15
+    # ... same for Interpretation and Hypothesis
+    # Section modifier (from Epistemic Separation Engine)
+    if source_section == "abstract":
+        scores["Interpretation"] += 0.2  # Bias toward Interpretation for abstracts
+    elif source_section == "results":
+        scores["Fact"] += 0.2  # Bias toward Fact for results
+    # Return the highest-scoring category
+    return max(scores, key=scores.get), scores
+```
+Use this as a **validator** — if the AI Council says "Fact" but the trigger analysis says "Hypothesis," the disagreement IS the signal. Flag it for human review.
+**The analogy**: KGX3's approach is like a grammar checker — it uses specific rules to flag errors. Our AI Council is like a writing tutor — it uses judgment. We should use BOTH. The grammar checker catches things the tutor misses, and vice versa.
+**Layer affected**: Layer 2 (Qualified Extraction — validation step) + Layer 5 (Scoring)
+**Priority**: 🟠 High — improves epistemic accuracy
+**Effort**: 1 week
+---
+### 🟣 AD-4: Paper Circle's Coverage Checker → Our "Completeness Auditor"
+**Source system**: Paper Circle
+**Paper**: [arxiv:2604.06170](https://arxiv.org/abs/2604.06170)
+**What they do**: After all extraction agents have finished, a "Coverage Checker" agent reviews the results and asks: "Did we miss anything? Are there figures without linked concepts? Are there methods in the text that we didn't extract? Are there experimental results that nobody captured?"
+**Why this matters**: The most dangerous failure in any extraction system is **silent omission** — the system looks like it's done, but it actually missed the most important finding in the paper. Without a coverage check, you'd never know.
+**How we adapt it**: After the AI Council finishes extraction (end of Layer 2), add a completeness audit:
+1. **Section coverage**: Did we extract at least one claim from every Results subsection? If not, why not? Flag unextracted sections.
+2. **Figure/table coverage**: Every figure and table should be referenced by at least one claim. Orphaned figures/tables suggest missed findings.
+3. **Statistical coverage**: If the paper reports N, p-values, effect sizes, or confidence intervals, at least one claim should capture them. Unreferenced statistics suggest missed quantitative findings.
+4. **Citation coverage**: Claims that cite other papers should be tagged as inherited citations. Untagged citations suggest the system confused someone else's finding for this paper's finding.
+**The analogy**: After a team of movers loads a truck, a supervisor walks through the house checking every room. "Did we get the bedroom? Kitchen? Garage? Anything left behind?" The Coverage Checker is that supervisor.
+**Layer affected**: Layer 2 (Qualified Extraction — post-processing step)
+**Priority**: 🟡 Medium — prevents silent omission
+**Effort**: 1 week
+---
+### 🟣 AD-5: PaperQA2's "Refuse to Answer" → Our "Low Confidence Quarantine"
+**Source system**: PaperQA2 (FutureHouse)
+**Paper**: [arxiv:2409.13740](https://arxiv.org/abs/2409.13740)
+**What they do**: When PaperQA2 doesn't have enough evidence to answer a question, it says "I cannot answer this question with the available evidence." It literally refuses rather than making something up.
+**Why this matters**: A truth-seeking system that ALWAYS produces output is lying. Real scientists say "I don't know" constantly — it's the most honest and most important thing they say. Our system should too.
+**How we adapt it**: Add a confidence quarantine zone:
+1. **At extraction time**: If the AI Council's composite confidence for a claim is below 0.3, the claim goes into a "Low Confidence Quarantine" queue instead of the main knowledge graph.
+2. **In the UI**: Quarantined claims are visible in a separate tab labeled "⚠️ Uncertain — Needs Evidence." They are NOT shown in the main dashboard.
+3. **In exports**: Quarantined claims are excluded from Obsidian exports, CSV downloads, and BibTeX files by default. A user can manually include them with a checkbox: "☐ Include uncertain claims (confidence < 0.3)."
+4. **In the knowledge graph**: Quarantined claims are NOT connected to other claims via supports/refutes edges. They exist as isolated nodes in a "pending" state.
+**The analogy**: A weather forecaster who says "it might rain tomorrow" is more honest than one who says "it WILL rain tomorrow" when they're not sure. Our system should be the honest forecaster.
+**Layer affected**: Layer 2, Layer 4, Layer 7 (cross-cutting)
+**Priority**: 🟠 High — prevents false confidence
+**Effort**: 3-5 days
+---
+### 🟣 AD-6: CLAIRE's Agentic Contradiction Loop → Our "Conflict Investigation Protocol"
+**Source system**: CLAIRE (Stanford)
+**Paper**: [arxiv:2509.23233](https://arxiv.org/abs/2509.23233)
+**What they do**: Instead of just comparing two claims and saying "these contradict," CLAIRE runs an investigation:
+1. Extract a claim
+2. Search the entire corpus for related claims
+3. For each related claim, use AI reasoning to determine: agree, disagree, or unrelated?
+4. If disagree, retrieve MORE evidence from surrounding context to confirm the contradiction isn't just a misunderstanding
+5. Only then flag it as a real contradiction
+**How we adapt it**: Upgrade our conflict detection from simple pairwise comparison to an investigation protocol:
+1. **Pre-filter** (DA-5: SciBERT-NLI): Quick check — do these two claims even potentially disagree?
+2. **Context retrieval**: For both claims, retrieve the full paragraph they came from (not just the claim text)
+3. **Method check**: Do both claims use comparable methods? (Our existing method-compatibility layer)
+4. **Evidence deepening**: Search for other claims in the knowledge graph that relate to EITHER claim. Do they provide additional evidence for or against the contradiction?
+5. **Confidence assessment**: How confident are we that this is a REAL contradiction vs. a difference in methods, definitions, or scope?
+6. **Output**: A conflict report with evidence from multiple sources, not just two isolated claims
+**The analogy**: CLAIRE's approach is like a detective who doesn't just interview two witnesses separately — they interview 10 witnesses, check the physical evidence, review the security footage, and THEN decide if there's a real disagreement. We should be that thorough.
+**Layer affected**: Layer 4 (Knowledge Graph — conflict detection upgrade)
+**Priority**: 🟡 Medium — significant quality improvement for contradiction detection
+**Effort**: 2-3 weeks
+---
+### 🟣 AD-7: CritiCal's Critique-Based Calibration → Our "Confidence Explanation Engine"
+**Source system**: CritiCal
+**Paper**: [arxiv:2510.24505](https://arxiv.org/abs/2510.24505)
+**What they do**: Instead of just asking an AI "how confident are you?", CritiCal asks the AI to WRITE a critique of its own answer first, then derives confidence from the critique. An AI that writes "I'm not sure about the denominator" naturally produces a lower confidence than one that writes "this is clearly supported by Figure 3."
+**Why this matters**: Their key finding is that the ACT of critiquing improves calibration. The AI becomes more honest about uncertainty when forced to articulate where the uncertainty comes from.
+**How we adapt it**: Add a critique step to our AI Council's workflow:
+After the Extractor produces a claim, but BEFORE the Critic evaluates it, require the Extractor to write a brief self-critique:
+```
+Claim: "The LOD was 0.8 fM"
+Self-critique: "I am confident about the value (0.8 fM) because it appears
+in Table 2 with clear units. I am less confident about whether this is a
+universal or condition-specific finding — the paper says 'in 10 mM PBS'
+which I've preserved as a qualifier."
+```
+This self-critique serves two purposes:
+1. It makes the Extractor's uncertainty **explicit and auditable** (the researcher can read WHY the AI is uncertain)
+2. It provides **structured input** for our confidence formula (the self-critique mentions preserved qualifiers, which maps to our qualifier_strength score)
+**The analogy**: A student who writes "I think this is right but I'm not sure about step 3" is a BETTER student than one who just writes the answer. The self-doubt IS the learning signal.
+**Layer affected**: Layer 2 (Qualified Extraction — council workflow) + Layer 5 (Scoring input)
+**Priority**: 🟡 Medium — improves calibration
+**Effort**: 1 week
+---
+## 3. Indirect Inspirations — Learn From These Approaches {#3-indirect-inspirations}
+These aren't things we should build directly. They're philosophical and architectural lessons.
+---
+### 🟤 IN-1: PaperQA2's "Superhuman Benchmark" Philosophy
+**Lesson**: PaperQA2 didn't just build a system — they proved it beats humans on specific, measurable tasks. They created the LitQA2 benchmark, ran their system against PhD-level scientists, and published the numbers.
+**What we should learn**: We need to define OUR equivalent of LitQA2 — a specific task where we can measure: "Does our epistemic classification match expert human classification?" This requires:
+1. 20-30 papers annotated by domain experts
+2. Every claim labeled (Fact / Interpretation / Hypothesis) by at least 2 experts
+3. Inter-annotator agreement measured
+4. Our system evaluated against the same papers
+**The goal isn't to beat humans** — it's to know WHERE we agree with humans and WHERE we disagree. The disagreements are the most interesting findings.
+---
+### 🟤 IN-2: KGX3's "Code Over AI" Philosophy
+**Lesson**: KGX3 deliberately chose deterministic rules over stochastic AI for epistemic classification. Their paper argues that epistemic decisions should be REPRODUCIBLE — the same paper should always get the same classification. They achieved this by defining specific linguistic triggers with fixed activation thresholds.
+**What we should learn**: We already share this philosophy ("AI provides components, code computes scores"). But we should go further. Every epistemic decision in our system should be decomposable into:
+1. **AI-provided inputs** (what the model observed)
+2. **Code-computed outputs** (what the formula calculated)
+3. **Auditable reasoning** (why the formula gave this result)
+If a researcher asks "why is this claim labeled Interpretation?", the answer should never be "because the AI said so." It should be: "Because the claim appeared in the Discussion section (0.75× modifier), contains the qualifier 'suggests' (-0.1 penalty), and the trigger-word analysis scored Interpretation at 0.6 vs Fact at 0.3."
+---
+### 🟤 IN-3: AgentSLR's "Speed × Quality" Proof of Concept
+**Lesson**: AgentSLR proved that AI can do literature reviews at human quality with 58× speedup. This isn't theoretical — they applied it to 9 WHO-priority pathogens and published the comparison.
+**What we should learn**: Speed matters for adoption. If our system takes 10 hours to process 100 papers, few researchers will use it. If it takes 20 minutes, everyone will. The architecture decisions we make now (local model size, batch processing, caching) directly affect adoption.
+**Specific implication**: Consider processing papers in two passes:
+- **Fast pass** (minutes): Lightweight extraction using trigger words + SciBERT. Gives researchers immediate, approximate results.
+- **Deep pass** (hours): Full AI Council extraction with confidence scoring. Refines the fast-pass results.
+This way, researchers get SOMETHING useful immediately and the system keeps improving in the background.
+---
+### 🟤 IN-4: Paper Circle's "Multi-Agent Specialization" Pattern
+**Lesson**: Paper Circle assigns different agent roles based on CONTENT TYPE — one agent for concepts, one for methods, one for experiments. Our Council assigns roles based on FUNCTION — planner, extractor, critic, chairman. Both patterns work.
+**What we should learn**: We should combine both. Our Council should have function-specialized roles (planner, extractor, critic, chairman) AND content-specialized routing (text goes to the language model, figures go to the VLM, tables go to the structured parser, equations go to Nougat).
+The content router decides WHO processes each chunk. The Council decides HOW to interpret the results.
+---
+### 🟤 IN-5: FactReview's "Execution as Evidence" Insight
+**Lesson**: FactReview verifies empirical claims by RUNNING CODE. If a paper says "our model achieves 95% accuracy," FactReview actually executes the code to check. This is a profound epistemic innovation — it adds a non-AI evidence source that grounds claims in physical reality.
+**What we should learn**: Not all evidence is text. For papers that include code, data, or computational results, we should consider:
+1. **Code availability check**: Does the paper link to a working repository? (Our Layer 1 already plans this)
+2. **Reproducibility flag**: Has anyone replicated the results? (CrossRef + Semantic Scholar APIs)
+3. **Data availability check**: Is the underlying data accessible? (Zenodo, Figshare, Dryad checks)
+We can't execute arbitrary code (too dangerous), but we CAN check whether the infrastructure for verification EXISTS. A paper whose code is on GitHub, whose data is on Zenodo, and whose results have been replicated gets a higher evidence_quality score than a paper with none of these.
+---
+### 🟤 IN-6: ORKG's "Human + Machine" Knowledge Graph
+**Lesson**: ORKG (Open Research Knowledge Graph) at orkg.org takes a different approach than every AI system. Instead of having AI build the knowledge graph, they have HUMAN RESEARCHERS create structured comparison tables. The result is smaller but much more reliable than any AI-generated graph.
+**What we should learn**: Our "human support" philosophy should include a mechanism for researchers to contribute structured knowledge BACK to the system:
+1. When a researcher corrects a claim label → that correction becomes training data
+2. When a researcher manually adds a graph edge → that edge has maximum provenance (human-verified)
+3. When a researcher creates a comparison table → that table becomes structured input for future claims
+ORKG proves that human-contributed knowledge, even in small amounts, dramatically improves graph quality. Our system already has the proposal mechanism (agents propose, humans approve). We should extend this so that human corrections flow back into both the knowledge graph AND the training pipeline.
+---
+### 🟤 IN-7: CLUE's "Explain the Uncertainty" Philosophy
+**Source system**: CLUE
+**Paper**: [arxiv:2505.17855](https://arxiv.org/abs/2505.17855)
+**Lesson**: CLUE doesn't just say "confidence: 0.6". It explains WHERE the uncertainty comes from: "There are 3 supporting papers and 1 contradicting paper. The contradiction is about the dosage (0.5mg vs 1.0mg), not about the overall finding." This is dramatically more useful than a bare number.
+**What we should learn**: Our 3-score system (evidence quality, truth likelihood, qualifier strength) is already a form of uncertainty decomposition. But we should go further and generate a **plain-English explanation** for every composite score:
+```
+Confidence: 0.72
+WHY: Evidence is strong (0.90) — the paper reports n=12 with p<0.001
+     and clear methodology. But one other paper in the knowledge graph
+     contradicts this finding using a different method (truth likelihood: 0.65).
+     Also, the claim uses hedging language ("may reduce") suggesting the
+     authors themselves aren't fully certain (qualifier strength: 0.60).
+```
+This explanation should be auto-generated from the scoring formula's components. No extra AI calls needed — just template the decomposition.
+---
+## 4. New Features Inspired by Prior Art {#4-new-features}
+These are entirely new capabilities that don't exist in our current design, directly inspired by what we learned from other systems.
+---
+### 🆕 NF-1: Epistemic Velocity Tracking
+**Inspired by**: CLAIRE's temporal analysis + PaperQA2's contradiction tracking
+**What it is**: For every claim in the knowledge graph, track how its confidence has changed over time. Display this as a trend:
+```
+Claim: "Graphene FETs achieve LOD < 1 fM"
+Confidence over time:
+  2022 ──── 0.50 ──── First reported, single paper
+  2023 ──── 0.65 ──── Replicated by 2 more labs
+  2024 ──── 0.80 ──── Meta-analysis confirms
+  2025 ──── 0.72 ──── One lab reports contradictory results
+  2026 ──── 0.75 ──── Contradiction explained by method difference
+  Status: STABLE (±0.05 over last 12 months)
+  Trend: Rising (0.50 → 0.75 over 4 years)
+```
+**Why it matters for truth-seeking**: A claim that has been steadily gaining confidence for 4 years is very different from a claim that keeps bouncing between 0.4 and 0.8. The TREND tells you more than the current number. Rising claims are being confirmed. Falling claims are being challenged. Volatile claims are actively contested — and therefore the most interesting for research.
+**Implementation**: We already store timestamps on every claim and version history on canonical claims. Computing the velocity is just a query:
+```python
+def compute_epistemic_velocity(canonical_claim_id: str) -> dict:
+    versions = db.get_version_history(canonical_claim_id)
+    if len(versions) < 2:
+        return {"trend": "insufficient_data", "stability": "unknown"}
+    # Compute trend: slope of confidence over time
+    times = [(v["date"] - versions[0]["date"]).days for v in versions]
+    confidences = [v["confidence"] for v in versions]
+    slope = compute_linear_slope(times, confidences)
+    # Compute stability: standard deviation of last 3 versions
+    recent = confidences[-3:]
+    stability = compute_std(recent)
+    return {
+        "trend": "rising" if slope > 0.01 else "falling" if slope < -0.01 else "stable",
+        "stability": "stable" if stability < 0.05 else "volatile",
+        "current": confidences[-1],
+        "history": list(zip([v["date"] for v in versions], confidences)),
+    }
+```
+**Layer affected**: Layer 5 (Scoring — new temporal dimension)
+**Priority**: 🟡 Medium — high value, low effort
+**Effort**: 3-5 days
+---
+### 🆕 NF-2: Devil's Advocate Mode
+**Inspired by**: CLAIRE's agentic contradiction detection + KGX3's challenge detection
+**What it is**: A UI mode that actively CHALLENGES high-confidence claims:
+For every claim with confidence > 0.8, the system automatically:
+1. Searches the knowledge graph for the **strongest counter-evidence**
+2. Finds the **nearest alternative explanation** (different epistemic label that could fit the same data)
+3. Identifies what would need to be true **for the claim to be wrong**
+Display:
+```
+📋 Claim: "Graphene FETs achieve LOD < 1 fM" (Confidence: 0.88)
+😈 Devil's Advocate:
+   Counter-evidence: Park 2024 reports LOD of 5 fM using identical
+   materials but different buffer conditions (comparability: 0.6)
+   Alternative label: This could be an INTERPRETATION rather than a FACT
+   if the LOD calculation method (3σ/slope) is disputed — two papers in
+   the graph use different LOD calculation methods.
+   What would make this wrong: If the buffer conditions (10 mM PBS) are
+   critical to the result, the claim should have the qualifier
+   "in 10 mM PBS only" — which would reduce qualifier_strength by 0.1.
+```
+**Why it matters for truth-seeking**: The most dangerous state for a researcher is high confidence with hidden weaknesses. Devil's Advocate mode forces the system to ACTIVELY LOOK for reasons to doubt its own outputs. This is the essence of scientific skepticism.
+**Layer affected**: Cross-cutting (UI + Layer 4 + Layer 5)
+**Priority**: 🟡 Medium — high value for truth-seeking
+**Effort**: 1-2 weeks
+---
+### 🆕 NF-3: Epistemic Provenance Levels (Human Verification Tracking)
+**Inspired by**: ORKG's human contribution model + Paper Circle's verification status
+**What it is**: Every claim in the knowledge graph gets a human-verification level:
+| Level | Icon | Meaning | How It Happens |
+|-------|------|---------|---------------|
+| 0 | 🤖 | Machine-extracted, unreviewed | Default when claim is first extracted |
+| 1 | 🟡 | Machine-extracted, human-spot-checked | Researcher saw it in the dashboard, didn't flag an error |
+| 2 | 🟢 | Human-verified claim text AND label | Researcher explicitly confirmed text + Fact/Interpretation/Hypothesis |
+| 3 | ⭐ | Expert-verified with domain knowledge | Domain expert reviewed with full context |
+| 4 | 📄 | Published / peer-reviewed | Claim appears in a peer-reviewed paper's verified dataset |
+**Rules**:
+- Level 0 claims CANNOT be cited in exports without a warning banner
+- Level 0 claims CANNOT generate "supports" edges in the knowledge graph (they can only be the TARGET of edges, not the SOURCE)
+- Level 2+ claims can be used as training data for the next model version
+- Level 3+ claims become part of the gold standard evaluation dataset
+**Why it matters for truth-seeking**: A system that tells you "this claim was extracted 30 seconds ago and no human has looked at it" is much more honest than one that presents all claims with equal visual weight. The verification level IS the epistemic honesty.
+**Layer affected**: Cross-cutting (all layers + UI)
+**Priority**: 🟡 Medium — fundamental for trust
+**Effort**: 1 week
+---
+### 🆕 NF-4: Confidence Decomposition Display
+**Inspired by**: CLUE's uncertainty explanation + our own 3-score system
+**What it is**: When showing a claim's confidence, display the full decomposition visually:
+```
+Claim: "The treatment reduced tumor size by 40%"
+Epistemic Tag: Fact
+Composite Confidence: 0.72
+Evidence Quality:    0.90 ████████████████████░░
+  (n=12, p<0.001, Results section, T1 journal)
+Truth Likelihood:    0.65 █████████████████░░░░░
+  (one contradicting study: Kim 2024 found only 15% reduction)
+Qualifier Strength:  0.60 ████████████████░░░░░░
+  (hedged: "in the murine model", "under these conditions")
+                     ────────────────────────────
+Composite:           0.72 ████████████████████░░
+⚠️ The truth likelihood is dragged down by Kim 2024.
+   Click to see the conflict in the Courtroom UI.
+```
+**Why it matters for truth-seeking**: The number 0.72 by itself is meaningless. The decomposition tells the researcher WHAT to investigate: "The evidence is strong, but there's a contradicting study, and the authors used hedging language." Each dimension points to a different action.
+**Layer affected**: Layer 7 (UI — Obsidian export + Gradio display)
+**Priority**: 🟡 Medium — already in our design, needs implementation
+**Effort**: 3-5 days (UI components)
+---
+## 5. Integration Map — Where Each Inspiration Fits {#5-integration-map}
+```
+PDF Bundle arrives
+  │
+  ▼
+LAYER 0: Structural Ingestion
+  ├── 🔵 DA-4: Nougat for equations
+  ├── (Marker + GROBID already planned)
+  └── Output: structured regions with bbox
+  │
+  ▼
+LAYER 1: Entity Resolution
+  ├── 🟤 IN-5: FactReview's code/data availability checks
+  └── (existing citation + retraction checks)
+  │
+  ▼
+LAYER 2: Qualified Extraction
+  ├── 🟣 AD-1: PaperQA2's RCS → Pre-Extraction Filter (before Council)
+  ├── 🟣 AD-7: CritiCal's Critique → Self-Critique step (during Council)
+  ├── 🟣 AD-3: KGX3's Language Filters → Epistemic Trigger Words (validation)
+  ├── 🟣 AD-4: Paper Circle's Coverage Checker → Completeness Auditor (after Council)
+  ├── 🟣 AD-5: PaperQA2's Refuse → Low Confidence Quarantine (output filtering)
+  └── 🆕 NF-3: Epistemic Provenance Levels (applied to every claim)
+  │
+  ▼
+LAYER 3: Canonicalization
+  ├── 🔵 DA-1: SPECTER2 for claim embeddings (replaces word overlap)
+  └── (existing temporal versioning)
+  │
+  ▼
+LAYER 4: Knowledge Graph
+  ├── 🔵 DA-5: SciBERT-NLI as fast contradiction pre-filter
+  ├── 🟣 AD-2: FactReview's Dual Evidence → Cross-Reference Verification
+  ├── 🟣 AD-6: CLAIRE's Investigation Loop → Conflict Investigation Protocol
+  └── 🆕 NF-2: Devil's Advocate Mode (cross-cutting with UI)
+  │
+  ▼
+LAYER 5: Calibrated Scoring
+  ├── 🆕 NF-1: Epistemic Velocity Tracking (temporal confidence trends)
+  ├── 🆕 NF-4: Confidence Decomposition Display
+  └── 🟤 IN-7: CLUE's uncertainty explanation → auto-generated score explanations
+  │
+  ▼
+LAYER 6: Evaluation
+  ├── 🔵 DA-2: SciFact as evaluation benchmark
+  ├── 🔵 DA-3: SciRIFF for training data
+  └── 🟤 IN-1: PaperQA2's superhuman benchmark philosophy → build our own epistemic benchmark
+  │
+  ▼
+LAYER 7: Provenance & Export
+  └── 🟤 IN-6: ORKG's human contribution model → corrections flow back to training
+```
+---
+## 6. Priority Execution Order {#6-priority-order}
+### Phase I: Foundations (Do These First — Everything Else Depends on Them)
+| # | Item | Type | Source System | Effort | Impact |
+|---|------|------|---------------|--------|--------|
+| 1 | DA-1: SPECTER2 for embeddings | 🔵 Direct | AllenAI | 1-2 days | Fixes deduplication completely |
+| 2 | DA-2: SciFact benchmark | 🔵 Direct | AllenAI | 1 day | Gives us quality measurement |
+| 3 | DA-3: SciRIFF training data | 🔵 Direct | AllenAI | 2-3 days | 72× more training examples |
+| 4 | DA-4: Nougat integration | 🔵 Direct | Meta | 2-3 days | Fixes equation parsing |
+### Phase II: Quality Improvements (Build These Next)
+| # | Item | Type | Source System | Effort | Impact |
+|---|------|------|---------------|--------|--------|
+| 5 | DA-5: SciBERT-NLI pre-filter | 🔵 Direct | AllenAI + Sarti | 1-2 days | Speeds up conflict detection 100× |
+| 6 | AD-1: Pre-Extraction Filter (RCS) | 🟣 Adapt | PaperQA2 | 1 week | Reduces noise in extraction |
+| 7 | AD-3: Epistemic Trigger Words | 🟣 Adapt | KGX3 | 1 week | Code-based epistemic validation |
+| 8 | AD-5: Low Confidence Quarantine | 🟣 Adapt | PaperQA2 | 3-5 days | Prevents false confidence |
+### Phase III: Deep Improvements (Build These When Core Is Stable)
+| # | Item | Type | Source System | Effort | Impact |
+|---|------|------|---------------|--------|--------|
+| 9 | AD-2: Cross-Reference Verification | 🟣 Adapt | FactReview | 1-2 weeks | Reduces hallucination |
+| 10 | AD-4: Completeness Auditor | 🟣 Adapt | Paper Circle | 1 week | Prevents silent omission |
+| 11 | AD-7: Self-Critique step | 🟣 Adapt | CritiCal | 1 week | Improves calibration |
+| 12 | NF-3: Epistemic Provenance Levels | 🆕 New | ORKG + Paper Circle | 1 week | Human verification tracking |
+### Phase IV: Advanced Features (Build These for PhD Year 2+)
+| # | Item | Type | Source System | Effort | Impact |
+|---|------|------|---------------|--------|--------|
+| 13 | AD-6: Conflict Investigation Protocol | 🟣 Adapt | CLAIRE | 2-3 weeks | Thorough contradiction analysis |
+| 14 | NF-1: Epistemic Velocity | 🆕 New | CLAIRE + PaperQA2 | 3-5 days | Temporal confidence trends |
+| 15 | NF-2: Devil's Advocate Mode | 🆕 New | CLAIRE + KGX3 | 1-2 weeks | Active skepticism |
+| 16 | NF-4: Confidence Decomposition Display | 🆕 New | CLUE | 3-5 days | Transparent scoring |
+### Total Estimated Effort: ~12-16 weeks (spread across PhD timeline)
+---
+## Summary: The Steal Sheet
+| What We're Stealing | From Whom | Why It's Fair |
+|---------------------|-----------|---------------|
+| SPECTER2 embeddings | AllenAI | Open-source, Apache 2.0 |
+| SciFact benchmark | AllenAI | Open-source, CC-BY |
+| SciRIFF training data | AllenAI | Open-source, ODC-BY |
+| Nougat PDF parser | Meta Research | Open-source, MIT |
+| SciBERT-NLI model | Gabriele Sarti | Open-source |
+| RCS pattern | PaperQA2 | Published technique, not code |
+| Language filter approach | KGX3 | Published methodology |
+| Coverage checker pattern | Paper Circle | Published architecture |
+| Refuse-to-answer pattern | PaperQA2 | Published behavior |
+| Investigation loop pattern | CLAIRE | Published methodology |
+| Critique calibration technique | CritiCal | Published technique |
+| Human contribution philosophy | ORKG | Published approach |
+| Uncertainty explanation approach | CLUE | Published philosophy |
+**Every adoption is either open-source with permissive licensing or a published technique (not code).** We're standing on the shoulders of giants — and giving them credit.
+---
+*Document compiled 2026-04-23.*
+*All source systems, papers, and resources have been verified as available.*
+*This document is a living roadmap — update as new systems appear.*