phd-research-os-brain / SYSTEM_INSPIRATIONS.md
nkshirsa's picture
Add SYSTEM_INSPIRATIONS.md β€” concrete adoption plan: what to steal, what to adapt, what to learn from
ef8daae verified

PhD Research OS β€” System Inspirations

What to Steal, What to Adapt, and What to Learn From

Date: 2026-04-23 Status: ACTIONABLE β€” Every item maps to a specific layer, priority, and source system Written for: Anyone with a high school education Purpose: Turn the Prior Art Analysis into concrete things we should actually DO


What Is This Document?

The PRIOR_ART_ANALYSIS.md tells you WHO has built what. This document tells you WHAT we should do about it. Every item here is either:

  • πŸ”΅ DIRECT ADOPTION β€” Use this system/tool/dataset exactly as-is, plugged into our pipeline
  • 🟣 ADAPTATION β€” Take the idea or technique and rebuild it to fit our specific needs
  • 🟀 INDIRECT INSPIRATION β€” Learn from their approach without copying their code

Think of it like cooking. Direct adoption = buying a jar of pasta sauce. Adaptation = taking someone's recipe but adjusting the spices. Inspiration = visiting a great restaurant and getting ideas for your own menu.


Table of Contents

  1. Direct Adoptions β€” Plug These In
  2. Adaptations β€” Rebuild These for Our Needs
  3. Indirect Inspirations β€” Learn From These Approaches
  4. New Features Inspired by Prior Art
  5. Integration Map β€” Where Each Inspiration Fits
  6. Priority Execution Order

1. Direct Adoptions β€” Plug These In {#1-direct-adoptions}

These are tools, datasets, and models that we should use EXACTLY as they are. No modification needed. Just install and connect.


πŸ”΅ DA-1: SPECTER2 for Claim Embeddings (Layer 3: Deduplication)

Source system: AllenAI / Semantic Scholar What to adopt: allenai/specter2_base with the proximity adapter

The problem we have: Our deduplication system uses Jaccard word overlap. "The LOD was 0.8 fM" and "A detection limit of 800 attomolar was achieved" share almost no words but mean the SAME thing. Our current system thinks they're different. That's like a librarian who thinks "dog" and "canine" are different animals.

The fix: SPECTER2 converts any scientific sentence into a list of 768 numbers (an "embedding"). Sentences that mean the same thing get similar numbers, even if they use completely different words. It was trained on millions of scientific papers, so it actually understands scientific vocabulary.

How to plug it in:

# Replace word-overlap deduplication with embedding-based similarity
from transformers import AutoTokenizer
from adapters import AutoAdapterModel
import torch

# Load once at startup
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
model.load_adapter("allenai/specter2", source="hf", set_active=True)

def embed_claim(text: str) -> list[float]:
    """Convert a claim into a 768-dimensional vector."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        output = model(**inputs)
    return output.last_hidden_state[:, 0, :].squeeze().tolist()

def are_duplicates(claim_a: str, claim_b: str, threshold: float = 0.85) -> bool:
    """Check if two claims are semantically duplicate."""
    emb_a = embed_claim(claim_a)
    emb_b = embed_claim(claim_b)
    cosine_sim = sum(a*b for a,b in zip(emb_a, emb_b)) / (
        sum(a**2 for a in emb_a)**0.5 * sum(b**2 for b in emb_b)**0.5
    )
    return cosine_sim > threshold

Analogy: This is like upgrading from a librarian who only matches exact book titles to a librarian who actually reads the books and knows which ones are about the same topic.

Layer affected: Layer 3 (Canonicalization) Priority: πŸ”΄ Critical β€” do this first Effort: 1-2 days (model download + integration)


πŸ”΅ DA-2: SciFact as Our Evaluation Benchmark (Layer 6: Evaluation)

Source system: AllenAI What to adopt: bigbio/scifact

The problem we have: We have 143 tests that check if the code runs. Zero tests that check if the SCIENCE is right. That's like testing a calculator by checking if the buttons click, but never checking if 2+2 actually equals 4.

The fix: SciFact contains 1,400 scientific claims with expert labels (SUPPORTS / REFUTES / NOINFO) and rationale sentences. Feed our extraction pipeline the same claims and check if our labels match the expert labels. Instant quality measurement.

How to plug it in:

from datasets import load_dataset

# Load SciFact benchmark
scifact = load_dataset("bigbio/scifact")

# For each claim in the test set:
for example in scifact["test"]:
    claim_text = example["text_1"]  # The scientific claim
    evidence_text = example["text_2"]  # The abstract/evidence
    expert_label = example["label"]  # SUPPORTS, REFUTES, or NOINFO
    
    # Run our pipeline on the same inputs
    our_label = our_pipeline.extract_and_classify(claim_text, evidence_text)
    
    # Compare: do we agree with the expert?
    correct = (our_label == expert_label)

Analogy: SciFact is like the answer key for a standardized test. We can't know how well our system works until we check it against the answer key.

Layer affected: Layer 6 (Evaluation) Priority: πŸ”΄ Critical β€” do this alongside SPECTER2 Effort: 1 day (dataset download + evaluation script)


πŸ”΅ DA-3: SciRIFF for Training Data (Training Pipeline)

Source system: AllenAI What to adopt: allenai/SciRIFF

The problem we have: Our training data is 1,900 examples generated by a Python template script. SciRIFF has 137,000 examples written by human experts across 54 scientific tasks. That's 72Γ— more data, and it's real human expertise, not computer-generated patterns.

The fix: Use SciRIFF as supplementary training data alongside our existing dataset. The SciRIFF paper specifically found that expert-written examples outperform GPT-4-generated examples β€” confirming that our synthetic data is a weakness.

Key tasks from SciRIFF that map to our pipeline:

SciRIFF Task Our Pipeline Layer How It Helps
Claim verification (from SciFact) Layer 2 + Layer 4 Teaches claim extraction + supports/refutes labeling
Information extraction (from SciERC) Layer 2 Teaches entity + relation extraction from papers
NER (from various sources) Layer 1 Teaches entity recognition
Summarization Layer 2 Teaches faithful text compression
Question answering (from QASPER) Cross-cutting Teaches reading comprehension of full papers

Analogy: We're currently training our student using a workbook where all the practice problems were invented by a computer. SciRIFF is like getting 137,000 practice problems written by real teachers. The student will learn much better.

Layer affected: Training pipeline (all stages) Priority: πŸ”΄ Critical β€” do this before next training run Effort: 2-3 days (dataset conversion to our format + integration)


πŸ”΅ DA-4: Nougat for Equation Parsing (Layer 0: Ingestion)

Source system: Meta/Facebook Research What to adopt: facebook/nougat-base

The problem we have: Our parser turns equations into garbage. A paper that says "E = mcΒ²" gets turned into "E mc 2" or worse. For papers in physics, chemistry, and engineering, equations ARE the findings.

The fix: Nougat was specifically trained to convert scientific PDFs into markdown WITH proper LaTeX equations. It reads the visual layout of the PDF and outputs structured text that preserves math symbols, subscripts, superscripts, and equation structure.

Analogy: Our current parser is like a translator who speaks English but not math. Nougat speaks both fluently.

Layer affected: Layer 0 (Structural Ingestion) Priority: πŸ”΄ Critical β€” already in our design, needs actual integration Effort: 2-3 days (model integration + routing logic for equation regions)


πŸ”΅ DA-5: SciBERT-NLI as a Fast Contradiction Pre-Filter (Layer 4: Knowledge Graph)

Source system: AllenAI (base) + Gabriele Sarti (NLI fine-tuning) What to adopt: gsarti/scibert-nli

The problem we have: Our conflict detection checks all claim pairs using expensive LLM calls. With 1,000 claims, that's 499,500 pairs. Even checking only the top 500 claims is slow.

The fix: SciBERT-NLI is a small, fast model specifically trained to detect entailment and contradiction in scientific text. Use it as a pre-filter: run all pairs through SciBERT-NLI (fast, cheap), then only send the pairs flagged as potential contradictions to the expensive main model.

Analogy: Instead of having a doctor examine every person in a city (expensive), first have nurses do quick screenings (cheap), then only send people with concerning results to the doctor.

Layer affected: Layer 4 (Knowledge Graph β€” conflict detection) Priority: 🟠 High β€” significant speed improvement Effort: 1-2 days (model download + pre-filter pipeline)


2. Adaptations β€” Rebuild These for Our Needs {#2-adaptations}

These are techniques and patterns from other systems that we should adapt and rebuild, not copy directly.


🟣 AD-1: PaperQA2's RCS Pattern β†’ Our "Pre-Extraction Filter"

Source system: PaperQA2 (FutureHouse) Paper: arxiv:2409.13740

What they do: Before generating an answer, PaperQA2 runs RCS (Rerank + Contextual Summarize):

  1. Retrieve many text chunks from papers
  2. Have the AI rerank them by relevance (throw away the irrelevant ones)
  3. Have the AI write a contextual summary of each remaining chunk
  4. Generate the final answer using ONLY the summaries

Why this matters: PaperQA2 found that the RCS step is the single biggest contributor to answer quality. Without it, the model drowns in irrelevant text. With it, the model sees only the information that matters.

How we adapt it: Before our AI Council processes a paper section, add a pre-filtering step:

  1. After Layer 0 parsing, get all text chunks from a section
  2. Have a lightweight model (SciBERT) score each chunk for "claim density" β€” how many extractable findings does this chunk contain?
  3. Only send high-density chunks to the expensive Council extraction
  4. Low-density chunks (boilerplate methodology, acknowledgments, generic intro text) get a lightweight extraction or skip

The analogy: When studying for an exam, you don't re-read the entire textbook. You first skim to find the important sections, highlight them, then study the highlights deeply. RCS is the highlighting step.

Our adaptation vs. their original: PaperQA2 uses RCS for question-answering. We adapt it for claim extraction. Different task, same principle β€” filter before you analyze.

Layer affected: Layer 2 (Qualified Extraction β€” pre-processing step) Priority: 🟠 High β€” significant quality improvement Effort: 1 week


🟣 AD-2: FactReview's Dual Evidence Pattern β†’ Our "Cross-Reference Verification"

Source system: FactReview Paper: arxiv:2604.04074

What they do: For every claim in a paper, FactReview checks it against TWO sources:

  1. The paper's own evidence (does the data in the paper actually support this claim?)
  2. External literature (do other papers agree or disagree?)

Why this matters: A claim might look strong within its own paper but be contradicted by 10 other papers. Or a claim might look weak in isolation but be supported by a mountain of external evidence. You need both perspectives.

How we adapt it: After Layer 2 extraction produces claims from a single paper, add a cross-reference step before Layer 4 scoring:

  1. For each extracted claim, check: does the claim's evidence (source quote, table reference, figure reference) actually support what the claim says? (Internal check)
  2. Then check: does the claim align with or contradict the existing knowledge graph? (External check)
  3. If internal and external evidence disagree, flag for human review

The analogy: FactReview is like a detective who interviews both the suspect (the paper) AND the witnesses (other papers). We should do the same β€” don't trust what a paper says about itself without checking the neighbors.

Layer affected: Between Layer 2 and Layer 4 (new verification step) Priority: 🟠 High β€” reduces hallucination Effort: 1-2 weeks


🟣 AD-3: KGX3's Deterministic Language Filters β†’ Our "Epistemic Trigger Words"

Source system: KGX3 / iKuhn Paper: arxiv:2002.03531

What they do: KGX3 doesn't ask an AI "is this a confirmation or a challenge?" Instead, they have specific word patterns (called "language-game filters") that detect each epistemic category:

  • "consistent with previous findings" β†’ Confirms
  • "contrary to expectations" β†’ Challenges
  • "first time that" β†’ Novel
  • "replication of" β†’ Confirms

Each filter has a fixed activation threshold (ΞΈ=0.7), validated across multiple scientific disciplines.

Why this matters: Rules-based classification is deterministic β€” it gives the same answer every time, and you can audit exactly WHY a claim was classified a certain way. AI-based classification is stochastic β€” it might give different answers on different runs, and you can't fully explain why.

How we adapt it: Build a set of "epistemic trigger word" dictionaries for our three categories:

FACT_TRIGGERS = {
    "strong": ["demonstrated", "measured", "observed", "detected", "confirmed",
               "showed that", "resulted in", "was found to be", "achieved"],
    "moderate": ["correlated with", "associated with", "was consistent with"],
}

INTERPRETATION_TRIGGERS = {
    "strong": ["suggests that", "indicates that", "implies", "may be attributed to",
               "could be explained by", "appears to", "is likely due to"],
    "moderate": ["consistent with", "in line with", "supports the hypothesis"],
}

HYPOTHESIS_TRIGGERS = {
    "strong": ["may", "might", "could potentially", "we hypothesize", 
               "it is possible that", "remains to be determined", "future work"],
    "moderate": ["we propose", "we speculate", "it is conceivable"],
}

def compute_epistemic_trigger_score(claim_text: str, source_section: str) -> dict:
    """
    Code-computed epistemic classification based on trigger words.
    This runs ALONGSIDE the AI classification as a validator.
    If the AI says 'Fact' but the trigger words say 'Hypothesis',
    flag for human review.
    """
    scores = {"Fact": 0, "Interpretation": 0, "Hypothesis": 0}
    text_lower = claim_text.lower()
    
    for trigger in FACT_TRIGGERS["strong"]:
        if trigger in text_lower:
            scores["Fact"] += 0.3
    for trigger in FACT_TRIGGERS["moderate"]:
        if trigger in text_lower:
            scores["Fact"] += 0.15
    # ... same for Interpretation and Hypothesis
    
    # Section modifier (from Epistemic Separation Engine)
    if source_section == "abstract":
        scores["Interpretation"] += 0.2  # Bias toward Interpretation for abstracts
    elif source_section == "results":
        scores["Fact"] += 0.2  # Bias toward Fact for results
    
    # Return the highest-scoring category
    return max(scores, key=scores.get), scores

Use this as a validator β€” if the AI Council says "Fact" but the trigger analysis says "Hypothesis," the disagreement IS the signal. Flag it for human review.

The analogy: KGX3's approach is like a grammar checker β€” it uses specific rules to flag errors. Our AI Council is like a writing tutor β€” it uses judgment. We should use BOTH. The grammar checker catches things the tutor misses, and vice versa.

Layer affected: Layer 2 (Qualified Extraction β€” validation step) + Layer 5 (Scoring) Priority: 🟠 High β€” improves epistemic accuracy Effort: 1 week


🟣 AD-4: Paper Circle's Coverage Checker β†’ Our "Completeness Auditor"

Source system: Paper Circle Paper: arxiv:2604.06170

What they do: After all extraction agents have finished, a "Coverage Checker" agent reviews the results and asks: "Did we miss anything? Are there figures without linked concepts? Are there methods in the text that we didn't extract? Are there experimental results that nobody captured?"

Why this matters: The most dangerous failure in any extraction system is silent omission β€” the system looks like it's done, but it actually missed the most important finding in the paper. Without a coverage check, you'd never know.

How we adapt it: After the AI Council finishes extraction (end of Layer 2), add a completeness audit:

  1. Section coverage: Did we extract at least one claim from every Results subsection? If not, why not? Flag unextracted sections.
  2. Figure/table coverage: Every figure and table should be referenced by at least one claim. Orphaned figures/tables suggest missed findings.
  3. Statistical coverage: If the paper reports N, p-values, effect sizes, or confidence intervals, at least one claim should capture them. Unreferenced statistics suggest missed quantitative findings.
  4. Citation coverage: Claims that cite other papers should be tagged as inherited citations. Untagged citations suggest the system confused someone else's finding for this paper's finding.

The analogy: After a team of movers loads a truck, a supervisor walks through the house checking every room. "Did we get the bedroom? Kitchen? Garage? Anything left behind?" The Coverage Checker is that supervisor.

Layer affected: Layer 2 (Qualified Extraction β€” post-processing step) Priority: 🟑 Medium β€” prevents silent omission Effort: 1 week


🟣 AD-5: PaperQA2's "Refuse to Answer" β†’ Our "Low Confidence Quarantine"

Source system: PaperQA2 (FutureHouse) Paper: arxiv:2409.13740

What they do: When PaperQA2 doesn't have enough evidence to answer a question, it says "I cannot answer this question with the available evidence." It literally refuses rather than making something up.

Why this matters: A truth-seeking system that ALWAYS produces output is lying. Real scientists say "I don't know" constantly β€” it's the most honest and most important thing they say. Our system should too.

How we adapt it: Add a confidence quarantine zone:

  1. At extraction time: If the AI Council's composite confidence for a claim is below 0.3, the claim goes into a "Low Confidence Quarantine" queue instead of the main knowledge graph.
  2. In the UI: Quarantined claims are visible in a separate tab labeled "⚠️ Uncertain β€” Needs Evidence." They are NOT shown in the main dashboard.
  3. In exports: Quarantined claims are excluded from Obsidian exports, CSV downloads, and BibTeX files by default. A user can manually include them with a checkbox: "☐ Include uncertain claims (confidence < 0.3)."
  4. In the knowledge graph: Quarantined claims are NOT connected to other claims via supports/refutes edges. They exist as isolated nodes in a "pending" state.

The analogy: A weather forecaster who says "it might rain tomorrow" is more honest than one who says "it WILL rain tomorrow" when they're not sure. Our system should be the honest forecaster.

Layer affected: Layer 2, Layer 4, Layer 7 (cross-cutting) Priority: 🟠 High β€” prevents false confidence Effort: 3-5 days


🟣 AD-6: CLAIRE's Agentic Contradiction Loop β†’ Our "Conflict Investigation Protocol"

Source system: CLAIRE (Stanford) Paper: arxiv:2509.23233

What they do: Instead of just comparing two claims and saying "these contradict," CLAIRE runs an investigation:

  1. Extract a claim
  2. Search the entire corpus for related claims
  3. For each related claim, use AI reasoning to determine: agree, disagree, or unrelated?
  4. If disagree, retrieve MORE evidence from surrounding context to confirm the contradiction isn't just a misunderstanding
  5. Only then flag it as a real contradiction

How we adapt it: Upgrade our conflict detection from simple pairwise comparison to an investigation protocol:

  1. Pre-filter (DA-5: SciBERT-NLI): Quick check β€” do these two claims even potentially disagree?
  2. Context retrieval: For both claims, retrieve the full paragraph they came from (not just the claim text)
  3. Method check: Do both claims use comparable methods? (Our existing method-compatibility layer)
  4. Evidence deepening: Search for other claims in the knowledge graph that relate to EITHER claim. Do they provide additional evidence for or against the contradiction?
  5. Confidence assessment: How confident are we that this is a REAL contradiction vs. a difference in methods, definitions, or scope?
  6. Output: A conflict report with evidence from multiple sources, not just two isolated claims

The analogy: CLAIRE's approach is like a detective who doesn't just interview two witnesses separately β€” they interview 10 witnesses, check the physical evidence, review the security footage, and THEN decide if there's a real disagreement. We should be that thorough.

Layer affected: Layer 4 (Knowledge Graph β€” conflict detection upgrade) Priority: 🟑 Medium β€” significant quality improvement for contradiction detection Effort: 2-3 weeks


🟣 AD-7: CritiCal's Critique-Based Calibration β†’ Our "Confidence Explanation Engine"

Source system: CritiCal Paper: arxiv:2510.24505

What they do: Instead of just asking an AI "how confident are you?", CritiCal asks the AI to WRITE a critique of its own answer first, then derives confidence from the critique. An AI that writes "I'm not sure about the denominator" naturally produces a lower confidence than one that writes "this is clearly supported by Figure 3."

Why this matters: Their key finding is that the ACT of critiquing improves calibration. The AI becomes more honest about uncertainty when forced to articulate where the uncertainty comes from.

How we adapt it: Add a critique step to our AI Council's workflow:

After the Extractor produces a claim, but BEFORE the Critic evaluates it, require the Extractor to write a brief self-critique:

Claim: "The LOD was 0.8 fM"
Self-critique: "I am confident about the value (0.8 fM) because it appears 
in Table 2 with clear units. I am less confident about whether this is a 
universal or condition-specific finding β€” the paper says 'in 10 mM PBS' 
which I've preserved as a qualifier."

This self-critique serves two purposes:

  1. It makes the Extractor's uncertainty explicit and auditable (the researcher can read WHY the AI is uncertain)
  2. It provides structured input for our confidence formula (the self-critique mentions preserved qualifiers, which maps to our qualifier_strength score)

The analogy: A student who writes "I think this is right but I'm not sure about step 3" is a BETTER student than one who just writes the answer. The self-doubt IS the learning signal.

Layer affected: Layer 2 (Qualified Extraction β€” council workflow) + Layer 5 (Scoring input) Priority: 🟑 Medium β€” improves calibration Effort: 1 week


3. Indirect Inspirations β€” Learn From These Approaches {#3-indirect-inspirations}

These aren't things we should build directly. They're philosophical and architectural lessons.


🟀 IN-1: PaperQA2's "Superhuman Benchmark" Philosophy

Lesson: PaperQA2 didn't just build a system β€” they proved it beats humans on specific, measurable tasks. They created the LitQA2 benchmark, ran their system against PhD-level scientists, and published the numbers.

What we should learn: We need to define OUR equivalent of LitQA2 β€” a specific task where we can measure: "Does our epistemic classification match expert human classification?" This requires:

  1. 20-30 papers annotated by domain experts
  2. Every claim labeled (Fact / Interpretation / Hypothesis) by at least 2 experts
  3. Inter-annotator agreement measured
  4. Our system evaluated against the same papers

The goal isn't to beat humans β€” it's to know WHERE we agree with humans and WHERE we disagree. The disagreements are the most interesting findings.


🟀 IN-2: KGX3's "Code Over AI" Philosophy

Lesson: KGX3 deliberately chose deterministic rules over stochastic AI for epistemic classification. Their paper argues that epistemic decisions should be REPRODUCIBLE β€” the same paper should always get the same classification. They achieved this by defining specific linguistic triggers with fixed activation thresholds.

What we should learn: We already share this philosophy ("AI provides components, code computes scores"). But we should go further. Every epistemic decision in our system should be decomposable into:

  1. AI-provided inputs (what the model observed)
  2. Code-computed outputs (what the formula calculated)
  3. Auditable reasoning (why the formula gave this result)

If a researcher asks "why is this claim labeled Interpretation?", the answer should never be "because the AI said so." It should be: "Because the claim appeared in the Discussion section (0.75Γ— modifier), contains the qualifier 'suggests' (-0.1 penalty), and the trigger-word analysis scored Interpretation at 0.6 vs Fact at 0.3."


🟀 IN-3: AgentSLR's "Speed Γ— Quality" Proof of Concept

Lesson: AgentSLR proved that AI can do literature reviews at human quality with 58Γ— speedup. This isn't theoretical β€” they applied it to 9 WHO-priority pathogens and published the comparison.

What we should learn: Speed matters for adoption. If our system takes 10 hours to process 100 papers, few researchers will use it. If it takes 20 minutes, everyone will. The architecture decisions we make now (local model size, batch processing, caching) directly affect adoption.

Specific implication: Consider processing papers in two passes:

  • Fast pass (minutes): Lightweight extraction using trigger words + SciBERT. Gives researchers immediate, approximate results.
  • Deep pass (hours): Full AI Council extraction with confidence scoring. Refines the fast-pass results.

This way, researchers get SOMETHING useful immediately and the system keeps improving in the background.


🟀 IN-4: Paper Circle's "Multi-Agent Specialization" Pattern

Lesson: Paper Circle assigns different agent roles based on CONTENT TYPE β€” one agent for concepts, one for methods, one for experiments. Our Council assigns roles based on FUNCTION β€” planner, extractor, critic, chairman. Both patterns work.

What we should learn: We should combine both. Our Council should have function-specialized roles (planner, extractor, critic, chairman) AND content-specialized routing (text goes to the language model, figures go to the VLM, tables go to the structured parser, equations go to Nougat).

The content router decides WHO processes each chunk. The Council decides HOW to interpret the results.


🟀 IN-5: FactReview's "Execution as Evidence" Insight

Lesson: FactReview verifies empirical claims by RUNNING CODE. If a paper says "our model achieves 95% accuracy," FactReview actually executes the code to check. This is a profound epistemic innovation β€” it adds a non-AI evidence source that grounds claims in physical reality.

What we should learn: Not all evidence is text. For papers that include code, data, or computational results, we should consider:

  1. Code availability check: Does the paper link to a working repository? (Our Layer 1 already plans this)
  2. Reproducibility flag: Has anyone replicated the results? (CrossRef + Semantic Scholar APIs)
  3. Data availability check: Is the underlying data accessible? (Zenodo, Figshare, Dryad checks)

We can't execute arbitrary code (too dangerous), but we CAN check whether the infrastructure for verification EXISTS. A paper whose code is on GitHub, whose data is on Zenodo, and whose results have been replicated gets a higher evidence_quality score than a paper with none of these.


🟀 IN-6: ORKG's "Human + Machine" Knowledge Graph

Lesson: ORKG (Open Research Knowledge Graph) at orkg.org takes a different approach than every AI system. Instead of having AI build the knowledge graph, they have HUMAN RESEARCHERS create structured comparison tables. The result is smaller but much more reliable than any AI-generated graph.

What we should learn: Our "human support" philosophy should include a mechanism for researchers to contribute structured knowledge BACK to the system:

  1. When a researcher corrects a claim label β†’ that correction becomes training data
  2. When a researcher manually adds a graph edge β†’ that edge has maximum provenance (human-verified)
  3. When a researcher creates a comparison table β†’ that table becomes structured input for future claims

ORKG proves that human-contributed knowledge, even in small amounts, dramatically improves graph quality. Our system already has the proposal mechanism (agents propose, humans approve). We should extend this so that human corrections flow back into both the knowledge graph AND the training pipeline.


🟀 IN-7: CLUE's "Explain the Uncertainty" Philosophy

Source system: CLUE Paper: arxiv:2505.17855

Lesson: CLUE doesn't just say "confidence: 0.6". It explains WHERE the uncertainty comes from: "There are 3 supporting papers and 1 contradicting paper. The contradiction is about the dosage (0.5mg vs 1.0mg), not about the overall finding." This is dramatically more useful than a bare number.

What we should learn: Our 3-score system (evidence quality, truth likelihood, qualifier strength) is already a form of uncertainty decomposition. But we should go further and generate a plain-English explanation for every composite score:

Confidence: 0.72
WHY: Evidence is strong (0.90) β€” the paper reports n=12 with p<0.001 
     and clear methodology. But one other paper in the knowledge graph 
     contradicts this finding using a different method (truth likelihood: 0.65). 
     Also, the claim uses hedging language ("may reduce") suggesting the 
     authors themselves aren't fully certain (qualifier strength: 0.60).

This explanation should be auto-generated from the scoring formula's components. No extra AI calls needed β€” just template the decomposition.


4. New Features Inspired by Prior Art {#4-new-features}

These are entirely new capabilities that don't exist in our current design, directly inspired by what we learned from other systems.


πŸ†• NF-1: Epistemic Velocity Tracking

Inspired by: CLAIRE's temporal analysis + PaperQA2's contradiction tracking

What it is: For every claim in the knowledge graph, track how its confidence has changed over time. Display this as a trend:

Claim: "Graphene FETs achieve LOD < 1 fM"

Confidence over time:
  2022 ──── 0.50 ──── First reported, single paper
  2023 ──── 0.65 ──── Replicated by 2 more labs
  2024 ──── 0.80 ──── Meta-analysis confirms
  2025 ──── 0.72 ──── One lab reports contradictory results
  2026 ──── 0.75 ──── Contradiction explained by method difference
  
  Status: STABLE (Β±0.05 over last 12 months)
  Trend: Rising (0.50 β†’ 0.75 over 4 years)

Why it matters for truth-seeking: A claim that has been steadily gaining confidence for 4 years is very different from a claim that keeps bouncing between 0.4 and 0.8. The TREND tells you more than the current number. Rising claims are being confirmed. Falling claims are being challenged. Volatile claims are actively contested β€” and therefore the most interesting for research.

Implementation: We already store timestamps on every claim and version history on canonical claims. Computing the velocity is just a query:

def compute_epistemic_velocity(canonical_claim_id: str) -> dict:
    versions = db.get_version_history(canonical_claim_id)
    if len(versions) < 2:
        return {"trend": "insufficient_data", "stability": "unknown"}
    
    # Compute trend: slope of confidence over time
    times = [(v["date"] - versions[0]["date"]).days for v in versions]
    confidences = [v["confidence"] for v in versions]
    slope = compute_linear_slope(times, confidences)
    
    # Compute stability: standard deviation of last 3 versions
    recent = confidences[-3:]
    stability = compute_std(recent)
    
    return {
        "trend": "rising" if slope > 0.01 else "falling" if slope < -0.01 else "stable",
        "stability": "stable" if stability < 0.05 else "volatile",
        "current": confidences[-1],
        "history": list(zip([v["date"] for v in versions], confidences)),
    }

Layer affected: Layer 5 (Scoring β€” new temporal dimension) Priority: 🟑 Medium β€” high value, low effort Effort: 3-5 days


πŸ†• NF-2: Devil's Advocate Mode

Inspired by: CLAIRE's agentic contradiction detection + KGX3's challenge detection

What it is: A UI mode that actively CHALLENGES high-confidence claims:

For every claim with confidence > 0.8, the system automatically:

  1. Searches the knowledge graph for the strongest counter-evidence
  2. Finds the nearest alternative explanation (different epistemic label that could fit the same data)
  3. Identifies what would need to be true for the claim to be wrong

Display:

πŸ“‹ Claim: "Graphene FETs achieve LOD < 1 fM" (Confidence: 0.88)

😈 Devil's Advocate:
   Counter-evidence: Park 2024 reports LOD of 5 fM using identical 
   materials but different buffer conditions (comparability: 0.6)
   
   Alternative label: This could be an INTERPRETATION rather than a FACT 
   if the LOD calculation method (3Οƒ/slope) is disputed β€” two papers in 
   the graph use different LOD calculation methods.
   
   What would make this wrong: If the buffer conditions (10 mM PBS) are 
   critical to the result, the claim should have the qualifier 
   "in 10 mM PBS only" β€” which would reduce qualifier_strength by 0.1.

Why it matters for truth-seeking: The most dangerous state for a researcher is high confidence with hidden weaknesses. Devil's Advocate mode forces the system to ACTIVELY LOOK for reasons to doubt its own outputs. This is the essence of scientific skepticism.

Layer affected: Cross-cutting (UI + Layer 4 + Layer 5) Priority: 🟑 Medium β€” high value for truth-seeking Effort: 1-2 weeks


πŸ†• NF-3: Epistemic Provenance Levels (Human Verification Tracking)

Inspired by: ORKG's human contribution model + Paper Circle's verification status

What it is: Every claim in the knowledge graph gets a human-verification level:

Level Icon Meaning How It Happens
0 πŸ€– Machine-extracted, unreviewed Default when claim is first extracted
1 🟑 Machine-extracted, human-spot-checked Researcher saw it in the dashboard, didn't flag an error
2 🟒 Human-verified claim text AND label Researcher explicitly confirmed text + Fact/Interpretation/Hypothesis
3 ⭐ Expert-verified with domain knowledge Domain expert reviewed with full context
4 πŸ“„ Published / peer-reviewed Claim appears in a peer-reviewed paper's verified dataset

Rules:

  • Level 0 claims CANNOT be cited in exports without a warning banner
  • Level 0 claims CANNOT generate "supports" edges in the knowledge graph (they can only be the TARGET of edges, not the SOURCE)
  • Level 2+ claims can be used as training data for the next model version
  • Level 3+ claims become part of the gold standard evaluation dataset

Why it matters for truth-seeking: A system that tells you "this claim was extracted 30 seconds ago and no human has looked at it" is much more honest than one that presents all claims with equal visual weight. The verification level IS the epistemic honesty.

Layer affected: Cross-cutting (all layers + UI) Priority: 🟑 Medium β€” fundamental for trust Effort: 1 week


πŸ†• NF-4: Confidence Decomposition Display

Inspired by: CLUE's uncertainty explanation + our own 3-score system

What it is: When showing a claim's confidence, display the full decomposition visually:

Claim: "The treatment reduced tumor size by 40%"
Epistemic Tag: Fact
Composite Confidence: 0.72

Evidence Quality:    0.90 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  
  (n=12, p<0.001, Results section, T1 journal)
  
Truth Likelihood:    0.65 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘  
  (one contradicting study: Kim 2024 found only 15% reduction)
  
Qualifier Strength:  0.60 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  
  (hedged: "in the murine model", "under these conditions")
                     ────────────────────────────
Composite:           0.72 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘

⚠️ The truth likelihood is dragged down by Kim 2024. 
   Click to see the conflict in the Courtroom UI.

Why it matters for truth-seeking: The number 0.72 by itself is meaningless. The decomposition tells the researcher WHAT to investigate: "The evidence is strong, but there's a contradicting study, and the authors used hedging language." Each dimension points to a different action.

Layer affected: Layer 7 (UI β€” Obsidian export + Gradio display) Priority: 🟑 Medium β€” already in our design, needs implementation Effort: 3-5 days (UI components)


5. Integration Map β€” Where Each Inspiration Fits {#5-integration-map}

PDF Bundle arrives
  β”‚
  β–Ό
LAYER 0: Structural Ingestion
  β”œβ”€β”€ πŸ”΅ DA-4: Nougat for equations
  β”œβ”€β”€ (Marker + GROBID already planned)
  └── Output: structured regions with bbox
  β”‚
  β–Ό
LAYER 1: Entity Resolution
  β”œβ”€β”€ 🟀 IN-5: FactReview's code/data availability checks
  └── (existing citation + retraction checks)
  β”‚
  β–Ό
LAYER 2: Qualified Extraction
  β”œβ”€β”€ 🟣 AD-1: PaperQA2's RCS β†’ Pre-Extraction Filter (before Council)
  β”œβ”€β”€ 🟣 AD-7: CritiCal's Critique β†’ Self-Critique step (during Council)
  β”œβ”€β”€ 🟣 AD-3: KGX3's Language Filters β†’ Epistemic Trigger Words (validation)
  β”œβ”€β”€ 🟣 AD-4: Paper Circle's Coverage Checker β†’ Completeness Auditor (after Council)
  β”œβ”€β”€ 🟣 AD-5: PaperQA2's Refuse β†’ Low Confidence Quarantine (output filtering)
  └── πŸ†• NF-3: Epistemic Provenance Levels (applied to every claim)
  β”‚
  β–Ό
LAYER 3: Canonicalization
  β”œβ”€β”€ πŸ”΅ DA-1: SPECTER2 for claim embeddings (replaces word overlap)
  └── (existing temporal versioning)
  β”‚
  β–Ό
LAYER 4: Knowledge Graph
  β”œβ”€β”€ πŸ”΅ DA-5: SciBERT-NLI as fast contradiction pre-filter
  β”œβ”€β”€ 🟣 AD-2: FactReview's Dual Evidence β†’ Cross-Reference Verification
  β”œβ”€β”€ 🟣 AD-6: CLAIRE's Investigation Loop β†’ Conflict Investigation Protocol
  └── πŸ†• NF-2: Devil's Advocate Mode (cross-cutting with UI)
  β”‚
  β–Ό
LAYER 5: Calibrated Scoring
  β”œβ”€β”€ πŸ†• NF-1: Epistemic Velocity Tracking (temporal confidence trends)
  β”œβ”€β”€ πŸ†• NF-4: Confidence Decomposition Display
  └── 🟀 IN-7: CLUE's uncertainty explanation β†’ auto-generated score explanations
  β”‚
  β–Ό
LAYER 6: Evaluation
  β”œβ”€β”€ πŸ”΅ DA-2: SciFact as evaluation benchmark
  β”œβ”€β”€ πŸ”΅ DA-3: SciRIFF for training data
  └── 🟀 IN-1: PaperQA2's superhuman benchmark philosophy β†’ build our own epistemic benchmark
  β”‚
  β–Ό
LAYER 7: Provenance & Export
  └── 🟀 IN-6: ORKG's human contribution model β†’ corrections flow back to training

6. Priority Execution Order {#6-priority-order}

Phase I: Foundations (Do These First β€” Everything Else Depends on Them)

# Item Type Source System Effort Impact
1 DA-1: SPECTER2 for embeddings πŸ”΅ Direct AllenAI 1-2 days Fixes deduplication completely
2 DA-2: SciFact benchmark πŸ”΅ Direct AllenAI 1 day Gives us quality measurement
3 DA-3: SciRIFF training data πŸ”΅ Direct AllenAI 2-3 days 72Γ— more training examples
4 DA-4: Nougat integration πŸ”΅ Direct Meta 2-3 days Fixes equation parsing

Phase II: Quality Improvements (Build These Next)

# Item Type Source System Effort Impact
5 DA-5: SciBERT-NLI pre-filter πŸ”΅ Direct AllenAI + Sarti 1-2 days Speeds up conflict detection 100Γ—
6 AD-1: Pre-Extraction Filter (RCS) 🟣 Adapt PaperQA2 1 week Reduces noise in extraction
7 AD-3: Epistemic Trigger Words 🟣 Adapt KGX3 1 week Code-based epistemic validation
8 AD-5: Low Confidence Quarantine 🟣 Adapt PaperQA2 3-5 days Prevents false confidence

Phase III: Deep Improvements (Build These When Core Is Stable)

# Item Type Source System Effort Impact
9 AD-2: Cross-Reference Verification 🟣 Adapt FactReview 1-2 weeks Reduces hallucination
10 AD-4: Completeness Auditor 🟣 Adapt Paper Circle 1 week Prevents silent omission
11 AD-7: Self-Critique step 🟣 Adapt CritiCal 1 week Improves calibration
12 NF-3: Epistemic Provenance Levels πŸ†• New ORKG + Paper Circle 1 week Human verification tracking

Phase IV: Advanced Features (Build These for PhD Year 2+)

# Item Type Source System Effort Impact
13 AD-6: Conflict Investigation Protocol 🟣 Adapt CLAIRE 2-3 weeks Thorough contradiction analysis
14 NF-1: Epistemic Velocity πŸ†• New CLAIRE + PaperQA2 3-5 days Temporal confidence trends
15 NF-2: Devil's Advocate Mode πŸ†• New CLAIRE + KGX3 1-2 weeks Active skepticism
16 NF-4: Confidence Decomposition Display πŸ†• New CLUE 3-5 days Transparent scoring

Total Estimated Effort: ~12-16 weeks (spread across PhD timeline)


Summary: The Steal Sheet

What We're Stealing From Whom Why It's Fair
SPECTER2 embeddings AllenAI Open-source, Apache 2.0
SciFact benchmark AllenAI Open-source, CC-BY
SciRIFF training data AllenAI Open-source, ODC-BY
Nougat PDF parser Meta Research Open-source, MIT
SciBERT-NLI model Gabriele Sarti Open-source
RCS pattern PaperQA2 Published technique, not code
Language filter approach KGX3 Published methodology
Coverage checker pattern Paper Circle Published architecture
Refuse-to-answer pattern PaperQA2 Published behavior
Investigation loop pattern CLAIRE Published methodology
Critique calibration technique CritiCal Published technique
Human contribution philosophy ORKG Published approach
Uncertainty explanation approach CLUE Published philosophy

Every adoption is either open-source with permissive licensing or a published technique (not code). We're standing on the shoulders of giants β€” and giving them credit.


Document compiled 2026-04-23. All source systems, papers, and resources have been verified as available. This document is a living roadmap β€” update as new systems appear.