phd-research-os-brain / PRIOR_ART_ANALYSIS.md
nkshirsa's picture
Add PRIOR_ART_ANALYSIS.md β€” comprehensive landscape of similar systems with high-school-level explanations
b8f72d2 verified

PhD Research OS β€” Prior Art Analysis

Every System That Has Tried Something Similar, and How We Compare

Date: 2026-04-23 Status: COMPREHENSIVE β€” 15 systems analyzed across 6 capability areas Written for: Anyone with a high school education Purpose: Honest accounting of who has built what, so we know exactly where we stand


What Is This Document?

Before you build anything, you should check if someone already built it. That's not pessimism β€” it's good science. If someone already solved a problem, you don't need to solve it again. You can use their solution and spend your time on the parts nobody has cracked yet.

We searched through research papers, open-source code, commercial products, and HuggingFace repositories to find every system that does something similar to PhD Research OS. We found 15 significant systems. Here is what each one does, where it beats us, where we beat it, and what we should steal (with credit).

The bottom line: Nobody has built the complete system we're building. But every piece of our system exists somewhere else in isolation. The novel part is the combination β€” and three specific capabilities that nobody else has published.


Table of Contents

  1. The Scoreboard β€” How Every System Compares
  2. Tier 1: The Closest Rivals
  3. Tier 2: Best-in-Class Components
  4. Tier 3: Specialist Tools We Should Use
  5. Tier 4: Commercial Systems for Context
  6. What Nobody Has Built Yet (Our Innovations)
  7. The Honest Comparison Table
  8. Essential Resources and Links

1. The Scoreboard β€” How Every System Compares {#1-the-scoreboard}

Our system has 5 core capabilities. Here's who else has each one:

Capability What It Means Who Has It Who Does It Best
1. Extract claims from papers Read a PDF and pull out the key findings PaperQA2, FactReview, SciEx, AgentSLR PaperQA2 (superhuman on benchmarks)
2. Label each claim epistemically Is this a Fact? An Interpretation? A Hypothesis? KGX3 (paper-level only) Nobody does claim-level ← Us
3. Build a typed knowledge graph Connect claims with labeled edges (supports, refutes, extends) Paper Circle, SciIE/SciERC, ORKG Paper Circle (multi-agent KG)
4. Detect contradictions across papers Find when Paper A disagrees with Paper B PaperQA2, CLAIRE, FactReview PaperQA2 (2.34Γ— better than PhDs)
5. Compute calibrated confidence scores Use math formulas (not AI guessing) to rate trustworthiness Nobody does formula-based Nobody ← Us

The key insight: Lots of systems can do capabilities 1, 3, and 4. A few can do capability 2 at the paper level. Nobody combines all 5 at the claim level. And nobody does capability 5 the way we do (code-computed, not LLM-guessed).


2. Tier 1: The Closest Rivals {#2-tier-1-closest-rivals}

These are the systems that come closest to what we're building. If you only read one section, read this one.


πŸ₯‡ PaperQA2 β€” The Gold Standard for Scientific Literature AI

Paper: Language Agents Achieve Superhuman Synthesis of Scientific Knowledge (2024) Earlier version: PaperQA: Retrieval-Augmented Generative Agent (2023) Code: github.com/Future-House/paper-qa β€” open source, pip install paper-qa Built by: FutureHouse (well-funded AI research lab)

What It Does (The Analogy)

Imagine you're writing a research paper and you have a brilliant research assistant. You ask them a question like "Does protein X bind to receptor Y?" They go to the library, pull out 50 relevant papers, read the important sections, highlight the key sentences, and come back with a summary that includes page numbers for everything they claim. If two papers disagree, they tell you about the contradiction.

That's PaperQA2. It's a multi-step AI agent that:

  1. Searches for relevant papers (like a librarian)
  2. Retrieves the most important chunks from each paper (like highlighting)
  3. Reranks those chunks by relevance using AI (like sorting your highlights)
  4. Summarizes each chunk in context (like writing margin notes)
  5. Generates a final answer with citations (like writing a paragraph with footnotes)
  6. Detects contradictions across papers (like spotting when two witnesses disagree)

How Good Is It?

This is the scary part. PaperQA2 has been benchmarked against real PhD-level scientists:

Task PaperQA2 Score Human PhD Score Winner
Answer questions about papers (LitQA2) 85.2% precision 73.8% precision πŸ€– PaperQA2
Find contradictions in protein biology 2.34Γ— more found Baseline πŸ€– PaperQA2
Write Wikipedia-quality articles (WikiCrow) 86.1% precision 71.2% precision πŸ€– PaperQA2

Analogy: If paper-reading were a sport, PaperQA2 would be the Olympic champion. We're not competing directly β€” we're building a different sport (epistemic analysis), but we should learn from the champion's training methods.

What It Does Better Than Us

Feature PaperQA2 PhD Research OS What This Means
PDF parsing GROBID (professional, battle-tested) PyMuPDF (basic) Their input quality is much higher
Retrieval quality Dense retrieval + LLM reranking (RCS) No retrieval system They find the right sections faster
Contradiction detection Benchmarked, superhuman Designed but not built They can prove it works; we can't yet
"I don't know" ability Refuses when evidence is insufficient No refusal mechanism They're honest about gaps; we aren't yet
Citation traversal Follows reference chains automatically Citation resolution designed They do it; we plan to do it

What We Do Better Than PaperQA2

Feature PhD Research OS PaperQA2 What This Means
Epistemic labels Fact / Interpretation / Hypothesis None β€” flat claims We tell you WHAT KIND of claim it is
Knowledge graph Typed edges (supports/refutes/extends) No persistent graph We build a web of connected knowledge
Calibrated scoring 3-score code-computed formula LLM-stated confidence Our confidence numbers are auditable
Local-first privacy Paper data never leaves your computer Requires cloud API calls We protect unpublished research
Human oversight design Council votes, proposals, manual mode Answer generation only We're built for verification, not just answers

The Big Lesson from PaperQA2

Their secret weapon is RCS (Rerank + Contextual Summarize). Before answering a question, they have the AI rerank all retrieved chunks by relevance, then write a contextual summary of each chunk. This two-step filter removes noise before the final answer is generated.

Analogy: Imagine you're studying for an exam. Method A: read all your notes and try to answer from memory. Method B: first sort your notes by relevance to the question, then write a one-sentence summary of each relevant note, then answer using only the summaries. Method B is dramatically better. That's RCS.

We should adopt this. Our Layer 2 extraction pipeline would benefit enormously from an RCS-style pre-filtering step before the AI Council processes each section.


πŸ₯ˆ FactReview β€” The Closest to Claim-Level Verification

Paper: FactReview: Evidence-Grounded Reviews with Literature Positioning (2026)

What It Does (The Analogy)

Imagine a very thorough peer reviewer. They don't just read your paper β€” they go find all the papers you cited, plus papers you DIDN'T cite, and check: "Does the existing literature actually support what this paper claims?" For every claim, they write: "SUPPORTS β€” three papers agree" or "REFUTES β€” this contradicts Smith 2023" or "RELATED β€” similar work, but not directly comparable."

And here's the clever part: if your paper says "our model achieves 95% accuracy," FactReview actually runs your code to check if that's true. It does what most human reviewers can't do β€” it verifies empirical claims by execution.

What It Does Better Than Us

  • Execution-based verification: It can actually run code to check if claimed results are real. We have no equivalent. This is like a teacher not just reading your math homework but re-doing the calculations.
  • Literature positioning: For each claim, it retrieves relevant prior work and labels the relationship (SUPPORTS / REFUTES / RELATED). This is close to our knowledge graph edges.

What We Do Better

  • Persistent knowledge graph: FactReview checks one paper at a time. It doesn't build a lasting web of knowledge that grows as you add more papers.
  • Epistemic classification: FactReview labels relationships between claims (supports/refutes), but doesn't label the claims themselves (Fact vs Hypothesis).
  • Calibrated scoring: FactReview says "supports" or "refutes" β€” binary. We give a 3-dimensional confidence score.

The Big Lesson from FactReview

Dual evidence is better than single evidence. They check claims against BOTH the manuscript itself AND external literature. We should adopt this: when extracting a claim, check it against both (a) the specific paper and (b) the existing knowledge graph.


πŸ₯‰ KGX3 / iKuhn β€” The Closest Epistemic System

Paper: A Novel Kuhnian Ontology for Epistemic Classification (2020) System: Formerly called Soph.io, then iKuhn, now KGX3 / Preprint Watch

What It Does (The Analogy)

Imagine a very opinionated professor who reads every paper and immediately says: "This paper confirms what we already knew," or "This paper challenges the established theory," or "This paper introduces a completely new method." They don't just read the claims β€” they classify the paper's ROLE in the field's history.

KGX3 does this automatically using a philosophy framework from Thomas Kuhn (a famous historian of science). It classifies every paper using a three-part code:

  • M (Methodological): Did they reuse an old method, adapt one, or invent a new one?
  • N (Observational): Were the results expected, unexpected, or truly anomalous?
  • P (Positional): Does this confirm, extend, or challenge existing knowledge?

So a paper might be classified as (Novel, Unexpected, Challenges) β€” a new method found something nobody predicted that contradicts the consensus. That's a paradigm-shifting paper.

Why This Matters for Us

KGX3 is the ONLY system we found that does formal epistemic classification β€” labeling the knowledge-status of scientific content using rules and formulas, not AI guessing. That's exactly our philosophy!

But there's a critical difference: KGX3 classifies whole papers. We classify individual claims within papers. This is like the difference between rating a whole restaurant (KGX3) versus rating each dish on the menu (us). Our approach is much more granular and much harder.

The Big Lesson from KGX3

Their deterministic language-game filters are brilliant. Instead of asking an AI "is this a confirmation or a challenge?", they have specific word patterns and rules that detect each epistemic category. For example, phrases like "consistent with previous findings" trigger the "Confirms" filter, while "contrary to expectations" triggers the "Challenges" filter.

The activation threshold (ΞΈ=0.7) is validated across multiple scientific disciplines. This is exactly the kind of empirically-grounded rule that our scoring formulas need.

We should adopt the principle: define specific linguistic triggers for each epistemic category (Fact / Interpretation / Hypothesis) and use them as code-based validators, not just AI prompts.


Paper Circle β€” The Closest Multi-Agent KG System

Paper: Paper Circle: An Open-source Multi-agent Research Discovery Framework (2025) Code: github.com/MAXNORM8650/papercircle

What It Does (The Analogy)

Imagine a team of research assistants, each with a different specialty:

  • Assistant 1 reads papers and identifies key concepts
  • Assistant 2 identifies the methods used
  • Assistant 3 extracts experimental results
  • Assistant 4 links figures and tables to the right concepts
  • A manager coordinates all four and checks that nothing was missed

Together, they build a knowledge graph β€” a web of connected concepts, methods, experiments, and datasets.

Paper Circle does exactly this with AI agents. Their architecture looks like this:

CodeAgent (the manager)
  β†’ Concept Extractor (finds ideas and theories)
  β†’ Method Extractor (finds algorithms and techniques)
  β†’ Experiment Extractor (finds setups, datasets, metrics, results)
  β†’ Linkage Agent (connects figures/tables to concepts/methods)
  β†’ Coverage Checker (makes sure nothing important was missed)

What We Should Learn

  1. The Coverage Checker is genius. After all agents have done their work, a final agent checks: "Did we miss any important sections? Are there figures without linked concepts? Are there methods mentioned in the text that we didn't extract?" This prevents the silent omission problem β€” where the system looks complete but actually missed something important.

  2. Their provenance model is solid. Every node in their knowledge graph carries: source chunk IDs, page numbers, verification status, confidence scores, and timestamps. This is very close to our Layer 7 design.

Where We Differ

Paper Circle builds a structural knowledge graph (concepts, methods, experiments connected by "used-in" and "part-of" edges). We build an epistemic knowledge graph (claims connected by "supports," "refutes," and "extends" edges). Think of it this way: Paper Circle maps the territory of science (what exists). We map the arguments of science (what agrees, what disagrees, what's uncertain).


AgentSLR β€” The Best End-to-End Literature Review System

Paper: AgentSLR: Automating Systematic Literature Reviews with Agentic AI (2025) Code: github.com/oxrml/agentslr

What It Does (The Analogy)

Imagine you're a doctor who needs to know: "Does drug X help with disease Y?" To answer properly, you need to do a systematic review β€” find EVERY study about this question, check which ones are good quality, extract the key numbers from each one, and synthesize a conclusion.

This normally takes a team of researchers 7 weeks. AgentSLR does it in 20 hours. And it performs at human level.

The Numbers

  • 58Γ— faster than human systematic reviews
  • Applied to 9 WHO-priority pathogens (real-world medical task)
  • Open-source with human-in-the-loop hooks

What We Should Learn

AgentSLR is the proof that AI can process scientific literature at scale AND match human quality. But it doesn't build knowledge graphs, doesn't classify epistemic status, and doesn't detect contradictions. It's a pipeline (find β†’ screen β†’ extract β†’ report), not an engine (continuously building and updating a knowledge web).

Our role is different: AgentSLR answers one question at a time. We build a growing knowledge base that answers every future question about the papers you've already processed. Think of AgentSLR as a very fast research assistant. We're building a very smart filing cabinet.


CLAIRE (Stanford) β€” The Best Contradiction Detector

Paper: Detecting Corpus-Level Knowledge Inconsistencies with LLMs (2024) Code: github.com/stanford-oval/inconsistency-detection

What It Does (The Analogy)

Wikipedia has millions of articles. Some of them contradict each other β€” Article A says one thing, Article B says the opposite. Nobody has time to read all of Wikipedia looking for contradictions. CLAIRE does.

It works like a detective:

  1. Extract claims from one article
  2. Search the rest of the corpus for related content
  3. Compare using AI reasoning β€” do these agree or disagree?
  4. Flag contradictions for human editors to review

When tested with real Wikipedia editors, the editors said they were more confident and found more issues when using CLAIRE than working alone.

What We Should Learn

CLAIRE's agentic RAG loop (extract β†’ retrieve β†’ compare β†’ flag) maps directly to our conflict detection pipeline. The key insight is that you should retrieve evidence BEFORE making a contradiction judgment β€” don't just compare two claims in isolation, but also look at what other sources say about both claims.

Their domain is Wikipedia, not science. Scientific contradictions are harder because methods matter β€” two papers might look contradictory but actually used different experimental conditions. We handle this with our method-compatibility layer. CLAIRE doesn't have this.


3. Tier 2: Best-in-Class Components {#3-tier-2-best-in-class-components}

These aren't competing systems β€” they're the best tools for specific sub-problems. We should use them directly.


SciFact + MultiVerS β€” The Standard for Claim Verification

Paper (SciFact): Fact or Fiction: Verifying Scientific Claims (2020) Paper (MultiVerS): MultiVerS: Improving scientific claim verification (2021) Dataset: bigbio/scifact on HuggingFace Code: github.com/allenai/multivers Built by: AllenAI (the Allen Institute for AI β€” one of the best AI research labs in the world)

What It Is (The Analogy)

SciFact is like a standardized test for claim-checking AI. It contains 1,400 scientific claims written by experts, and for each claim, the "answer key" tells you which research abstracts SUPPORT it, REFUTE it, or provide NO INFORMATION.

MultiVerS is the best model for taking this test. It reads a claim AND a full research abstract at the same time, then outputs: (a) does this abstract support, refute, or have nothing to say about this claim? and (b) which specific sentences in the abstract are the evidence?

Why This Matters for Us

SciFact defines the SUPPORTS / REFUTES / NOINFO label scheme that our knowledge graph edges are based on. It's the industry standard. Our system should be benchmarkable against SciFact β€” if we can't match MultiVerS scores on this standard test, our extraction pipeline isn't good enough.

How to use it: bigbio/scifact is available on HuggingFace Hub. Each example has a claim text, evidence documents, labels (SUPPORTS/REFUTES/NOINFO), and rationale sentences.


SciRIFF β€” The Best Training Data for Scientific Tasks

Paper: SciRIFF: Enhancing LM Instruction-Following for Scientific Literature (2024) Dataset: allenai/SciRIFF on HuggingFace Built by: AllenAI

What It Is (The Analogy)

Imagine you want to teach someone to be a research assistant. SciRIFF is the textbook. It contains 137,000 practice exercises across 54 different scientific tasks: extracting information, verifying claims, summarizing papers, answering questions, classifying text.

Each exercise was written by human experts β€” not generated by AI. This matters because AI-generated training data teaches the model AI-style shortcuts, not real scientific reasoning.

Why This Matters for Us

Our current training data is 1,900 synthetic examples generated by a Python script. SciRIFF has 137,000 expert-written examples. That's 72Γ— more data, and it's real, not fake.

Key finding from the SciRIFF paper: Expert-written templates perform significantly better than GPT-4-generated templates. This validates our concern about synthetic training data β€” real human-written examples are genuinely better.

Training recipe from the paper: 5 epochs, learning rate 2e-5, BF16, batch size 128, sequence length 4096. 7B models reach ~70% on scientific tasks; you need 70B for 77%+.


SPECTER2 β€” The Best Scientific Paper Embeddings

Models: allenai/specter2_base + task-specific adapters on HuggingFace Built by: AllenAI

What It Is (The Analogy)

When you hear a song, your brain instantly converts it into a "feeling" β€” happy, sad, energetic. SPECTER2 does the same thing for scientific text. It converts any scientific sentence, abstract, or paper into a list of numbers (called an "embedding") that captures its meaning.

Two sentences that mean the same thing β€” even if they use completely different words β€” will get similar numbers. "The detection limit was 0.8 fM" and "We achieved femtomolar-level sensitivity" would have very similar embeddings.

Why This Matters for Us

Our deduplication system (Layer 3) currently uses word overlap, which misses paraphrases completely. SPECTER2 would fix this instantly. It's specifically trained on scientific text, so it understands domain-specific terminology.

Available adapters:

Adapter HuggingFace ID Use Case
Proximity/Retrieval allenai/specter2 Finding similar claims (deduplication)
Classification allenai/specter2_classification Topic/field classification
Ad-hoc Search allenai/specter2_adhoc_query Searching claims by query

SciIE / SciERC β€” The Standard for Scientific KG Construction

Paper: Multi-Task IE for Scientific Knowledge Graph Construction (2018) Dataset: SciERC β€” 500 annotated abstracts with entities, relations, and coreference Built by: AllenAI

What It Is (The Analogy)

SciERC is like a labeled map of how scientific concepts connect to each other. In 500 research paper abstracts, human experts marked:

  • Entities: What things are mentioned? (Tasks, Methods, Metrics, Materials)
  • Relations: How do they connect? (used-for, feature-of, part-of, compare)
  • Coreference: When do two mentions refer to the same thing?

SciIE is the AI model that learned from this map and can now automatically build knowledge graphs from new papers.

Why This Matters for Us

The SciERC relation types (used-for, feature-of, part-of) are structural β€” they describe how things relate physically or functionally. Our edge types (supports, refutes, extends) are epistemic β€” they describe how claims relate evidentially.

We need both. A complete knowledge graph should know that "Method A was used-for Task B" (structural) AND "Claim X from Paper 1 supports Claim Y from Paper 2" (epistemic). We should use SciERC's taxonomy as the starting point for our structural edges, alongside our existing epistemic edges.


SciBERT & BiomedBERT β€” Domain-Specific Language Models

Model HuggingFace ID What It's Good For
SciBERT allenai/scibert_scivocab_uncased General scientific NLP (NER, classification)
BiomedBERT microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext Biomedical text (best on BLURB benchmark)
SciBERT-NLI gsarti/scibert-nli Scientific entailment/contradiction detection

Why These Matter for Us

These are smaller, faster models trained specifically on scientific text. They won't replace our main Qwen brain, but they can serve as lightweight validators. For example:

  • Use SciBERT-NLI as a fast pre-filter for contradiction detection (check if two claims even MIGHT contradict before sending to the expensive main model)
  • Use SciBERT for scientific NER to help with entity resolution (Layer 1)
  • Use SPECTER2 + SciBERT together for claim embedding (Layer 3)

4. Tier 3: Specialist Tools We Should Use {#4-tier-3-specialist-tools}

PDF Parsing

Tool What It Does Analogy
Nougat (facebook/nougat-base) Converts scientific PDFs to markdown with proper LaTeX equations A translator who speaks both "PDF" and "structured text" and is especially good at math
GROBID (github.com/kermitt2/grobid) Extracts authors, citations, references, and paper structure A librarian who catalogs every detail about a book
Marker (github.com/VikParuchuri/marker) ML-based PDF β†’ markdown with layout awareness A reader who understands that columns, captions, and footnotes are different things

Our plan already includes all three. The SYSTEM_DESIGN.md correctly specifies Marker + Nougat + GROBID. What we don't have yet is the actual integration code.

Scientific NER (Named Entity Recognition)

Tool Paper What It Does
GLiNER-biomed arxiv:2504.00676 Zero-shot biomedical entity recognition β€” can find ANY type of entity without being specifically trained for it
SciER arxiv:2410.21155 Extracts datasets, methods, and tasks from full-text papers β€” directly feeds into knowledge graph nodes

Fact-Checking Systems

System Paper Key Innovation
ClaimVer arxiv:2403.09724 KG-backed claim verification with natural language explanations β€” generates human-readable reasoning for why a claim is true/false
CLUE arxiv:2505.17855 Explains WHERE uncertainty comes from β€” inter-evidence conflicts vs. agreement. This is exactly what our 3-score system tries to capture
PubHealth arxiv:2010.09926 Explainable fact-checking for health claims β€” proof that domain-specific fact-checking outperforms general-purpose

Confidence Calibration

System Paper Key Innovation
CritiCal arxiv:2510.24505 Natural language critiques improve LLM calibration β€” the AI writes WHY it's uncertain, and this improves the accuracy of its confidence scores
MetaFaith arxiv:2505.24858 Prompt-based faithful uncertainty β€” teaches the model to express uncertainty using hedging words that actually correspond to its real confidence
Calibration-Tuning arxiv:2406.08391 Fine-tune on correct/incorrect answer pairs β†’ model learns when it's likely to be wrong

Systematic Review Tools

Tool Type Best For
ASReview (github.com/asreview/asreview) Open-source Active learning for paper screening β€” learns your preferences to surface the most relevant papers first
LatteReview (arxiv:2501.05468) Open-source Multi-agent systematic review framework with Pydantic validation
Bio-SIEVE (arxiv:2308.06610) Open-source Instruction-tuned LLaMA for abstract screening

5. Tier 4: Commercial Systems for Context {#5-tier-4-commercial-systems}

These are commercial products that researchers actually use today. We're not competing with them directly β€” they answer questions, we build knowledge maps β€” but we should understand what researchers are already used to.

System What It Does Strengths What It Can't Do (That We Can)
Semantic Scholar (semanticscholar.org) Citation graph, paper search, TLDR summaries, SPECTER embeddings Massive database (200M+ papers), free API, open data No claim extraction, no epistemic labels, no contradiction detection
Elicit (elicit.com) AI research assistant β€” finds papers, extracts data, answers questions Clean UI, popular with researchers Benchmarked as inferior to PaperQA2; no knowledge graph, no epistemic classification
Consensus (consensus.app) "Consensus meter" showing if papers agree Good at showing agreement/disagreement at topic level No claim-level analysis, no confidence decomposition, no public API
ResearchRabbit (researchrabbit.ai) Citation network visualization Beautiful graph visualization, free No claim analysis, no extraction, just citation mapping
Litmaps (litmaps.com) Citation mapping with timeline Good discovery tool No analysis beyond citation connections
ORKG (orkg.org) Open Research Knowledge Graph β€” human-contributed structured paper comparisons Structured comparison tables created by researchers Manual input only β€” no automated extraction
SciScore (sciscore.com) Methods rigor assessment β€” checks if papers report blinding, randomization, statistics properly Addresses an important research quality dimension Very narrow focus β€” just methods reporting, not claims or knowledge
Iris.ai (iris.ai) Evidence mapping, relevance scoring Nice relevance visualizations No epistemic classification, no public API

6. What Nobody Has Built Yet (Our Innovations) {#6-what-nobody-has-built-yet}

After analyzing all 15 systems, here are the things that exist nowhere else:

Innovation 1: Claim-Level Epistemic Classification

What it is: Labeling EACH CLAIM (not each paper) as Fact, Interpretation, or Hypothesis.

Who comes close: KGX3 classifies whole papers using a Kuhnian triplet (M, N, P). But a single paper contains dozens of claims β€” some are facts (direct measurements), some are interpretations (the authors' explanation of the measurements), and some are hypotheses (what the authors think might be true). KGX3 can't distinguish these within a single paper.

Analogy: KGX3 rates a whole restaurant (4 stars). We rate each dish on the menu (appetizer: 5 stars, soup: 3 stars, dessert: 4 stars). Both are useful, but ours gives you much more detail.

Why nobody has done this: It's HARD. The same sentence can be a Fact in one context and an Interpretation in another. "The treatment worked" is a Fact in the Results section (because it's backed by data) and an Interpretation in the Abstract (because it might overstate what the data actually shows). You need section-awareness, qualifier detection, statistical evidence extraction, and a rules-based engine to do this correctly.

Innovation 2: Code-Computed Calibrated Confidence

What it is: Using math formulas in code (not AI guessing) to compute how much you should trust each claim. Our 3-score system (evidence quality Γ— truth likelihood Γ— qualifier strength) is computed by Python code, not by asking the AI "how confident are you?"

Who comes close: SAFE (Google's factuality system) checks claims against search results, but its scores are binary (supported/not supported). CritiCal and MetaFaith improve LLM calibration, but they still rely on the LLM's self-assessed confidence. KGX3 uses deterministic rules but only at paper level.

Analogy: Other systems ask a student "how well do you think you did on the test?" and trust their answer. We give the test to 3 different graders, multiply their scores using a fixed formula, and add penalty points for unclear handwriting. The student never touches the grade book.

Why nobody has done this: Most AI systems want to be end-to-end β€” input goes in, output comes out, no human-designed formulas in between. Our approach deliberately breaks this by inserting code-computed gates between AI outputs and final confidence numbers. This requires more engineering but produces auditable, reproducible scores.

Innovation 3: The Integrated 7-Layer Local-First Pipeline

What it is: A complete system that does parse β†’ resolve β†’ extract β†’ dedup β†’ graph β†’ score β†’ evaluate β†’ export, all running locally on your laptop without sending data to the cloud.

Who comes close: PaperQA2 does search β†’ retrieve β†’ summarize β†’ answer, but requires cloud APIs and doesn't build a persistent knowledge graph. Paper Circle builds a knowledge graph but doesn't score claims or detect contradictions. AgentSLR does systematic reviews end-to-end but doesn't build any persistent knowledge structure.

Analogy: Other systems are like food delivery apps β€” each one brings you a different dish. PaperQA2 brings you answers. SciFact brings you verification. SPECTER2 brings you similarity scores. We're building a kitchen that can cook all these dishes and remember every recipe forever.


7. The Honest Comparison Table {#7-the-honest-comparison-table}

Here's the full truth. Green means better, yellow means comparable, red means worse.

Feature PhD Research OS PaperQA2 FactReview KGX3 Paper Circle AgentSLR
PDF parsing quality πŸ”΄ Basic 🟒 GROBID 🟑 Standard N/A 🟑 Standard 🟑 Standard
Claim extraction 🟑 Designed 🟒 Proven superhuman 🟒 Literature-grounded πŸ”΄ Paper-level only 🟑 Concept-level 🟑 Data extraction
Epistemic classification 🟒 Unique: claim-level πŸ”΄ None 🟑 Binary 🟒 Paper-level πŸ”΄ None πŸ”΄ None
Knowledge graph 🟒 Typed epistemic edges πŸ”΄ None πŸ”΄ None πŸ”΄ None 🟒 Structural KG πŸ”΄ None
Contradiction detection 🟑 Designed 🟒 Benchmarked superhuman 🟒 Literature-grounded 🟑 Paradigm drift πŸ”΄ None πŸ”΄ None
Confidence scoring 🟒 Formula-based, 3-score πŸ”΄ LLM-stated πŸ”΄ Binary 🟑 Deterministic rules 🟑 Basic confidence πŸ”΄ None
Human oversight 🟒 Council + proposals πŸ”΄ None 🟑 Reviewer-facing πŸ”΄ None πŸ”΄ None 🟑 Human-in-loop
Privacy (local-first) 🟒 All local πŸ”΄ Requires cloud πŸ”΄ Requires cloud 🟑 Unknown 🟑 Partial πŸ”΄ Requires cloud
Actually working today? πŸ”΄ Prototype 🟒 Production 🟑 Research 🟑 Proprietary 🟑 Open-source 🟒 Open-source
Benchmarked against humans? πŸ”΄ No 🟒 Yes, superhuman 🟑 Partial 🟑 Yes πŸ”΄ No 🟒 Yes

The honest reading: We have the best design for epistemic analysis and human oversight. PaperQA2 has the best working system. Our job is to close that gap β€” build what we've designed, using the best components from everyone else.


8. Essential Resources and Links {#8-essential-resources}

Systems to Study (Code Available)

System Install / Repository Priority
PaperQA2 pip install paper-qa Β· github.com/Future-House/paper-qa πŸ”΄ Study immediately
Paper Circle github.com/MAXNORM8650/papercircle 🟠 Study for KG pattern
AgentSLR github.com/oxrml/agentslr 🟑 Study for pipeline pattern
CLAIRE github.com/stanford-oval/inconsistency-detection 🟑 Study for contradiction detection
ASReview github.com/asreview/asreview 🟑 Study for screening UX

Datasets to Use (Available on HuggingFace)

Dataset HuggingFace ID What It's For Priority
SciFact bigbio/scifact Benchmark for claim verification πŸ”΄ Use as evaluation benchmark
SciRIFF allenai/SciRIFF 137K expert training examples πŸ”΄ Use for training data
Evidence Inference bigbio/evidence_inference RCT outcome extraction 🟠 Use for clinical paper training
QASPER allenai/qasper Full-text scientific QA 🟑 Use for extraction training
SciRepEval allenai/scirepeval 6M+ paper embedding triplets 🟑 Use for embedding fine-tuning
CiteWorth copenlu/citeworth Citation-worthy sentence detection 🟑 Use for claim identification

Models to Use (Available on HuggingFace)

Model HuggingFace ID What It's For Priority
SPECTER2 allenai/specter2_base + adapters Scientific text embeddings (Layer 3 dedup) πŸ”΄ Integrate immediately
Nougat facebook/nougat-base PDF β†’ markdown with LaTeX (Layer 0) πŸ”΄ Integrate immediately
SciBERT allenai/scibert_scivocab_uncased Scientific NER and classification 🟠 Use for entity resolution
SciBERT-NLI gsarti/scibert-nli Fast contradiction pre-filter 🟠 Use in conflict detection
BiomedBERT microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext Biomedical text processing 🟑 Use for bio-domain papers

Key Papers to Read

Paper ArXiv ID Why It Matters
PaperQA2 2409.13740 Superhuman scientific QA β€” benchmark to beat
SciFact 2004.14974 Defines the claim verification task
SciRIFF 2406.07835 Best training data for scientific extraction
KGX3 / iKuhn 2002.03531 Only published epistemic classification system
Paper Circle 2604.06170 Multi-agent KG construction pattern
FactReview 2604.04074 Execution-based claim verification
CLAIRE 2509.23233 Corpus-level contradiction detection
AgentSLR 2603.22327 End-to-end systematic review pipeline
MultiVerS 2112.01640 Best model for SUPPORTS/REFUTES
SciERC 1808.09602 Scientific entity + relation extraction
CritiCal 2510.24505 Critique-based confidence calibration
CLUE 2505.17855 Uncertainty source explanation

How This Document Was Created

  1. Searched academic databases (arXiv, Semantic Scholar, HuggingFace Papers) for every system matching our 5 core capabilities
  2. Read the papers β€” not just abstracts, but methodology sections β€” to understand what each system actually does vs. what it claims to do
  3. Verified code availability β€” checked GitHub repos and HuggingFace models to confirm they're real, not vaporware
  4. Compared feature-by-feature against our SYSTEM_DESIGN.md
  5. Ranked by closeness to our system and quality of implementation
  6. Wrote everything in plain language so anyone can follow

This is a living document. As new systems appear, they should be added here.


Analysis conducted 2026-04-23 using research across arXiv, HuggingFace Hub, GitHub, and Semantic Scholar. Every system and resource link has been verified as of this date.