phd-research-os-brain / PRIOR_ART_ANALYSIS.md

Add PRIOR_ART_ANALYSIS.md — comprehensive landscape of similar systems with high-school-level explanations

b8f72d2 verified 15 days ago

preview code

raw

history blame contribute delete

41.2 kB

PhD Research OS — Prior Art Analysis

Every System That Has Tried Something Similar, and How We Compare

Date: 2026-04-23 Status: COMPREHENSIVE — 15 systems analyzed across 6 capability areas Written for: Anyone with a high school education Purpose: Honest accounting of who has built what, so we know exactly where we stand

What Is This Document?

Before you build anything, you should check if someone already built it. That's not pessimism — it's good science. If someone already solved a problem, you don't need to solve it again. You can use their solution and spend your time on the parts nobody has cracked yet.

We searched through research papers, open-source code, commercial products, and HuggingFace repositories to find every system that does something similar to PhD Research OS. We found 15 significant systems. Here is what each one does, where it beats us, where we beat it, and what we should steal (with credit).

The bottom line: Nobody has built the complete system we're building. But every piece of our system exists somewhere else in isolation. The novel part is the combination — and three specific capabilities that nobody else has published.

The Scoreboard — How Every System Compares
Tier 1: The Closest Rivals
Tier 2: Best-in-Class Components
Tier 3: Specialist Tools We Should Use
Tier 4: Commercial Systems for Context
What Nobody Has Built Yet (Our Innovations)
The Honest Comparison Table
Essential Resources and Links

1. The Scoreboard — How Every System Compares {#1-the-scoreboard}

Our system has 5 core capabilities. Here's who else has each one:

Capability	What It Means	Who Has It	Who Does It Best
1. Extract claims from papers	Read a PDF and pull out the key findings	PaperQA2, FactReview, SciEx, AgentSLR	PaperQA2 (superhuman on benchmarks)
2. Label each claim epistemically	Is this a Fact? An Interpretation? A Hypothesis?	KGX3 (paper-level only)	Nobody does claim-level ← Us
3. Build a typed knowledge graph	Connect claims with labeled edges (supports, refutes, extends)	Paper Circle, SciIE/SciERC, ORKG	Paper Circle (multi-agent KG)
4. Detect contradictions across papers	Find when Paper A disagrees with Paper B	PaperQA2, CLAIRE, FactReview	PaperQA2 (2.34× better than PhDs)
5. Compute calibrated confidence scores	Use math formulas (not AI guessing) to rate trustworthiness	Nobody does formula-based	Nobody ← Us

The key insight: Lots of systems can do capabilities 1, 3, and 4. A few can do capability 2 at the paper level. Nobody combines all 5 at the claim level. And nobody does capability 5 the way we do (code-computed, not LLM-guessed).

2. Tier 1: The Closest Rivals {#2-tier-1-closest-rivals}

These are the systems that come closest to what we're building. If you only read one section, read this one.

🥇 PaperQA2 — The Gold Standard for Scientific Literature AI

Paper: Language Agents Achieve Superhuman Synthesis of Scientific Knowledge (2024) Earlier version: PaperQA: Retrieval-Augmented Generative Agent (2023) Code: github.com/Future-House/paper-qa — open source, pip install paper-qa Built by: FutureHouse (well-funded AI research lab)

What It Does (The Analogy)

Imagine you're writing a research paper and you have a brilliant research assistant. You ask them a question like "Does protein X bind to receptor Y?" They go to the library, pull out 50 relevant papers, read the important sections, highlight the key sentences, and come back with a summary that includes page numbers for everything they claim. If two papers disagree, they tell you about the contradiction.

That's PaperQA2. It's a multi-step AI agent that:

Searches for relevant papers (like a librarian)
Retrieves the most important chunks from each paper (like highlighting)
Reranks those chunks by relevance using AI (like sorting your highlights)
Summarizes each chunk in context (like writing margin notes)
Generates a final answer with citations (like writing a paragraph with footnotes)
Detects contradictions across papers (like spotting when two witnesses disagree)

How Good Is It?

This is the scary part. PaperQA2 has been benchmarked against real PhD-level scientists:

Task	PaperQA2 Score	Human PhD Score	Winner
Answer questions about papers (LitQA2)	85.2% precision	73.8% precision	🤖 PaperQA2
Find contradictions in protein biology	2.34× more found	Baseline	🤖 PaperQA2
Write Wikipedia-quality articles (WikiCrow)	86.1% precision	71.2% precision	🤖 PaperQA2

Analogy: If paper-reading were a sport, PaperQA2 would be the Olympic champion. We're not competing directly — we're building a different sport (epistemic analysis), but we should learn from the champion's training methods.

What It Does Better Than Us

Feature	PaperQA2	PhD Research OS	What This Means
PDF parsing	GROBID (professional, battle-tested)	PyMuPDF (basic)	Their input quality is much higher
Retrieval quality	Dense retrieval + LLM reranking (RCS)	No retrieval system	They find the right sections faster
Contradiction detection	Benchmarked, superhuman	Designed but not built	They can prove it works; we can't yet
"I don't know" ability	Refuses when evidence is insufficient	No refusal mechanism	They're honest about gaps; we aren't yet
Citation traversal	Follows reference chains automatically	Citation resolution designed	They do it; we plan to do it

What We Do Better Than PaperQA2

Feature	PhD Research OS	PaperQA2	What This Means
Epistemic labels	Fact / Interpretation / Hypothesis	None — flat claims	We tell you WHAT KIND of claim it is
Knowledge graph	Typed edges (supports/refutes/extends)	No persistent graph	We build a web of connected knowledge
Calibrated scoring	3-score code-computed formula	LLM-stated confidence	Our confidence numbers are auditable
Local-first privacy	Paper data never leaves your computer	Requires cloud API calls	We protect unpublished research
Human oversight design	Council votes, proposals, manual mode	Answer generation only	We're built for verification, not just answers

The Big Lesson from PaperQA2

Their secret weapon is RCS (Rerank + Contextual Summarize). Before answering a question, they have the AI rerank all retrieved chunks by relevance, then write a contextual summary of each chunk. This two-step filter removes noise before the final answer is generated.

Analogy: Imagine you're studying for an exam. Method A: read all your notes and try to answer from memory. Method B: first sort your notes by relevance to the question, then write a one-sentence summary of each relevant note, then answer using only the summaries. Method B is dramatically better. That's RCS.

We should adopt this. Our Layer 2 extraction pipeline would benefit enormously from an RCS-style pre-filtering step before the AI Council processes each section.

🥈 FactReview — The Closest to Claim-Level Verification

Paper: FactReview: Evidence-Grounded Reviews with Literature Positioning (2026)

What It Does (The Analogy)

Imagine a very thorough peer reviewer. They don't just read your paper — they go find all the papers you cited, plus papers you DIDN'T cite, and check: "Does the existing literature actually support what this paper claims?" For every claim, they write: "SUPPORTS — three papers agree" or "REFUTES — this contradicts Smith 2023" or "RELATED — similar work, but not directly comparable."

And here's the clever part: if your paper says "our model achieves 95% accuracy," FactReview actually runs your code to check if that's true. It does what most human reviewers can't do — it verifies empirical claims by execution.

What It Does Better Than Us

Execution-based verification: It can actually run code to check if claimed results are real. We have no equivalent. This is like a teacher not just reading your math homework but re-doing the calculations.
Literature positioning: For each claim, it retrieves relevant prior work and labels the relationship (SUPPORTS / REFUTES / RELATED). This is close to our knowledge graph edges.

What We Do Better

Persistent knowledge graph: FactReview checks one paper at a time. It doesn't build a lasting web of knowledge that grows as you add more papers.
Epistemic classification: FactReview labels relationships between claims (supports/refutes), but doesn't label the claims themselves (Fact vs Hypothesis).
Calibrated scoring: FactReview says "supports" or "refutes" — binary. We give a 3-dimensional confidence score.

The Big Lesson from FactReview

Dual evidence is better than single evidence. They check claims against BOTH the manuscript itself AND external literature. We should adopt this: when extracting a claim, check it against both (a) the specific paper and (b) the existing knowledge graph.

🥉 KGX3 / iKuhn — The Closest Epistemic System

Paper: A Novel Kuhnian Ontology for Epistemic Classification (2020) System: Formerly called Soph.io, then iKuhn, now KGX3 / Preprint Watch

What It Does (The Analogy)

Imagine a very opinionated professor who reads every paper and immediately says: "This paper confirms what we already knew," or "This paper challenges the established theory," or "This paper introduces a completely new method." They don't just read the claims — they classify the paper's ROLE in the field's history.

KGX3 does this automatically using a philosophy framework from Thomas Kuhn (a famous historian of science). It classifies every paper using a three-part code:

M (Methodological): Did they reuse an old method, adapt one, or invent a new one?
N (Observational): Were the results expected, unexpected, or truly anomalous?
P (Positional): Does this confirm, extend, or challenge existing knowledge?

So a paper might be classified as (Novel, Unexpected, Challenges) — a new method found something nobody predicted that contradicts the consensus. That's a paradigm-shifting paper.

Why This Matters for Us

KGX3 is the ONLY system we found that does formal epistemic classification — labeling the knowledge-status of scientific content using rules and formulas, not AI guessing. That's exactly our philosophy!

But there's a critical difference: KGX3 classifies whole papers. We classify individual claims within papers. This is like the difference between rating a whole restaurant (KGX3) versus rating each dish on the menu (us). Our approach is much more granular and much harder.

The Big Lesson from KGX3

Their deterministic language-game filters are brilliant. Instead of asking an AI "is this a confirmation or a challenge?", they have specific word patterns and rules that detect each epistemic category. For example, phrases like "consistent with previous findings" trigger the "Confirms" filter, while "contrary to expectations" triggers the "Challenges" filter.

The activation threshold (θ=0.7) is validated across multiple scientific disciplines. This is exactly the kind of empirically-grounded rule that our scoring formulas need.

We should adopt the principle: define specific linguistic triggers for each epistemic category (Fact / Interpretation / Hypothesis) and use them as code-based validators, not just AI prompts.

Paper Circle — The Closest Multi-Agent KG System

Paper: Paper Circle: An Open-source Multi-agent Research Discovery Framework (2025) Code: github.com/MAXNORM8650/papercircle

What It Does (The Analogy)

Imagine a team of research assistants, each with a different specialty:

Assistant 1 reads papers and identifies key concepts
Assistant 2 identifies the methods used
Assistant 3 extracts experimental results
Assistant 4 links figures and tables to the right concepts
A manager coordinates all four and checks that nothing was missed

Together, they build a knowledge graph — a web of connected concepts, methods, experiments, and datasets.

Paper Circle does exactly this with AI agents. Their architecture looks like this:

CodeAgent (the manager)
  → Concept Extractor (finds ideas and theories)
  → Method Extractor (finds algorithms and techniques)
  → Experiment Extractor (finds setups, datasets, metrics, results)
  → Linkage Agent (connects figures/tables to concepts/methods)
  → Coverage Checker (makes sure nothing important was missed)

What We Should Learn

The Coverage Checker is genius. After all agents have done their work, a final agent checks: "Did we miss any important sections? Are there figures without linked concepts? Are there methods mentioned in the text that we didn't extract?" This prevents the silent omission problem — where the system looks complete but actually missed something important.
Their provenance model is solid. Every node in their knowledge graph carries: source chunk IDs, page numbers, verification status, confidence scores, and timestamps. This is very close to our Layer 7 design.

Where We Differ

Paper Circle builds a structural knowledge graph (concepts, methods, experiments connected by "used-in" and "part-of" edges). We build an epistemic knowledge graph (claims connected by "supports," "refutes," and "extends" edges). Think of it this way: Paper Circle maps the territory of science (what exists). We map the arguments of science (what agrees, what disagrees, what's uncertain).

AgentSLR — The Best End-to-End Literature Review System

Paper: AgentSLR: Automating Systematic Literature Reviews with Agentic AI (2025) Code: github.com/oxrml/agentslr

What It Does (The Analogy)

Imagine you're a doctor who needs to know: "Does drug X help with disease Y?" To answer properly, you need to do a systematic review — find EVERY study about this question, check which ones are good quality, extract the key numbers from each one, and synthesize a conclusion.

This normally takes a team of researchers 7 weeks. AgentSLR does it in 20 hours. And it performs at human level.

The Numbers

58× faster than human systematic reviews
Applied to 9 WHO-priority pathogens (real-world medical task)
Open-source with human-in-the-loop hooks

What We Should Learn

AgentSLR is the proof that AI can process scientific literature at scale AND match human quality. But it doesn't build knowledge graphs, doesn't classify epistemic status, and doesn't detect contradictions. It's a pipeline (find → screen → extract → report), not an engine (continuously building and updating a knowledge web).

Our role is different: AgentSLR answers one question at a time. We build a growing knowledge base that answers every future question about the papers you've already processed. Think of AgentSLR as a very fast research assistant. We're building a very smart filing cabinet.

CLAIRE (Stanford) — The Best Contradiction Detector

Paper: Detecting Corpus-Level Knowledge Inconsistencies with LLMs (2024) Code: github.com/stanford-oval/inconsistency-detection

What It Does (The Analogy)

Wikipedia has millions of articles. Some of them contradict each other — Article A says one thing, Article B says the opposite. Nobody has time to read all of Wikipedia looking for contradictions. CLAIRE does.

It works like a detective:

Extract claims from one article
Search the rest of the corpus for related content
Compare using AI reasoning — do these agree or disagree?
Flag contradictions for human editors to review

When tested with real Wikipedia editors, the editors said they were more confident and found more issues when using CLAIRE than working alone.

What We Should Learn

CLAIRE's agentic RAG loop (extract → retrieve → compare → flag) maps directly to our conflict detection pipeline. The key insight is that you should retrieve evidence BEFORE making a contradiction judgment — don't just compare two claims in isolation, but also look at what other sources say about both claims.

Their domain is Wikipedia, not science. Scientific contradictions are harder because methods matter — two papers might look contradictory but actually used different experimental conditions. We handle this with our method-compatibility layer. CLAIRE doesn't have this.

3. Tier 2: Best-in-Class Components {#3-tier-2-best-in-class-components}

These aren't competing systems — they're the best tools for specific sub-problems. We should use them directly.

SciFact + MultiVerS — The Standard for Claim Verification

Paper (SciFact): Fact or Fiction: Verifying Scientific Claims (2020) Paper (MultiVerS): MultiVerS: Improving scientific claim verification (2021) Dataset: bigbio/scifact on HuggingFace Code: github.com/allenai/multivers Built by: AllenAI (the Allen Institute for AI — one of the best AI research labs in the world)

What It Is (The Analogy)

SciFact is like a standardized test for claim-checking AI. It contains 1,400 scientific claims written by experts, and for each claim, the "answer key" tells you which research abstracts SUPPORT it, REFUTE it, or provide NO INFORMATION.

MultiVerS is the best model for taking this test. It reads a claim AND a full research abstract at the same time, then outputs: (a) does this abstract support, refute, or have nothing to say about this claim? and (b) which specific sentences in the abstract are the evidence?

Why This Matters for Us

SciFact defines the SUPPORTS / REFUTES / NOINFO label scheme that our knowledge graph edges are based on. It's the industry standard. Our system should be benchmarkable against SciFact — if we can't match MultiVerS scores on this standard test, our extraction pipeline isn't good enough.

How to use it: bigbio/scifact is available on HuggingFace Hub. Each example has a claim text, evidence documents, labels (SUPPORTS/REFUTES/NOINFO), and rationale sentences.

SciRIFF — The Best Training Data for Scientific Tasks

Paper: SciRIFF: Enhancing LM Instruction-Following for Scientific Literature (2024) Dataset: allenai/SciRIFF on HuggingFace Built by: AllenAI

What It Is (The Analogy)

Imagine you want to teach someone to be a research assistant. SciRIFF is the textbook. It contains 137,000 practice exercises across 54 different scientific tasks: extracting information, verifying claims, summarizing papers, answering questions, classifying text.

Each exercise was written by human experts — not generated by AI. This matters because AI-generated training data teaches the model AI-style shortcuts, not real scientific reasoning.

Why This Matters for Us

Our current training data is 1,900 synthetic examples generated by a Python script. SciRIFF has 137,000 expert-written examples. That's 72× more data, and it's real, not fake.

Key finding from the SciRIFF paper: Expert-written templates perform significantly better than GPT-4-generated templates. This validates our concern about synthetic training data — real human-written examples are genuinely better.

Training recipe from the paper: 5 epochs, learning rate 2e-5, BF16, batch size 128, sequence length 4096. 7B models reach ~70% on scientific tasks; you need 70B for 77%+.

SPECTER2 — The Best Scientific Paper Embeddings

Models: allenai/specter2_base + task-specific adapters on HuggingFace Built by: AllenAI

What It Is (The Analogy)

When you hear a song, your brain instantly converts it into a "feeling" — happy, sad, energetic. SPECTER2 does the same thing for scientific text. It converts any scientific sentence, abstract, or paper into a list of numbers (called an "embedding") that captures its meaning.

Two sentences that mean the same thing — even if they use completely different words — will get similar numbers. "The detection limit was 0.8 fM" and "We achieved femtomolar-level sensitivity" would have very similar embeddings.

Why This Matters for Us

Our deduplication system (Layer 3) currently uses word overlap, which misses paraphrases completely. SPECTER2 would fix this instantly. It's specifically trained on scientific text, so it understands domain-specific terminology.

Available adapters:

Adapter	HuggingFace ID	Use Case
Proximity/Retrieval	`allenai/specter2`	Finding similar claims (deduplication)
Classification	`allenai/specter2_classification`	Topic/field classification
Ad-hoc Search	`allenai/specter2_adhoc_query`	Searching claims by query

SciIE / SciERC — The Standard for Scientific KG Construction

Paper: Multi-Task IE for Scientific Knowledge Graph Construction (2018) Dataset: SciERC — 500 annotated abstracts with entities, relations, and coreference Built by: AllenAI

What It Is (The Analogy)

SciERC is like a labeled map of how scientific concepts connect to each other. In 500 research paper abstracts, human experts marked:

Entities: What things are mentioned? (Tasks, Methods, Metrics, Materials)
Relations: How do they connect? (used-for, feature-of, part-of, compare)
Coreference: When do two mentions refer to the same thing?

SciIE is the AI model that learned from this map and can now automatically build knowledge graphs from new papers.

Why This Matters for Us

The SciERC relation types (used-for, feature-of, part-of) are structural — they describe how things relate physically or functionally. Our edge types (supports, refutes, extends) are epistemic — they describe how claims relate evidentially.

We need both. A complete knowledge graph should know that "Method A was used-for Task B" (structural) AND "Claim X from Paper 1 supports Claim Y from Paper 2" (epistemic). We should use SciERC's taxonomy as the starting point for our structural edges, alongside our existing epistemic edges.

SciBERT & BiomedBERT — Domain-Specific Language Models

Model	HuggingFace ID	What It's Good For
SciBERT	`allenai/scibert_scivocab_uncased`	General scientific NLP (NER, classification)
BiomedBERT	`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`	Biomedical text (best on BLURB benchmark)
SciBERT-NLI	`gsarti/scibert-nli`	Scientific entailment/contradiction detection

Why These Matter for Us

These are smaller, faster models trained specifically on scientific text. They won't replace our main Qwen brain, but they can serve as lightweight validators. For example:

Use SciBERT-NLI as a fast pre-filter for contradiction detection (check if two claims even MIGHT contradict before sending to the expensive main model)
Use SciBERT for scientific NER to help with entity resolution (Layer 1)
Use SPECTER2 + SciBERT together for claim embedding (Layer 3)

4. Tier 3: Specialist Tools We Should Use {#4-tier-3-specialist-tools}

PDF Parsing

Tool	What It Does	Analogy
Nougat (`facebook/nougat-base`)	Converts scientific PDFs to markdown with proper LaTeX equations	A translator who speaks both "PDF" and "structured text" and is especially good at math
GROBID (`github.com/kermitt2/grobid`)	Extracts authors, citations, references, and paper structure	A librarian who catalogs every detail about a book
Marker (`github.com/VikParuchuri/marker`)	ML-based PDF → markdown with layout awareness	A reader who understands that columns, captions, and footnotes are different things

Our plan already includes all three. The SYSTEM_DESIGN.md correctly specifies Marker + Nougat + GROBID. What we don't have yet is the actual integration code.

Scientific NER (Named Entity Recognition)

Tool	Paper	What It Does
GLiNER-biomed	arxiv:2504.00676	Zero-shot biomedical entity recognition — can find ANY type of entity without being specifically trained for it
SciER	arxiv:2410.21155	Extracts datasets, methods, and tasks from full-text papers — directly feeds into knowledge graph nodes

Fact-Checking Systems

System	Paper	Key Innovation
ClaimVer	arxiv:2403.09724	KG-backed claim verification with natural language explanations — generates human-readable reasoning for why a claim is true/false
CLUE	arxiv:2505.17855	Explains WHERE uncertainty comes from — inter-evidence conflicts vs. agreement. This is exactly what our 3-score system tries to capture
PubHealth	arxiv:2010.09926	Explainable fact-checking for health claims — proof that domain-specific fact-checking outperforms general-purpose

Confidence Calibration

System	Paper	Key Innovation
CritiCal	arxiv:2510.24505	Natural language critiques improve LLM calibration — the AI writes WHY it's uncertain, and this improves the accuracy of its confidence scores
MetaFaith	arxiv:2505.24858	Prompt-based faithful uncertainty — teaches the model to express uncertainty using hedging words that actually correspond to its real confidence
Calibration-Tuning	arxiv:2406.08391	Fine-tune on correct/incorrect answer pairs → model learns when it's likely to be wrong

Systematic Review Tools

Tool	Type	Best For
ASReview (`github.com/asreview/asreview`)	Open-source	Active learning for paper screening — learns your preferences to surface the most relevant papers first
LatteReview (arxiv:2501.05468)	Open-source	Multi-agent systematic review framework with Pydantic validation
Bio-SIEVE (arxiv:2308.06610)	Open-source	Instruction-tuned LLaMA for abstract screening

5. Tier 4: Commercial Systems for Context {#5-tier-4-commercial-systems}

These are commercial products that researchers actually use today. We're not competing with them directly — they answer questions, we build knowledge maps — but we should understand what researchers are already used to.

System	What It Does	Strengths	What It Can't Do (That We Can)
Semantic Scholar (semanticscholar.org)	Citation graph, paper search, TLDR summaries, SPECTER embeddings	Massive database (200M+ papers), free API, open data	No claim extraction, no epistemic labels, no contradiction detection
Elicit (elicit.com)	AI research assistant — finds papers, extracts data, answers questions	Clean UI, popular with researchers	Benchmarked as inferior to PaperQA2; no knowledge graph, no epistemic classification
Consensus (consensus.app)	"Consensus meter" showing if papers agree	Good at showing agreement/disagreement at topic level	No claim-level analysis, no confidence decomposition, no public API
ResearchRabbit (researchrabbit.ai)	Citation network visualization	Beautiful graph visualization, free	No claim analysis, no extraction, just citation mapping
Litmaps (litmaps.com)	Citation mapping with timeline	Good discovery tool	No analysis beyond citation connections
ORKG (orkg.org)	Open Research Knowledge Graph — human-contributed structured paper comparisons	Structured comparison tables created by researchers	Manual input only — no automated extraction
SciScore (sciscore.com)	Methods rigor assessment — checks if papers report blinding, randomization, statistics properly	Addresses an important research quality dimension	Very narrow focus — just methods reporting, not claims or knowledge
Iris.ai (iris.ai)	Evidence mapping, relevance scoring	Nice relevance visualizations	No epistemic classification, no public API

6. What Nobody Has Built Yet (Our Innovations) {#6-what-nobody-has-built-yet}

After analyzing all 15 systems, here are the things that exist nowhere else:

Innovation 1: Claim-Level Epistemic Classification

What it is: Labeling EACH CLAIM (not each paper) as Fact, Interpretation, or Hypothesis.

Who comes close: KGX3 classifies whole papers using a Kuhnian triplet (M, N, P). But a single paper contains dozens of claims — some are facts (direct measurements), some are interpretations (the authors' explanation of the measurements), and some are hypotheses (what the authors think might be true). KGX3 can't distinguish these within a single paper.

Analogy: KGX3 rates a whole restaurant (4 stars). We rate each dish on the menu (appetizer: 5 stars, soup: 3 stars, dessert: 4 stars). Both are useful, but ours gives you much more detail.

Why nobody has done this: It's HARD. The same sentence can be a Fact in one context and an Interpretation in another. "The treatment worked" is a Fact in the Results section (because it's backed by data) and an Interpretation in the Abstract (because it might overstate what the data actually shows). You need section-awareness, qualifier detection, statistical evidence extraction, and a rules-based engine to do this correctly.

Innovation 2: Code-Computed Calibrated Confidence

What it is: Using math formulas in code (not AI guessing) to compute how much you should trust each claim. Our 3-score system (evidence quality × truth likelihood × qualifier strength) is computed by Python code, not by asking the AI "how confident are you?"

Who comes close: SAFE (Google's factuality system) checks claims against search results, but its scores are binary (supported/not supported). CritiCal and MetaFaith improve LLM calibration, but they still rely on the LLM's self-assessed confidence. KGX3 uses deterministic rules but only at paper level.

Analogy: Other systems ask a student "how well do you think you did on the test?" and trust their answer. We give the test to 3 different graders, multiply their scores using a fixed formula, and add penalty points for unclear handwriting. The student never touches the grade book.

Why nobody has done this: Most AI systems want to be end-to-end — input goes in, output comes out, no human-designed formulas in between. Our approach deliberately breaks this by inserting code-computed gates between AI outputs and final confidence numbers. This requires more engineering but produces auditable, reproducible scores.

Innovation 3: The Integrated 7-Layer Local-First Pipeline

What it is: A complete system that does parse → resolve → extract → dedup → graph → score → evaluate → export, all running locally on your laptop without sending data to the cloud.

Who comes close: PaperQA2 does search → retrieve → summarize → answer, but requires cloud APIs and doesn't build a persistent knowledge graph. Paper Circle builds a knowledge graph but doesn't score claims or detect contradictions. AgentSLR does systematic reviews end-to-end but doesn't build any persistent knowledge structure.

Analogy: Other systems are like food delivery apps — each one brings you a different dish. PaperQA2 brings you answers. SciFact brings you verification. SPECTER2 brings you similarity scores. We're building a kitchen that can cook all these dishes and remember every recipe forever.

7. The Honest Comparison Table {#7-the-honest-comparison-table}

Here's the full truth. Green means better, yellow means comparable, red means worse.

Feature	PhD Research OS	PaperQA2	FactReview	KGX3	Paper Circle	AgentSLR
PDF parsing quality	🔴 Basic	🟢 GROBID	🟡 Standard	N/A	🟡 Standard	🟡 Standard
Claim extraction	🟡 Designed	🟢 Proven superhuman	🟢 Literature-grounded	🔴 Paper-level only	🟡 Concept-level	🟡 Data extraction
Epistemic classification	🟢 Unique: claim-level	🔴 None	🟡 Binary	🟢 Paper-level	🔴 None	🔴 None
Knowledge graph	🟢 Typed epistemic edges	🔴 None	🔴 None	🔴 None	🟢 Structural KG	🔴 None
Contradiction detection	🟡 Designed	🟢 Benchmarked superhuman	🟢 Literature-grounded	🟡 Paradigm drift	🔴 None	🔴 None
Confidence scoring	🟢 Formula-based, 3-score	🔴 LLM-stated	🔴 Binary	🟡 Deterministic rules	🟡 Basic confidence	🔴 None
Human oversight	🟢 Council + proposals	🔴 None	🟡 Reviewer-facing	🔴 None	🔴 None	🟡 Human-in-loop
Privacy (local-first)	🟢 All local	🔴 Requires cloud	🔴 Requires cloud	🟡 Unknown	🟡 Partial	🔴 Requires cloud
Actually working today?	🔴 Prototype	🟢 Production	🟡 Research	🟡 Proprietary	🟡 Open-source	🟢 Open-source
Benchmarked against humans?	🔴 No	🟢 Yes, superhuman	🟡 Partial	🟡 Yes	🔴 No	🟢 Yes

The honest reading: We have the best design for epistemic analysis and human oversight. PaperQA2 has the best working system. Our job is to close that gap — build what we've designed, using the best components from everyone else.

8. Essential Resources and Links {#8-essential-resources}

Systems to Study (Code Available)

System	Install / Repository	Priority
PaperQA2	`pip install paper-qa` · `github.com/Future-House/paper-qa`	🔴 Study immediately
Paper Circle	`github.com/MAXNORM8650/papercircle`	🟠 Study for KG pattern
AgentSLR	`github.com/oxrml/agentslr`	🟡 Study for pipeline pattern
CLAIRE	`github.com/stanford-oval/inconsistency-detection`	🟡 Study for contradiction detection
ASReview	`github.com/asreview/asreview`	🟡 Study for screening UX

Datasets to Use (Available on HuggingFace)

Dataset	HuggingFace ID	What It's For	Priority
SciFact	`bigbio/scifact`	Benchmark for claim verification	🔴 Use as evaluation benchmark
SciRIFF	`allenai/SciRIFF`	137K expert training examples	🔴 Use for training data
Evidence Inference	`bigbio/evidence_inference`	RCT outcome extraction	🟠 Use for clinical paper training
QASPER	`allenai/qasper`	Full-text scientific QA	🟡 Use for extraction training
SciRepEval	`allenai/scirepeval`	6M+ paper embedding triplets	🟡 Use for embedding fine-tuning
CiteWorth	`copenlu/citeworth`	Citation-worthy sentence detection	🟡 Use for claim identification

Models to Use (Available on HuggingFace)

Model	HuggingFace ID	What It's For	Priority
SPECTER2	`allenai/specter2_base` + adapters	Scientific text embeddings (Layer 3 dedup)	🔴 Integrate immediately
Nougat	`facebook/nougat-base`	PDF → markdown with LaTeX (Layer 0)	🔴 Integrate immediately
SciBERT	`allenai/scibert_scivocab_uncased`	Scientific NER and classification	🟠 Use for entity resolution
SciBERT-NLI	`gsarti/scibert-nli`	Fast contradiction pre-filter	🟠 Use in conflict detection
BiomedBERT	`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`	Biomedical text processing	🟡 Use for bio-domain papers

Key Papers to Read

Paper	ArXiv ID	Why It Matters
PaperQA2	`2409.13740`	Superhuman scientific QA — benchmark to beat
SciFact	`2004.14974`	Defines the claim verification task
SciRIFF	`2406.07835`	Best training data for scientific extraction
KGX3 / iKuhn	`2002.03531`	Only published epistemic classification system
Paper Circle	`2604.06170`	Multi-agent KG construction pattern
FactReview	`2604.04074`	Execution-based claim verification
CLAIRE	`2509.23233`	Corpus-level contradiction detection
AgentSLR	`2603.22327`	End-to-end systematic review pipeline
MultiVerS	`2112.01640`	Best model for SUPPORTS/REFUTES
SciERC	`1808.09602`	Scientific entity + relation extraction
CritiCal	`2510.24505`	Critique-based confidence calibration
CLUE	`2505.17855`	Uncertainty source explanation

How This Document Was Created

Searched academic databases (arXiv, Semantic Scholar, HuggingFace Papers) for every system matching our 5 core capabilities
Read the papers — not just abstracts, but methodology sections — to understand what each system actually does vs. what it claims to do
Verified code availability — checked GitHub repos and HuggingFace models to confirm they're real, not vaporware
Compared feature-by-feature against our SYSTEM_DESIGN.md
Ranked by closeness to our system and quality of implementation
Wrote everything in plain language so anyone can follow

This is a living document. As new systems appear, they should be added here.

Analysis conducted 2026-04-23 using research across arXiv, HuggingFace Hub, GitHub, and Semantic Scholar. Every system and resource link has been verified as of this date.

PhD Research OS — Prior Art Analysis

Every System That Has Tried Something Similar, and How We Compare

What Is This Document?

Table of Contents

1. The Scoreboard — How Every System Compares {#1-the-scoreboard}

2. Tier 1: The Closest Rivals {#2-tier-1-closest-rivals}

🥇 PaperQA2 — The Gold Standard for Scientific Literature AI

What It Does (The Analogy)

How Good Is It?

What It Does Better Than Us

What We Do Better Than PaperQA2

The Big Lesson from PaperQA2

🥈 FactReview — The Closest to Claim-Level Verification

What It Does (The Analogy)

What It Does Better Than Us

What We Do Better

The Big Lesson from FactReview

🥉 KGX3 / iKuhn — The Closest Epistemic System

What It Does (The Analogy)

Why This Matters for Us

The Big Lesson from KGX3

Paper Circle — The Closest Multi-Agent KG System

What It Does (The Analogy)

What We Should Learn

Where We Differ

AgentSLR — The Best End-to-End Literature Review System

What It Does (The Analogy)

The Numbers

What We Should Learn

CLAIRE (Stanford) — The Best Contradiction Detector

What It Does (The Analogy)

What We Should Learn

3. Tier 2: Best-in-Class Components {#3-tier-2-best-in-class-components}

SciFact + MultiVerS — The Standard for Claim Verification

What It Is (The Analogy)

Why This Matters for Us

SciRIFF — The Best Training Data for Scientific Tasks

What It Is (The Analogy)

Why This Matters for Us

SPECTER2 — The Best Scientific Paper Embeddings

What It Is (The Analogy)

Why This Matters for Us

SciIE / SciERC — The Standard for Scientific KG Construction

What It Is (The Analogy)

Why This Matters for Us

SciBERT & BiomedBERT — Domain-Specific Language Models

Why These Matter for Us

4. Tier 3: Specialist Tools We Should Use {#4-tier-3-specialist-tools}

PDF Parsing

Scientific NER (Named Entity Recognition)

Fact-Checking Systems

Confidence Calibration

Systematic Review Tools

5. Tier 4: Commercial Systems for Context {#5-tier-4-commercial-systems}

6. What Nobody Has Built Yet (Our Innovations) {#6-what-nobody-has-built-yet}

Innovation 1: Claim-Level Epistemic Classification

Innovation 2: Code-Computed Calibrated Confidence

Innovation 3: The Integrated 7-Layer Local-First Pipeline

7. The Honest Comparison Table {#7-the-honest-comparison-table}

8. Essential Resources and Links {#8-essential-resources}

Systems to Study (Code Available)

Datasets to Use (Available on HuggingFace)

Models to Use (Available on HuggingFace)

Key Papers to Read

How This Document Was Created