phd-research-os-brain / README.md

Append Prior Art & Inspirations section to README landing page — links to new PRIOR_ART_ANALYSIS.md and SYSTEM_INSPIRATIONS.md

75cfa3a verified 14 days ago

preview code

raw

history blame contribute delete

51.2 kB

metadata

license: apache-2.0
tags:
  - scientific-research
  - claim-extraction
  - epistemic-classification
  - contradiction-detection
  - structured-output
  - research-assistant
  - phd-tools
  - multi-agent
  - ecc-harness
  - knowledge-graph
  - calibrated-scoring
  - transformer
  - causal-lm
  - qlora
  - sft
language:
  - en
base_model: Qwen/Qwen2.5-3B-Instruct
datasets:
  - nkshirsa/phd-research-os-sft-data
pipeline_tag: text-generation
model-index:
  - name: phd-research-os-brain
    results: []

PhD Research OS v2.0 — The Epistemic Engine 🧠

A local-first AI system that reads science papers, extracts findings, labels how trustworthy each one is, and builds a knowledge graph that spots contradictions across papers.

60 files | 563KB | 143 tests | 87 blindspots audited | 78 future improvements mapped

🏗️ Transformer Architecture

The entire system is built on top of a causal decoder-only Transformer — the same family of architecture behind GPT, LLaMA, and Qwen. Here is exactly how it works, explained so anyone can follow:

┌─────────────────────────────────────────────────────────────────────┐
│                    TRANSFORMER ARCHITECTURE                         │
│                    (Decoder-Only, Causal LM)                        │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    INPUT LAYER                               │   │
│   │                                                              │   │
│   │  Scientific text ──→ Tokenizer (Qwen BPE, 151,936 vocab)   │   │
│   │  "The LOD was 0.8 fM" ──→ [The][_LOD][_was][_0][.8][_fM]  │   │
│   │                                                              │   │
│   │  Token IDs ──→ Token Embeddings (3072-dim vectors)          │   │
│   │            + Rotary Position Embeddings (RoPE)               │   │
│   └──────────────────────────┬──────────────────────────────────┘   │
│                               ▼                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │              36 TRANSFORMER DECODER BLOCKS                   │   │
│   │              (repeated 36 times, same structure)              │   │
│   │                                                              │   │
│   │  For each block:                                             │   │
│   │                                                              │   │
│   │  ┌──────────────────────────────────────────────────────┐   │   │
│   │  │  1. RMSNorm (normalize values to stable range)       │   │   │
│   │  │                                                      │   │   │
│   │  │  2. Grouped-Query Self-Attention (GQA)               │   │   │
│   │  │     • 16 query heads (each 128-dim)                  │   │   │
│   │  │     • 2 key-value heads (shared across queries)      │   │   │
│   │  │     • Causal mask: can only look at PAST tokens      │   │   │
│   │  │     • RoPE: encodes where each token sits            │   │   │
│   │  │     • This is HOW the model reads and understands    │   │   │
│   │  │                                                      │   │   │
│   │  │  3. RMSNorm (normalize again)                        │   │   │
│   │  │                                                      │   │   │
│   │  │  4. SwiGLU Feed-Forward Network                      │   │   │
│   │  │     • gate_proj: 3072 → 8640                         │   │   │
│   │  │     • up_proj:   3072 → 8640                         │   │   │
│   │  │     • SiLU activation × gate                         │   │   │
│   │  │     • down_proj: 8640 → 3072                         │   │   │
│   │  │     • This is WHERE the model stores knowledge       │   │   │
│   │  │                                                      │   │   │
│   │  │  5. Residual connections (skip connections)           │   │   │
│   │  │     • Add input back to output at each step          │   │   │
│   │  │     • Prevents information from getting lost         │   │   │
│   │  └──────────────────────────────────────────────────────┘   │   │
│   │                                                              │   │
│   │  × 36 blocks = 36 layers of reading comprehension           │   │
│   └──────────────────────────┬──────────────────────────────────┘   │
│                               ▼                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    OUTPUT LAYER                               │   │
│   │                                                              │   │
│   │  Final RMSNorm ──→ Language Model Head (3072 → 151,936)    │   │
│   │  ──→ Probability over all possible next tokens               │   │
│   │  ──→ Pick most likely token ──→ Repeat until done            │   │
│   │                                                              │   │
│   │  Output: Structured JSON with claims, tags, confidence       │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│   TOTAL: 3.09 billion parameters                                    │
│   TRAINABLE (QLoRA r=64): ~160 million (5.2% of total)            │
│   QUANTIZATION: 4-bit NF4 (fits in ~2.5 GB VRAM)                  │
└─────────────────────────────────────────────────────────────────────┘

What Each Part Does (Plain English)

Component	What It Does	Analogy
Tokenizer	Chops text into small pieces the model can process	Cutting a sentence into word-cards
Embeddings	Turns each word-card into a list of 3,072 numbers	Giving each card a GPS coordinate in meaning-space
RoPE	Tells the model where each word sits in the sentence	Page numbers on the word-cards
Self-Attention	Each word looks at all previous words to understand context	Reading a sentence and remembering what came before
GQA (Grouped-Query)	A memory-efficient version of attention — shares key/value across heads	16 readers sharing 2 sets of notes instead of each having their own
Causal Mask	Prevents looking at future words (only past → present)	No peeking at the next page
Feed-Forward Network	Where factual knowledge is stored (patterns, associations)	The brain's long-term memory
SwiGLU	A smart activation function that helps the network learn better	A filter that keeps important signals and blocks noise
Residual Connections	Adds the input back to the output so information isn't lost	A safety rope — even if a layer learns nothing useful, the original info passes through
RMSNorm	Keeps numbers in a stable range so training doesn't explode	A volume knob that prevents the speakers from blowing out
LM Head	Converts the final representation into a probability for each word in the vocabulary	The final vote — "what word comes next?"

Model Specifications

Specification	Value
Architecture	Transformer (decoder-only, causal language model)
Base Model	Qwen/Qwen2.5-3B-Instruct
Parameters	3.09 billion total
Hidden Size	3,072
Layers	36 transformer blocks
Attention Heads	16 query heads, 2 key-value heads (GQA)
Head Dimension	128
Intermediate Size	8,640 (SwiGLU FFN)
Vocabulary	151,936 tokens (BPE)
Max Context	32,768 tokens (~25 pages of text)
Position Encoding	Rotary Position Embedding (RoPE)
Normalization	RMSNorm (pre-norm)
Activation	SiLU (in SwiGLU gate)
Precision	BF16 (training), 4-bit NF4 (inference via QLoRA)
Fine-Tuning Method	QLoRA (r=64, α=16, all-linear targets)
Trainable Parameters	~160M of 3.09B (5.2%)
Training Data	nkshirsa/phd-research-os-sft-data (1,900 examples, 6 tasks)
License	Apache 2.0

How Fine-Tuning Works (QLoRA)

Instead of updating all 3 billion parameters (which would need a huge GPU), we use QLoRA — a technique that:

Quantizes the base model to 4-bit (shrinks it from ~6GB to ~2.5GB in memory)
Freezes all original parameters (they don't change)
Adds tiny adapter layers (LoRA) to every linear layer in the model
Only trains the adapters (~160 million parameters, 5.2% of total)

Think of it like this: the base model is a textbook that's already printed. QLoRA adds sticky notes to every page. During training, we only write on the sticky notes — the textbook stays unchanged. At the end, reading the textbook + sticky notes together gives you a model that knows scientific claim extraction.

Base Model (frozen, 4-bit quantized):
  ┌─────────────────────────────┐
  │  Original Qwen2.5-3B weights │  ← 3.09B params, NOT updated
  │  (general knowledge)          │
  └─────────────┬───────────────┘
                │
LoRA Adapters (trainable):
  ┌─────────────┴───────────────┐
  │  Low-rank matrices (r=64)    │  ← ~160M params, UPDATED
  │  Added to every linear layer │
  │  (scientific claim knowledge) │
  └─────────────────────────────┘
                │
                ▼
  Combined output: general knowledge + scientific specialization

🔬 What This System Does

┌─────────────────────────────────────────────────────────────────┐
│                     PhD Research OS Pipeline                     │
│                                                                 │
│  📄 PDF Paper                                                   │
│    │                                                            │
│    ▼                                                            │
│  Layer 0: PARSE ─── PDF → sections, tables, figures, equations  │
│    │                                                            │
│    ▼                                                            │
│  Layer 1: RESOLVE ── Normalize names, resolve citations, check  │
│    │                  retractions                                │
│    ▼                                                            │
│  Layer 2: EXTRACT ── AI Council pulls out claims with labels    │
│    │                 (Fact / Interpretation / Hypothesis)        │
│    │                 Preserves qualifiers ("may", "suggests")   │
│    ▼                                                            │
│  Layer 3: DEDUP ──── Find duplicate claims across papers        │
│    │                 Merge same findings, track all sources      │
│    ▼                                                            │
│  Layer 4: GRAPH ──── Build knowledge graph with typed edges     │
│    │                 (supports / refutes / extends)              │
│    │                 Detect contradictions, find research gaps   │
│    ▼                                                            │
│  Layer 5: SCORE ──── Code-computed confidence (NOT AI guess)    │
│    │                 3 separate scores per claim                 │
│    │                 Parser quality caps confidence              │
│    ▼                                                            │
│  Layer 6: EVALUATE ─ Check against gold standard                │
│    │                 Regression gate blocks bad models           │
│    ▼                                                            │
│  Layer 7: EXPORT ─── Obsidian vault, CSV, BibTeX, courtroom UI  │
│                      Full provenance (page, bbox, quote)         │
└─────────────────────────────────────────────────────────────────┘

The 3-Score System (Code-Computed, Not AI-Guessed)

The AI provides components. The code computes the final scores. The AI never directly sets confidence.

Score	Formula	What It Measures
Evidence Quality	evidence × study_quality × journal_tier × completeness × section_mod	How strong is the evidence behind this claim?
Truth Likelihood	evidence_quality + corroboration − conflict_penalty − null_penalty	How likely is this claim to be true?
Qualifier Strength	1.0 − qualifier_count×0.1 − null_penalty − inherited_penalty	How definitive is the language?

Key gates (hard rules the AI cannot override):

Parser confidence caps claim confidence — bad parsing = low trust
Large sample + tiny effect → capped at 0.4 (statistically significant ≠ meaningful)
Abstract claims → forced to Interpretation with 0.7× penalty

AI Model Council

Four AI "members" process each paper section independently, then debate:

Member	Job	Analogy
Query Planner	Breaks questions into searchable sub-queries	The librarian who knows which shelf to check
Extractor	Pulls out claims with epistemic tags + qualifiers	The note-taker who writes down key findings
Critic	Challenges the extractor — flags errors, missing qualifiers	The skeptic who asks "are you sure about that?"
Chairman	Synthesizes final claims with documented reasoning	The judge who makes the final call

📖 The Complete Design Document — Explained from Top to Bottom

For high school students, curious researchers, and anyone who wants to understand WHY every piece of this system exists.

Every system is built from choices. This section explains what each part of the SYSTEM_DESIGN.md does, why we chose it, and what problem it solves. If you read nothing else, read this.

Part 1: The System Overview (The Big Picture)

The very first thing in the design doc is a giant box diagram showing 7 layers with inputs flowing in at the top and outputs coming out at the bottom.

Why this exists: Before anyone writes code, they need to see the whole journey. A PDF enters at the top. An Obsidian vault comes out at the bottom. The diagram shows every stop along the way so no one gets lost.

Why 7 layers specifically: Each layer does ONE job and ONE job only. This is the "single responsibility principle" — the same reason kitchens have separate stations for prep, cooking, and plating. If Layer 4 (the graph) breaks, Layer 2 (extraction) keeps working. You can fix one without touching the others.

Layer	Name	One-Sentence Purpose
0	Structural Ingestion	Turn a messy PDF into clean, structured text with section labels
1	Entity Resolution	Figure out what things mean — "NYU" is a university, "0.8 fM" is a measurement
2	Qualified Extraction	Find every claim and label it Fact, Interpretation, or Hypothesis
3	Canonicalization	If the same finding appears in 3 papers, merge them into one master record
4	Knowledge Graph	Connect claims into a web — this supports that, this contradicts this
5	Calibrated Scoring	Compute a trust score using math formulas, not AI guessing
6	Evaluation	Test the system against expert-labeled papers to make sure it works
7	Provenance	Remember exactly which paper, page, and sentence every claim came from

The cross-cutting box at the bottom (AI Model Council, Meta-Improver, etc.) exists because some jobs span ALL layers. The Council is used in extraction (Layer 2) AND in conflict detection (Layer 4). Keeping them separate from the main pipeline means they can be upgraded independently.

Part 2: The Two-Model Strategy (Why We Run TWO Brains)

The design doc specifies that the system runs TWO models, not one. This might seem wasteful. It is not. Here is why.

The Primary Brain stays on your local computer and NEVER touches the internet. This is the brain that reads your actual paper PDFs. It is a Qwen3-8B model running at 4-bit quantization.

Why local? Science papers contain unpublished data, patent ideas, and student thesis drafts. If you upload that to ChatGPT, it becomes training data for OpenAI. Local models stay private forever.

Why Qwen3-8B? We compared Qwen2.5-3B and Qwen3-8B:

Test	Qwen2.5-3B	Qwen3-8B	What This Means
Math reasoning (AIME)	~15%	~45%	Qwen3 can handle equations and statistics 3× better
JSON accuracy	~65%	~80%	Qwen3 produces valid structured output more reliably
Context window	32K tokens	128K tokens	Qwen3 can read entire 100-page supplements at once

The 8B model needs ~5GB for weights + ~4GB for KV cache at 128K context. This fits a 16GB consumer GPU. The 3B model would use less memory but makes too many errors to be trustworthy for science.

The Companion Brain lives on the internet (Claude API, GPT-4o-mini, or OpenRouter). It never sees raw paper text. It only gets metadata, queries, and anonymized claims.

Why two brains? The companion brain checks for paper retractions on CrossRef, searches arXiv for new papers, and validates repository URLs. These tasks need the internet. The primary brain reads sensitive unpublished data. These tasks must stay offline. One brain cannot do both jobs safely.

Part 3: Training Pipeline — The 4-Stage Journey

The model does not start smart. It must be taught. The design doc specifies a 4-stage training pipeline. Here is why each stage exists and why you cannot skip any.

Stage 1: SFT (Supervised Fine-Tuning)

What it does: Show the model thousands of examples of paper text paired with correct claim extractions. Like showing a student thousands of solved math problems.
Why it exists first: The model must learn the BASIC pattern — "given this paragraph, produce this JSON with these fields" — before it can learn anything subtle.
Why QLoRA: Full fine-tuning of 3B parameters needs a $10,000 GPU. QLoRA trains 160M adapter parameters on a $500 GPU. The adapters are "sticky notes" on the original model.

Stage 2: DPO (Direct Preference Optimization)

What it does: Show the model PAIRS — one correct extraction and one slightly wrong extraction — and teach it to prefer the correct one.
Why it comes AFTER SFT: A student who has never seen a correct answer cannot judge between two answers. SFT teaches "what is correct." DPO teaches "why A is better than B."
Why it matters: DPO fixes the "almost right" problem. SFT might produce JSON with all fields present but the wrong epistemic tag. DPO specifically penalizes tag errors.

Stage 3: GRPO (Group Relative Policy Optimization)

What it does: The model generates 8 different answers for the same prompt. A reward function scores each one. The model learns to generate higher-scoring answers.
Why it comes AFTER DPO: GRPO needs the model to already understand the task well enough to generate plausible variations. A random model would generate garbage that all score zero.
Why it matters: GRPO bakes in JSON validity, schema compliance, and qualifier preservation using MATH-BASED reward functions — not human judgment. The reward function is CODE, not a person.

Stage 4: ConfTuner (Calibration Tuning)

What it does: After the model produces correct answers, fix its confidence calibration. A model that says "95% confident" for correct answers and "95% confident" for wrong answers is broken.
Why it comes LAST: You cannot calibrate confidence until the model is already producing correct answers. Calibrating garbage is meaningless.
Why it matters: Science requires honest uncertainty. If the model is 50% confident, the user must know that. If the model lies and says 90% confident, bad decisions follow.

Part 4: Layer 0 — Structural Ingestion (Why PDFs Are Hard)

The problem: PDFs are not documents. They are instructions for where to place ink on paper. The computer sees "put letter T at position (72, 340)" not "this is a paragraph about detection limits."

Why we need 5 different tools:

Tool	What It Does	Why We Chose It
Marker	PDF → structured markdown with layout awareness	Knows the difference between a heading, a paragraph, and a caption
Nougat	Scientific PDFs → LaTeX equations	Can read math formulas that other tools turn into gibberish
GROBID	Headers, authors, citations, references	The academic standard — extracts bibliographic metadata
LayoutLMv3	Classifies page regions (text, table, figure, equation)	Machine vision for documents — "this region is a table"
PlotDigitizer	Quantitative plots → (x,y) CSV coordinates	Turns images of graphs into actual numbers

Why not just use PyMuPDF? PyMuPDF reads text in reading order. But scientific papers have multi-column layouts, floating figures, and tables that span pages. PyMuPDF often merges table cells with body text. Marker uses machine learning to understand document LAYOUT, not just text sequence.

Why section-aware chunking: If you split a paper at page 5, you might cut a table in half. The design specifies chunking by SECTION (Introduction, Methods, Results) with 1-paragraph overlap. This means every chunk has complete context. A Results chunk includes the paragraph before and after, so the AI sees "As shown in Figure 3..." and knows Figure 3 exists.

Why per-region quality scoring: Not every part of a PDF is readable equally. A scanned figure might have OCR errors. A handwritten annotation might be noise. The system assigns a parse_confidence score to every region. Low confidence regions get flagged so the user knows not to trust claims from that region.

Part 5: Layer 1 — Entity Resolution (What Is This Thing?)

The problem: "The NYU team used 0.8 fM" contains four entities: NYU (institution), team (group), 0.8 fM (measurement), and an implied method. The computer does not know any of this without help.

Why we normalize entities: If Paper A says "University of New York" and Paper B says "NYU," the system must know these are the same institution. Otherwise the knowledge graph treats them as unrelated nodes and misses connections.

Why we resolve citations: A paper might say "as shown in [32]." The system must find reference 32 in the bibliography, get its DOI, and look up the actual paper. Only then can it tag a claim as "inherited citation" (this paper is repeating someone else's finding, not producing a new one).

Why we check retractions: A paper might be retracted because the data was fraudulent. If the system does not know this, it treats fraudulent claims as trustworthy. The design specifies periodic checks against CrossRef and Retraction Watch.

Why Version of Record (VoR) lineage: Scientists often publish a preprint on arXiv, then a final version in a journal. The preprint might have errors that the journal version fixes. The system must track: preprint → journal version → erratum → retraction. Each version supersedes the previous one.

Part 6: Layer 2 — Qualified Extraction (The AI Council)

The problem: AI models make mistakes. They hallucinate numbers. They drop the word "may" and turn a tentative finding into a fact. They miss qualifiers like "in 10 mM PBS only" that limit where a result applies.

Why we use a COUNCIL instead of one model: One model is one opinion. Four models debating produces better answers for the same reason that four doctors reviewing a case produces a better diagnosis than one doctor. The design specifies:

Query Planner breaks the paper into sub-questions — "find the detection limit," "find the sample size," "find the p-value"
Two Extractors independently pull claims. If they disagree, that disagreement IS information.
Critic reviews both extractions and flags errors
Chairman synthesizes the final answer, resolving disagreements with documented reasoning

Why Round 1 is PARALLEL: The extractors must not see each other's work. If Extractor B sees Extractor A's answer, B might anchor on A's mistakes. Parallel execution guarantees independent judgment.

Why Round 2 shares TAGS but NOT confidence: If Extractor A tags a claim as "Fact" and Extractor B tags it as "Interpretation," the Council must debate WHY. But if they share confidence scores, the lower-confidence extractor might cave to the higher-confidence one. Confidence is hidden to prevent anchoring.

Why we need constrained decoding (Guidance engine): The AI is asked to output JSON. Without constraints, it might produce:

The detection limit was 0.8 fM {
  "claim": ...
}

The Guidance library GUARANTEES that epistemic_tag is one of {Fact, Interpretation, Hypothesis, Conflict_Hypothesis}. It does this by controlling which tokens the model is allowed to generate at each position. The model literally CANNOT output an invalid tag.

Why we preserve qualifiers: "The method MAY reduce costs" and "The method reduces costs" are completely different claims. Dropping "may" turns an uncertain hypothesis into a confident fact. The system has a specific qualifier preservation reward function that checks whether hedging words from the source appear in the extraction.

Part 7: Layer 3 — Canonicalization (Deduplication)

The problem: If 50 papers study COVID-19 vaccines, the phrase "vaccines are effective" might appear in all 50. Without deduplication, the knowledge graph has 50 identical nodes. The user sees 50 copies of the same claim.

Why embedding-based similarity: Word-based deduplication fails. "The vaccine showed 95% efficacy" and "Efficacy of 95% was demonstrated by the vaccine" share almost no words but mean the SAME thing. The design specifies embedding models that convert sentences to number vectors. Similar meaning = similar vectors, regardless of wording.

Why cosine similarity threshold of 0.85: Below 0.70, claims are definitely different. Above 0.85, they are probably the same. Between 0.70 and 0.85, the system flags them for human review. This threshold was chosen empirically — it is not a magic number.

Why temporal versioning: A claim from 2020 might say "vaccine efficacy unknown." The same claim from 2021 might say "vaccine efficacy 95%." These are NOT duplicates. They are sequential versions of the same scientific story. The system stores a version history: version 1: preprint_2020 → version 2: journal_2021.

Part 8: Layer 4 — Knowledge Graph (The Web of Science)

The problem: Science is not a list of facts. It is a web of relationships. Finding A supports Finding B. Finding C contradicts Finding B. Method X was used in both Study Y and Study Z. A list cannot represent this.

Why SQLite instead of Neo4j: Neo4j is a professional graph database. It is also external software that requires installation, licensing, and maintenance. SQLite is built into Python. It requires zero setup. For a PhD student running this on a laptop, "works immediately" beats "theoretically better."

Why typed edges: An edge without a type is just "connected." The design specifies 7 edge types:

Edge Type	What It Means	Why It Matters
`supports`	A provides evidence for B	The strongest positive relationship
`refutes`	A contradicts B	Triggers conflict detection in the Courtroom UI
`extends`	A adds conditions to B	"Works in water" extends "Works in ideal conditions"
`depends_on`	A assumes B is true	If B is retracted, A's validity is questioned
`supersedes`	A replaces B (newer data)	Old claims are marked outdated
`blocks`	A is a null finding (no effect)	Prevents false connections
`investigative_hypothesis`	Inferred (not directly observed)	Marked differently because it is weaker evidence

Why we NEVER auto-generate supports across multiple hops: If A supports B, and B supports C, it does NOT follow that A supports C. This is the "transitivity fallacy" — common in science but logically invalid. The design explicitly bans this.

Why gap analysis: The system looks for pairs of well-connected entities with NO edge between them. If "graphene" connects to 50 methods and "biosensor" connects to 50 methods, but NO paper connects graphene to biosensors, that is a research gap. It is a publishable finding.

Part 9: Layer 5 — Calibrated Scoring (Why Code Computes Trust)

The problem: If you ask an AI "how confident are you?" it will make up a number. GPT-4 might say "95% confident" for a completely hallucinated claim. Humans trust numbers. Fake numbers are dangerous.

Why the AI provides COMPONENTS, not scores: The AI outputs evidence_strength: 900, study_quality_weight: 1000, journal_tier_weight: 1000. The CODE multiplies these together using fixed formulas. The AI never touches the final confidence number.

Why fixed-point math (×1000): Floating-point numbers have rounding errors. 0.1 + 0.2 ≠ 0.3 in computers. The design stores all probabilities as integers from 0 to 1000. 950 means "95.0% confident." This is exact. No rounding. No drift.

Why 3 separate scores instead of 1: One number hides what is wrong. A claim might have perfect evidence (Score 1 = 1000) but terrible language qualifiers (Score 3 = 300). A single average of 650 hides the problem. Three scores show: "the evidence is strong, but the claim is hedged."

Why parser confidence CAPS the claim score: If the PDF parser was only 60% confident it read the table correctly, the claim CANNOT score above 600. This prevents the classic failure: "garbage in, gospel out."

Why the statistical gate exists: A study with N=10,000 and effect size d=0.01 is "statistically significant" (p<0.001) but practically meaningless. The effect is too small to matter. The design has a hard rule: if N>1000 and |effect|<0.1, cap confidence at 400. This prevents the system from treating "statistically significant" as "important."

Part 10: Layer 6 — Evaluation (How Do You Know It Works?)

The problem: A system that processes science papers must be TESTED. But testing AI is hard because the answers are not binary right/wrong. "Is this claim well-extracted?" is a judgment call.

Why we need a GOLD STANDARD: The design specifies 10 papers where human experts have labeled every claim, tag, qualifier, and conflict. This is the ground truth. Without it, you are driving without a map.

Why paper-level test splits: If you randomly split claims into train/test, claims from the SAME paper might appear in both. The model memorizes paper-specific phrases instead of learning general extraction. The design specifies splitting by PAPER — all claims from Paper A go to training, all claims from Paper B go to testing.

Why LLM-as-Judge: For fuzzy criteria like "is this claim faithful to the source text?" there is no code formula. The design uses a separate LLM (not the one being tested) as a judge. But the judge is also imperfect, so results are treated as estimates, not truth.

Why stochastic testing: LLMs give different answers each time. Running evaluation once gives one number that might be 5% higher or lower next time. The design specifies running evaluations 3-5 times and reporting the range.

Why hidden holdout: The team developing the system might unconsciously optimize for the test set. A truly hidden holdout — 3 papers that NO ONE on the team has seen during development — catches this. Evaluated quarterly.

Part 11: Layer 7 — Provenance (Where Did This Come From?)

The problem: A claim in the knowledge graph might have been extracted 6 months ago using an old model version. If the model has since improved, the old claim might be wrong. But you cannot know unless you track EVERYTHING.

Why every claim stores a full lineage:

{
  "pipeline_version": "2.1.0",
  "model_checkpoint": "research-os-grpo-v2-step-5000",
  "parser_version": "marker-1.2.0",
  "taxonomy_version": "quantum_bio_v1",
  "prompt_hash": "sha256:a3b4c5...",
  "extraction_timestamp": "2026-04-23T10:30:00Z"
}

This is like a food label that tells you which farm, factory, and truck touch your apple. If a bad batch of apples makes people sick, you trace the batch number. If a model checkpoint produces bad extractions, you trace the checkpoint hash.

Why the security sandbox: The system validates repository URLs and downloads code. A malicious URL could execute harmful code. The sandbox runs these checks in isolation: HTTP GET only, 60-second timeout, 100MB download limit, no code execution.

Why the epistemic embargo: A researcher analyzing unpublished data needs privacy. The embargo mode stores claims in a private subgraph that NO other user can see. After paper submission, one click moves them to the shared lab graph.

Part 12: The UI — Progressive Disclosure

The problem: Science produces overwhelming output. 100 papers → 3,000+ claims. Showing everything at once paralyzes the user.

Why progressive disclosure levels: The design specifies 5 levels of detail, like peeling an onion:

Level	What You See	When You Need It
0	Dashboard — health scores and today's queue	First thing every morning
1	Claim text + tag + confidence + source	Deciding if a claim is worth reading
2	The 3 separate scores (evidence, truth, qualifier)	Understanding WHY a claim scored low
3	Source quote + page + bbox + council votes	Verifying the claim against the original paper
4	2-hop subgraph (neighbors and edges)	Understanding how this claim connects to others
5	Raw LLM outputs + token distributions	Debugging when something goes wrong

Why the Courtroom UI: When 3 papers contradict each other, the system displays them side by side — like a courtroom with 3 witnesses. The user sees the methods, sample sizes, and p-values for each. The system does NOT pick a winner. It presents the evidence and lets the researcher decide.

Why Manual Synthesis Mode: After using the system for months, researchers might start trusting it blindly. Manual mode hides ALL AI output — no scores, no conflict flags, no suggestions. The researcher draws connections manually, then switches back to compare. This built-in friction prevents over-reliance.

Part 13: Appendix A — Future Architecture (Why We Are Planning Ahead)

The SYSTEM_DESIGN.md includes an Appendix A that describes improvements validated by peer-reviewed research. These are NOT implemented yet. They are roadmap items for Phase F and beyond.

Why MAGMA (Multi-Graph Memory) matters: The current knowledge graph mixes temporal, causal, semantic, and entity relationships into one edge space. This is like storing all your files in one giant folder. MAGMA creates separate folders (graphs) for each relationship type. When you ask "Why did the 2023 paper disagree?" the system only searches the CAUSAL graph, not everything.

Why post-Transformer architectures matter: The current Transformer (Qwen3-8B) has a "quadratic attention bottleneck." At 128K tokens, every new token must look at ALL previous tokens. This is like a student re-reading the entire textbook before every new page. It is slow and memory-hungry.

New architectures solve this:

Mamba (SSM): Compresses history into a hidden state. Like a speed-reader who keeps a mental summary instead of re-reading every page. 5× faster on long sequences.
RWKV: Constant memory during inference. Like having a notebook where you only write the latest summary, not every note ever taken.
Jamba/Griffin (Hybrid): Keep a few Transformer layers for precise short-range attention, add SSM layers for cheap long-range memory. Best of both worlds.
MoE (Mixture of Experts): 30B total parameters, only 3B active per token. Like a hospital with 30 specialists where each patient sees only the relevant 3.

Why CompreSSM matters: Training SSMs from scratch at small dimension produces weak models. CompreSSM starts LARGE (state dimension 256), trains for a while, then uses control theory (Hankel singular values) to identify which dimensions matter and surgically removes the rest. Result: a compact model that outperforms one trained small from the start.

Why gradual migration, not a hard switch: The Transformer ecosystem (AWQ quantization, vLLM serving, GRPO in TRL) is mature. SSM ecosystems are newer. The design recommends:

Now: Keep Transformer (stable, tested, ecosystem)
Phase F: Test hybrid model on holdout data, compare metrics
Year 2+: If hybrids win, migrate ingestion pipeline to SSM backbone
Always: Companion brain stays on frontier APIs (Claude, GPT-4o) — only the local primary brain migrates

📊 Resources

Resource	Link	What's Inside
Model + Code	nkshirsa/phd-research-os-brain	This repo — 60 files of code, design docs, tests
Training Data	nkshirsa/phd-research-os-sft-data	1,900 multi-task examples across 6 tasks
Taxonomy GUI	nkshirsa/phd-research-os-taxonomy	Live Gradio Space — explore the epistemic taxonomy
Training Space	nkshirsa/phd-research-os-train	ZeroGPU micro-batch training interface
Blindspot Audit	BLINDSPOT_AUDIT_COMPLETE.md	87 failure modes across 4 discovery epochs
System Design	SYSTEM_DESIGN.md	Complete 7-layer architecture + Appendix A (MAGMA, post-Transformer architectures, CompreSSM)
Future Improvements	FUTURE_IMPROVEMENTS.md	78 blindspots found by reading every code file (high-school readable)

🚀 Quick Start

git clone https://huggingface.co/nkshirsa/phd-research-os-brain
cd phd-research-os-brain
pip install gradio pymupdf
python -m phd_research_os_v2.app
# Open http://localhost:7860

Works immediately with heuristic extraction. Add an API key for AI-powered extraction:

export ANTHROPIC_API_KEY=sk-...  # or OPENAI_API_KEY

🧬 Epistemic Separation Engine

The same finding means different things depending on where it appears in a paper:

Source Section	Confidence Modifier	Why
Results (with statistics)	1.0× (full weight)	Direct data — the strongest evidence
Supplement	1.0× (full weight)	Same as results, just in a different file
Methods	1.0× (protocol, not a claim)	Describes what was done, not what was found
Discussion	0.75× (penalized)	Author's interpretation — goes beyond the data
Abstract	0.7× (forced to Interpretation)	Often overstates what the results actually show

🤖 ECC Harness — Companion AI System

from phd_research_os.agent_os import AgentOS

aos = AgentOS()
agent = aos.spawn_companion("DataQualityAuditor")
task = aos.assign_task(agent, "Audit last 50 claims for hallucination patterns")
result = aos.run_task(task)
proposals = aos.get_proposals(agent)  # Review what the agent found
aos.approve_proposal(proposals[0])   # Human approves changes

5 built-in companion types:

Agent	What It Does
DataQualityAuditor	Checks claims for hallucinations and suspicious confidence
PromptOptimizer	Improves system prompts via A/B testing
DomainExpander	Generates training examples for new scientific fields
CalibrationAnalyst	Analyzes Brier scores, recommends weight adjustments
CitationChaser	Finds new papers that support or contradict existing claims

All companion output goes through Proposals — the agent can suggest changes but a human must approve them.

📐 Quantum-Bio Taxonomy V2

Evidence from different study types gets different weights:

Study Type	Weight	Example
In vivo	1.00	Animal/human experiment
Direct physical measurement	1.00	Electron microscopy image
Mathematical proof	0.95	Formal theorem
In vitro	0.85	Cell culture experiment
First-principles simulation	0.80	Density functional theory
Phenomenological simulation	0.60	Curve fitting model
Review	0.40	Literature survey
Perspective	0.20	Expert opinion piece

✅ Tests

test_v2_integration.py  — 24 ✅  (full pipeline)
test_db.py              — 22 ✅  (data layer)
test_agent_os.py        — 21 ✅  (ECC harness)
test_taxonomy.py        — 27 ✅  (taxonomy)
test_skills_and_meta.py — 30 ✅  (skills + meta)
test_council.py         — 19 ✅  (AI council)
Total: 143 passing

📈 Training Pipeline

Current: SFT (Supervised Fine-Tuning) with QLoRA on Qwen2.5-3B-Instruct

Planned 4-stage pipeline:

SFT — "Here's the right answer, learn it" ✅ Built
DPO — "This answer is better than that one" 📋 Designed
GRPO — "Reward for good JSON, correct tags, preserved qualifiers" 📋 Designed
ConfTuner — "Make your confidence scores match reality" 📋 Designed

Training options:

ZeroGPU: nkshirsa/phd-research-os-train — click Train repeatedly
Local (needs GPU): python train.py
HF Jobs: Single continuous run on dedicated GPU (recommended)

🔍 Honest Status — What's Built vs. Planned

Feature	Status	Details
Database (all 7 layers)	✅ Implemented	SQLite with full schema for claims, graph, conflicts
PDF parser	🚧 Basic	PyMuPDF/pdfplumber work; Marker (ML-based) not yet integrated
Claim extraction	🚧 Mock + AI	Heuristic fallback works; real AI extraction needs API key
Deduplication	🚧 Basic	Jaccard word overlap; embedding-based similarity not yet added
Knowledge graph	🚧 Basic	Structure and queries work; conflict detection uses word overlap
Scoring engine	✅ Implemented	3-score formula with all gates and caps
Evaluation harness	🚧 Basic	Metric counting works; gold standard comparison not yet built
Obsidian export	✅ Implemented	Claims, canonicals, conflicts, dashboard
AI Council	🚧 Sequential	Prompts defined; parallel debate not yet implemented
Agent OS / ECC Harness	✅ Implemented	Full lifecycle, audit trail, proposals
Constrained decoding	📋 Designed	Guidance library integration specified but not built
4-stage training	🚧 Stage 1 only	SFT built; DPO, GRPO, ConfTuner designed

📋 Documentation

Document	Size	What It Contains
SYSTEM_DESIGN.md	71KB	Complete 7-layer architecture + Appendix A (MAGMA, post-Transformer architectures, CompreSSM) with training pipeline
BLINDSPOT_AUDIT_COMPLETE.md	45KB	87 failure modes found by adversarial thinking
FUTURE_IMPROVEMENTS.md	39KB	78 practical problems found by reading every code file
DEPLOYMENT_ARM.md	3KB	ARM/local deployment guide

License

Apache 2.0 — use it, modify it, build on it. Just include the license.

🔬 Prior Art & Inspirations — Standing on the Shoulders of Giants

We searched every relevant system, paper, and tool. Nobody has built the complete system we're building. But we found 15 systems that each do pieces of what we do — and we're adopting the best parts.

We analyzed 15 published systems, 12 open-source tools, 8 HuggingFace datasets, and 6 HuggingFace models that do something similar to PhD Research OS. Here's the honest picture.

The Big Scoreboard — Who Has What

Capability	Our System	Best Rival	Gap
Claim extraction from papers	🟡 Designed	🟢 PaperQA2 (superhuman)	They're proven; we're not yet
Epistemic labels (Fact/Interpretation/Hypothesis)	🟢 Unique	🟡 KGX3 (paper-level only)	Nobody does claim-level
Typed knowledge graph	🟢 Unique	🟡 Paper Circle (structural, not epistemic)	Our edges carry epistemic meaning
Contradiction detection	🟡 Designed	🟢 PaperQA2 (2.34× better than PhDs)	They can prove it works; we can't yet
Code-computed confidence scores	🟢 Unique	❌ Nobody	Nobody does formula-based scoring

Analogy: If AI research systems were restaurants, PaperQA2 would be the Michelin-starred place that's been open for years. We're the ambitious new restaurant with a unique menu that nobody else offers — but we haven't opened our doors yet. The plan is to learn from the best while cooking dishes nobody else serves.

What We're Adopting (The Short Version)

What	From Where	Why	Type
SPECTER2 embeddings	AllenAI	Fix our word-overlap deduplication	🔵 Direct plug-in
SciFact benchmark	AllenAI	Measure our extraction quality	🔵 Direct plug-in
SciRIFF training data (137K examples)	AllenAI	Replace/supplement our 1,900 synthetic examples	🔵 Direct plug-in
Nougat equation parser	Meta	Fix garbled equations	🔵 Direct plug-in
RCS pre-filtering technique	PaperQA2	Filter noise before extraction	🟣 Adapted technique
Deterministic language filters	KGX3	Code-based epistemic validation	🟣 Adapted technique
Coverage checking pattern	Paper Circle	Prevent silent omissions	🟣 Adapted pattern
"Refuse to answer" behavior	PaperQA2	Know when to say "I don't know"	🟣 Adapted behavior
Epistemic Velocity (new feature)	CLAIRE + PaperQA2	Track confidence trends over time	🆕 Inspired by
Devil's Advocate Mode (new feature)	CLAIRE + KGX3	Actively challenge high-confidence claims	🆕 Inspired by

Full Documentation

Document	Size	What It Contains
PRIOR_ART_ANALYSIS.md	40KB+	15 systems analyzed — what they do, how they compare, honest scoreboard
SYSTEM_INSPIRATIONS.md	35KB+	16 concrete adoption/adaptation items with code examples, integration map, priority order

Our 3 Genuine Innovations (Not Found Anywhere Else)

Claim-Level Epistemic Classification — KGX3 rates whole papers; we rate individual claims within papers. That's like the difference between rating a restaurant (them) vs rating each dish (us).
Code-Computed Calibrated Confidence — Other systems ask the AI "how confident are you?" and trust the answer. We have the AI provide raw measurements, then Python code computes the score using fixed formulas the AI can never override. Like a teacher who grades the test herself instead of asking the student "what grade do you think you deserve?"
The Integrated 7-Layer Local-First Pipeline — Other systems are like food delivery apps (each brings one dish). We're building a kitchen that cooks everything and remembers every recipe forever — all on your laptop, never sending your data to the cloud.