Append Prior Art & Inspirations section to README landing page — links to new PRIOR_ART_ANALYSIS.md and SYSTEM_INSPIRATIONS.md

Browse files

Files changed (1) hide show

README.md +725 -0

README.md CHANGED Viewed

	@@ -1 +1,726 @@




























1

+---
+license: apache-2.0
+tags:
+  - scientific-research
+  - claim-extraction
+  - epistemic-classification
+  - contradiction-detection
+  - structured-output
+  - research-assistant
+  - phd-tools
+  - multi-agent
+  - ecc-harness
+  - knowledge-graph
+  - calibrated-scoring
+  - transformer
+  - causal-lm
+  - qlora
+  - sft
+language:
+  - en
+base_model: Qwen/Qwen2.5-3B-Instruct
+datasets:
+  - nkshirsa/phd-research-os-sft-data
+pipeline_tag: text-generation
+model-index:
+  - name: phd-research-os-brain
+    results: []
+---
+# PhD Research OS v2.0 — The Epistemic Engine 🧠
+> **A local-first AI system that reads science papers, extracts findings, labels how trustworthy each one is, and builds a knowledge graph that spots contradictions across papers.**
+**60 files | 563KB | 143 tests | 87 blindspots audited | 78 future improvements mapped**
+---
+## 🏗️ Transformer Architecture
+The entire system is built on top of a **causal decoder-only Transformer** — the same family of architecture behind GPT, LLaMA, and Qwen. Here is exactly how it works, explained so anyone can follow:
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                    TRANSFORMER ARCHITECTURE                         │
+│                    (Decoder-Only, Causal LM)                        │
+│                                                                     │
+│   ┌─────────────────────────────────────────────────────────────┐   │
+│   │                    INPUT LAYER                               │   │
+│   │                                                              │   │
+│   │  Scientific text ──→ Tokenizer (Qwen BPE, 151,936 vocab)   │   │
+│   │  "The LOD was 0.8 fM" ──→ [The][_LOD][_was][_0][.8][_fM]  │   │
+│   │                                                              │   │
+│   │  Token IDs ──→ Token Embeddings (3072-dim vectors)          │   │
+│   │            + Rotary Position Embeddings (RoPE)               │   │
+│   └──────────────────────────┬──────────────────────────────────┘   │
+│                               ▼                                     │
+│   ┌─────────────────────────────────────────────────────────────┐   │
+│   │              36 TRANSFORMER DECODER BLOCKS                   │   │
+│   │              (repeated 36 times, same structure)              │   │
+│   │                                                              │   │
+│   │  For each block:                                             │   │
+│   │                                                              │   │
+│   │  ┌──────────────────────────────────────────────────────┐   │   │
+│   │  │  1. RMSNorm (normalize values to stable range)       │   │   │
+│   │  │                                                      │   │   │
+│   │  │  2. Grouped-Query Self-Attention (GQA)               │   │   │
+│   │  │     • 16 query heads (each 128-dim)                  │   │   │
+│   │  │     • 2 key-value heads (shared across queries)      │   │   │
+│   │  │     • Causal mask: can only look at PAST tokens      │   │   │
+│   │  │     • RoPE: encodes where each token sits            │   │   │
+│   │  │     • This is HOW the model reads and understands    │   │   │
+│   │  │                                                      │   │   │
+│   │  │  3. RMSNorm (normalize again)                        │   │   │
+│   │  │                                                      │   │   │
+│   │  │  4. SwiGLU Feed-Forward Network                      │   │   │
+│   │  │     • gate_proj: 3072 → 8640                         │   │   │
+│   │  │     • up_proj:   3072 → 8640                         │   │   │
+│   │  │     • SiLU activation × gate                         │   │   │
+│   │  │     • down_proj: 8640 → 3072                         │   │   │
+│   │  │     • This is WHERE the model stores knowledge       │   │   │
+│   │  │                                                      │   │   │
+│   │  │  5. Residual connections (skip connections)           │   │   │
+│   │  │     • Add input back to output at each step          │   │   │
+│   │  │     • Prevents information from getting lost         │   │   │
+│   │  └──────────────────────────────────────────────────────┘   │   │
+│   │                                                              │   │
+│   │  × 36 blocks = 36 layers of reading comprehension           │   │
+│   └──────────────────────────┬──────────────────────────────────┘   │
+│                               ▼                                     │
+│   ┌─────────────────────────────────────────────────────────────┐   │
+│   │                    OUTPUT LAYER                               │   │
+│   │                                                              │   │
+│   │  Final RMSNorm ──→ Language Model Head (3072 → 151,936)    │   │
+│   │  ──→ Probability over all possible next tokens               │   │
+│   │  ──→ Pick most likely token ──→ Repeat until done            │   │
+│   │                                                              │   │
+│   │  Output: Structured JSON with claims, tags, confidence       │   │
+│   └─────────────────────────────────────────────────────────────┘   │
+│                                                                     │
+│   TOTAL: 3.09 billion parameters                                    │
+│   TRAINABLE (QLoRA r=64): ~160 million (5.2% of total)            │
+│   QUANTIZATION: 4-bit NF4 (fits in ~2.5 GB VRAM)                  │
+└─────────────────────────────────────────────────────────────────────┘
+```
+### What Each Part Does (Plain English)
+| Component | What It Does | Analogy |
+|-----------|-------------|---------|
+| **Tokenizer** | Chops text into small pieces the model can process | Cutting a sentence into word-cards |
+| **Embeddings** | Turns each word-card into a list of 3,072 numbers | Giving each card a GPS coordinate in meaning-space |
+| **RoPE** | Tells the model where each word sits in the sentence | Page numbers on the word-cards |
+| **Self-Attention** | Each word looks at all previous words to understand context | Reading a sentence and remembering what came before |
+| **GQA (Grouped-Query)** | A memory-efficient version of attention — shares key/value across heads | 16 readers sharing 2 sets of notes instead of each having their own |
+| **Causal Mask** | Prevents looking at future words (only past → present) | No peeking at the next page |
+| **Feed-Forward Network** | Where factual knowledge is stored (patterns, associations) | The brain's long-term memory |
+| **SwiGLU** | A smart activation function that helps the network learn better | A filter that keeps important signals and blocks noise |
+| **Residual Connections** | Adds the input back to the output so information isn't lost | A safety rope — even if a layer learns nothing useful, the original info passes through |
+| **RMSNorm** | Keeps numbers in a stable range so training doesn't explode | A volume knob that prevents the speakers from blowing out |
+| **LM Head** | Converts the final representation into a probability for each word in the vocabulary | The final vote — "what word comes next?" |
+### Model Specifications
+| Specification | Value |
+|--------------|-------|
+| **Architecture** | Transformer (decoder-only, causal language model) |
+| **Base Model** | [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) |
+| **Parameters** | 3.09 billion total |
+| **Hidden Size** | 3,072 |
+| **Layers** | 36 transformer blocks |
+| **Attention Heads** | 16 query heads, 2 key-value heads (GQA) |
+| **Head Dimension** | 128 |
+| **Intermediate Size** | 8,640 (SwiGLU FFN) |
+| **Vocabulary** | 151,936 tokens (BPE) |
+| **Max Context** | 32,768 tokens (~25 pages of text) |
+| **Position Encoding** | Rotary Position Embedding (RoPE) |
+| **Normalization** | RMSNorm (pre-norm) |
+| **Activation** | SiLU (in SwiGLU gate) |
+| **Precision** | BF16 (training), 4-bit NF4 (inference via QLoRA) |
+| **Fine-Tuning Method** | QLoRA (r=64, α=16, all-linear targets) |
+| **Trainable Parameters** | ~160M of 3.09B (5.2%) |
+| **Training Data** | [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) (1,900 examples, 6 tasks) |
+| **License** | Apache 2.0 |
+### How Fine-Tuning Works (QLoRA)
+Instead of updating all 3 billion parameters (which would need a huge GPU), we use **QLoRA** — a technique that:
+1. **Quantizes** the base model to 4-bit (shrinks it from ~6GB to ~2.5GB in memory)
+2. **Freezes** all original parameters (they don't change)
+3. **Adds tiny adapter layers** (LoRA) to every linear layer in the model
+4. **Only trains the adapters** (~160 million parameters, 5.2% of total)
+Think of it like this: the base model is a textbook that's already printed. QLoRA adds sticky notes to every page. During training, we only write on the sticky notes — the textbook stays unchanged. At the end, reading the textbook + sticky notes together gives you a model that knows scientific claim extraction.
+```
+Base Model (frozen, 4-bit quantized):
+  ┌─────────────────────────────┐
+  │  Original Qwen2.5-3B weights │  ← 3.09B params, NOT updated
+  │  (general knowledge)          │
+  └─────────────┬───────────────┘
+                │
+LoRA Adapters (trainable):
+  ┌─────────────┴───────────────┐
+  │  Low-rank matrices (r=64)    │  ← ~160M params, UPDATED
+  │  Added to every linear layer │
+  │  (scientific claim knowledge) │
+  └─────────────────────────────┘
+                │
+                ▼
+  Combined output: general knowledge + scientific specialization
+```
+---
+## 🔬 What This System Does
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     PhD Research OS Pipeline                     │
+│                                                                 │
+│  📄 PDF Paper                                                   │
+│    │                                                            │
+│    ▼                                                            │
+│  Layer 0: PARSE ─── PDF → sections, tables, figures, equations  │
+│    │                                                            │
+│    ▼                                                            │
+│  Layer 1: RESOLVE ── Normalize names, resolve citations, check  │
+│    │                  retractions                                │
+│    ▼                                                            │
+│  Layer 2: EXTRACT ── AI Council pulls out claims with labels    │
+│    │                 (Fact / Interpretation / Hypothesis)        │
+│    │                 Preserves qualifiers ("may", "suggests")   │
+│    ▼                                                            │
+│  Layer 3: DEDUP ──── Find duplicate claims across papers        │
+│    │                 Merge same findings, track all sources      │
+│    ▼                                                            │
+│  Layer 4: GRAPH ──── Build knowledge graph with typed edges     │
+│    │                 (supports / refutes / extends)              │
+│    │                 Detect contradictions, find research gaps   │
+│    ▼                                                            │
+│  Layer 5: SCORE ──── Code-computed confidence (NOT AI guess)    │
+│    │                 3 separate scores per claim                 │
+│    │                 Parser quality caps confidence              │
+│    ▼                                                            │
+│  Layer 6: EVALUATE ─ Check against gold standard                │
+│    │                 Regression gate blocks bad models           │
+│    ▼                                                            │
+│  Layer 7: EXPORT ─── Obsidian vault, CSV, BibTeX, courtroom UI  │
+│                      Full provenance (page, bbox, quote)         │
+└─────────────────────────────────────────────────────────────────┘
+```
+### The 3-Score System (Code-Computed, Not AI-Guessed)
+The AI provides components. **The code computes the final scores.** The AI never directly sets confidence.
+| Score | Formula | What It Measures |
+|-------|---------|-----------------|
+| **Evidence Quality** | evidence × study_quality × journal_tier × completeness × section_mod | How strong is the evidence behind this claim? |
+| **Truth Likelihood** | evidence_quality + corroboration − conflict_penalty − null_penalty | How likely is this claim to be true? |
+| **Qualifier Strength** | 1.0 − qualifier_count×0.1 − null_penalty − inherited_penalty | How definitive is the language? |
+**Key gates** (hard rules the AI cannot override):
+- Parser confidence **caps** claim confidence — bad parsing = low trust
+- Large sample + tiny effect → **capped at 0.4** (statistically significant ≠ meaningful)
+- Abstract claims → **forced to Interpretation** with 0.7× penalty
+### AI Model Council
+Four AI "members" process each paper section independently, then debate:
+| Member | Job | Analogy |
+|--------|-----|---------|
+| **Query Planner** | Breaks questions into searchable sub-queries | The librarian who knows which shelf to check |
+| **Extractor** | Pulls out claims with epistemic tags + qualifiers | The note-taker who writes down key findings |
+| **Critic** | Challenges the extractor — flags errors, missing qualifiers | The skeptic who asks "are you sure about that?" |
+| **Chairman** | Synthesizes final claims with documented reasoning | The judge who makes the final call |
+---
+## 📖 The Complete Design Document — Explained from Top to Bottom
+> **For high school students, curious researchers, and anyone who wants to understand WHY every piece of this system exists.**
+Every system is built from choices. This section explains what each part of the [SYSTEM_DESIGN.md](SYSTEM_DESIGN.md) does, why we chose it, and what problem it solves. If you read nothing else, read this.
+---
+### Part 1: The System Overview (The Big Picture)
+The very first thing in the design doc is a giant box diagram showing 7 layers with inputs flowing in at the top and outputs coming out at the bottom.
+**Why this exists**: Before anyone writes code, they need to see the whole journey. A PDF enters at the top. An Obsidian vault comes out at the bottom. The diagram shows every stop along the way so no one gets lost.
+**Why 7 layers specifically**: Each layer does ONE job and ONE job only. This is the "single responsibility principle" — the same reason kitchens have separate stations for prep, cooking, and plating. If Layer 4 (the graph) breaks, Layer 2 (extraction) keeps working. You can fix one without touching the others.
+| Layer | Name | One-Sentence Purpose |
+|-------|------|---------------------|
+| 0 | Structural Ingestion | Turn a messy PDF into clean, structured text with section labels |
+| 1 | Entity Resolution | Figure out what things mean — "NYU" is a university, "0.8 fM" is a measurement |
+| 2 | Qualified Extraction | Find every claim and label it Fact, Interpretation, or Hypothesis |
+| 3 | Canonicalization | If the same finding appears in 3 papers, merge them into one master record |
+| 4 | Knowledge Graph | Connect claims into a web — this supports that, this contradicts this |
+| 5 | Calibrated Scoring | Compute a trust score using math formulas, not AI guessing |
+| 6 | Evaluation | Test the system against expert-labeled papers to make sure it works |
+| 7 | Provenance | Remember exactly which paper, page, and sentence every claim came from |
+**The cross-cutting box** at the bottom (AI Model Council, Meta-Improver, etc.) exists because some jobs span ALL layers. The Council is used in extraction (Layer 2) AND in conflict detection (Layer 4). Keeping them separate from the main pipeline means they can be upgraded independently.
+---
+### Part 2: The Two-Model Strategy (Why We Run TWO Brains)
+The design doc specifies that the system runs TWO models, not one. This might seem wasteful. It is not. Here is why.
+**The Primary Brain** stays on your local computer and NEVER touches the internet. This is the brain that reads your actual paper PDFs. It is a Qwen3-8B model running at 4-bit quantization.
+**Why local?** Science papers contain unpublished data, patent ideas, and student thesis drafts. If you upload that to ChatGPT, it becomes training data for OpenAI. Local models stay private forever.
+**Why Qwen3-8B?** We compared Qwen2.5-3B and Qwen3-8B:
+| Test | Qwen2.5-3B | Qwen3-8B | What This Means |
+|------|-----------|----------|----------------|
+| Math reasoning (AIME) | ~15% | ~45% | Qwen3 can handle equations and statistics 3× better |
+| JSON accuracy | ~65% | ~80% | Qwen3 produces valid structured output more reliably |
+| Context window | 32K tokens | 128K tokens | Qwen3 can read entire 100-page supplements at once |
+The 8B model needs ~5GB for weights + ~4GB for KV cache at 128K context. This fits a 16GB consumer GPU. The 3B model would use less memory but makes too many errors to be trustworthy for science.
+**The Companion Brain** lives on the internet (Claude API, GPT-4o-mini, or OpenRouter). It never sees raw paper text. It only gets metadata, queries, and anonymized claims.
+**Why two brains?** The companion brain checks for paper retractions on CrossRef, searches arXiv for new papers, and validates repository URLs. These tasks need the internet. The primary brain reads sensitive unpublished data. These tasks must stay offline. One brain cannot do both jobs safely.
+---
+### Part 3: Training Pipeline — The 4-Stage Journey
+The model does not start smart. It must be taught. The design doc specifies a 4-stage training pipeline. Here is why each stage exists and why you cannot skip any.
+**Stage 1: SFT (Supervised Fine-Tuning)**
+- **What it does**: Show the model thousands of examples of paper text paired with correct claim extractions. Like showing a student thousands of solved math problems.
+- **Why it exists first**: The model must learn the BASIC pattern — "given this paragraph, produce this JSON with these fields" — before it can learn anything subtle.
+- **Why QLoRA**: Full fine-tuning of 3B parameters needs a $10,000 GPU. QLoRA trains 160M adapter parameters on a $500 GPU. The adapters are "sticky notes" on the original model.
+**Stage 2: DPO (Direct Preference Optimization)**
+- **What it does**: Show the model PAIRS — one correct extraction and one slightly wrong extraction — and teach it to prefer the correct one.
+- **Why it comes AFTER SFT**: A student who has never seen a correct answer cannot judge between two answers. SFT teaches "what is correct." DPO teaches "why A is better than B."
+- **Why it matters**: DPO fixes the "almost right" problem. SFT might produce JSON with all fields present but the wrong epistemic tag. DPO specifically penalizes tag errors.
+**Stage 3: GRPO (Group Relative Policy Optimization)**
+- **What it does**: The model generates 8 different answers for the same prompt. A reward function scores each one. The model learns to generate higher-scoring answers.
+- **Why it comes AFTER DPO**: GRPO needs the model to already understand the task well enough to generate plausible variations. A random model would generate garbage that all score zero.
+- **Why it matters**: GRPO bakes in JSON validity, schema compliance, and qualifier preservation using MATH-BASED reward functions — not human judgment. The reward function is CODE, not a person.
+**Stage 4: ConfTuner (Calibration Tuning)**
+- **What it does**: After the model produces correct answers, fix its confidence calibration. A model that says "95% confident" for correct answers and "95% confident" for wrong answers is broken.
+- **Why it comes LAST**: You cannot calibrate confidence until the model is already producing correct answers. Calibrating garbage is meaningless.
+- **Why it matters**: Science requires honest uncertainty. If the model is 50% confident, the user must know that. If the model lies and says 90% confident, bad decisions follow.
+---
+### Part 4: Layer 0 — Structural Ingestion (Why PDFs Are Hard)
+**The problem**: PDFs are not documents. They are instructions for where to place ink on paper. The computer sees "put letter T at position (72, 340)" not "this is a paragraph about detection limits."
+**Why we need 5 different tools**:
+| Tool | What It Does | Why We Chose It |
+|------|-------------|-----------------|
+| **Marker** | PDF → structured markdown with layout awareness | Knows the difference between a heading, a paragraph, and a caption |
+| **Nougat** | Scientific PDFs → LaTeX equations | Can read math formulas that other tools turn into gibberish |
+| **GROBID** | Headers, authors, citations, references | The academic standard — extracts bibliographic metadata |
+| **LayoutLMv3** | Classifies page regions (text, table, figure, equation) | Machine vision for documents — "this region is a table" |
+| **PlotDigitizer** | Quantitative plots → (x,y) CSV coordinates | Turns images of graphs into actual numbers |
+**Why not just use PyMuPDF?** PyMuPDF reads text in reading order. But scientific papers have multi-column layouts, floating figures, and tables that span pages. PyMuPDF often merges table cells with body text. Marker uses machine learning to understand document LAYOUT, not just text sequence.
+**Why section-aware chunking**: If you split a paper at page 5, you might cut a table in half. The design specifies chunking by SECTION (Introduction, Methods, Results) with 1-paragraph overlap. This means every chunk has complete context. A Results chunk includes the paragraph before and after, so the AI sees "As shown in Figure 3..." and knows Figure 3 exists.
+**Why per-region quality scoring**: Not every part of a PDF is readable equally. A scanned figure might have OCR errors. A handwritten annotation might be noise. The system assigns a `parse_confidence` score to every region. Low confidence regions get flagged so the user knows not to trust claims from that region.
+---
+### Part 5: Layer 1 — Entity Resolution (What Is This Thing?)
+**The problem**: "The NYU team used 0.8 fM" contains four entities: NYU (institution), team (group), 0.8 fM (measurement), and an implied method. The computer does not know any of this without help.
+**Why we normalize entities**: If Paper A says "University of New York" and Paper B says "NYU," the system must know these are the same institution. Otherwise the knowledge graph treats them as unrelated nodes and misses connections.
+**Why we resolve citations**: A paper might say "as shown in [32]." The system must find reference 32 in the bibliography, get its DOI, and look up the actual paper. Only then can it tag a claim as "inherited citation" (this paper is repeating someone else's finding, not producing a new one).
+**Why we check retractions**: A paper might be retracted because the data was fraudulent. If the system does not know this, it treats fraudulent claims as trustworthy. The design specifies periodic checks against CrossRef and Retraction Watch.
+**Why Version of Record (VoR) lineage**: Scientists often publish a preprint on arXiv, then a final version in a journal. The preprint might have errors that the journal version fixes. The system must track: preprint → journal version → erratum → retraction. Each version supersedes the previous one.
+---
+### Part 6: Layer 2 — Qualified Extraction (The AI Council)
+**The problem**: AI models make mistakes. They hallucinate numbers. They drop the word "may" and turn a tentative finding into a fact. They miss qualifiers like "in 10 mM PBS only" that limit where a result applies.
+**Why we use a COUNCIL instead of one model**: One model is one opinion. Four models debating produces better answers for the same reason that four doctors reviewing a case produces a better diagnosis than one doctor. The design specifies:
+1. **Query Planner** breaks the paper into sub-questions — "find the detection limit," "find the sample size," "find the p-value"
+2. **Two Extractors** independently pull claims. If they disagree, that disagreement IS information.
+3. **Critic** reviews both extractions and flags errors
+4. **Chairman** synthesizes the final answer, resolving disagreements with documented reasoning
+**Why Round 1 is PARALLEL**: The extractors must not see each other's work. If Extractor B sees Extractor A's answer, B might anchor on A's mistakes. Parallel execution guarantees independent judgment.
+**Why Round 2 shares TAGS but NOT confidence**: If Extractor A tags a claim as "Fact" and Extractor B tags it as "Interpretation," the Council must debate WHY. But if they share confidence scores, the lower-confidence extractor might cave to the higher-confidence one. Confidence is hidden to prevent anchoring.
+**Why we need constrained decoding (Guidance engine)**: The AI is asked to output JSON. Without constraints, it might produce:
+```
+The detection limit was 0.8 fM {
+  "claim": ...
+}
+```
+The Guidance library GUARANTEES that `epistemic_tag` is one of {Fact, Interpretation, Hypothesis, Conflict_Hypothesis}. It does this by controlling which tokens the model is allowed to generate at each position. The model literally CANNOT output an invalid tag.
+**Why we preserve qualifiers**: "The method MAY reduce costs" and "The method reduces costs" are completely different claims. Dropping "may" turns an uncertain hypothesis into a confident fact. The system has a specific qualifier preservation reward function that checks whether hedging words from the source appear in the extraction.
+---
+### Part 7: Layer 3 — Canonicalization (Deduplication)
+**The problem**: If 50 papers study COVID-19 vaccines, the phrase "vaccines are effective" might appear in all 50. Without deduplication, the knowledge graph has 50 identical nodes. The user sees 50 copies of the same claim.
+**Why embedding-based similarity**: Word-based deduplication fails. "The vaccine showed 95% efficacy" and "Efficacy of 95% was demonstrated by the vaccine" share almost no words but mean the SAME thing. The design specifies embedding models that convert sentences to number vectors. Similar meaning = similar vectors, regardless of wording.
+**Why cosine similarity threshold of 0.85**: Below 0.70, claims are definitely different. Above 0.85, they are probably the same. Between 0.70 and 0.85, the system flags them for human review. This threshold was chosen empirically — it is not a magic number.
+**Why temporal versioning**: A claim from 2020 might say "vaccine efficacy unknown." The same claim from 2021 might say "vaccine efficacy 95%." These are NOT duplicates. They are sequential versions of the same scientific story. The system stores a version history: `version 1: preprint_2020` → `version 2: journal_2021`.
+---
+### Part 8: Layer 4 — Knowledge Graph (The Web of Science)
+**The problem**: Science is not a list of facts. It is a web of relationships. Finding A supports Finding B. Finding C contradicts Finding B. Method X was used in both Study Y and Study Z. A list cannot represent this.
+**Why SQLite instead of Neo4j**: Neo4j is a professional graph database. It is also external software that requires installation, licensing, and maintenance. SQLite is built into Python. It requires zero setup. For a PhD student running this on a laptop, "works immediately" beats "theoretically better."
+**Why typed edges**: An edge without a type is just "connected." The design specifies 7 edge types:
+| Edge Type | What It Means | Why It Matters |
+|-----------|-------------|---------------|
+| `supports` | A provides evidence for B | The strongest positive relationship |
+| `refutes` | A contradicts B | Triggers conflict detection in the Courtroom UI |
+| `extends` | A adds conditions to B | "Works in water" extends "Works in ideal conditions" |
+| `depends_on` | A assumes B is true | If B is retracted, A's validity is questioned |
+| `supersedes` | A replaces B (newer data) | Old claims are marked outdated |
+| `blocks` | A is a null finding (no effect) | Prevents false connections |
+| `investigative_hypothesis` | Inferred (not directly observed) | Marked differently because it is weaker evidence |
+**Why we NEVER auto-generate `supports` across multiple hops**: If A supports B, and B supports C, it does NOT follow that A supports C. This is the "transitivity fallacy" — common in science but logically invalid. The design explicitly bans this.
+**Why gap analysis**: The system looks for pairs of well-connected entities with NO edge between them. If "graphene" connects to 50 methods and "biosensor" connects to 50 methods, but NO paper connects graphene to biosensors, that is a research gap. It is a publishable finding.
+---
+### Part 9: Layer 5 — Calibrated Scoring (Why Code Computes Trust)
+**The problem**: If you ask an AI "how confident are you?" it will make up a number. GPT-4 might say "95% confident" for a completely hallucinated claim. Humans trust numbers. Fake numbers are dangerous.
+**Why the AI provides COMPONENTS, not scores**: The AI outputs `evidence_strength: 900`, `study_quality_weight: 1000`, `journal_tier_weight: 1000`. The CODE multiplies these together using fixed formulas. The AI never touches the final confidence number.
+**Why fixed-point math (×1000)**: Floating-point numbers have rounding errors. `0.1 + 0.2 ≠ 0.3` in computers. The design stores all probabilities as integers from 0 to 1000. 950 means "95.0% confident." This is exact. No rounding. No drift.
+**Why 3 separate scores instead of 1**: One number hides what is wrong. A claim might have perfect evidence (Score 1 = 1000) but terrible language qualifiers (Score 3 = 300). A single average of 650 hides the problem. Three scores show: "the evidence is strong, but the claim is hedged."
+**Why parser confidence CAPS the claim score**: If the PDF parser was only 60% confident it read the table correctly, the claim CANNOT score above 600. This prevents the classic failure: "garbage in, gospel out."
+**Why the statistical gate exists**: A study with N=10,000 and effect size d=0.01 is "statistically significant" (p<0.001) but practically meaningless. The effect is too small to matter. The design has a hard rule: if N>1000 and |effect|<0.1, cap confidence at 400. This prevents the system from treating "statistically significant" as "important."
+---
+### Part 10: Layer 6 — Evaluation (How Do You Know It Works?)
+**The problem**: A system that processes science papers must be TESTED. But testing AI is hard because the answers are not binary right/wrong. "Is this claim well-extracted?" is a judgment call.
+**Why we need a GOLD STANDARD**: The design specifies 10 papers where human experts have labeled every claim, tag, qualifier, and conflict. This is the ground truth. Without it, you are driving without a map.
+**Why paper-level test splits**: If you randomly split claims into train/test, claims from the SAME paper might appear in both. The model memorizes paper-specific phrases instead of learning general extraction. The design specifies splitting by PAPER — all claims from Paper A go to training, all claims from Paper B go to testing.
+**Why LLM-as-Judge**: For fuzzy criteria like "is this claim faithful to the source text?" there is no code formula. The design uses a separate LLM (not the one being tested) as a judge. But the judge is also imperfect, so results are treated as estimates, not truth.
+**Why stochastic testing**: LLMs give different answers each time. Running evaluation once gives one number that might be 5% higher or lower next time. The design specifies running evaluations 3-5 times and reporting the range.
+**Why hidden holdout**: The team developing the system might unconsciously optimize for the test set. A truly hidden holdout — 3 papers that NO ONE on the team has seen during development — catches this. Evaluated quarterly.
+---
+### Part 11: Layer 7 — Provenance (Where Did This Come From?)
+**The problem**: A claim in the knowledge graph might have been extracted 6 months ago using an old model version. If the model has since improved, the old claim might be wrong. But you cannot know unless you track EVERYTHING.
+**Why every claim stores a full lineage**:
+```json
+{
+  "pipeline_version": "2.1.0",
+  "model_checkpoint": "research-os-grpo-v2-step-5000",
+  "parser_version": "marker-1.2.0",
+  "taxonomy_version": "quantum_bio_v1",
+  "prompt_hash": "sha256:a3b4c5...",
+  "extraction_timestamp": "2026-04-23T10:30:00Z"
+}
+```
+This is like a food label that tells you which farm, factory, and truck touch your apple. If a bad batch of apples makes people sick, you trace the batch number. If a model checkpoint produces bad extractions, you trace the checkpoint hash.
+**Why the security sandbox**: The system validates repository URLs and downloads code. A malicious URL could execute harmful code. The sandbox runs these checks in isolation: HTTP GET only, 60-second timeout, 100MB download limit, no code execution.
+**Why the epistemic embargo**: A researcher analyzing unpublished data needs privacy. The embargo mode stores claims in a private subgraph that NO other user can see. After paper submission, one click moves them to the shared lab graph.
+---
+### Part 12: The UI — Progressive Disclosure
+**The problem**: Science produces overwhelming output. 100 papers → 3,000+ claims. Showing everything at once paralyzes the user.
+**Why progressive disclosure levels**: The design specifies 5 levels of detail, like peeling an onion:
+| Level | What You See | When You Need It |
+|-------|-------------|-----------------|
+| 0 | Dashboard — health scores and today's queue | First thing every morning |
+| 1 | Claim text + tag + confidence + source | Deciding if a claim is worth reading |
+| 2 | The 3 separate scores (evidence, truth, qualifier) | Understanding WHY a claim scored low |
+| 3 | Source quote + page + bbox + council votes | Verifying the claim against the original paper |
+| 4 | 2-hop subgraph (neighbors and edges) | Understanding how this claim connects to others |
+| 5 | Raw LLM outputs + token distributions | Debugging when something goes wrong |
+**Why the Courtroom UI**: When 3 papers contradict each other, the system displays them side by side — like a courtroom with 3 witnesses. The user sees the methods, sample sizes, and p-values for each. The system does NOT pick a winner. It presents the evidence and lets the researcher decide.
+**Why Manual Synthesis Mode**: After using the system for months, researchers might start trusting it blindly. Manual mode hides ALL AI output — no scores, no conflict flags, no suggestions. The researcher draws connections manually, then switches back to compare. This built-in friction prevents over-reliance.
+---
+### Part 13: Appendix A — Future Architecture (Why We Are Planning Ahead)
+The SYSTEM_DESIGN.md includes an Appendix A that describes improvements validated by peer-reviewed research. These are NOT implemented yet. They are roadmap items for Phase F and beyond.
+**Why MAGMA (Multi-Graph Memory) matters**: The current knowledge graph mixes temporal, causal, semantic, and entity relationships into one edge space. This is like storing all your files in one giant folder. MAGMA creates separate folders (graphs) for each relationship type. When you ask "Why did the 2023 paper disagree?" the system only searches the CAUSAL graph, not everything.
+**Why post-Transformer architectures matter**: The current Transformer (Qwen3-8B) has a "quadratic attention bottleneck." At 128K tokens, every new token must look at ALL previous tokens. This is like a student re-reading the entire textbook before every new page. It is slow and memory-hungry.
+New architectures solve this:
+- **Mamba (SSM)**: Compresses history into a hidden state. Like a speed-reader who keeps a mental summary instead of re-reading every page. 5× faster on long sequences.
+- **RWKV**: Constant memory during inference. Like having a notebook where you only write the latest summary, not every note ever taken.
+- **Jamba/Griffin (Hybrid)**: Keep a few Transformer layers for precise short-range attention, add SSM layers for cheap long-range memory. Best of both worlds.
+- **MoE (Mixture of Experts)**: 30B total parameters, only 3B active per token. Like a hospital with 30 specialists where each patient sees only the relevant 3.
+**Why CompreSSM matters**: Training SSMs from scratch at small dimension produces weak models. CompreSSM starts LARGE (state dimension 256), trains for a while, then uses control theory (Hankel singular values) to identify which dimensions matter and surgically removes the rest. Result: a compact model that outperforms one trained small from the start.
+**Why gradual migration, not a hard switch**: The Transformer ecosystem (AWQ quantization, vLLM serving, GRPO in TRL) is mature. SSM ecosystems are newer. The design recommends:
+1. **Now**: Keep Transformer (stable, tested, ecosystem)
+2. **Phase F**: Test hybrid model on holdout data, compare metrics
+3. **Year 2+**: If hybrids win, migrate ingestion pipeline to SSM backbone
+4. **Always**: Companion brain stays on frontier APIs (Claude, GPT-4o) — only the local primary brain migrates
+---
+## 📊 Resources
+| Resource | Link | What's Inside |
+|----------|------|---------------|
+| **Model + Code** | [nkshirsa/phd-research-os-brain](https://huggingface.co/nkshirsa/phd-research-os-brain) | This repo — 60 files of code, design docs, tests |
+| **Training Data** | [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) | 1,900 multi-task examples across 6 tasks |
+| **Taxonomy GUI** | [nkshirsa/phd-research-os-taxonomy](https://huggingface.co/spaces/nkshirsa/phd-research-os-taxonomy) | Live Gradio Space — explore the epistemic taxonomy |
+| **Training Space** | [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) | ZeroGPU micro-batch training interface |
+| **Blindspot Audit** | [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md) | 87 failure modes across 4 discovery epochs |
+| **System Design** | [SYSTEM_DESIGN.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/SYSTEM_DESIGN.md) | Complete 7-layer architecture + Appendix A (MAGMA, post-Transformer architectures, CompreSSM) |
+| **Future Improvements** | [FUTURE_IMPROVEMENTS.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/FUTURE_IMPROVEMENTS.md) | 78 blindspots found by reading every code file (high-school readable) |
+---
+## 🚀 Quick Start
+```bash
+git clone https://huggingface.co/nkshirsa/phd-research-os-brain
+cd phd-research-os-brain
+pip install gradio pymupdf
+python -m phd_research_os_v2.app
+# Open http://localhost:7860
+```
+Works immediately with heuristic extraction. Add an API key for AI-powered extraction:
+```bash
+export ANTHROPIC_API_KEY=sk-...  # or OPENAI_API_KEY
+```
+---
+## 🧬 Epistemic Separation Engine
+The same finding means different things depending on where it appears in a paper:
+| Source Section | Confidence Modifier | Why |
+|---------------|-------------------|-----|
+| Results (with statistics) | 1.0× (full weight) | Direct data — the strongest evidence |
+| Supplement | 1.0× (full weight) | Same as results, just in a different file |
+| Methods | 1.0× (protocol, not a claim) | Describes what was done, not what was found |
+| Discussion | 0.75× (penalized) | Author's interpretation — goes beyond the data |
+| Abstract | 0.7× (forced to Interpretation) | Often overstates what the results actually show |
+---
+## 🤖 ECC Harness — Companion AI System
+```python
+from phd_research_os.agent_os import AgentOS
+aos = AgentOS()
+agent = aos.spawn_companion("DataQualityAuditor")
+task = aos.assign_task(agent, "Audit last 50 claims for hallucination patterns")
+result = aos.run_task(task)
+proposals = aos.get_proposals(agent)  # Review what the agent found
+aos.approve_proposal(proposals[0])   # Human approves changes
+```
+5 built-in companion types:
+| Agent | What It Does |
+|-------|-------------|
+| **DataQualityAuditor** | Checks claims for hallucinations and suspicious confidence |
+| **PromptOptimizer** | Improves system prompts via A/B testing |
+| **DomainExpander** | Generates training examples for new scientific fields |
+| **CalibrationAnalyst** | Analyzes Brier scores, recommends weight adjustments |
+| **CitationChaser** | Finds new papers that support or contradict existing claims |
+All companion output goes through **Proposals** — the agent can suggest changes but a human must approve them.
+---
+## 📐 Quantum-Bio Taxonomy V2
+Evidence from different study types gets different weights:
+| Study Type | Weight | Example |
+|-----------|--------|---------|
+| In vivo | 1.00 | Animal/human experiment |
+| Direct physical measurement | 1.00 | Electron microscopy image |
+| Mathematical proof | 0.95 | Formal theorem |
+| In vitro | 0.85 | Cell culture experiment |
+| First-principles simulation | 0.80 | Density functional theory |
+| Phenomenological simulation | 0.60 | Curve fitting model |
+| Review | 0.40 | Literature survey |
+| Perspective | 0.20 | Expert opinion piece |
+---
+## ✅ Tests
+```
+test_v2_integration.py  — 24 ✅  (full pipeline)
+test_db.py              — 22 ✅  (data layer)
+test_agent_os.py        — 21 ✅  (ECC harness)
+test_taxonomy.py        — 27 ✅  (taxonomy)
+test_skills_and_meta.py — 30 ✅  (skills + meta)
+test_council.py         — 19 ✅  (AI council)
+Total: 143 passing
+```
+---
+## 📈 Training Pipeline
+**Current**: SFT (Supervised Fine-Tuning) with QLoRA on Qwen2.5-3B-Instruct
+**Planned 4-stage pipeline**:
+1. **SFT** — "Here's the right answer, learn it" ✅ Built
+2. **DPO** — "This answer is better than that one" 📋 Designed
+3. **GRPO** — "Reward for good JSON, correct tags, preserved qualifiers" 📋 Designed
+4. **ConfTuner** — "Make your confidence scores match reality" 📋 Designed
+**Training options**:
+- **ZeroGPU**: [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) — click Train repeatedly
+- **Local** (needs GPU): `python train.py`
+- **HF Jobs**: Single continuous run on dedicated GPU (recommended)
+---
+## 🔍 Honest Status — What's Built vs. Planned
+| Feature | Status | Details |
+|---------|--------|---------|
+| Database (all 7 layers) | ✅ Implemented | SQLite with full schema for claims, graph, conflicts |
+| PDF parser | 🚧 Basic | PyMuPDF/pdfplumber work; Marker (ML-based) not yet integrated |
+| Claim extraction | 🚧 Mock + AI | Heuristic fallback works; real AI extraction needs API key |
+| Deduplication | 🚧 Basic | Jaccard word overlap; embedding-based similarity not yet added |
+| Knowledge graph | 🚧 Basic | Structure and queries work; conflict detection uses word overlap |
+| Scoring engine | ✅ Implemented | 3-score formula with all gates and caps |
+| Evaluation harness | 🚧 Basic | Metric counting works; gold standard comparison not yet built |
+| Obsidian export | ✅ Implemented | Claims, canonicals, conflicts, dashboard |
+| AI Council | 🚧 Sequential | Prompts defined; parallel debate not yet implemented |
+| Agent OS / ECC Harness | ✅ Implemented | Full lifecycle, audit trail, proposals |
+| Constrained decoding | 📋 Designed | Guidance library integration specified but not built |
+| 4-stage training | 🚧 Stage 1 only | SFT built; DPO, GRPO, ConfTuner designed |
+---
+## 📋 Documentation
+| Document | Size | What It Contains |
+|----------|------|-----------------|
+| [SYSTEM_DESIGN.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/SYSTEM_DESIGN.md) | 71KB | Complete 7-layer architecture + Appendix A (MAGMA, post-Transformer architectures, CompreSSM) with training pipeline |
+| [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md) | 45KB | 87 failure modes found by adversarial thinking |
+| [FUTURE_IMPROVEMENTS.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/FUTURE_IMPROVEMENTS.md) | 39KB | 78 practical problems found by reading every code file |
+| [DEPLOYMENT_ARM.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/DEPLOYMENT_ARM.md) | 3KB | ARM/local deployment guide |
+---
+## License
+Apache 2.0 — use it, modify it, build on it. Just include the license.
+---
+## 🔬 Prior Art & Inspirations — Standing on the Shoulders of Giants
+> **We searched every relevant system, paper, and tool. Nobody has built the complete system we're building. But we found 15 systems that each do pieces of what we do — and we're adopting the best parts.**
+We analyzed 15 published systems, 12 open-source tools, 8 HuggingFace datasets, and 6 HuggingFace models that do something similar to PhD Research OS. Here's the honest picture.
+### The Big Scoreboard — Who Has What
+| Capability | Our System | Best Rival | Gap |
+|------------|:---------:|:----------:|:---:|
+| **Claim extraction from papers** | 🟡 Designed | 🟢 PaperQA2 (superhuman) | They're proven; we're not yet |
+| **Epistemic labels (Fact/Interpretation/Hypothesis)** | 🟢 **Unique** | 🟡 KGX3 (paper-level only) | **Nobody does claim-level** |
+| **Typed knowledge graph** | 🟢 **Unique** | 🟡 Paper Circle (structural, not epistemic) | Our edges carry epistemic meaning |
+| **Contradiction detection** | 🟡 Designed | 🟢 PaperQA2 (2.34× better than PhDs) | They can prove it works; we can't yet |
+| **Code-computed confidence scores** | 🟢 **Unique** | ❌ Nobody | **Nobody does formula-based scoring** |
+**Analogy**: If AI research systems were restaurants, PaperQA2 would be the Michelin-starred place that's been open for years. We're the ambitious new restaurant with a unique menu that nobody else offers — but we haven't opened our doors yet. The plan is to learn from the best while cooking dishes nobody else serves.
+### What We're Adopting (The Short Version)
+| What | From Where | Why | Type |
+|------|-----------|-----|------|
+| SPECTER2 embeddings | [AllenAI](https://huggingface.co/allenai/specter2_base) | Fix our word-overlap deduplication | 🔵 Direct plug-in |
+| SciFact benchmark | [AllenAI](https://huggingface.co/datasets/bigbio/scifact) | Measure our extraction quality | 🔵 Direct plug-in |
+| SciRIFF training data (137K examples) | [AllenAI](https://huggingface.co/datasets/allenai/SciRIFF) | Replace/supplement our 1,900 synthetic examples | 🔵 Direct plug-in |
+| Nougat equation parser | [Meta](https://huggingface.co/facebook/nougat-base) | Fix garbled equations | 🔵 Direct plug-in |
+| RCS pre-filtering technique | PaperQA2 | Filter noise before extraction | 🟣 Adapted technique |
+| Deterministic language filters | KGX3 | Code-based epistemic validation | 🟣 Adapted technique |
+| Coverage checking pattern | Paper Circle | Prevent silent omissions | 🟣 Adapted pattern |
+| "Refuse to answer" behavior | PaperQA2 | Know when to say "I don't know" | 🟣 Adapted behavior |
+| Epistemic Velocity (new feature) | CLAIRE + PaperQA2 | Track confidence trends over time | 🆕 Inspired by |
+| Devil's Advocate Mode (new feature) | CLAIRE + KGX3 | Actively challenge high-confidence claims | 🆕 Inspired by |
+### Full Documentation
+| Document | Size | What It Contains |
+|----------|------|-----------------|
+| [PRIOR_ART_ANALYSIS.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/PRIOR_ART_ANALYSIS.md) | 40KB+ | 15 systems analyzed — what they do, how they compare, honest scoreboard |
+| [SYSTEM_INSPIRATIONS.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/SYSTEM_INSPIRATIONS.md) | 35KB+ | 16 concrete adoption/adaptation items with code examples, integration map, priority order |
+### Our 3 Genuine Innovations (Not Found Anywhere Else)
+1. **Claim-Level Epistemic Classification** — KGX3 rates whole papers; we rate individual claims within papers. That's like the difference between rating a restaurant (them) vs rating each dish (us).
+2. **Code-Computed Calibrated Confidence** — Other systems ask the AI "how confident are you?" and trust the answer. We have the AI provide raw measurements, then Python code computes the score using fixed formulas the AI can never override. Like a teacher who grades the test herself instead of asking the student "what grade do you think you deserve?"
+3. **The Integrated 7-Layer Local-First Pipeline** — Other systems are like food delivery apps (each brings one dish). We're building a kitchen that cooks everything and remembers every recipe forever — all on your laptop, never sending your data to the cloud.