Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +281 -0

README.md CHANGED Viewed

@@ -239,6 +239,287 @@ Four AI "members" process each paper section independently, then debate:
 ---
 ## 📊 Resources
 | Resource | Link | What's Inside |

 ---
+## 📖 The Complete Design Document — Explained from Top to Bottom
+> **For high school students, curious researchers, and anyone who wants to understand WHY every piece of this system exists.**
+Every system is built from choices. This section explains what each part of the [SYSTEM_DESIGN.md](SYSTEM_DESIGN.md) does, why we chose it, and what problem it solves. If you read nothing else, read this.
+---
+### Part 1: The System Overview (The Big Picture)
+The very first thing in the design doc is a giant box diagram showing 7 layers with inputs flowing in at the top and outputs coming out at the bottom.
+**Why this exists**: Before anyone writes code, they need to see the whole journey. A PDF enters at the top. An Obsidian vault comes out at the bottom. The diagram shows every stop along the way so no one gets lost.
+**Why 7 layers specifically**: Each layer does ONE job and ONE job only. This is the "single responsibility principle" — the same reason kitchens have separate stations for prep, cooking, and plating. If Layer 4 (the graph) breaks, Layer 2 (extraction) keeps working. You can fix one without touching the others.
+| Layer | Name | One-Sentence Purpose |
+|-------|------|---------------------|
+| 0 | Structural Ingestion | Turn a messy PDF into clean, structured text with section labels |
+| 1 | Entity Resolution | Figure out what things mean — "NYU" is a university, "0.8 fM" is a measurement |
+| 2 | Qualified Extraction | Find every claim and label it Fact, Interpretation, or Hypothesis |
+| 3 | Canonicalization | If the same finding appears in 3 papers, merge them into one master record |
+| 4 | Knowledge Graph | Connect claims into a web — this supports that, this contradicts this |
+| 5 | Calibrated Scoring | Compute a trust score using math formulas, not AI guessing |
+| 6 | Evaluation | Test the system against expert-labeled papers to make sure it works |
+| 7 | Provenance | Remember exactly which paper, page, and sentence every claim came from |
+**The cross-cutting box** at the bottom (AI Model Council, Meta-Improver, etc.) exists because some jobs span ALL layers. The Council is used in extraction (Layer 2) AND in conflict detection (Layer 4). Keeping them separate from the main pipeline means they can be upgraded independently.
+---
+### Part 2: The Two-Model Strategy (Why We Run TWO Brains)
+The design doc specifies that the system runs TWO models, not one. This might seem wasteful. It is not. Here is why.
+**The Primary Brain** stays on your local computer and NEVER touches the internet. This is the brain that reads your actual paper PDFs. It is a Qwen3-8B model running at 4-bit quantization.
+**Why local?** Science papers contain unpublished data, patent ideas, and student thesis drafts. If you upload that to ChatGPT, it becomes training data for OpenAI. Local models stay private forever.
+**Why Qwen3-8B?** We compared Qwen2.5-3B and Qwen3-8B:
+| Test | Qwen2.5-3B | Qwen3-8B | What This Means |
+|------|-----------|----------|----------------|
+| Math reasoning (AIME) | ~15% | ~45% | Qwen3 can handle equations and statistics 3× better |
+| JSON accuracy | ~65% | ~80% | Qwen3 produces valid structured output more reliably |
+| Context window | 32K tokens | 128K tokens | Qwen3 can read entire 100-page supplements at once |
+The 8B model needs ~5GB for weights + ~4GB for KV cache at 128K context. This fits a 16GB consumer GPU. The 3B model would use less memory but makes too many errors to be trustworthy for science.
+**The Companion Brain** lives on the internet (Claude API, GPT-4o-mini, or OpenRouter). It never sees raw paper text. It only gets metadata, queries, and anonymized claims.
+**Why two brains?** The companion brain checks for paper retractions on CrossRef, searches arXiv for new papers, and validates repository URLs. These tasks need the internet. The primary brain reads sensitive unpublished data. These tasks must stay offline. One brain cannot do both jobs safely.
+---
+### Part 3: Training Pipeline — The 4-Stage Journey
+The model does not start smart. It must be taught. The design doc specifies a 4-stage training pipeline. Here is why each stage exists and why you cannot skip any.
+**Stage 1: SFT (Supervised Fine-Tuning)**
+- **What it does**: Show the model thousands of examples of paper text paired with correct claim extractions. Like showing a student thousands of solved math problems.
+- **Why it exists first**: The model must learn the BASIC pattern — "given this paragraph, produce this JSON with these fields" — before it can learn anything subtle.
+- **Why QLoRA**: Full fine-tuning of 3B parameters needs a $10,000 GPU. QLoRA trains 160M adapter parameters on a $500 GPU. The adapters are "sticky notes" on the original model.
+**Stage 2: DPO (Direct Preference Optimization)**
+- **What it does**: Show the model PAIRS — one correct extraction and one slightly wrong extraction — and teach it to prefer the correct one.
+- **Why it comes AFTER SFT**: A student who has never seen a correct answer cannot judge between two answers. SFT teaches "what is correct." DPO teaches "why A is better than B."
+- **Why it matters**: DPO fixes the "almost right" problem. SFT might produce JSON with all fields present but the wrong epistemic tag. DPO specifically penalizes tag errors.
+**Stage 3: GRPO (Group Relative Policy Optimization)**
+- **What it does**: The model generates 8 different answers for the same prompt. A reward function scores each one. The model learns to generate higher-scoring answers.
+- **Why it comes AFTER DPO**: GRPO needs the model to already understand the task well enough to generate plausible variations. A random model would generate garbage that all score zero.
+- **Why it matters**: GRPO bakes in JSON validity, schema compliance, and qualifier preservation using MATH-BASED reward functions — not human judgment. The reward function is CODE, not a person.
+**Stage 4: ConfTuner (Calibration Tuning)**
+- **What it does**: After the model produces correct answers, fix its confidence calibration. A model that says "95% confident" for correct answers and "95% confident" for wrong answers is broken.
+- **Why it comes LAST**: You cannot calibrate confidence until the model is already producing correct answers. Calibrating garbage is meaningless.
+- **Why it matters**: Science requires honest uncertainty. If the model is 50% confident, the user must know that. If the model lies and says 90% confident, bad decisions follow.
+---
+### Part 4: Layer 0 — Structural Ingestion (Why PDFs Are Hard)
+**The problem**: PDFs are not documents. They are instructions for where to place ink on paper. The computer sees "put letter T at position (72, 340)" not "this is a paragraph about detection limits."
+**Why we need 5 different tools**:
+| Tool | What It Does | Why We Chose It |
+|------|-------------|-----------------|
+| **Marker** | PDF → structured markdown with layout awareness | Knows the difference between a heading, a paragraph, and a caption |
+| **Nougat** | Scientific PDFs → LaTeX equations | Can read math formulas that other tools turn into gibberish |
+| **GROBID** | Headers, authors, citations, references | The academic standard — extracts bibliographic metadata |
+| **LayoutLMv3** | Classifies page regions (text, table, figure, equation) | Machine vision for documents — "this region is a table" |
+| **PlotDigitizer** | Quantitative plots → (x,y) CSV coordinates | Turns images of graphs into actual numbers |
+**Why not just use PyMuPDF?** PyMuPDF reads text in reading order. But scientific papers have multi-column layouts, floating figures, and tables that span pages. PyMuPDF often merges table cells with body text. Marker uses machine learning to understand document LAYOUT, not just text sequence.
+**Why section-aware chunking**: If you split a paper at page 5, you might cut a table in half. The design specifies chunking by SECTION (Introduction, Methods, Results) with 1-paragraph overlap. This means every chunk has complete context. A Results chunk includes the paragraph before and after, so the AI sees "As shown in Figure 3..." and knows Figure 3 exists.
+**Why per-region quality scoring**: Not every part of a PDF is readable equally. A scanned figure might have OCR errors. A handwritten annotation might be noise. The system assigns a `parse_confidence` score to every region. Low confidence regions get flagged so the user knows not to trust claims from that region.
+---
+### Part 5: Layer 1 — Entity Resolution (What Is This Thing?)
+**The problem**: "The NYU team used 0.8 fM" contains four entities: NYU (institution), team (group), 0.8 fM (measurement), and an implied method. The computer does not know any of this without help.
+**Why we normalize entities**: If Paper A says "University of New York" and Paper B says "NYU," the system must know these are the same institution. Otherwise the knowledge graph treats them as unrelated nodes and misses connections.
+**Why we resolve citations**: A paper might say "as shown in [32]." The system must find reference 32 in the bibliography, get its DOI, and look up the actual paper. Only then can it tag a claim as "inherited citation" (this paper is repeating someone else's finding, not producing a new one).
+**Why we check retractions**: A paper might be retracted because the data was fraudulent. If the system does not know this, it treats fraudulent claims as trustworthy. The design specifies periodic checks against CrossRef and Retraction Watch.
+**Why Version of Record (VoR) lineage**: Scientists often publish a preprint on arXiv, then a final version in a journal. The preprint might have errors that the journal version fixes. The system must track: preprint → journal version → erratum → retraction. Each version supersedes the previous one.
+---
+### Part 6: Layer 2 — Qualified Extraction (The AI Council)
+**The problem**: AI models make mistakes. They hallucinate numbers. They drop the word "may" and turn a tentative finding into a fact. They miss qualifiers like "in 10 mM PBS only" that limit where a result applies.
+**Why we use a COUNCIL instead of one model**: One model is one opinion. Four models debating produces better answers for the same reason that four doctors reviewing a case produces a better diagnosis than one doctor. The design specifies:
+1. **Query Planner** breaks the paper into sub-questions — "find the detection limit," "find the sample size," "find the p-value"
+2. **Two Extractors** independently pull claims. If they disagree, that disagreement IS information.
+3. **Critic** reviews both extractions and flags errors
+4. **Chairman** synthesizes the final answer, resolving disagreements with documented reasoning
+**Why Round 1 is PARALLEL**: The extractors must not see each other's work. If Extractor B sees Extractor A's answer, B might anchor on A's mistakes. Parallel execution guarantees independent judgment.
+**Why Round 2 shares TAGS but NOT confidence**: If Extractor A tags a claim as "Fact" and Extractor B tags it as "Interpretation," the Council must debate WHY. But if they share confidence scores, the lower-confidence extractor might cave to the higher-confidence one. Confidence is hidden to prevent anchoring.
+**Why we need constrained decoding (Guidance engine)**: The AI is asked to output JSON. Without constraints, it might produce:
+```
+The detection limit was 0.8 fM {
+  "claim": ...
+}
+```
+The Guidance library GUARANTEES that `epistemic_tag` is one of {Fact, Interpretation, Hypothesis, Conflict_Hypothesis}. It does this by controlling which tokens the model is allowed to generate at each position. The model literally CANNOT output an invalid tag.
+**Why we preserve qualifiers**: "The method MAY reduce costs" and "The method reduces costs" are completely different claims. Dropping "may" turns an uncertain hypothesis into a confident fact. The system has a specific qualifier preservation reward function that checks whether hedging words from the source appear in the extraction.
+---
+### Part 7: Layer 3 — Canonicalization (Deduplication)
+**The problem**: If 50 papers study COVID-19 vaccines, the phrase "vaccines are effective" might appear in all 50. Without deduplication, the knowledge graph has 50 identical nodes. The user sees 50 copies of the same claim.
+**Why embedding-based similarity**: Word-based deduplication fails. "The vaccine showed 95% efficacy" and "Efficacy of 95% was demonstrated by the vaccine" share almost no words but mean the SAME thing. The design specifies embedding models that convert sentences to number vectors. Similar meaning = similar vectors, regardless of wording.
+**Why cosine similarity threshold of 0.85**: Below 0.70, claims are definitely different. Above 0.85, they are probably the same. Between 0.70 and 0.85, the system flags them for human review. This threshold was chosen empirically — it is not a magic number.
+**Why temporal versioning**: A claim from 2020 might say "vaccine efficacy unknown." The same claim from 2021 might say "vaccine efficacy 95%." These are NOT duplicates. They are sequential versions of the same scientific story. The system stores a version history: `version 1: preprint_2020` → `version 2: journal_2021`.
+---
+### Part 8: Layer 4 — Knowledge Graph (The Web of Science)
+**The problem**: Science is not a list of facts. It is a web of relationships. Finding A supports Finding B. Finding C contradicts Finding B. Method X was used in both Study Y and Study Z. A list cannot represent this.
+**Why SQLite instead of Neo4j**: Neo4j is a professional graph database. It is also external software that requires installation, licensing, and maintenance. SQLite is built into Python. It requires zero setup. For a PhD student running this on a laptop, "works immediately" beats "theoretically better."
+**Why typed edges**: An edge without a type is just "connected." The design specifies 7 edge types:
+| Edge Type | What It Means | Why It Matters |
+|-----------|-------------|---------------|
+| `supports` | A provides evidence for B | The strongest positive relationship |
+| `refutes` | A contradicts B | Triggers conflict detection in the Courtroom UI |
+| `extends` | A adds conditions to B | "Works in water" extends "Works in ideal conditions" |
+| `depends_on` | A assumes B is true | If B is retracted, A's validity is questioned |
+| `supersedes` | A replaces B (newer data) | Old claims are marked outdated |
+| `blocks` | A is a null finding (no effect) | Prevents false connections |
+| `investigative_hypothesis` | Inferred (not directly observed) | Marked differently because it is weaker evidence |
+**Why we NEVER auto-generate `supports` across multiple hops**: If A supports B, and B supports C, it does NOT follow that A supports C. This is the "transitivity fallacy" — common in science but logically invalid. The design explicitly bans this.
+**Why gap analysis**: The system looks for pairs of well-connected entities with NO edge between them. If "graphene" connects to 50 methods and "biosensor" connects to 50 methods, but NO paper connects graphene to biosensors, that is a research gap. It is a publishable finding.
+---
+### Part 9: Layer 5 — Calibrated Scoring (Why Code Computes Trust)
+**The problem**: If you ask an AI "how confident are you?" it will make up a number. GPT-4 might say "95% confident" for a completely hallucinated claim. Humans trust numbers. Fake numbers are dangerous.
+**Why the AI provides COMPONENTS, not scores**: The AI outputs `evidence_strength: 900`, `study_quality_weight: 1000`, `journal_tier_weight: 1000`. The CODE multiplies these together using fixed formulas. The AI never touches the final confidence number.
+**Why fixed-point math (×1000)**: Floating-point numbers have rounding errors. `0.1 + 0.2 ≠ 0.3` in computers. The design stores all probabilities as integers from 0 to 1000. 950 means "95.0% confident." This is exact. No rounding. No drift.
+**Why 3 separate scores instead of 1**: One number hides what is wrong. A claim might have perfect evidence (Score 1 = 1000) but terrible language qualifiers (Score 3 = 300). A single average of 650 hides the problem. Three scores show: "the evidence is strong, but the claim is hedged."
+**Why parser confidence CAPS the claim score**: If the PDF parser was only 60% confident it read the table correctly, the claim CANNOT score above 600. This prevents the classic failure: "garbage in, gospel out."
+**Why the statistical gate exists**: A study with N=10,000 and effect size d=0.01 is "statistically significant" (p<0.001) but practically meaningless. The effect is too small to matter. The design has a hard rule: if N>1000 and |effect|<0.1, cap confidence at 400. This prevents the system from treating "statistically significant" as "important."
+---
+### Part 10: Layer 6 — Evaluation (How Do You Know It Works?)
+**The problem**: A system that processes science papers must be TESTED. But testing AI is hard because the answers are not binary right/wrong. "Is this claim well-extracted?" is a judgment call.
+**Why we need a GOLD STANDARD**: The design specifies 10 papers where human experts have labeled every claim, tag, qualifier, and conflict. This is the ground truth. Without it, you are driving without a map.
+**Why paper-level test splits**: If you randomly split claims into train/test, claims from the SAME paper might appear in both. The model memorizes paper-specific phrases instead of learning general extraction. The design specifies splitting by PAPER — all claims from Paper A go to training, all claims from Paper B go to testing.
+**Why LLM-as-Judge**: For fuzzy criteria like "is this claim faithful to the source text?" there is no code formula. The design uses a separate LLM (not the one being tested) as a judge. But the judge is also imperfect, so results are treated as estimates, not truth.
+**Why stochastic testing**: LLMs give different answers each time. Running evaluation once gives one number that might be 5% higher or lower next time. The design specifies running evaluations 3-5 times and reporting the range.
+**Why hidden holdout**: The team developing the system might unconsciously optimize for the test set. A truly hidden holdout — 3 papers that NO ONE on the team has seen during development — catches this. Evaluated quarterly.
+---
+### Part 11: Layer 7 — Provenance (Where Did This Come From?)
+**The problem**: A claim in the knowledge graph might have been extracted 6 months ago using an old model version. If the model has since improved, the old claim might be wrong. But you cannot know unless you track EVERYTHING.
+**Why every claim stores a full lineage**:
+```json
+{
+  "pipeline_version": "2.1.0",
+  "model_checkpoint": "research-os-grpo-v2-step-5000",
+  "parser_version": "marker-1.2.0",
+  "taxonomy_version": "quantum_bio_v1",
+  "prompt_hash": "sha256:a3b4c5...",
+  "extraction_timestamp": "2026-04-23T10:30:00Z"
+}
+```
+This is like a food label that tells you which farm, factory, and truck touch your apple. If a bad batch of apples makes people sick, you trace the batch number. If a model checkpoint produces bad extractions, you trace the checkpoint hash.
+**Why the security sandbox**: The system validates repository URLs and downloads code. A malicious URL could execute harmful code. The sandbox runs these checks in isolation: HTTP GET only, 60-second timeout, 100MB download limit, no code execution.
+**Why the epistemic embargo**: A researcher analyzing unpublished data needs privacy. The embargo mode stores claims in a private subgraph that NO other user can see. After paper submission, one click moves them to the shared lab graph.
+---
+### Part 12: The UI — Progressive Disclosure
+**The problem**: Science produces overwhelming output. 100 papers → 3,000+ claims. Showing everything at once paralyzes the user.
+**Why progressive disclosure levels**: The design specifies 5 levels of detail, like peeling an onion:
+| Level | What You See | When You Need It |
+|-------|-------------|-----------------|
+| 0 | Dashboard — health scores and today's queue | First thing every morning |
+| 1 | Claim text + tag + confidence + source | Deciding if a claim is worth reading |
+| 2 | The 3 separate scores (evidence, truth, qualifier) | Understanding WHY a claim scored low |
+| 3 | Source quote + page + bbox + council votes | Verifying the claim against the original paper |
+| 4 | 2-hop subgraph (neighbors and edges) | Understanding how this claim connects to others |
+| 5 | Raw LLM outputs + token distributions | Debugging when something goes wrong |
+**Why the Courtroom UI**: When 3 papers contradict each other, the system displays them side by side — like a courtroom with 3 witnesses. The user sees the methods, sample sizes, and p-values for each. The system does NOT pick a winner. It presents the evidence and lets the researcher decide.
+**Why Manual Synthesis Mode**: After using the system for months, researchers might start trusting it blindly. Manual mode hides ALL AI output — no scores, no conflict flags, no suggestions. The researcher draws connections manually, then switches back to compare. This built-in friction prevents over-reliance.
+---
+### Part 13: Appendix A — Future Architecture (Why We Are Planning Ahead)
+The SYSTEM_DESIGN.md includes an Appendix A that describes improvements validated by peer-reviewed research. These are NOT implemented yet. They are roadmap items for Phase F and beyond.
+**Why MAGMA (Multi-Graph Memory) matters**: The current knowledge graph mixes temporal, causal, semantic, and entity relationships into one edge space. This is like storing all your files in one giant folder. MAGMA creates separate folders (graphs) for each relationship type. When you ask "Why did the 2023 paper disagree?" the system only searches the CAUSAL graph, not everything.
+**Why post-Transformer architectures matter**: The current Transformer (Qwen3-8B) has a "quadratic attention bottleneck." At 128K tokens, every new token must look at ALL previous tokens. This is like a student re-reading the entire textbook before every new page. It is slow and memory-hungry.
+New architectures solve this:
+- **Mamba (SSM)**: Compresses history into a hidden state. Like a speed-reader who keeps a mental summary instead of re-reading every page. 5× faster on long sequences.
+- **RWKV**: Constant memory during inference. Like having a notebook where you only write the latest summary, not every note ever taken.
+- **Jamba/Griffin (Hybrid)**: Keep a few Transformer layers for precise short-range attention, add SSM layers for cheap long-range memory. Best of both worlds.
+- **MoE (Mixture of Experts)**: 30B total parameters, only 3B active per token. Like a hospital with 30 specialists where each patient sees only the relevant 3.
+**Why CompreSSM matters**: Training SSMs from scratch at small dimension produces weak models. CompreSSM starts LARGE (state dimension 256), trains for a while, then uses control theory (Hankel singular values) to identify which dimensions matter and surgically removes the rest. Result: a compact model that outperforms one trained small from the start.
+**Why gradual migration, not a hard switch**: The Transformer ecosystem (AWQ quantization, vLLM serving, GRPO in TRL) is mature. SSM ecosystems are newer. The design recommends:
+1. **Now**: Keep Transformer (stable, tested, ecosystem)
+2. **Phase F**: Test hybrid model on holdout data, compare metrics
+3. **Year 2+**: If hybrids win, migrate ingestion pipeline to SSM backbone
+4. **Always**: Companion brain stays on frontier APIs (Claude, GPT-4o) — only the local primary brain migrates
+---
 ## 📊 Resources
 | Resource | Link | What's Inside |