Add FUTURE_IMPROVEMENTS.md v1.0 — 47 blindspots across 12 categories (highschool-readable)

Browse files

Files changed (1) hide show

FUTURE_IMPROVEMENTS.md +710 -0

FUTURE_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,710 @@

+# PhD Research OS — Future Improvements Roadmap
+## The Complete Guide to Making This System Actually Work
+**Written so a high school student can understand every word.**
+**Version**: 1.0
+**Date**: 2026-04-23
+**Status**: Living document — updated through iterative blindspot discovery
+**Iteration count**: 1 of 100
+---
+## What Is This Document?
+Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts. This document is a list of everything that could go wrong and how to fix it — written by looking at the actual code, not just the design plans.
+Think of it like a car safety inspection. The car looks good from the outside, but we need to check under the hood, test the brakes, and make sure the airbags actually work.
+---
+## Table of Contents
+1. [The Big Picture Problems](#1-the-big-picture-problems)
+2. [Data Problems — Garbage In, Garbage Out](#2-data-problems)
+3. [Reading Papers — The PDF Nightmare](#3-reading-papers)
+4. [Understanding Science — The Brain Problems](#4-understanding-science)
+5. [Remembering Things — The Memory Problems](#5-remembering-things)
+6. [Scoring Confidence — The Trust Problems](#6-scoring-confidence)
+7. [Testing — How Do We Know It Works?](#7-testing)
+8. [Training — Teaching the AI](#8-training)
+9. [Working Together — Multi-Model Problems](#9-working-together)
+10. [Staying Honest Over Time](#10-staying-honest-over-time)
+11. [The Human Side](#11-the-human-side)
+12. [Security and Safety](#12-security-and-safety)
+13. [What To Build First](#13-what-to-build-first)
+---
+## 1. The Big Picture Problems
+### What's the main goal?
+We want a computer that can:
+1. Read a science paper (like a research study about a new cancer test)
+2. Pull out the key findings ("This test detected cancer at 0.8 fM concentration")
+3. Label each finding: Is it a proven fact? A guess? A theory?
+4. Compare findings across many papers and spot contradictions
+5. Tell the researcher how much to trust each finding
+### What's actually built vs. what's designed?
+Think of building a house. Here's where we are:
+| Part | Status | Analogy |
+|------|--------|---------|
+| Database (where data lives) | ✅ Built and working | Foundation is poured |
+| PDF reader (Layer 0) | ⚠️ Basic version works | Walls are up but no insulation |
+| Claim extractor (Layer 2) | ⚠️ Works with mock data, needs real AI | Wiring is in but no electricity yet |
+| Deduplicator (Layer 3) | ⚠️ Uses simple word matching, not smart matching | Front door is installed but it's a screen door |
+| Knowledge graph (Layer 4) | ⚠️ Structure exists, conflict detection is basic | Plumbing pipes laid but not connected to water |
+| Scoring engine (Layer 5) | ✅ Formula works correctly | Heating system works |
+| Evaluation (Layer 6) | ⚠️ Counts things but doesn't check if they're right | Smoke detector is installed but has no battery |
+| Export to Obsidian (Layer 7) | ✅ Works | Mailbox is installed |
+| AI agent system | ✅ Framework works | Garage is built |
+| Training data | ⚠️ 1,900 examples — need 10,000+ | You have some furniture but most rooms are empty |
+| Trained model | ⚠️ Qwen2.5-3B — design says upgrade to Qwen3-8B | The car in the garage is a Honda Civic, plan calls for a Tesla |
+### The Gap That Matters Most
+**The system is designed to be evidence-centered, but the code is still model-centered.**
+What does that mean?
+- **Model-centered** means: "Hey AI, read this text and tell me what you think." The AI writes a paragraph about what it found. You hope it's right.
+- **Evidence-centered** means: "Hey AI, find the exact sentence in this paper that says something important, highlight it, and classify it." The AI points to specific text. You can check it yourself.
+Right now, the code generates claims as free text. It should be generating claims as *pointers into the document*. Every claim should be a highlighted sentence with a label, not a summary written by the AI.
+**Why this matters for a high schooler**: Imagine your friend reads a book and tells you "The main character dies at the end." That's model-centered — you're trusting your friend. Evidence-centered would be: "Look at page 342, paragraph 2 — it says 'He took his last breath.'" Now you can check for yourself.
+---
+## 2. Data Problems — Garbage In, Garbage Out
+### Blindspot D-1: Training Data Is Synthetic, Not Real 🔴
+**What's wrong**: All 1,900 training examples were generated by a Python script (`generate_dataset.py`), not extracted from actual science papers. The fake data uses templates like "We investigated the effect of {param1} on {topic}..." — real papers don't write like that.
+**Why it matters**: Imagine learning to cook by reading a cookbook written by someone who has never cooked. The recipes look right, but the timings are wrong, the ingredients are approximate, and the techniques are described theoretically. When you actually try to cook, nothing works quite right.
+**The fix**: We need real examples. Take 100 actual science papers. Have a human expert read each one and manually label: "This sentence is a Fact," "This sentence is an Interpretation," "This qualifier word 'may' was important." Those hand-labeled examples become the gold standard.
+**Cost**: About 200 hours of expert time ($10,000-$20,000 if hiring domain experts).
+### Blindspot D-2: No Hard Negatives in Training Data 🔴
+**What's wrong**: The training data only has correct examples. There are no examples of *almost correct but subtly wrong* outputs.
+**Why it matters**: Think of studying for a multiple choice test. If you only study the right answers, you'll struggle with tricky wrong answers that look almost right. The AI needs to see examples like:
+- "The LOD was 0.8 fM" → correct extraction ✅
+- "The LOD was 0.8" (unit dropped) → wrong! ❌
+- "The LOD improved dramatically" (number replaced with vague word) → wrong! ❌
+- "The LOD was 0.8 µM" (wrong unit — fM ≠ µM, off by a billion) → wrong! ❌
+**The fix**: Run the current model on papers, collect its mistakes, and use those mistakes as "here's what NOT to do" training examples. This is called hard-negative mining.
+### Blindspot D-3: No Error Categories in Training Data 🟠
+**What's wrong**: When the model makes a mistake, we just mark it as "wrong." We don't say WHY it's wrong. Was it because a qualifier was dropped? A number was changed? A unit was lost? A section was misidentified?
+**Why it matters**: If a student keeps getting math problems wrong, a good teacher figures out WHY — is it multiplication? Fractions? Word problems? Just saying "wrong" over and over doesn't help. We need to say "you keep dropping the units" or "you keep missing the word 'not' in sentences."
+**The fix**: Create a taxonomy of error types:
+| Error Type | Example | How Common |
+|-----------|---------|------------|
+| Qualifier loss | "may reduce" → "reduces" | Very common |
+| Unit drop | "0.8 fM" → "0.8" | Common |
+| Negation flip | "did NOT improve" → "improved" | Dangerous |
+| Section confusion | Abstract claim labeled as Results | Common |
+| Number hallucination | Paper says 0.8, AI says 0.6 | Rare but devastating |
+| Granularity mismatch | Table row treated as whole-paper finding | Common |
+| Citation attribution | Claim from cited paper treated as this paper's finding | Very common |
+| Causal overclaim | "correlated with" → "caused by" | Common |
+| Missing statistical context | p-value or sample size dropped | Common |
+Train the model to recognize its own error types. When it's uncertain, it should say "I might be dropping a qualifier here" instead of just guessing.
+### Blindspot D-4: Teacher Signals Are Not Preserved 🟠
+**What's wrong**: The system design talks about running 5 different AI models (teachers) on the same paper and keeping ALL their answers. But the current data pipeline just generates one answer per example.
+**Why it matters**: Imagine 5 doctors examining the same patient. If 4 say "it's a cold" and 1 says "check for pneumonia," you don't want to throw away that 1 dissenting opinion. That minority view might save a life. Same with AI — the disagreement IS the signal.
+**The fix**: For each training example, store:
+- What Teacher 1 said (and how confident it was)
+- What Teacher 2 said (and how confident it was)
+- ... and so on for all teachers
+- Where they agreed → high confidence training signal
+- Where they disagreed → this is a "hard case" that teaches the student about uncertainty
+### Blindspot D-5: No Counterfactual Data 🟡
+**What's wrong**: No training examples where we deliberately change one thing and check if the model notices.
+**Why it matters**: If a paper says "Treatment A did NOT reduce tumor size," and we change it to "Treatment A reduced tumor size," does the model catch the difference? If it gives the same answer for both, it's not actually reading — it's pattern matching on the other words.
+**The fix**: For every 10th training example, create a "mirror" version where one critical word is changed (add/remove "not", swap a unit, change a number). The model must give a DIFFERENT answer for the mirror version. If it doesn't, that tells us it's using shortcuts instead of actually understanding.
+---
+## 3. Reading Papers — The PDF Nightmare
+### Blindspot P-1: No Actual ML-Based Parser Is Integrated 🔴
+**What's wrong**: The code in `parser.py` uses PyMuPDF (fitz) or pdfplumber. These are basic text extractors — they just scrape text off the page. The system design says to use Marker (a machine learning-based parser that understands document layout), but Marker is not actually integrated.
+**Why it matters**: Basic text extractors destroy the structure of a paper. They merge columns, lose table formatting, can't tell the difference between a heading and body text, and turn equations into garbage characters. It's like trying to read a newspaper by photographing it and running basic OCR — you get a jumble of text from different columns mixed together.
+**The fix**: Actually install and integrate Marker. It's a Python library that uses deep learning to understand page layout. It knows that a two-column paper has separate columns, that a table has rows and cells, and that an equation is not regular text.
+```
+Current:  PDF → pdfplumber → flat text (broken tables, merged columns)
+Needed:   PDF → Marker → structured markdown (tables preserved, columns separated)
+```
+### Blindspot P-2: Section Detection Is Fragile 🟠
+**What's wrong**: The code detects sections (Abstract, Methods, Results) using simple text pattern matching — it looks for the word "Abstract" at the start of a line. But many papers:
+- Number their sections ("2. Materials and Methods" or "III. Experimental Setup")
+- Use non-standard names ("Findings" instead of "Results")
+- Are in languages other than English
+- Have combined sections ("Results and Discussion")
+- Don't have clear section headers at all (short communications, letters)
+**Why it matters**: The entire scoring system depends on knowing which section a claim comes from. A finding in the Results section gets full confidence (1.0×). The same finding in the Abstract gets penalized (0.7×). If we misidentify the section, we score the claim wrong.
+**The fix**: Use Marker's built-in section detection (it's trained on millions of papers), and add a fallback classifier that looks at the CONTENT of a paragraph to guess which section it belongs to (e.g., paragraphs with many citations are probably Introduction/Discussion, paragraphs with numbers and p-values are probably Results).
+### Blindspot P-3: Tables Are Extracted As Flat Text 🔴
+**What's wrong**: When pdfplumber extracts a table, it stores it as pipe-separated text: `"0.8 | fM | PBS"`. The relationship between the header ("LOD") and the value ("0.8 fM") is lost. You can't tell which column header goes with which value.
+**Why it matters**: Tables contain the most important evidence in a science paper. If you lose the header-value relationship, you lose the meaning. "0.8" means nothing. "LOD = 0.8 fM" means everything.
+**The fix**: Store tables as structured data (like a spreadsheet), not flat text:
+```json
+{
+  "table_id": "TAB_001",
+  "headers": ["Parameter", "Value", "Buffer", "Method"],
+  "rows": [
+    ["LOD", "0.8 fM", "10 mM PBS", "3σ/slope"],
+    ["Dynamic range", "1 fM - 100 nM", "10 mM PBS", "calibration curve"]
+  ]
+}
+```
+Now the AI knows that "0.8 fM" is the "LOD" (limit of detection), measured in "10 mM PBS" buffer, using the "3σ/slope" method.
+### Blindspot P-4: Figures Are Completely Ignored 🟠
+**What's wrong**: The parser detects image blocks and stores them as `"[Image detected — requires VLM processing]"`. No actual image analysis happens.
+**Why it matters**: About 30-40% of the key evidence in a science paper is in figures. A scatter plot showing that sensitivity decreases at high ionic strength contains critical quantitative data. If we ignore all figures, we're reading the paper with one eye closed.
+**The fix**: This is a multi-step process:
+1. Detect if a figure is a chart (bar, scatter, line), a diagram, or a photograph (micrograph)
+2. For charts: use a plot digitizer to extract the actual data points (like WebPlotDigitizer)
+3. For diagrams: use a vision-language model (VLM) to describe what the diagram shows
+4. For micrographs: use a VLM to identify what the image shows (cells, crystals, etc.)
+5. Cross-check: does the data from the figure match what the text says about it?
+### Blindspot P-5: Cross-References Are Detected But Never Verified 🟡
+**What's wrong**: The code finds references like "see Figure 3" or "Table 2" in the text, but it never checks if Figure 3 actually exists in the parsed output, or if the table labeled as Table 2 is actually Table 2.
+**Why it matters**: If the parser mislabels tables (assigns Table 1's caption to Table 2's data), every claim that references those tables will have wrong evidence attached to it. Wrong evidence is worse than no evidence.
+### Blindspot P-6: No Supplement Handling 🟠
+**What's wrong**: The system design describes "paper bundles" (main PDF + supplementary files), but the parser can only process one file at a time. There's no way to say "this supplement goes with that main paper."
+**Why it matters**: In many fields (biology, chemistry), the most important data is in the supplementary materials. The main paper says "see Supplementary Table S3 for full results" — if we can't read the supplement, we're missing the best evidence.
+### Blindspot P-7: No Language Detection or Handling 🟡
+**What's wrong**: The system assumes all papers are in English. There's no detection of non-English text and no handling strategy for it.
+**Why it matters**: Many important papers, especially in Chinese, Japanese, German, and French science journals, are not in English. If someone feeds a Chinese paper into the system, it will try to extract "claims" from text it can't understand, and it will produce confident-sounding garbage.
+**The fix**: Before processing, detect the language. If it's not English, either: (a) refuse with a clear message, (b) translate first and flag reduced confidence, or (c) route to a multilingual model.
+---
+## 4. Understanding Science — The Brain Problems
+### Blindspot B-1: The Model Is Way Too Small 🔴
+**What's wrong**: The current model is Qwen2.5-3B (3 billion parameters). The design calls for Qwen3-8B. But even 8B may not be enough for reliable scientific reasoning.
+**Why it matters**: Think of brain size as vocabulary + reasoning ability. A 3B model is like a smart middle schooler — it can follow instructions and produce well-formatted output, but it doesn't deeply understand what it's writing. An 8B model is like a college freshman — much better, but still makes errors on complex reasoning. A 27B model is like a PhD student — it can genuinely reason about scientific claims.
+**The trade-off**: Bigger models need more computer memory (GPU/VRAM). The design targets a 16-24GB consumer GPU. Here are the options:
+| Model | Size | VRAM (quantized) | Reasoning Quality |
+|-------|------|-------------------|-------------------|
+| Qwen2.5-3B | 3B | ~2.5 GB | 🟡 Basic structure, frequent errors |
+| Qwen3-8B | 8B | ~5 GB | 🟢 Good structure, occasional errors |
+| Qwen3.5-9B | 9B | ~6 GB | 🟢 Better reasoning patterns |
+| Qwen3-30B-A3B MoE | 30B (3B active) | ~6 GB | 🟢🟢 Dense-14B quality at 3B cost |
+| Qwen3.5-27B | 27B | ~15 GB | 🟢🟢🟢 Best for this task |
+### Blindspot B-2: No Specialist Heads — One Brain Doing Too Many Jobs 🟠
+**What's wrong**: A single model does ALL tasks: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want very different behaviors.
+**Why it matters**: Think about a Swiss Army knife versus a chef's knife set. The Swiss Army knife can do everything but nothing well. Claim extraction wants to cast a wide net (find ALL the claims, even questionable ones). Qualifier detection wants surgical precision (keep the EXACT words). Statistical parsing wants numerical accuracy (get the numbers RIGHT). One model can't optimize for all of these simultaneously.
+**The fix**: Use a shared base model but add specialized "heads" (output layers) for different tasks:
+- Extraction head: optimized for recall (finding things)
+- Qualifier head: optimized for precision (keeping exact words)
+- Statistics head: optimized for numerical accuracy
+- Conflict head: optimized for comparison across claims
+This is like having one person who can speak 6 languages (the shared base) but wears different hats (the heads) depending on which task they're doing.
+### Blindspot B-3: No Domain Pre-Training 🟠
+**What's wrong**: The base model (Qwen2.5-3B or Qwen3-8B) was trained on general internet text. It knows about cooking, sports, history, and programming. But it doesn't have deep knowledge of scientific terminology, experimental methods, or statistical analysis.
+**Why it matters**: If you asked someone who has never taken a chemistry class to read a chemistry paper, they'd struggle with terms like "LOD," "analyte," "electrochemical impedance spectroscopy," or "cyclic voltammetry." They might extract the sentence but misclassify it because they don't understand what the words mean. Our AI has the same problem.
+**The fix**: Before fine-tuning on the Research OS data, do a "domain warm-up" — feed the model thousands of science papers (just reading them, not labeling) so it learns the vocabulary and writing style. Datasets like S2ORC (Semantic Scholar Open Research Corpus) have millions of papers we could use.
+### Blindspot B-4: Mock Extraction Is the Default Path 🔴
+**What's wrong**: The `QualifiedExtractor` class in `extractor.py` has a method called `_extract_mock()` that uses simple pattern matching (looking for words like "measured" or "suggests") to classify claims. When no AI model is connected, this is what runs. And based on the code, this is what actually runs in practice because the AI brain connection is optional.
+**Why it matters**: The mock extractor is a toy. It assigns "Fact" to any sentence containing the word "measured" and "Hypothesis" to any sentence containing "may." This is like a doctor who diagnoses every patient with a cough as having a cold and every patient with a headache as having a migraine. The real world is much more complex.
+**Example of mock failure**:
+- "We measured the baseline but found no significant difference" → Mock says: Fact (because "measured") → Actually: Null result
+- "The previously measured values suggest contamination" → Mock says: Fact → Actually: Interpretation about a past measurement
+- "It may be the most reliable technique measured to date" → Mock says: both Fact AND Hypothesis → Which one wins? Depends on code ordering.
+### Blindspot B-5: No Constrained Decoding 🟠
+**What's wrong**: The design describes using the "Guidance" library to force the AI to output valid JSON with valid tags (only "Fact", "Interpretation", "Hypothesis", or "Conflict_Hypothesis"). But this is not implemented. The current code just asks the AI to output JSON and hopes for the best.
+**Why it matters**: LLMs are unreliable at producing valid JSON, especially small ones. Without constrained decoding, the model might output:
+- `{"epistemic_tag": "fact"}` (lowercase — not in the allowed list)
+- `{"epistemic_tag": "Observation"}` (a category we don't have)
+- Broken JSON with missing brackets
+- A mix of JSON and natural language explanation
+Constrained decoding GUARANTEES the output is valid. It's like fill-in-the-blank versus essay — fill-in-the-blank can't have format errors.
+### Blindspot B-6: No Out-of-Distribution Detection 🟠
+**What's wrong**: The system has no way to detect when it's being asked to process a paper from a field it doesn't understand.
+**Why it matters**: The training data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. If someone feeds it a sociology paper, a legal document, or a paper about ancient Egyptian archaeology, the model will still produce confident-looking scientific claim extractions. They'll be nonsense, but they'll look perfectly formatted.
+**The fix**: Before the AI processes anything, check if the paper's vocabulary and topic are similar to the training data. If they're too different, say "I don't know this field well enough — my results may not be reliable" instead of producing garbage with high confidence.
+---
+## 5. Remembering Things — The Memory Problems
+### Blindspot M-1: Deduplication Uses Jaccard Similarity (Word Overlap), Not Semantic Similarity 🔴
+**What's wrong**: When the system checks if two claims are duplicates, it compares them by counting how many words they share. This is the `jaccard_similarity` function in `canonicalizer.py`.
+**Why it matters**: These two claims say the same thing but share almost no words:
+- "The LOD was 0.8 fM for the GFET biosensor"
+- "A detection limit of eight-tenths of a femtomolar was achieved using the graphene transistor"
+Jaccard similarity would rate these as very different (almost no shared words). But they're the same finding! The system would create two separate canonical claims for the same discovery.
+**The fix**: Use embedding-based similarity. Convert each claim into a number vector using a sentence embedding model (like `all-MiniLM-L6-v2`). Claims that mean the same thing will have similar vectors, even if they use different words.
+### Blindspot M-2: No Embedding Model Is Connected 🟠
+**What's wrong**: The design mentions using embeddings for deduplication, but the code doesn't import or use any embedding model. There's no ChromaDB connection, no sentence-transformers, nothing.
+**The fix**: Add a lightweight embedding model (runs on CPU, very fast):
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, runs on CPU
+embedding = model.encode("The LOD was 0.8 fM")
+# Now compare embeddings with cosine similarity instead of word overlap
+```
+### Blindspot M-3: Knowledge Graph Conflict Detection Is Based on Word Overlap 🟠
+**What's wrong**: The `find_conflicts()` method in `graph.py` finds conflicts by checking if two claims share enough words. Same problem as deduplication — semantically related claims with different vocabulary will be missed.
+**Additionally**: The method only compares the top 500 claims by confidence. If claim #501 contradicts claim #3, it will never be found.
+### Blindspot M-4: No Temporal Reasoning 🟡
+**What's wrong**: The knowledge graph has no concept of time. A claim from 2015 and a claim from 2025 about the same topic are treated equally. But the 2025 paper might specifically address and overturn the 2015 finding.
+**The fix**: Every claim needs a publication year. When comparing claims, newer evidence should be weighted more heavily. And when a newer paper directly cites and contradicts an older one, the graph should note this as a "supersedes" relationship.
+### Blindspot M-5: Gap Analysis Only Compares Same-Type Nodes 🟡
+**What's wrong**: In `graph.py`, the `find_gaps()` method only looks for gaps between nodes of the same type (`if a["node_type"] != b["node_type"]: continue`). But interesting gaps can exist between different types — for example, between a claim node and a method node that have never been connected.
+---
+## 6. Scoring Confidence — The Trust Problems
+### Blindspot S-1: Qualifier Penalty Is Too Simple 🟠
+**What's wrong**: Each qualifier reduces confidence by exactly 0.1. So "may" (-0.1), "suggests" (-0.1), and "under controlled conditions" (-0.1) all get the same penalty. But "may" is much weaker than "under controlled conditions."
+**Why it matters**: Not all hedging words are equal:
+- "may" = very uncertain (should be -0.2)
+- "appears to" = moderately uncertain (should be -0.15)
+- "under these specific conditions" = limits scope but not uncertain (should be -0.05)
+- "demonstrates" = actually strengthens the claim (should be +0.05)
+**The fix**: Weight qualifiers by type:
+| Qualifier Type | Example Words | Penalty |
+|---------------|---------------|---------|
+| Strong hedge | may, might, could, possibly | -0.20 |
+| Moderate hedge | suggests, indicates, appears | -0.15 |
+| Weak hedge | likely, tends to, generally | -0.10 |
+| Scope limiter | under these conditions, in vitro | -0.05 |
+| Strengthener | demonstrates, confirms, proves | +0.05 |
+### Blindspot S-2: No Multi-Dimensional Calibration 🟠
+**What's wrong**: The system tracks one Brier score for overall confidence calibration. But confidence means different things in different contexts:
+- "I'm confident this claim EXISTS in the paper" (extraction confidence)
+- "I'm confident this qualifier is important" (qualifier confidence)
+- "I'm confident these two claims contradict each other" (conflict confidence)
+- "I'm confident this claim came from the Results section" (section confidence)
+A single Brier score hides all of these behind one number.
+**The fix**: Track 6 separate calibration curves. A model can be brilliantly calibrated on extraction ("I found a claim" is right 85% of the time when it says 85%) but terribly calibrated on conflict detection ("I found a contradiction" is right only 40% of the time when it says 80%).
+### Blindspot S-3: Source Count Bonus Is Not In The Code 🟡
+**What's wrong**: The SYSTEM_DESIGN.md describes a "source_count_bonus" where claims supported by multiple papers get up to +0.2 confidence. But in the actual `scorer.py` code, this bonus doesn't exist. The scorer only uses information from a single claim, not the canonical claim's evidence count.
+### Blindspot S-4: Conflict Penalty Is Not In The Code 🟡
+**What's wrong**: Similarly, the design says claims with active conflicts get a -0.3 penalty. The scorer code doesn't check for conflicts.
+### Blindspot S-5: The Composite Score Is Just An Average 🟡
+**What's wrong**: The composite confidence is `(evidence_quality + truth_likelihood + qualifier_strength) / 3`. This means a claim with evidence_quality=1.0 and qualifier_strength=0.0 gets the same score as a claim with both at 0.5. But the first claim is STRONG evidence that's completely hedged, while the second is mediocre evidence that's moderately hedged — very different situations.
+**The fix**: Use the MINIMUM of the three scores, not the average. A chain is only as strong as its weakest link. If any one dimension is terrible, the composite should be terrible.
+---
+## 7. Testing — How Do We Know It Works?
+### Blindspot T-1: No Gold Standard Test Set 🔴
+**What's wrong**: The evaluation system (`evaluator.py`) counts things (how many claims, how many with qualifiers, how many null results) but doesn't compare against a known correct answer. It's like grading a test by checking if every question was answered, not if the answers are correct.
+**Why it matters**: The system could extract 500 claims from 10 papers and report "92% have qualifiers, 15% are null results, average confidence 0.65" — and ALL 500 claims could be wrong. We'd never know because we never checked against human-labeled ground truth.
+**The fix**: Create a gold standard test set:
+1. Take 10 papers from your actual research domain
+2. Have an expert manually label every claim, qualifier, epistemic tag, and conflict
+3. Run the system on those same papers
+4. Compare: How many real claims did it find? How many fake claims did it invent? How many tags did it get right?
+### Blindspot T-2: No Counterfactual Robustness Tests 🟠
+**What's wrong**: We never test if the model changes its answer for the RIGHT reason.
+**Why it matters**: The model might get the right answer for the wrong reason. For example, it might learn that papers published in Nature usually contain "Facts" and papers from arXiv usually contain "Hypotheses." This is a shortcut, not understanding.
+**Test suite we need**:
+| Test | What We Do | What Should Happen |
+|------|-----------|-------------------|
+| Remove table header | Delete the column labels from a table | Claims from that table should become "Incomplete" |
+| Swap unit | Change "fM" to "µM" | The extracted value should change accordingly |
+| Flip negation | "significant" → "not significant" | Epistemic tag should change from Fact to null result |
+| Remove figure | Delete a figure and its caption | Claims that relied on that figure should lose confidence |
+| Change journal | Pretend the paper was published in a lower-tier journal | Journal tier weight should decrease |
+### Blindspot T-3: No Paper-Level Evaluation Splits 🔴
+**What's wrong**: The train/test split in the dataset is random — individual examples are shuffled. This means the same paper's claims might appear in both train and test. The model could memorize "when I see biosensor + LOD, output Fact with 0.85 confidence" instead of actually understanding.
+**The fix**: Split by PAPER, not by example. All claims from Paper A go in training. All claims from Paper B go in testing. This way, the test set truly measures: "Can the model handle a paper it's never seen?"
+Better yet, also split by:
+- **Lab** — all papers from one research group in holdout
+- **Journal** — all papers from one journal in holdout
+- **Year** — all papers from 2025+ in holdout
+- **Field** — all materials science papers in holdout
+### Blindspot T-4: The Regression Gate Doesn't Actually Gate Anything 🟡
+**What's wrong**: The `run_regression_gate()` function checks metrics and returns pass/fail, but nothing in the system actually prevents a failing model from being deployed. There's no CI/CD pipeline that blocks a model update when the regression gate fails.
+### Blindspot T-5: No Stochastic Testing 🟡
+**What's wrong**: LLMs are not deterministic — ask the same question twice and you might get different answers. The evaluation runs once and reports a single number. But that number could be 72% one run and 68% the next.
+**The fix**: Run every evaluation 5 times. Report the mean AND the range. If the range is too wide (more than 5% spread), that's a problem — the model is unreliable.
+---
+## 8. Training — Teaching the AI
+### Blindspot TR-1: Only SFT Is Implemented — No DPO, GRPO, or ConfTuner 🔴
+**What's wrong**: The design describes a 4-stage training pipeline:
+1. **SFT** (Supervised Fine-Tuning): "Here's the right answer, learn it" ← Only this is built
+2. **DPO** (Direct Preference Optimization): "This answer is better than that one" ← Not built
+3. **GRPO** (Group Relative Policy Optimization): "Here's a reward for good JSON, good tags, and preserved qualifiers" ← Not built
+4. **ConfTuner**: "Make your confidence scores match reality" ← Not built
+**Why it matters**: SFT alone gets you to maybe 70% quality. DPO pushes it to 80%. GRPO (with the custom reward functions) pushes it to 85-90%. ConfTuner ensures the confidence numbers are meaningful. Stopping at Stage 1 is like running a marathon but stopping at mile 7.
+### Blindspot TR-2: ZeroGPU Training Space Has Fundamental Limitations 🟠
+**What's wrong**: The training Space (`phd-research-os-train`) uses ZeroGPU with 60-second time limits per call. It trains in micro-batches of ~20 steps. This means:
+- The learning rate scheduler resets every micro-batch (cosine schedule restarts)
+- The model is loaded from scratch every call (no persistent GPU state)
+- Total training time is limited by how many times you click "Train"
+**Why it matters**: Good fine-tuning needs continuous training with a smooth learning rate schedule. Chopping it into 20-step micro-batches is like studying for an exam in 2-minute bursts with a nap between each burst — you lose context and momentum.
+**The fix**: Use a proper training job (like HF Jobs) with a single continuous run. 3 epochs on 2,000 examples with Qwen2.5-3B takes about 1-2 hours on a single GPU. That's one uninterrupted training run, not 100 button clicks.
+### Blindspot TR-3: No Curriculum Learning 🟡
+**What's wrong**: All training examples are shuffled randomly. The model sees hard cases (complex multi-claim extractions with conflicts) at the same time as easy cases (single fact with clear evidence).
+**Why it matters**: Humans learn better when material is presented from easy to hard. First learn to identify a single Fact. Then learn to distinguish Fact from Interpretation. Then handle multi-claim papers. Then handle contradictions. This is called curriculum learning, and it leads to more stable training and better handling of rare, difficult cases.
+### Blindspot TR-4: No Evaluation During Training 🟡
+**What's wrong**: The `train.py` script does eval every 50 steps (reporting eval_loss), but eval_loss doesn't tell you if the model is actually getting better at the TASK. A model might have low eval_loss but still produce invalid JSON, drop qualifiers, or misclassify epistemic tags.
+**The fix**: During training, periodically run the model on a small "canary" set of 10 examples and check:
+- Is the JSON valid?
+- Are the epistemic tags correct?
+- Are qualifiers preserved?
+- Are the confidence scores reasonable?
+This is called task-specific evaluation, and it's much more informative than just watching the loss number go down.
+---
+## 9. Working Together — Multi-Model Problems
+### Blindspot W-1: The Council Has No Real Disagreement Mechanism 🟠
+**What's wrong**: The design describes a multi-round council (Extractor → Critic → Chairman) where members debate. But in the code, the council is sequential: one call to the Extractor, one call to the Critic, one call to the Chairman. There's no actual back-and-forth.
+**Why it matters**: The whole point of the council is that disagreement reveals uncertainty. If the Extractor says "Fact" and the Critic says "Actually, this is an Interpretation because the author used the word 'suggests'," that disagreement is incredibly valuable information. But in a sequential pipeline, the Chairman just picks a winner — there's no real debate.
+### Blindspot W-2: No Router Exists 🟠
+**What's wrong**: The system design describes a "Dynamic Task Router" that decides which model should handle each piece of content. The code has no router. Everything goes to the same model (or the mock extractor).
+**The fix**: Build a simple classifier that looks at each content chunk and decides:
+- Regular text → primary extraction model
+- Table → specialized table parser
+- Figure → VLM or plot digitizer
+- Equation → LaTeX parser
+- Garbage/low quality → skip with "Unextractable" flag
+- Out of domain → flag for human review
+### Blindspot W-3: No "I Don't Know" Option 🟠
+**What's wrong**: The system always produces an answer. There's no mechanism for any component to say "I can't handle this — pass it to something else or ask a human."
+**Why it matters**: The worst possible failure is confident garbage. A system that says "I don't know" is infinitely better than a system that says "I'm 85% confident" when it's actually guessing.
+**What "I don't know" looks like in practice**:
+- Parser confidence < 0.4 → "I couldn't read this page clearly. Don't trust my extraction."
+- Model uncertainty is high → "I found what might be a claim, but I'm not sure if it's a Fact or Interpretation."
+- Content is out of domain → "This doesn't look like the kind of science I was trained on."
+- Input is non-English → "This appears to be in a language I can't process reliably."
+---
+## 10. Staying Honest Over Time
+### Blindspot H-1: No Drift Detection 🟠
+**What's wrong**: If the model's performance gradually degrades (because it encounters papers from a new sub-field, or because the scientific vocabulary evolves), there's no alarm system.
+**The fix**: Every week, run the model on the gold standard test set. If any metric drops by more than 5% compared to the previous week, send an alert. This is like a regular health check-up — catch problems before they become emergencies.
+### Blindspot H-2: No Human Feedback Integration 🟠
+**What's wrong**: When a human uses the UI and corrects a claim (changes a tag from Fact to Interpretation, adds a missing qualifier), that correction is not stored in a way that can be used for future training.
+**The fix**: Every human correction becomes a high-value training example. Over time, these corrections create a growing gold dataset tailored to YOUR specific research domain. This is the cheapest way to get high-quality labeled data — your daily work generates it automatically.
+### Blindspot H-3: No Versioning of Taxonomy Changes 🟡
+**What's wrong**: The taxonomy (which study types exist and what weights they get) can change over time. But old claims don't get re-scored when the taxonomy changes. A claim scored under the old taxonomy has a different meaning than a claim scored under the new one.
+### Blindspot H-4: No Recovery Plan for Database Corruption 🟡
+**What's wrong**: The system uses SQLite, which stores everything in a single file. If that file gets corrupted (power failure during a write, disk error, accidental deletion), everything is lost. There's no backup strategy, no WAL (well, WAL mode IS enabled), no periodic snapshots.
+---
+## 11. The Human Side
+### Blindspot HU-1: Information Overload 🟠
+**What's wrong**: If you ingest 100 papers, you might get 3,000+ claims. The UI shows them all equally. There's no way to say "show me only the most important 20 claims" or "show me only the contradictions."
+**The fix**: Priority-ranked views:
+- **Dashboard**: Top 10 most surprising findings, Top 5 contradictions, Top 3 gaps
+- **Deep dive**: Filterable by section, tag, confidence, paper, topic
+- **Review queue**: Only items that need human attention (conflicts, low-confidence, out-of-domain)
+### Blindspot HU-2: No Explanation of Scores 🟡
+**What's wrong**: The UI shows "Confidence: 0.723" but doesn't explain how that number was calculated. A researcher can't tell if 0.723 is good or bad, or why it's 0.723 and not 0.85.
+**The fix**: Show the breakdown:
+- Evidence strength: 0.85 (strong direct measurement)
+- Study quality: ×1.0 (primary experimental)
+- Journal tier: ×0.85 (Tier 2 journal)
+- Section: ×1.0 (Results)
+- Completeness: ×1.0 (all fields present)
+- = 0.723
+### Blindspot HU-3: Automation Bias Risk 🟡
+**What's wrong**: Once a researcher trusts the system, they might stop checking its work. If the system says "Fact, confidence 0.92," the researcher might accept it without reading the original paper.
+**The fix**: Built-in friction:
+- Randomly hide the confidence score for 10% of claims and ask the researcher to rate it themselves
+- Periodically show a "quiz" — "The system classified this as a Fact. Do you agree?"
+- Track how often the researcher overrides the system — if overrides drop to zero, they might be over-trusting
+---
+## 12. Security and Safety
+### Blindspot SEC-1: No Input Validation 🟠
+**What's wrong**: The parser accepts any file and tries to process it. There's no check for:
+- Malicious PDFs (they exist — PDFs can contain JavaScript and exploits)
+- Extremely large files (a 500MB PDF would crash the system)
+- Non-PDF files disguised as PDFs
+- Files with embedded malware
+### Blindspot SEC-2: SQL Injection Risk 🟡
+**What's wrong**: While most database queries use parameterized queries (safe), the `get_stats()` function in `database.py` uses f-strings to construct table names: `f"SELECT COUNT(*) FROM {table}"`. If an attacker could control the table name, they could inject SQL.
+*Note*: This is low risk because the table names are hardcoded in the function, not from user input. But it's a bad pattern that could become dangerous if the code is modified later.
+### Blindspot SEC-3: No Rate Limiting on API Calls 🟡
+**What's wrong**: If the AI brain is connected to an external API (Claude, GPT), there's no limit on how many calls can be made. Processing 1,000 papers could generate thousands of API calls costing hundreds of dollars.
+---
+## 13. What To Build First
+Based on impact and dependencies, here's the build order:
+### Phase 1: Make It Work (Weeks 1-4)
+1. **Integrate Marker parser** — Everything else depends on good PDF reading
+2. **Add embedding model** — Needed for smart deduplication and conflict detection
+3. **Create gold standard test set** — 10 expert-labeled papers. Can't improve what you can't measure
+4. **Connect the real AI model** — Stop using mock extraction
+### Phase 2: Make It Reliable (Weeks 5-8)
+5. **Add constrained decoding** — Guarantee valid JSON output
+6. **Build proper training pipeline** — One continuous training job, not micro-batches
+7. **Implement weighted qualifier penalties** — Not all hedges are equal
+8. **Add out-of-distribution detection** — Know when to say "I don't know"
+### Phase 3: Make It Smart (Weeks 9-16)
+9. **Expand training data to 10K+** — With real paper examples and hard negatives
+10. **Implement DPO training** — Learn from preference pairs
+11. **Implement GRPO training** — Custom reward functions for epistemic correctness
+12. **Add multi-dimensional calibration** — 6 separate Brier scores
+### Phase 4: Make It Trustworthy (Weeks 17-24)
+13. **Build counterfactual evaluation suite** — Test for the right reasons
+14. **Add paper-level evaluation splits** — Honest accuracy numbers
+15. **Implement human feedback loop** — Corrections become training data
+16. **Add drift detection and monitoring** — Catch problems early
+### Phase 5: Make It Scale (Weeks 25-32)
+17. **Add figure/table specialist models** — Handle 100% of a paper, not just text
+18. **Build the router** — Right model for right content
+19. **Implement supplement handling** — Paper bundles
+20. **Add language detection** — Handle non-English papers safely
+---
+## Summary of All Blindspots
+| ID | Category | Severity | Description |
+|----|----------|----------|-------------|
+| D-1 | Data | 🔴 Critical | Training data is synthetic, not from real papers |
+| D-2 | Data | 🔴 Critical | No hard negative examples |
+| D-3 | Data | 🟠 High | No error taxonomy in training |
+| D-4 | Data | 🟠 High | Teacher distribution signals not preserved |
+| D-5 | Data | 🟡 Medium | No counterfactual training examples |
+| P-1 | Parser | 🔴 Critical | No ML-based parser (Marker) integrated |
+| P-2 | Parser | 🟠 High | Section detection is fragile |
+| P-3 | Parser | 🔴 Critical | Tables extracted as flat text |
+| P-4 | Parser | 🟠 High | Figures completely ignored |
+| P-5 | Parser | 🟡 Medium | Cross-references detected but not verified |
+| P-6 | Parser | 🟠 High | No supplement handling |
+| P-7 | Parser | 🟡 Medium | No language detection |
+| B-1 | Brain | 🔴 Critical | Model too small (3B) |
+| B-2 | Brain | 🟠 High | No specialist heads |
+| B-3 | Brain | 🟠 High | No domain pre-training |
+| B-4 | Brain | 🔴 Critical | Mock extraction is the default |
+| B-5 | Brain | 🟠 High | No constrained decoding |
+| B-6 | Brain | 🟠 High | No out-of-distribution detection |
+| M-1 | Memory | 🔴 Critical | Dedup uses word overlap, not semantic matching |
+| M-2 | Memory | 🟠 High | No embedding model connected |
+| M-3 | Memory | 🟠 High | Conflict detection uses word overlap |
+| M-4 | Memory | 🟡 Medium | No temporal reasoning |
+| M-5 | Memory | 🟡 Medium | Gap analysis only same-type nodes |
+| S-1 | Scoring | 🟠 High | Qualifier penalty too simple |
+| S-2 | Scoring | 🟠 High | No multi-dimensional calibration |
+| S-3 | Scoring | 🟡 Medium | Source count bonus missing from code |
+| S-4 | Scoring | 🟡 Medium | Conflict penalty missing from code |
+| S-5 | Scoring | 🟡 Medium | Composite is just an average |
+| T-1 | Testing | 🔴 Critical | No gold standard test set |
+| T-2 | Testing | 🟠 High | No counterfactual robustness tests |
+| T-3 | Testing | 🔴 Critical | No paper-level evaluation splits |
+| T-4 | Testing | 🟡 Medium | Regression gate doesn't block deployment |
+| T-5 | Testing | 🟡 Medium | No stochastic testing |
+| TR-1 | Training | 🔴 Critical | Only SFT built, not DPO/GRPO/ConfTuner |
+| TR-2 | Training | 🟠 High | ZeroGPU micro-batch training is suboptimal |
+| TR-3 | Training | 🟡 Medium | No curriculum learning |
+| TR-4 | Training | 🟡 Medium | No task-specific eval during training |
+| W-1 | Multi-Model | 🟠 High | Council has no real disagreement mechanism |
+| W-2 | Multi-Model | 🟠 High | No router exists |
+| W-3 | Multi-Model | 🟠 High | No "I don't know" option |
+| H-1 | Longevity | 🟠 High | No drift detection |
+| H-2 | Longevity | 🟠 High | No human feedback integration |
+| H-3 | Longevity | 🟡 Medium | No taxonomy versioning recovery |
+| H-4 | Longevity | 🟡 Medium | No database backup strategy |
+| HU-1 | Human | 🟠 High | Information overload in UI |
+| HU-2 | Human | 🟡 Medium | No score explanation |
+| HU-3 | Human | 🟡 Medium | Automation bias risk |
+| SEC-1 | Security | 🟠 High | No input validation |
+| SEC-2 | Security | 🟡 Medium | SQL injection pattern |
+| SEC-3 | Security | 🟡 Medium | No API rate limiting |
+**Total: 47 blindspots** (11 Critical 🔴, 22 High 🟠, 14 Medium 🟡)
+---
+*This document will be updated iteratively. Each pass adds new blindspots discovered by re-reading with fresh eyes.*