FUTURE_IMPROVEMENTS.md v2.0 Final — 78 blindspots, 14 categories, highschool-readable, compiled from code review + audit + architecture review

Browse files

Files changed (1) hide show

FUTURE_IMPROVEMENTS.md +436 -474

FUTURE_IMPROVEMENTS.md CHANGED Viewed

@@ -1,710 +1,672 @@
 # PhD Research OS — Future Improvements Roadmap
-## The Complete Guide to Making This System Actually Work
 **Written so a high school student can understand every word.**
-**Version**: 1.0
 **Date**: 2026-04-23
-**Status**: Living document — updated through iterative blindspot discovery
-**Iteration count**: 1 of 100
 ---
 ## What Is This Document?
-Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts. This document is a list of everything that could go wrong and how to fix it — written by looking at the actual code, not just the design plans.
-Think of it like a car safety inspection. The car looks good from the outside, but we need to check under the hood, test the brakes, and make sure the airbags actually work.
 ---
 ## Table of Contents
-1. [The Big Picture Problems](#1-the-big-picture-problems)
-2. [Data Problems — Garbage In, Garbage Out](#2-data-problems)
-3. [Reading Papers — The PDF Nightmare](#3-reading-papers)
-4. [Understanding Science — The Brain Problems](#4-understanding-science)
-5. [Remembering Things — The Memory Problems](#5-remembering-things)
-6. [Scoring Confidence — The Trust Problems](#6-scoring-confidence)
-7. [Testing — How Do We Know It Works?](#7-testing)
-8. [Training — Teaching the AI](#8-training)
-9. [Working Together — Multi-Model Problems](#9-working-together)
-10. [Staying Honest Over Time](#10-staying-honest-over-time)
-11. [The Human Side](#11-the-human-side)
-12. [Security and Safety](#12-security-and-safety)
-13. [What To Build First](#13-what-to-build-first)
 ---
-## 1. The Big Picture Problems
-### What's the main goal?
-We want a computer that can:
-1. Read a science paper (like a research study about a new cancer test)
-2. Pull out the key findings ("This test detected cancer at 0.8 fM concentration")
-3. Label each finding: Is it a proven fact? A guess? A theory?
-4. Compare findings across many papers and spot contradictions
-5. Tell the researcher how much to trust each finding
-### What's actually built vs. what's designed?
-Think of building a house. Here's where we are:
-| Part | Status | Analogy |
-|------|--------|---------|
-| Database (where data lives) | ✅ Built and working | Foundation is poured |
-| PDF reader (Layer 0) | ⚠️ Basic version works | Walls are up but no insulation |
-| Claim extractor (Layer 2) | ⚠️ Works with mock data, needs real AI | Wiring is in but no electricity yet |
-| Deduplicator (Layer 3) | ⚠️ Uses simple word matching, not smart matching | Front door is installed but it's a screen door |
-| Knowledge graph (Layer 4) | ⚠️ Structure exists, conflict detection is basic | Plumbing pipes laid but not connected to water |
-| Scoring engine (Layer 5) | ✅ Formula works correctly | Heating system works |
-| Evaluation (Layer 6) | ⚠️ Counts things but doesn't check if they're right | Smoke detector is installed but has no battery |
-| Export to Obsidian (Layer 7) | ✅ Works | Mailbox is installed |
-| AI agent system | ✅ Framework works | Garage is built |
-| Training data | ⚠️ 1,900 examples — need 10,000+ | You have some furniture but most rooms are empty |
-| Trained model | ⚠️ Qwen2.5-3B — design says upgrade to Qwen3-8B | The car in the garage is a Honda Civic, plan calls for a Tesla |
-### The Gap That Matters Most
-**The system is designed to be evidence-centered, but the code is still model-centered.**
-What does that mean?
-- **Model-centered** means: "Hey AI, read this text and tell me what you think." The AI writes a paragraph about what it found. You hope it's right.
-- **Evidence-centered** means: "Hey AI, find the exact sentence in this paper that says something important, highlight it, and classify it." The AI points to specific text. You can check it yourself.
-Right now, the code generates claims as free text. It should be generating claims as *pointers into the document*. Every claim should be a highlighted sentence with a label, not a summary written by the AI.
-**Why this matters for a high schooler**: Imagine your friend reads a book and tells you "The main character dies at the end." That's model-centered — you're trusting your friend. Evidence-centered would be: "Look at page 342, paragraph 2 — it says 'He took his last breath.'" Now you can check for yourself.
----
-## 2. Data Problems — Garbage In, Garbage Out
-### Blindspot D-1: Training Data Is Synthetic, Not Real 🔴
-**What's wrong**: All 1,900 training examples were generated by a Python script (`generate_dataset.py`), not extracted from actual science papers. The fake data uses templates like "We investigated the effect of {param1} on {topic}..." — real papers don't write like that.
-**Why it matters**: Imagine learning to cook by reading a cookbook written by someone who has never cooked. The recipes look right, but the timings are wrong, the ingredients are approximate, and the techniques are described theoretically. When you actually try to cook, nothing works quite right.
-**The fix**: We need real examples. Take 100 actual science papers. Have a human expert read each one and manually label: "This sentence is a Fact," "This sentence is an Interpretation," "This qualifier word 'may' was important." Those hand-labeled examples become the gold standard.
-**Cost**: About 200 hours of expert time ($10,000-$20,000 if hiring domain experts).
-### Blindspot D-2: No Hard Negatives in Training Data 🔴
-**What's wrong**: The training data only has correct examples. There are no examples of *almost correct but subtly wrong* outputs.
-**Why it matters**: Think of studying for a multiple choice test. If you only study the right answers, you'll struggle with tricky wrong answers that look almost right. The AI needs to see examples like:
-- "The LOD was 0.8 fM" → correct extraction ✅
-- "The LOD was 0.8" (unit dropped) → wrong! ❌
-- "The LOD improved dramatically" (number replaced with vague word) → wrong! ❌
-- "The LOD was 0.8 µM" (wrong unit — fM ≠ µM, off by a billion) → wrong! ❌
-**The fix**: Run the current model on papers, collect its mistakes, and use those mistakes as "here's what NOT to do" training examples. This is called hard-negative mining.
-### Blindspot D-3: No Error Categories in Training Data 🟠
-**What's wrong**: When the model makes a mistake, we just mark it as "wrong." We don't say WHY it's wrong. Was it because a qualifier was dropped? A number was changed? A unit was lost? A section was misidentified?
-**Why it matters**: If a student keeps getting math problems wrong, a good teacher figures out WHY — is it multiplication? Fractions? Word problems? Just saying "wrong" over and over doesn't help. We need to say "you keep dropping the units" or "you keep missing the word 'not' in sentences."
-**The fix**: Create a taxonomy of error types:
-| Error Type | Example | How Common |
-|-----------|---------|------------|
-| Qualifier loss | "may reduce" → "reduces" | Very common |
-| Unit drop | "0.8 fM" → "0.8" | Common |
-| Negation flip | "did NOT improve" → "improved" | Dangerous |
-| Section confusion | Abstract claim labeled as Results | Common |
-| Number hallucination | Paper says 0.8, AI says 0.6 | Rare but devastating |
-| Granularity mismatch | Table row treated as whole-paper finding | Common |
-| Citation attribution | Claim from cited paper treated as this paper's finding | Very common |
-| Causal overclaim | "correlated with" → "caused by" | Common |
-| Missing statistical context | p-value or sample size dropped | Common |
-Train the model to recognize its own error types. When it's uncertain, it should say "I might be dropping a qualifier here" instead of just guessing.
-### Blindspot D-4: Teacher Signals Are Not Preserved 🟠
-**What's wrong**: The system design talks about running 5 different AI models (teachers) on the same paper and keeping ALL their answers. But the current data pipeline just generates one answer per example.
-**Why it matters**: Imagine 5 doctors examining the same patient. If 4 say "it's a cold" and 1 says "check for pneumonia," you don't want to throw away that 1 dissenting opinion. That minority view might save a life. Same with AI — the disagreement IS the signal.
-**The fix**: For each training example, store:
-- What Teacher 1 said (and how confident it was)
-- What Teacher 2 said (and how confident it was)
-- ... and so on for all teachers
-- Where they agreed → high confidence training signal
-- Where they disagreed → this is a "hard case" that teaches the student about uncertainty
-### Blindspot D-5: No Counterfactual Data 🟡
-**What's wrong**: No training examples where we deliberately change one thing and check if the model notices.
-**Why it matters**: If a paper says "Treatment A did NOT reduce tumor size," and we change it to "Treatment A reduced tumor size," does the model catch the difference? If it gives the same answer for both, it's not actually reading — it's pattern matching on the other words.
-**The fix**: For every 10th training example, create a "mirror" version where one critical word is changed (add/remove "not", swap a unit, change a number). The model must give a DIFFERENT answer for the mirror version. If it doesn't, that tells us it's using shortcuts instead of actually understanding.
 ---
-## 3. Reading Papers — The PDF Nightmare
-### Blindspot P-1: No Actual ML-Based Parser Is Integrated 🔴
-**What's wrong**: The code in `parser.py` uses PyMuPDF (fitz) or pdfplumber. These are basic text extractors — they just scrape text off the page. The system design says to use Marker (a machine learning-based parser that understands document layout), but Marker is not actually integrated.
-**Why it matters**: Basic text extractors destroy the structure of a paper. They merge columns, lose table formatting, can't tell the difference between a heading and body text, and turn equations into garbage characters. It's like trying to read a newspaper by photographing it and running basic OCR — you get a jumble of text from different columns mixed together.
-**The fix**: Actually install and integrate Marker. It's a Python library that uses deep learning to understand page layout. It knows that a two-column paper has separate columns, that a table has rows and cells, and that an equation is not regular text.
-```
-Current:  PDF → pdfplumber → flat text (broken tables, merged columns)
-Needed:   PDF → Marker → structured markdown (tables preserved, columns separated)
-```
-### Blindspot P-2: Section Detection Is Fragile 🟠
-**What's wrong**: The code detects sections (Abstract, Methods, Results) using simple text pattern matching — it looks for the word "Abstract" at the start of a line. But many papers:
-- Number their sections ("2. Materials and Methods" or "III. Experimental Setup")
-- Use non-standard names ("Findings" instead of "Results")
-- Are in languages other than English
-- Have combined sections ("Results and Discussion")
-- Don't have clear section headers at all (short communications, letters)
-**Why it matters**: The entire scoring system depends on knowing which section a claim comes from. A finding in the Results section gets full confidence (1.0×). The same finding in the Abstract gets penalized (0.7×). If we misidentify the section, we score the claim wrong.
-**The fix**: Use Marker's built-in section detection (it's trained on millions of papers), and add a fallback classifier that looks at the CONTENT of a paragraph to guess which section it belongs to (e.g., paragraphs with many citations are probably Introduction/Discussion, paragraphs with numbers and p-values are probably Results).
-### Blindspot P-3: Tables Are Extracted As Flat Text 🔴
-**What's wrong**: When pdfplumber extracts a table, it stores it as pipe-separated text: `"0.8 | fM | PBS"`. The relationship between the header ("LOD") and the value ("0.8 fM") is lost. You can't tell which column header goes with which value.
-**Why it matters**: Tables contain the most important evidence in a science paper. If you lose the header-value relationship, you lose the meaning. "0.8" means nothing. "LOD = 0.8 fM" means everything.
-**The fix**: Store tables as structured data (like a spreadsheet), not flat text:
-```json
-{
-  "table_id": "TAB_001",
-  "headers": ["Parameter", "Value", "Buffer", "Method"],
-  "rows": [
-    ["LOD", "0.8 fM", "10 mM PBS", "3σ/slope"],
-    ["Dynamic range", "1 fM - 100 nM", "10 mM PBS", "calibration curve"]
-  ]
-}
-```
-Now the AI knows that "0.8 fM" is the "LOD" (limit of detection), measured in "10 mM PBS" buffer, using the "3σ/slope" method.
-### Blindspot P-4: Figures Are Completely Ignored 🟠
-**What's wrong**: The parser detects image blocks and stores them as `"[Image detected — requires VLM processing]"`. No actual image analysis happens.
-**Why it matters**: About 30-40% of the key evidence in a science paper is in figures. A scatter plot showing that sensitivity decreases at high ionic strength contains critical quantitative data. If we ignore all figures, we're reading the paper with one eye closed.
-**The fix**: This is a multi-step process:
-1. Detect if a figure is a chart (bar, scatter, line), a diagram, or a photograph (micrograph)
-2. For charts: use a plot digitizer to extract the actual data points (like WebPlotDigitizer)
-3. For diagrams: use a vision-language model (VLM) to describe what the diagram shows
-4. For micrographs: use a VLM to identify what the image shows (cells, crystals, etc.)
-5. Cross-check: does the data from the figure match what the text says about it?
-### Blindspot P-5: Cross-References Are Detected But Never Verified 🟡
-**What's wrong**: The code finds references like "see Figure 3" or "Table 2" in the text, but it never checks if Figure 3 actually exists in the parsed output, or if the table labeled as Table 2 is actually Table 2.
-**Why it matters**: If the parser mislabels tables (assigns Table 1's caption to Table 2's data), every claim that references those tables will have wrong evidence attached to it. Wrong evidence is worse than no evidence.
-### Blindspot P-6: No Supplement Handling 🟠
-**What's wrong**: The system design describes "paper bundles" (main PDF + supplementary files), but the parser can only process one file at a time. There's no way to say "this supplement goes with that main paper."
-**Why it matters**: In many fields (biology, chemistry), the most important data is in the supplementary materials. The main paper says "see Supplementary Table S3 for full results" — if we can't read the supplement, we're missing the best evidence.
-### Blindspot P-7: No Language Detection or Handling 🟡
-**What's wrong**: The system assumes all papers are in English. There's no detection of non-English text and no handling strategy for it.
-**Why it matters**: Many important papers, especially in Chinese, Japanese, German, and French science journals, are not in English. If someone feeds a Chinese paper into the system, it will try to extract "claims" from text it can't understand, and it will produce confident-sounding garbage.
-**The fix**: Before processing, detect the language. If it's not English, either: (a) refuse with a clear message, (b) translate first and flag reduced confidence, or (c) route to a multilingual model.
----
-## 4. Understanding Science — The Brain Problems
-### Blindspot B-1: The Model Is Way Too Small 🔴
-**What's wrong**: The current model is Qwen2.5-3B (3 billion parameters). The design calls for Qwen3-8B. But even 8B may not be enough for reliable scientific reasoning.
-**Why it matters**: Think of brain size as vocabulary + reasoning ability. A 3B model is like a smart middle schooler — it can follow instructions and produce well-formatted output, but it doesn't deeply understand what it's writing. An 8B model is like a college freshman — much better, but still makes errors on complex reasoning. A 27B model is like a PhD student — it can genuinely reason about scientific claims.
-**The trade-off**: Bigger models need more computer memory (GPU/VRAM). The design targets a 16-24GB consumer GPU. Here are the options:
-| Model | Size | VRAM (quantized) | Reasoning Quality |
-|-------|------|-------------------|-------------------|
-| Qwen2.5-3B | 3B | ~2.5 GB | 🟡 Basic structure, frequent errors |
-| Qwen3-8B | 8B | ~5 GB | 🟢 Good structure, occasional errors |
-| Qwen3.5-9B | 9B | ~6 GB | 🟢 Better reasoning patterns |
-| Qwen3-30B-A3B MoE | 30B (3B active) | ~6 GB | 🟢🟢 Dense-14B quality at 3B cost |
-| Qwen3.5-27B | 27B | ~15 GB | 🟢🟢🟢 Best for this task |
-### Blindspot B-2: No Specialist Heads — One Brain Doing Too Many Jobs 🟠
-**What's wrong**: A single model does ALL tasks: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want very different behaviors.
-**Why it matters**: Think about a Swiss Army knife versus a chef's knife set. The Swiss Army knife can do everything but nothing well. Claim extraction wants to cast a wide net (find ALL the claims, even questionable ones). Qualifier detection wants surgical precision (keep the EXACT words). Statistical parsing wants numerical accuracy (get the numbers RIGHT). One model can't optimize for all of these simultaneously.
-**The fix**: Use a shared base model but add specialized "heads" (output layers) for different tasks:
-- Extraction head: optimized for recall (finding things)
-- Qualifier head: optimized for precision (keeping exact words)
-- Statistics head: optimized for numerical accuracy
-- Conflict head: optimized for comparison across claims
-This is like having one person who can speak 6 languages (the shared base) but wears different hats (the heads) depending on which task they're doing.
-### Blindspot B-3: No Domain Pre-Training 🟠
-**What's wrong**: The base model (Qwen2.5-3B or Qwen3-8B) was trained on general internet text. It knows about cooking, sports, history, and programming. But it doesn't have deep knowledge of scientific terminology, experimental methods, or statistical analysis.
-**Why it matters**: If you asked someone who has never taken a chemistry class to read a chemistry paper, they'd struggle with terms like "LOD," "analyte," "electrochemical impedance spectroscopy," or "cyclic voltammetry." They might extract the sentence but misclassify it because they don't understand what the words mean. Our AI has the same problem.
-**The fix**: Before fine-tuning on the Research OS data, do a "domain warm-up" — feed the model thousands of science papers (just reading them, not labeling) so it learns the vocabulary and writing style. Datasets like S2ORC (Semantic Scholar Open Research Corpus) have millions of papers we could use.
-### Blindspot B-4: Mock Extraction Is the Default Path 🔴
-**What's wrong**: The `QualifiedExtractor` class in `extractor.py` has a method called `_extract_mock()` that uses simple pattern matching (looking for words like "measured" or "suggests") to classify claims. When no AI model is connected, this is what runs. And based on the code, this is what actually runs in practice because the AI brain connection is optional.
-**Why it matters**: The mock extractor is a toy. It assigns "Fact" to any sentence containing the word "measured" and "Hypothesis" to any sentence containing "may." This is like a doctor who diagnoses every patient with a cough as having a cold and every patient with a headache as having a migraine. The real world is much more complex.
-**Example of mock failure**:
-- "We measured the baseline but found no significant difference" → Mock says: Fact (because "measured") → Actually: Null result
-- "The previously measured values suggest contamination" → Mock says: Fact → Actually: Interpretation about a past measurement
-- "It may be the most reliable technique measured to date" → Mock says: both Fact AND Hypothesis → Which one wins? Depends on code ordering.
-### Blindspot B-5: No Constrained Decoding 🟠
-**What's wrong**: The design describes using the "Guidance" library to force the AI to output valid JSON with valid tags (only "Fact", "Interpretation", "Hypothesis", or "Conflict_Hypothesis"). But this is not implemented. The current code just asks the AI to output JSON and hopes for the best.
-**Why it matters**: LLMs are unreliable at producing valid JSON, especially small ones. Without constrained decoding, the model might output:
-- `{"epistemic_tag": "fact"}` (lowercase — not in the allowed list)
-- `{"epistemic_tag": "Observation"}` (a category we don't have)
-- Broken JSON with missing brackets
-- A mix of JSON and natural language explanation
-Constrained decoding GUARANTEES the output is valid. It's like fill-in-the-blank versus essay — fill-in-the-blank can't have format errors.
-### Blindspot B-6: No Out-of-Distribution Detection 🟠
-**What's wrong**: The system has no way to detect when it's being asked to process a paper from a field it doesn't understand.
-**Why it matters**: The training data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. If someone feeds it a sociology paper, a legal document, or a paper about ancient Egyptian archaeology, the model will still produce confident-looking scientific claim extractions. They'll be nonsense, but they'll look perfectly formatted.
-**The fix**: Before the AI processes anything, check if the paper's vocabulary and topic are similar to the training data. If they're too different, say "I don't know this field well enough — my results may not be reliable" instead of producing garbage with high confidence.
 ---
-## 5. Remembering Things — The Memory Problems
-### Blindspot M-1: Deduplication Uses Jaccard Similarity (Word Overlap), Not Semantic Similarity 🔴
-**What's wrong**: When the system checks if two claims are duplicates, it compares them by counting how many words they share. This is the `jaccard_similarity` function in `canonicalizer.py`.
-**Why it matters**: These two claims say the same thing but share almost no words:
-- "The LOD was 0.8 fM for the GFET biosensor"
-- "A detection limit of eight-tenths of a femtomolar was achieved using the graphene transistor"
-Jaccard similarity would rate these as very different (almost no shared words). But they're the same finding! The system would create two separate canonical claims for the same discovery.
-**The fix**: Use embedding-based similarity. Convert each claim into a number vector using a sentence embedding model (like `all-MiniLM-L6-v2`). Claims that mean the same thing will have similar vectors, even if they use different words.
-### Blindspot M-2: No Embedding Model Is Connected 🟠
-**What's wrong**: The design mentions using embeddings for deduplication, but the code doesn't import or use any embedding model. There's no ChromaDB connection, no sentence-transformers, nothing.
-**The fix**: Add a lightweight embedding model (runs on CPU, very fast):
-```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, runs on CPU
-embedding = model.encode("The LOD was 0.8 fM")
-# Now compare embeddings with cosine similarity instead of word overlap
-```
-### Blindspot M-3: Knowledge Graph Conflict Detection Is Based on Word Overlap 🟠
-**What's wrong**: The `find_conflicts()` method in `graph.py` finds conflicts by checking if two claims share enough words. Same problem as deduplication — semantically related claims with different vocabulary will be missed.
-**Additionally**: The method only compares the top 500 claims by confidence. If claim #501 contradicts claim #3, it will never be found.
-### Blindspot M-4: No Temporal Reasoning 🟡
-**What's wrong**: The knowledge graph has no concept of time. A claim from 2015 and a claim from 2025 about the same topic are treated equally. But the 2025 paper might specifically address and overturn the 2015 finding.
-**The fix**: Every claim needs a publication year. When comparing claims, newer evidence should be weighted more heavily. And when a newer paper directly cites and contradicts an older one, the graph should note this as a "supersedes" relationship.
-### Blindspot M-5: Gap Analysis Only Compares Same-Type Nodes 🟡
-**What's wrong**: In `graph.py`, the `find_gaps()` method only looks for gaps between nodes of the same type (`if a["node_type"] != b["node_type"]: continue`). But interesting gaps can exist between different types — for example, between a claim node and a method node that have never been connected.
 ---
-## 6. Scoring Confidence — The Trust Problems
-### Blindspot S-1: Qualifier Penalty Is Too Simple 🟠
-**What's wrong**: Each qualifier reduces confidence by exactly 0.1. So "may" (-0.1), "suggests" (-0.1), and "under controlled conditions" (-0.1) all get the same penalty. But "may" is much weaker than "under controlled conditions."
-**Why it matters**: Not all hedging words are equal:
-- "may" = very uncertain (should be -0.2)
-- "appears to" = moderately uncertain (should be -0.15)
-- "under these specific conditions" = limits scope but not uncertain (should be -0.05)
-- "demonstrates" = actually strengthens the claim (should be +0.05)
-**The fix**: Weight qualifiers by type:
-| Qualifier Type | Example Words | Penalty |
-|---------------|---------------|---------|
-| Strong hedge | may, might, could, possibly | -0.20 |
-| Moderate hedge | suggests, indicates, appears | -0.15 |
-| Weak hedge | likely, tends to, generally | -0.10 |
-| Scope limiter | under these conditions, in vitro | -0.05 |
-| Strengthener | demonstrates, confirms, proves | +0.05 |
-### Blindspot S-2: No Multi-Dimensional Calibration 🟠
-**What's wrong**: The system tracks one Brier score for overall confidence calibration. But confidence means different things in different contexts:
-- "I'm confident this claim EXISTS in the paper" (extraction confidence)
-- "I'm confident this qualifier is important" (qualifier confidence)
-- "I'm confident these two claims contradict each other" (conflict confidence)
-- "I'm confident this claim came from the Results section" (section confidence)
-A single Brier score hides all of these behind one number.
-**The fix**: Track 6 separate calibration curves. A model can be brilliantly calibrated on extraction ("I found a claim" is right 85% of the time when it says 85%) but terribly calibrated on conflict detection ("I found a contradiction" is right only 40% of the time when it says 80%).
-### Blindspot S-3: Source Count Bonus Is Not In The Code 🟡
-**What's wrong**: The SYSTEM_DESIGN.md describes a "source_count_bonus" where claims supported by multiple papers get up to +0.2 confidence. But in the actual `scorer.py` code, this bonus doesn't exist. The scorer only uses information from a single claim, not the canonical claim's evidence count.
-### Blindspot S-4: Conflict Penalty Is Not In The Code 🟡
-**What's wrong**: Similarly, the design says claims with active conflicts get a -0.3 penalty. The scorer code doesn't check for conflicts.
-### Blindspot S-5: The Composite Score Is Just An Average 🟡
-**What's wrong**: The composite confidence is `(evidence_quality + truth_likelihood + qualifier_strength) / 3`. This means a claim with evidence_quality=1.0 and qualifier_strength=0.0 gets the same score as a claim with both at 0.5. But the first claim is STRONG evidence that's completely hedged, while the second is mediocre evidence that's moderately hedged — very different situations.
-**The fix**: Use the MINIMUM of the three scores, not the average. A chain is only as strong as its weakest link. If any one dimension is terrible, the composite should be terrible.
 ---
-## 7. Testing — How Do We Know It Works?
-### Blindspot T-1: No Gold Standard Test Set 🔴
-**What's wrong**: The evaluation system (`evaluator.py`) counts things (how many claims, how many with qualifiers, how many null results) but doesn't compare against a known correct answer. It's like grading a test by checking if every question was answered, not if the answers are correct.
-**Why it matters**: The system could extract 500 claims from 10 papers and report "92% have qualifiers, 15% are null results, average confidence 0.65" — and ALL 500 claims could be wrong. We'd never know because we never checked against human-labeled ground truth.
-**The fix**: Create a gold standard test set:
-1. Take 10 papers from your actual research domain
-2. Have an expert manually label every claim, qualifier, epistemic tag, and conflict
-3. Run the system on those same papers
-4. Compare: How many real claims did it find? How many fake claims did it invent? How many tags did it get right?
-### Blindspot T-2: No Counterfactual Robustness Tests 🟠
-**What's wrong**: We never test if the model changes its answer for the RIGHT reason.
-**Why it matters**: The model might get the right answer for the wrong reason. For example, it might learn that papers published in Nature usually contain "Facts" and papers from arXiv usually contain "Hypotheses." This is a shortcut, not understanding.
-**Test suite we need**:
-| Test | What We Do | What Should Happen |
-|------|-----------|-------------------|
-| Remove table header | Delete the column labels from a table | Claims from that table should become "Incomplete" |
-| Swap unit | Change "fM" to "µM" | The extracted value should change accordingly |
-| Flip negation | "significant" → "not significant" | Epistemic tag should change from Fact to null result |
-| Remove figure | Delete a figure and its caption | Claims that relied on that figure should lose confidence |
-| Change journal | Pretend the paper was published in a lower-tier journal | Journal tier weight should decrease |
-### Blindspot T-3: No Paper-Level Evaluation Splits 🔴
-**What's wrong**: The train/test split in the dataset is random — individual examples are shuffled. This means the same paper's claims might appear in both train and test. The model could memorize "when I see biosensor + LOD, output Fact with 0.85 confidence" instead of actually understanding.
-**The fix**: Split by PAPER, not by example. All claims from Paper A go in training. All claims from Paper B go in testing. This way, the test set truly measures: "Can the model handle a paper it's never seen?"
-Better yet, also split by:
-- **Lab** — all papers from one research group in holdout
-- **Journal** — all papers from one journal in holdout
-- **Year** — all papers from 2025+ in holdout
-- **Field** — all materials science papers in holdout
-### Blindspot T-4: The Regression Gate Doesn't Actually Gate Anything 🟡
-**What's wrong**: The `run_regression_gate()` function checks metrics and returns pass/fail, but nothing in the system actually prevents a failing model from being deployed. There's no CI/CD pipeline that blocks a model update when the regression gate fails.
-### Blindspot T-5: No Stochastic Testing 🟡
-**What's wrong**: LLMs are not deterministic — ask the same question twice and you might get different answers. The evaluation runs once and reports a single number. But that number could be 72% one run and 68% the next.
-**The fix**: Run every evaluation 5 times. Report the mean AND the range. If the range is too wide (more than 5% spread), that's a problem — the model is unreliable.
 ---
-## 8. Training — Teaching the AI
-### Blindspot TR-1: Only SFT Is Implemented — No DPO, GRPO, or ConfTuner 🔴
-**What's wrong**: The design describes a 4-stage training pipeline:
-1. **SFT** (Supervised Fine-Tuning): "Here's the right answer, learn it" ← Only this is built
-2. **DPO** (Direct Preference Optimization): "This answer is better than that one" ← Not built
-3. **GRPO** (Group Relative Policy Optimization): "Here's a reward for good JSON, good tags, and preserved qualifiers" ← Not built
-4. **ConfTuner**: "Make your confidence scores match reality" ← Not built
-**Why it matters**: SFT alone gets you to maybe 70% quality. DPO pushes it to 80%. GRPO (with the custom reward functions) pushes it to 85-90%. ConfTuner ensures the confidence numbers are meaningful. Stopping at Stage 1 is like running a marathon but stopping at mile 7.
-### Blindspot TR-2: ZeroGPU Training Space Has Fundamental Limitations 🟠
-**What's wrong**: The training Space (`phd-research-os-train`) uses ZeroGPU with 60-second time limits per call. It trains in micro-batches of ~20 steps. This means:
-- The learning rate scheduler resets every micro-batch (cosine schedule restarts)
-- The model is loaded from scratch every call (no persistent GPU state)
-- Total training time is limited by how many times you click "Train"
-**Why it matters**: Good fine-tuning needs continuous training with a smooth learning rate schedule. Chopping it into 20-step micro-batches is like studying for an exam in 2-minute bursts with a nap between each burst — you lose context and momentum.
-**The fix**: Use a proper training job (like HF Jobs) with a single continuous run. 3 epochs on 2,000 examples with Qwen2.5-3B takes about 1-2 hours on a single GPU. That's one uninterrupted training run, not 100 button clicks.
-### Blindspot TR-3: No Curriculum Learning 🟡
-**What's wrong**: All training examples are shuffled randomly. The model sees hard cases (complex multi-claim extractions with conflicts) at the same time as easy cases (single fact with clear evidence).
-**Why it matters**: Humans learn better when material is presented from easy to hard. First learn to identify a single Fact. Then learn to distinguish Fact from Interpretation. Then handle multi-claim papers. Then handle contradictions. This is called curriculum learning, and it leads to more stable training and better handling of rare, difficult cases.
-### Blindspot TR-4: No Evaluation During Training 🟡
-**What's wrong**: The `train.py` script does eval every 50 steps (reporting eval_loss), but eval_loss doesn't tell you if the model is actually getting better at the TASK. A model might have low eval_loss but still produce invalid JSON, drop qualifiers, or misclassify epistemic tags.
-**The fix**: During training, periodically run the model on a small "canary" set of 10 examples and check:
-- Is the JSON valid?
-- Are the epistemic tags correct?
-- Are qualifiers preserved?
-- Are the confidence scores reasonable?
-This is called task-specific evaluation, and it's much more informative than just watching the loss number go down.
----
-## 9. Working Together — Multi-Model Problems
-### Blindspot W-1: The Council Has No Real Disagreement Mechanism 🟠
-**What's wrong**: The design describes a multi-round council (Extractor → Critic → Chairman) where members debate. But in the code, the council is sequential: one call to the Extractor, one call to the Critic, one call to the Chairman. There's no actual back-and-forth.
-**Why it matters**: The whole point of the council is that disagreement reveals uncertainty. If the Extractor says "Fact" and the Critic says "Actually, this is an Interpretation because the author used the word 'suggests'," that disagreement is incredibly valuable information. But in a sequential pipeline, the Chairman just picks a winner — there's no real debate.
-### Blindspot W-2: No Router Exists 🟠
-**What's wrong**: The system design describes a "Dynamic Task Router" that decides which model should handle each piece of content. The code has no router. Everything goes to the same model (or the mock extractor).
-**The fix**: Build a simple classifier that looks at each content chunk and decides:
-- Regular text → primary extraction model
-- Table → specialized table parser
-- Figure → VLM or plot digitizer
-- Equation → LaTeX parser
-- Garbage/low quality → skip with "Unextractable" flag
-- Out of domain → flag for human review
-### Blindspot W-3: No "I Don't Know" Option 🟠
-**What's wrong**: The system always produces an answer. There's no mechanism for any component to say "I can't handle this — pass it to something else or ask a human."
-**Why it matters**: The worst possible failure is confident garbage. A system that says "I don't know" is infinitely better than a system that says "I'm 85% confident" when it's actually guessing.
-**What "I don't know" looks like in practice**:
-- Parser confidence < 0.4 → "I couldn't read this page clearly. Don't trust my extraction."
-- Model uncertainty is high → "I found what might be a claim, but I'm not sure if it's a Fact or Interpretation."
-- Content is out of domain → "This doesn't look like the kind of science I was trained on."
-- Input is non-English → "This appears to be in a language I can't process reliably."
 ---
-## 10. Staying Honest Over Time
-### Blindspot H-1: No Drift Detection 🟠
-**What's wrong**: If the model's performance gradually degrades (because it encounters papers from a new sub-field, or because the scientific vocabulary evolves), there's no alarm system.
-**The fix**: Every week, run the model on the gold standard test set. If any metric drops by more than 5% compared to the previous week, send an alert. This is like a regular health check-up — catch problems before they become emergencies.
-### Blindspot H-2: No Human Feedback Integration 🟠
-**What's wrong**: When a human uses the UI and corrects a claim (changes a tag from Fact to Interpretation, adds a missing qualifier), that correction is not stored in a way that can be used for future training.
-**The fix**: Every human correction becomes a high-value training example. Over time, these corrections create a growing gold dataset tailored to YOUR specific research domain. This is the cheapest way to get high-quality labeled data — your daily work generates it automatically.
-### Blindspot H-3: No Versioning of Taxonomy Changes 🟡
-**What's wrong**: The taxonomy (which study types exist and what weights they get) can change over time. But old claims don't get re-scored when the taxonomy changes. A claim scored under the old taxonomy has a different meaning than a claim scored under the new one.
-### Blindspot H-4: No Recovery Plan for Database Corruption 🟡
-**What's wrong**: The system uses SQLite, which stores everything in a single file. If that file gets corrupted (power failure during a write, disk error, accidental deletion), everything is lost. There's no backup strategy, no WAL (well, WAL mode IS enabled), no periodic snapshots.
 ---
-## 11. The Human Side
-### Blindspot HU-1: Information Overload 🟠
-**What's wrong**: If you ingest 100 papers, you might get 3,000+ claims. The UI shows them all equally. There's no way to say "show me only the most important 20 claims" or "show me only the contradictions."
-**The fix**: Priority-ranked views:
-- **Dashboard**: Top 10 most surprising findings, Top 5 contradictions, Top 3 gaps
-- **Deep dive**: Filterable by section, tag, confidence, paper, topic
-- **Review queue**: Only items that need human attention (conflicts, low-confidence, out-of-domain)
-### Blindspot HU-2: No Explanation of Scores 🟡
-**What's wrong**: The UI shows "Confidence: 0.723" but doesn't explain how that number was calculated. A researcher can't tell if 0.723 is good or bad, or why it's 0.723 and not 0.85.
-**The fix**: Show the breakdown:
-- Evidence strength: 0.85 (strong direct measurement)
-- Study quality: ×1.0 (primary experimental)
-- Journal tier: ×0.85 (Tier 2 journal)
-- Section: ×1.0 (Results)
-- Completeness: ×1.0 (all fields present)
-- = 0.723
-### Blindspot HU-3: Automation Bias Risk 🟡
-**What's wrong**: Once a researcher trusts the system, they might stop checking its work. If the system says "Fact, confidence 0.92," the researcher might accept it without reading the original paper.
-**The fix**: Built-in friction:
-- Randomly hide the confidence score for 10% of claims and ask the researcher to rate it themselves
-- Periodically show a "quiz" — "The system classified this as a Fact. Do you agree?"
-- Track how often the researcher overrides the system — if overrides drop to zero, they might be over-trusting
 ---
-## 12. Security and Safety
-### Blindspot SEC-1: No Input Validation 🟠
-**What's wrong**: The parser accepts any file and tries to process it. There's no check for:
-- Malicious PDFs (they exist — PDFs can contain JavaScript and exploits)
-- Extremely large files (a 500MB PDF would crash the system)
-- Non-PDF files disguised as PDFs
-- Files with embedded malware
-### Blindspot SEC-2: SQL Injection Risk 🟡
-**What's wrong**: While most database queries use parameterized queries (safe), the `get_stats()` function in `database.py` uses f-strings to construct table names: `f"SELECT COUNT(*) FROM {table}"`. If an attacker could control the table name, they could inject SQL.
-*Note*: This is low risk because the table names are hardcoded in the function, not from user input. But it's a bad pattern that could become dangerous if the code is modified later.
-### Blindspot SEC-3: No Rate Limiting on API Calls 🟡
-**What's wrong**: If the AI brain is connected to an external API (Claude, GPT), there's no limit on how many calls can be made. Processing 1,000 papers could generate thousands of API calls costing hundreds of dollars.
 ---
-## 13. What To Build First
-Based on impact and dependencies, here's the build order:
-### Phase 1: Make It Work (Weeks 1-4)
-1. **Integrate Marker parser** — Everything else depends on good PDF reading
-2. **Add embedding model** — Needed for smart deduplication and conflict detection
-3. **Create gold standard test set** — 10 expert-labeled papers. Can't improve what you can't measure
-4. **Connect the real AI model** — Stop using mock extraction
-### Phase 2: Make It Reliable (Weeks 5-8)
-5. **Add constrained decoding** — Guarantee valid JSON output
-6. **Build proper training pipeline** — One continuous training job, not micro-batches
-7. **Implement weighted qualifier penalties** — Not all hedges are equal
-8. **Add out-of-distribution detection** — Know when to say "I don't know"
-### Phase 3: Make It Smart (Weeks 9-16)
-9. **Expand training data to 10K+** — With real paper examples and hard negatives
-10. **Implement DPO training** — Learn from preference pairs
-11. **Implement GRPO training** — Custom reward functions for epistemic correctness
-12. **Add multi-dimensional calibration** — 6 separate Brier scores
-### Phase 4: Make It Trustworthy (Weeks 17-24)
-13. **Build counterfactual evaluation suite** — Test for the right reasons
-14. **Add paper-level evaluation splits** — Honest accuracy numbers
-15. **Implement human feedback loop** — Corrections become training data
-16. **Add drift detection and monitoring** — Catch problems early
-### Phase 5: Make It Scale (Weeks 25-32)
-17. **Add figure/table specialist models** — Handle 100% of a paper, not just text
-18. **Build the router** — Right model for right content
-19. **Implement supplement handling** — Paper bundles
-20. **Add language detection** — Handle non-English papers safely
----
-## Summary of All Blindspots
-| ID | Category | Severity | Description |
-|----|----------|----------|-------------|
-| D-1 | Data | 🔴 Critical | Training data is synthetic, not from real papers |
-| D-2 | Data | 🔴 Critical | No hard negative examples |
-| D-3 | Data | 🟠 High | No error taxonomy in training |
-| D-4 | Data | 🟠 High | Teacher distribution signals not preserved |
-| D-5 | Data | 🟡 Medium | No counterfactual training examples |
-| P-1 | Parser | 🔴 Critical | No ML-based parser (Marker) integrated |
-| P-2 | Parser | 🟠 High | Section detection is fragile |
-| P-3 | Parser | 🔴 Critical | Tables extracted as flat text |
-| P-4 | Parser | 🟠 High | Figures completely ignored |
-| P-5 | Parser | 🟡 Medium | Cross-references detected but not verified |
-| P-6 | Parser | 🟠 High | No supplement handling |
-| P-7 | Parser | 🟡 Medium | No language detection |
-| B-1 | Brain | 🔴 Critical | Model too small (3B) |
-| B-2 | Brain | 🟠 High | No specialist heads |
-| B-3 | Brain | 🟠 High | No domain pre-training |
-| B-4 | Brain | 🔴 Critical | Mock extraction is the default |
-| B-5 | Brain | 🟠 High | No constrained decoding |
-| B-6 | Brain | 🟠 High | No out-of-distribution detection |
-| M-1 | Memory | 🔴 Critical | Dedup uses word overlap, not semantic matching |
-| M-2 | Memory | 🟠 High | No embedding model connected |
-| M-3 | Memory | 🟠 High | Conflict detection uses word overlap |
-| M-4 | Memory | 🟡 Medium | No temporal reasoning |
-| M-5 | Memory | 🟡 Medium | Gap analysis only same-type nodes |
-| S-1 | Scoring | 🟠 High | Qualifier penalty too simple |
-| S-2 | Scoring | 🟠 High | No multi-dimensional calibration |
-| S-3 | Scoring | 🟡 Medium | Source count bonus missing from code |
-| S-4 | Scoring | 🟡 Medium | Conflict penalty missing from code |
-| S-5 | Scoring | 🟡 Medium | Composite is just an average |
-| T-1 | Testing | 🔴 Critical | No gold standard test set |
-| T-2 | Testing | 🟠 High | No counterfactual robustness tests |
-| T-3 | Testing | 🔴 Critical | No paper-level evaluation splits |
-| T-4 | Testing | 🟡 Medium | Regression gate doesn't block deployment |
-| T-5 | Testing | 🟡 Medium | No stochastic testing |
-| TR-1 | Training | 🔴 Critical | Only SFT built, not DPO/GRPO/ConfTuner |
-| TR-2 | Training | 🟠 High | ZeroGPU micro-batch training is suboptimal |
-| TR-3 | Training | 🟡 Medium | No curriculum learning |
-| TR-4 | Training | 🟡 Medium | No task-specific eval during training |
-| W-1 | Multi-Model | 🟠 High | Council has no real disagreement mechanism |
-| W-2 | Multi-Model | 🟠 High | No router exists |
-| W-3 | Multi-Model | 🟠 High | No "I don't know" option |
-| H-1 | Longevity | 🟠 High | No drift detection |
-| H-2 | Longevity | 🟠 High | No human feedback integration |
-| H-3 | Longevity | 🟡 Medium | No taxonomy versioning recovery |
-| H-4 | Longevity | 🟡 Medium | No database backup strategy |
-| HU-1 | Human | 🟠 High | Information overload in UI |
-| HU-2 | Human | 🟡 Medium | No score explanation |
-| HU-3 | Human | 🟡 Medium | Automation bias risk |
-| SEC-1 | Security | 🟠 High | No input validation |
-| SEC-2 | Security | 🟡 Medium | SQL injection pattern |
-| SEC-3 | Security | 🟡 Medium | No API rate limiting |
-**Total: 47 blindspots** (11 Critical 🔴, 22 High 🟠, 14 Medium 🟡)
 ---
-*This document will be updated iteratively. Each pass adds new blindspots discovered by re-reading with fresh eyes.*

 # PhD Research OS — Future Improvements Roadmap
+## Everything That Needs to Be Fixed, and How
 **Written so a high school student can understand every word.**
+**Version**: 2.0 (Final Compiled Edition)
 **Date**: 2026-04-23
+**Status**: Comprehensive — compiled from 87 original audit findings, 12 architectural improvements, and iterative code review
+**Total blindspots catalogued**: 78 (across 14 categories)
 ---
 ## What Is This Document?
+Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts built. This document lists every problem we've found — from looking at the actual code, not just the plans — and explains how to fix each one.
+**Who is this for?** Anyone who wants to understand, contribute to, or learn from this project. Every explanation uses everyday language. If you've ever written a school report, you already have the intuition for most of these problems.
+**How was this made?** We read every file in the repository (60+ files), compared the actual code against the design document, re-read our findings multiple times looking for things we missed, and compiled everything into this single document.
 ---
 ## Table of Contents
+1. [The Big Picture](#1-the-big-picture)
+2. [The Core Philosophy Problem](#2-the-core-philosophy-problem)
+3. [Data Problems — Your Training Material Is Weak](#3-data-problems)
+4. [Reading Papers — The PDF Nightmare](#4-reading-papers)
+5. [The Brain — Your AI Is Underpowered](#5-the-brain)
+6. [Memory — How the System Remembers](#6-memory)
+7. [Scoring — Can You Trust the Numbers?](#7-scoring)
+8. [Testing — How Do You Know It Works?](#8-testing)
+9. [Training — Teaching the AI Properly](#9-training)
+10. [Teamwork — Multiple Models Working Together](#10-teamwork)
+11. [Staying Honest Over Time](#11-staying-honest)
+12. [The Human Experience](#12-the-human-experience)
+13. [Security and Safety](#13-security-and-safety)
+14. [The Companion Agent System](#14-companion-agents)
+15. [What To Build First](#15-what-to-build-first)
+16. [Master Blindspot Table](#16-master-table)
 ---
+## 1. The Big Picture
+### What does this system do?
+It reads science papers and does 5 things:
+1. **Extracts** key findings ("This test detected cancer at 0.8 fM")
+2. **Labels** each finding (Is it a proven fact? An educated guess? A theory?)
+3. **Scores** how much you should trust each finding (using math formulas, not AI guesses)
+4. **Compares** findings across papers and spots contradictions
+5. **Exports** everything to tools researchers already use (Obsidian notes, CSV files)
+### What's built vs. what's planned?
+| Part | Status | Plain English |
+|------|--------|---------------|
+| Database | ✅ Works | The filing cabinet is built |
+| PDF reader | ⚠️ Basic | Can read papers but misses tables, figures, and equations |
+| Claim extractor | ⚠️ Uses fake AI | Has a pretend brain that guesses using word patterns |
+| Deduplicator | ⚠️ Primitive | Checks if two sentences share the same words (misses paraphrases) |
+| Knowledge graph | ⚠️ Basic | Can store connections but can't find subtle contradictions |
+| Scoring engine | ✅ Works | Math formulas work correctly |
+| Evaluation | ⚠️ Incomplete | Counts claims but never checks if they're RIGHT |
+| Obsidian export | ✅ Works | Can create notes for each finding |
+| AI agent helpers | ✅ Framework works | The assistant robots exist but have no real brains yet |
+| Training data | ⚠️ Too small | 1,900 fake examples (need 10,000+ real ones) |
+| Trained model | ⚠️ Too small | Using a 3-billion parameter model (plan says 8-27 billion) |
+---
+## 2. The Core Philosophy Problem
+### Blindspot CORE-1: The System Is Model-Centered, Not Evidence-Centered 🔴
+**This is the single most important problem in the entire project.**
+Right now, the system works like this:
+1. Give text to AI
+2. AI writes a summary of what it found
+3. Hope the summary is correct
+It should work like this:
+1. AI scans the text and points to specific sentences
+2. Each pointed-to sentence gets a label (Fact / Interpretation / Hypothesis)
+3. You can click on any finding and see the EXACT sentence in the original paper
+**Analogy**: Imagine writing a school essay. Model-centered is like saying "Shakespeare wrote about love and death" with no page references. Evidence-centered is like saying "In Act 3, Scene 1, line 88, Romeo says '...' which shows the theme of love." Teachers always want the second kind — because they can CHECK it.
+**What changes in the code**: Every claim in the database needs to be a POINTER to a specific text span (page number, character position, exact quote). The AI's job is to FIND important spans, not to WRITE about what it found. The claim text should be copied directly from the paper, not generated by the AI.
+### Blindspot CORE-2: Design-Code Gap Is Enormous 🔴
+**What's wrong**: The SYSTEM_DESIGN.md describes a sophisticated 7-layer system with ML-based parsing, constrained decoding, multi-round council debates, embedding-based deduplication, and a 4-stage training pipeline. The actual code has basic text extraction, pattern-matching classification, word-overlap deduplication, and single-stage training.
+**Why it matters**: Someone reading the design document would think this is a near-complete system. Someone reading the code would see it's a prototype. This gap can mislead contributors, reviewers, and users.
+**The fix**: The README and documentation should clearly mark which features are "✅ Implemented," "🚧 Skeleton exists," and "📋 Designed only." Honesty about the current state is itself an epistemic virtue — and this project is all about epistemic honesty.
+### Blindspot CORE-3: No End-to-End Pipeline Has Ever Run Successfully 🔴
+**What's wrong**: There's no evidence (in tests, logs, or documentation) that anyone has ever: taken a real PDF → parsed it → extracted claims → deduplicated them → stored them in the graph → scored them → exported to Obsidian. Each layer has been tested in isolation, but the full pipeline has never been verified end-to-end.
+**Why it matters**: Individual parts working doesn't mean the whole system works. A car where the engine works, the wheels spin, and the steering turns is still broken if they're not connected properly.
+**The fix**: Create ONE end-to-end test. Take a real paper. Run it through every layer. Check the output. This single test is worth more than 100 unit tests for building confidence that the system actually does what it claims.
+---
+## 3. Data Problems — Your Training Material Is Weak
+### Blindspot D-1: Training Data Is All Fake 🔴
+All 1,900 training examples were generated by a Python script using templates. Real papers don't write like templates. The AI is learning to process fake text, not real science.
+**Fix**: Get 100 real papers, have a human expert label them, use those as training data.
+### Blindspot D-2: No "Wrong Answer" Examples 🔴
+The AI only sees correct examples. It never sees examples of ALMOST correct but subtly wrong outputs. It's like studying for a test using only the answer key — you won't recognize trick questions.
+**Fix**: Run the current model, collect its mistakes, and use those as "here's what NOT to do" training examples (hard-negative mining).
+### Blindspot D-3: No Error Categories 🟠
+When the model makes a mistake, we don't know WHY. Was it a dropped unit? A missed qualifier? A wrong section? Without categorizing errors, we can't target specific weaknesses.
+**The 9 error types to track**:
+1. **Qualifier loss** — "may reduce" becomes "reduces"
+2. **Unit drop** — "0.8 fM" becomes "0.8"
+3. **Negation flip** — "did NOT work" becomes "worked"
+4. **Section confusion** — Abstract claim scored as Results
+5. **Number hallucination** — Paper says 0.8, AI says 0.6
+6. **Granularity mismatch** — One table row treated as a paper-level conclusion
+7. **Citation theft** — A finding from a cited paper treated as this paper's finding
+8. **Causal overclaim** — "correlated with" becomes "caused by"
+9. **Statistics loss** — p-value or sample size dropped
+### Blindspot D-4: Teacher Disagreement Not Preserved 🟠
+The system design talks about running 5 AI models on the same paper and keeping ALL answers. The current pipeline just generates one answer. The disagreement between multiple models IS the most valuable training signal — it teaches the student model about uncertainty.
+### Blindspot D-5: No Counterfactual Examples 🟡
+We never create "mirror" training examples where one key word is changed (add "not", swap a unit) to test if the model notices. Without these, we can't tell if the model is actually reading or just pattern-matching.
+### Blindspot D-6: Training Data Only Covers 5 Domains 🟡
+The data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. Feed it a paper about ecology, psychology, or economics and it will hallucinate confidently. The data needs to be broader, or the system needs to refuse out-of-domain papers.
 ---
+## 4. Reading Papers — The PDF Nightmare
+### Blindspot P-1: No Smart PDF Parser 🔴
+The code uses basic text scrapers (PyMuPDF, pdfplumber) that destroy document structure. The design says to use Marker (ML-based), but it's not connected. This is the #1 dependency — everything downstream depends on good parsing.
+### Blindspot P-2: Tables Become Unreadable 🔴
+Tables are extracted as flat pipe-separated text. The relationship between column headers and cell values is lost. "0.8" without knowing it's the "LOD" column is meaningless.
+**Fix**: Store tables as structured data with headers and rows preserved.
+### Blindspot P-3: Figures Are Completely Skipped 🟠
+The parser detects images and stores "[Image detected — requires VLM processing]". No analysis happens. 30-40% of key evidence in science papers is in figures.
+### Blindspot P-4: Section Detection Is Brittle 🟠
+Sections are detected by looking for the word "Abstract" or "Results" at the start of a line. Papers that number sections ("2.1 Experimental Setup"), use non-standard names, or combine sections ("Results and Discussion") are misidentified. Since section identity directly affects confidence scores, this causes systematic scoring errors.
+### Blindspot P-5: Equations Become Garbage 🟠
+Mathematical equations are not handled. LaTeX, inline math, and special symbols are either garbled or dropped. For papers in physics, chemistry, or engineering, equations ARE the findings.
+### Blindspot P-6: Supplements Can't Be Linked to Main Papers 🟠
+The parser processes one file at a time. There's no way to say "this supplement belongs to that main paper." In biology and chemistry, the best data is often in supplements.
+### Blindspot P-7: No Language Detection 🟡
+If someone uploads a Chinese or Japanese paper, the system produces English-looking garbage with high confidence. It should detect the language and refuse or flag reduced confidence.
+### Blindspot P-8: Cross-References Are Detected But Never Verified 🟡
+The code finds "see Table 2" in the text but never checks if the parsed Table 2 is actually Table 2. If tables are mislabeled, every claim referencing them has wrong evidence.
+### Blindspot P-9: No File Safety Checks 🟠
+The parser accepts any file with no checks for malicious PDFs (they exist), extremely large files, or corrupted data. A 500MB PDF would crash the system.
+### Blindspot P-10: Chunking Ignores Table/Caption Integrity 🟡
+The section-aware chunker in `parser.py` merges consecutive body text regions, but tables and their captions can be split across chunks. A table in one chunk and its caption in another means the AI sees numbers without context.
+---
+## 5. The Brain — Your AI Is Underpowered
+### Blindspot B-1: Model Is Too Small 🔴
+Qwen2.5-3B (current) is like a bright middle schooler — can follow instructions but doesn't deeply understand scientific reasoning. The design says upgrade to Qwen3-8B or Qwen3.5-27B. This hasn't happened.
+### Blindspot B-2: The Fake Brain Is the Default 🔴
+When no real AI is connected (which is the default), a `_extract_mock()` function runs that classifies claims by keyword matching: any sentence with "measured" → Fact, any with "may" → Hypothesis. This is wildly inaccurate.
+Example failures:
+- "We measured nothing significant" → Mock says Fact → Actually a null result
+- "The most reliable technique measured to date may revolutionize the field" → Mock says both Fact AND Hypothesis
+### Blindspot B-3: No Output Guarantees 🟠
+The system asks the AI to output JSON and hopes for the best. Without constrained decoding (Guidance library), the model can produce broken JSON, invalid tags, or mixed text-and-JSON output.
+### Blindspot B-4: One Brain for Six Different Jobs 🟠
+One model does: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want opposite behaviors (extraction wants to catch everything; qualifier detection wants surgical precision).
+**Fix**: Shared base model with specialized output heads for different tasks.
+### Blindspot B-5: No Scientific Vocabulary Training 🟠
+The base model was trained on general internet text. It doesn't deeply know what "LOD," "analyte," "p-value," or "cyclic voltammetry" mean. It can extract the words but might misclassify their importance.
+### Blindspot B-6: Can't Detect "I Don't Know This Field" 🟠
+No out-of-distribution detection. Feed it a sociology paper and it produces confident-looking scientific claims that are nonsense. The model should flag when content is too different from its training data.
+### Blindspot B-7: No Verification of Claims Against Source 🟠
+The design describes a "chain-of-verification" approach where a second model re-reads the source and checks if the extracted claim is actually supported. This isn't built. The system trusts its first extraction with no double-checking.
+---
+## 6. Memory — How the System Remembers
+### Blindspot M-1: Deduplication Uses Word Overlap, Not Meaning 🔴
+Two claims that say the same thing in different words are treated as different findings. "The LOD was 0.8 fM" and "A detection limit of 0.8 femtomolar was achieved" would NOT be recognized as duplicates because they share few words.
+**Fix**: Use an embedding model that converts sentences into number vectors. Similar meanings = similar vectors, regardless of word choice.
+### Blindspot M-2: No Embedding Model Exists in the Code 🟠
+The design describes embedding-based similarity. The code imports no embedding model. No sentence-transformers, no ChromaDB, nothing. The entire deduplication and conflict detection system relies on word counting.
+### Blindspot M-3: Conflict Detection Is Limited to 500 Claims 🟠
+The `find_conflicts()` method only checks the top 500 claims by confidence. If important claim #501 contradicts claim #10, it'll never be found. And it uses the same word-overlap approach, so semantically similar but differently-worded contradictions are invisible.
+### Blindspot M-4: No Concept of Time 🟡
+A finding from 2015 and a finding from 2025 are weighted equally. But the 2025 paper might specifically disprove the 2015 one. The graph needs temporal reasoning.
+### Blindspot M-5: Gap Analysis Is Too Narrow 🟡
+The gap-finding algorithm only looks for missing connections between same-type nodes. But the most interesting gaps are between different types — like a method that's never been applied to a particular material.
+### Blindspot M-6: No Retraction Checking 🟠
+If a paper gets retracted (pulled back because it was wrong or fraudulent), the system doesn't know. Claims from retracted papers sit in the knowledge graph with full confidence.
+**Fix**: Check CrossRef/Retraction Watch before ingesting a paper. If a paper in the database gets retracted later, propagate that information to all its claims.
+### Blindspot M-7: Obsidian Export Doesn't Include Graph Relationships 🟡
+The Obsidian exporter creates notes for claims, conflicts, and goals, but doesn't export the actual graph edges (supports/refutes/extends). A researcher looking at the vault sees individual findings but can't see how they connect.
 ---
+## 7. Scoring — Can You Trust the Numbers?
+### Blindspot S-1: All Qualifiers Get the Same Penalty 🟠
+Every qualifier reduces confidence by exactly 0.1. But "may" (very uncertain) and "under controlled laboratory conditions" (limits scope but is quite certain) are treated identically.
+**Fix**: Weighted qualifier types — strong hedges like "may" get -0.20, scope limiters like "in vitro" get -0.05.
+### Blindspot S-2: Only One Calibration Number 🟠
+The system tracks one Brier score for overall calibration. But the model might be perfectly calibrated on "did I find a real claim?" and terribly calibrated on "are these two claims contradictory?" One number hides this.
+**Fix**: Track 6 separate calibration curves: extraction confidence, qualifier confidence, statistical confidence, conflict confidence, section confidence, provenance confidence.
+### Blindspot S-3: Design Features Missing from Code 🟡
+The SYSTEM_DESIGN.md describes:
+- Source count bonus (+0.2 for claims backed by multiple papers) → NOT in scorer.py
+- Conflict penalty (-0.3 for claims with active conflicts) → NOT in scorer.py
+- Corroboration bonus → NOT in scorer.py
+The scorer only uses single-claim information. It ignores the knowledge graph entirely.
+### Blindspot S-4: Average Instead of Minimum for Composite Score 🟡
+Composite confidence = average of three scores. This means terrific evidence + terrible qualifier strength = mediocre composite. A chain is only as strong as its weakest link — use the minimum, not the average.
+### Blindspot S-5: Parser Confidence Isn't Actually Meaningful 🟡
+The `score_parse_quality()` function in parser.py assigns quality scores, but the thresholds are arbitrary. What does "garbled_ratio × 3000" actually correspond to? These numbers were chosen by gut feeling, not calibrated against human judgments of parse quality.
 ---
+## 8. Testing — How Do You Know It Works?
+### Blindspot T-1: No Gold Standard Test Set 🔴
+The evaluation counts claims and checks distributions but never compares against human-labeled ground truth. The system could produce 500 completely wrong claims and report "all metrics look normal."
+**Fix**: 10 expert-labeled papers where every claim, tag, qualifier, and conflict is manually annotated. This is the #1 requirement for any ML system — you can't improve what you can't measure.
+### Blindspot T-2: No Paper-Level Test Splits 🔴
+Train and test sets are randomly shuffled. Claims from the same paper could be in both. The model might memorize patterns per paper rather than learning to generalize.
+**Fix**: Split by paper, lab, journal, year, and field. Test on genuinely unseen papers.
+### Blindspot T-3: No Counterfactual Tests 🟠
+We never check if the model changes its answer for the right reason. Remove a table header — does the claim become Incomplete? Flip "significant" to "not significant" — does the tag change? Without these tests, the model might use shortcuts instead of understanding.
+### Blindspot T-4: No Stochastic Testing 🟡
+LLMs give different answers each time. Running evaluation once gives one number that might change by ±5% next time. Run evaluations 5 times and report the range.
+### Blindspot T-5: Regression Gate Has No Teeth 🟡
+The `run_regression_gate()` returns pass/fail but nothing blocks a bad model from being used. It's like a fire alarm with no fire department.
+### Blindspot T-6: No Inter-Annotator Agreement Tracking 🟡
+When multiple people label the gold standard, they'll disagree on some claims. That disagreement is important information — it tells you which categories are genuinely ambiguous. The system has no mechanism to measure or track this.
+### Blindspot T-7: 143 Tests, Zero of Them Test Science Quality 🟠
+All 143 tests verify that the CODE runs correctly (functions don't crash, data is stored properly). Zero tests verify that the SCIENCE is right (extracted claims match what the paper actually says). Code tests and science tests are completely different things.
 ---
+## 9. Training — Teaching the AI Properly
+### Blindspot TR-1: Only Stage 1 of 4 Is Built 🔴
+The 4-stage pipeline: SFT → DPO → GRPO → ConfTuner. Only SFT exists. This is like building a race car but only installing first gear.
+- **SFT** (Stage 1): "Here's the right answer" → ~70% quality
+- **DPO** (Stage 2): "This answer is better than that one" → ~80% quality
+- **GRPO** (Stage 3): "Reward for good JSON, correct tags, preserved qualifiers" → ~85-90% quality
+- **ConfTuner** (Stage 4): "Make your confidence numbers match reality" → Calibrated confidence
+### Blindspot TR-2: ZeroGPU Micro-Batching Breaks Learning 🟠
+The training Space uses 60-second GPU bursts. The learning rate resets, the model reloads, and there's no training momentum. It's like studying in 2-minute bursts with naps between each one.
+**Fix**: One continuous training job on proper GPU (HF Jobs or local).
+### Blindspot TR-3: No Curriculum Learning 🟡
+Easy and hard examples are randomly mixed. Humans learn better easy-to-hard. The model should first learn single facts, then fact vs. interpretation, then multi-claim papers, then contradictions.
+### Blindspot TR-4: Loss Number Doesn't Mean Task Success 🟡
+Training tracks eval_loss, which tells you if the model is generating likely-looking text. It does NOT tell you if the JSON is valid, tags are correct, qualifiers are preserved, or numbers are accurate. Task-specific evaluation during training is essential.
+### Blindspot TR-5: No Training on Real Model Failures 🟠
+Hard-negative mining (collecting the model's actual mistakes and training on them) is the single most efficient way to improve quality. It's described in the previous conversation but not implemented anywhere. You need:
+1. Run current model on 100 papers
+2. Manually check which extractions are wrong
+3. Categorize WHY they're wrong (using the 9 error types)
+4. Train specifically on those failure cases
 ---
+## 10. Teamwork — Multiple Models Working Together
+### Blindspot W-1: Council Is Sequential, Not Debating 🟠
+The design describes a multi-round debate (Extractor → Critic → Chairman, with hidden confidence and revealed reasoning). The code is sequential: one call per member, no back-and-forth. The disagreement signal (which IS the whole point) is lost.
+### Blindspot W-2: No Router 🟠
+Everything goes to the same model. Text, tables, figures, equations — all handled the same way. A router should decide: regular text → language model, table → specialized parser, figure → vision model, garbage → skip.
+### Blindspot W-3: System Can Never Say "I Don't Know" 🟠
+Every input gets a confident-looking answer. There's no mechanism to say "I can't read this page," "I don't understand this field," or "I need a human to look at this." Confident garbage is the worst failure mode.
+### Blindspot W-4: Teachers Are Treated As Oracles 🟠
+The design proposes using 5 frontier AI models as teachers. But these models share biases — they were all trained on similar internet text, they all struggle with the same types of negation, and they can all be wrong in the same direction.
+**Fix**: Include at least one NON-AI anchor for each data type: deterministic table parser for tables, regex for statistics, CrossRef API for citations. Where the rules-based tool disagrees with all 5 AI models, investigate — the AIs might be harmonizing around an error.
+---
+## 11. Staying Honest Over Time
+### Blindspot H-1: No Drift Detection 🟠
+If the model slowly gets worse (because it encounters new sub-fields or vocabulary evolves), nobody notices. Weekly re-evaluation against the gold standard is needed.
+### Blindspot H-2: Human Corrections Disappear 🟠
+When a researcher fixes a wrong tag or adds a missing qualifier in the UI, that correction isn't saved for training. Those corrections are the most valuable data you can get — each one is an expert-labeled example from your exact domain.
+### Blindspot H-3: Taxonomy Changes Break Old Scores 🟡
+If study type weights change (e.g., "in_vitro" goes from 0.85 to 0.80), old claims aren't re-scored. Claims from different taxonomy versions can't be compared meaningfully.
+### Blindspot H-4: No Backup Strategy 🟡
+Everything is in one SQLite file. No automatic backups, no periodic snapshots, no way to recover from corruption or accidental deletion.
+### Blindspot H-5: No Retraction Monitoring 🟠
+Once a paper is ingested, its claims live in the graph forever — even if the paper is retracted (pulled back for fraud or errors). The system needs to periodically check for retractions and propagate that information.
+### Blindspot H-6: v1 → v2 Migration Path Undefined 🟡
+The repo has both `phd_research_os/` (v1) and `phd_research_os_v2/` (v2) with different database schemas. There's no migration script for anyone who has data in v1 format. Users could lose work when upgrading.
 ---
+## 12. The Human Experience
+### Blindspot HU-1: Information Overload 🟠
+100 papers → 3,000+ claims. The UI treats them all equally. Researchers need priority ranking: most surprising findings first, then contradictions, then gaps.
+### Blindspot HU-2: Scores Are Unexplained Numbers 🟡
+"Confidence: 0.723" means nothing without a breakdown. Show the multiplication chain: evidence × quality × tier × section × completeness = 0.723.
+### Blindspot HU-3: Risk of Blind Trust 🟡
+Once a researcher trusts the system, they stop checking. Built-in friction (hiding scores randomly, asking "do you agree?", tracking override rates) prevents over-reliance.
+### Blindspot HU-4: No "Fresh User" Onboarding 🟡
+A new graduate student encountering the system for the first time sees a complex 7-layer architecture with 87 blindspots. There's no tutorial, no "start here" guide, no simplified workflow for beginners.
+**Fix**: A "Getting Started" mode that processes one paper and walks the user through each step: "Here's what the parser found → Here's what the AI extracted → Here's how we scored it → Here's what we're not sure about."
+### Blindspot HU-5: No Accessibility Considerations 🟡
+The UI has no dark mode, no font size options, no screen reader support. Research tools should be accessible to everyone.
 ---
+## 13. Security and Safety
+### Blindspot SEC-1: No Input Validation 🟠
+No file size limits, no malicious PDF detection, no format verification. A bad file could crash or compromise the system.
+### Blindspot SEC-2: Risky SQL Pattern 🟡
+The `get_stats()` function uses f-strings for table names. Currently safe because names are hardcoded, but a bad pattern that could become dangerous if someone modifies the code.
+### Blindspot SEC-3: No API Cost Controls 🟡
+No rate limiting on external API calls. Processing 1,000 papers could generate thousands of expensive API requests with no spending cap.
+### Blindspot SEC-4: No Data Privacy for Unpublished Work 🟡
+The design describes an "Epistemic Embargo" (private graphs for unpublished research), but it's not implemented. A researcher analyzing their unpublished data has no privacy guarantees.
 ---
+## 14. The Companion Agent System
+### Blindspot A-1: Agents Have No Real Brain 🟠
+The companion agents (DataQualityAuditor, PromptOptimizer, etc.) are fully coded with lifecycle management, audit trails, and proposal systems. But when no AI model is connected (the default), they generate placeholder proposals that say "Brain not configured." They're robots with no intelligence.
+### Blindspot A-2: Agent Plans Are Hardcoded, Not Dynamic 🟡
+The `_plan()` method assigns fixed step lists based on agent type. A DataQualityAuditor always gets the same 4 steps regardless of what task it's given. Real planning would look at the task, check what data is available, and decide what to do.
+### Blindspot A-3: No Agent Coordination 🟡
+Multiple agents can run simultaneously but they can't talk to each other. If the DataQualityAuditor finds a problem and the PromptOptimizer could fix it, there's no mechanism for them to coordinate.
+### Blindspot A-4: Proposal Review Has No Urgency System 🟡
+All proposals wait equally for human review. But some (like "this paper was retracted — its claims should be removed") are urgent, while others (like "consider adding training examples for this domain") are routine. No priority system exists.
 ---
+## 15. What To Build First
+Everything has dependencies. You can't build the roof before the walls. Here's the order that makes engineering sense:
+### 🏗️ Foundation Phase (Weeks 1-4) — Nothing works without this
+| # | Task | Why First |
+|---|------|-----------|
+| 1 | Integrate Marker PDF parser | Every downstream layer depends on accurate parsing |
+| 2 | Create gold standard test set (10 real papers, expert-labeled) | Can't measure improvement without ground truth |
+| 3 | Add embedding model (sentence-transformers) | Needed for smart deduplication and conflict detection |
+| 4 | Run one end-to-end pipeline test | Prove the layers actually connect |
+### 🔧 Reliability Phase (Weeks 5-8) — Make it work correctly
+| # | Task | Why Next |
+|---|------|----------|
+| 5 | Connect real AI model (replace mock extractor) | The system needs a real brain |
+| 6 | Add constrained decoding (Guidance library) | Guarantee valid JSON output |
+| 7 | Build continuous training pipeline (replace ZeroGPU micro-batching) | Proper training produces proper models |
+| 8 | Implement weighted qualifier penalties | Not all hedging words are equal |
+### 🧠 Intelligence Phase (Weeks 9-16) — Make it smart
+| # | Task | Why This Order |
+|---|------|----------------|
+| 9 | Expand training data to 10K+ with hard negatives | More and better data = better model |
+| 10 | Implement DPO training (preference learning) | Stage 2 of training pipeline |
+| 11 | Implement GRPO training (reward functions for JSON quality, tag correctness, qualifier preservation) | Stage 3 — the biggest quality jump |
+| 12 | Add out-of-distribution detection | Know when to say "I don't know" |
+### ✅ Trust Phase (Weeks 17-24) — Make it honest
+| # | Task | Why Here |
+|---|------|----------|
+| 13 | Build counterfactual evaluation suite | Test that the model reasons correctly, not just accurately |
+| 14 | Add paper-level evaluation splits | Honest accuracy numbers |
+| 15 | Implement human feedback loop | Every correction becomes training data |
+| 16 | Add multi-dimensional calibration (6 Brier scores) | Know which confidence types are trustworthy |
+| 17 | Add drift detection and monitoring | Catch problems before they become crises |
+### 🚀 Scale Phase (Weeks 25-32) — Make it complete
+| # | Task | Why Last |
+|---|------|----------|
+| 18 | Add figure/table specialist models | Handle the 30-40% of evidence in images |
+| 19 | Build content router | Right model for right content type |
+| 20 | Implement supplement handling (paper bundles) | Complete paper coverage |
+| 21 | Add retraction checking | Keep the knowledge graph honest |
+| 22 | Build verification pipeline (double-check claims against source) | Reduce hallucinations dramatically |
+---
+## 16. Master Blindspot Table
+| ID | Category | Severity | Problem (one sentence) |
+|----|----------|----------|----------------------|
+| CORE-1 | Philosophy | 🔴 | System generates summaries instead of pointing to evidence spans |
+| CORE-2 | Philosophy | 🔴 | Design document describes features the code doesn't have |
+| CORE-3 | Philosophy | 🔴 | End-to-end pipeline has never been tested on a real paper |
+| D-1 | Data | 🔴 | All training data is computer-generated, not from real papers |
+| D-2 | Data | 🔴 | No examples of what wrong output looks like |
+| D-3 | Data | 🟠 | Errors aren't categorized by type |
+| D-4 | Data | 🟠 | Multiple-teacher disagreement signal is thrown away |
+| D-5 | Data | 🟡 | No counterfactual (mirror) training examples |
+| D-6 | Data | 🟡 | Training covers only 5 scientific fields |
+| P-1 | Parser | 🔴 | No ML-based PDF parser is connected |
+| P-2 | Parser | 🔴 | Tables lose their header-value relationships |
+| P-3 | Parser | 🟠 | Figures are detected but never analyzed |
+| P-4 | Parser | 🟠 | Section headers are identified by keyword matching only |
+| P-5 | Parser | 🟠 | Equations are garbled or dropped |
+| P-6 | Parser | 🟠 | Supplementary files can't be linked to main papers |
+| P-7 | Parser | 🟡 | Non-English papers produce garbage silently |
+| P-8 | Parser | 🟡 | "See Table 2" references are found but never verified correct |
+| P-9 | Parser | 🟠 | No file size/safety checks before processing |
+| P-10 | Parser | 🟡 | Tables and their captions can be split across chunks |
+| B-1 | Brain | 🔴 | Model is 3B parameters (design says 8-27B) |
+| B-2 | Brain | 🔴 | Default mode uses keyword matching instead of AI |
+| B-3 | Brain | 🟠 | Output format is not guaranteed (no constrained decoding) |
+| B-4 | Brain | 🟠 | Single model does 6 different tasks that want opposite behaviors |
+| B-5 | Brain | 🟠 | No training on scientific vocabulary specifically |
+| B-6 | Brain | 🟠 | Can't detect when content is outside its training domain |
+| B-7 | Brain | 🟠 | No second-pass verification of extracted claims |
+| M-1 | Memory | 🔴 | Deduplication checks word overlap, not meaning |
+| M-2 | Memory | 🟠 | No embedding model exists anywhere in the code |
+| M-3 | Memory | 🟠 | Conflict detection only checks 500 claims using word overlap |
+| M-4 | Memory | 🟡 | Knowledge graph has no concept of time |
+| M-5 | Memory | 🟡 | Gap analysis only looks within same node types |
+| M-6 | Memory | 🟠 | Retracted papers aren't detected or flagged |
+| M-7 | Memory | 🟡 | Obsidian export doesn't include graph edges |
+| S-1 | Scoring | 🟠 | All qualifier types penalized equally |
+| S-2 | Scoring | 🟠 | One calibration number hides per-task calibration failures |
+| S-3 | Scoring | 🟡 | Design features (source bonus, conflict penalty) not in code |
+| S-4 | Scoring | 🟡 | Composite = average (should be minimum of weakest dimension) |
+| S-5 | Scoring | 🟡 | Parse quality scores are arbitrary, not calibrated |
+| T-1 | Testing | 🔴 | No human-labeled gold standard test set |
+| T-2 | Testing | 🔴 | Train/test split is random, not paper-level |
+| T-3 | Testing | 🟠 | No counterfactual robustness tests |
+| T-4 | Testing | 🟡 | Evaluations run once (should run 5× to check consistency) |
+| T-5 | Testing | 🟡 | Regression gate returns pass/fail but nothing enforces it |
+| T-6 | Testing | 🟡 | No tracking of human annotator agreement |
+| T-7 | Testing | 🟠 | 143 code tests, zero science quality tests |
+| TR-1 | Training | 🔴 | Only 1 of 4 training stages exists |
+| TR-2 | Training | 🟠 | ZeroGPU micro-batching breaks learning continuity |
+| TR-3 | Training | 🟡 | Examples aren't ordered easy → hard |
+| TR-4 | Training | 🟡 | Training eval is loss-based, not task-based |
+| TR-5 | Training | 🟠 | No training on actual model failures |
+| W-1 | Teamwork | 🟠 | Council members don't actually debate |
+| W-2 | Teamwork | 🟠 | No router to send content to appropriate model |
+| W-3 | Teamwork | 🟠 | System can never say "I don't know" |
+| W-4 | Teamwork | 🟠 | Teacher ensemble assumed unbiased (they share biases) |
+| H-1 | Longevity | 🟠 | No automatic detection of model performance degradation |
+| H-2 | Longevity | 🟠 | Human corrections aren't saved for future training |
+| H-3 | Longevity | 🟡 | Taxonomy changes don't trigger re-scoring of old claims |
+| H-4 | Longevity | 🟡 | No database backup automation |
+| H-5 | Longevity | 🟠 | No retraction monitoring for ingested papers |
+| H-6 | Longevity | 🟡 | No migration path from v1 to v2 database schema |
+| HU-1 | Human | 🟠 | 3,000 claims displayed with no priority ranking |
+| HU-2 | Human | 🟡 | Confidence scores show number but no breakdown |
+| HU-3 | Human | 🟡 | No safeguards against over-trusting the system |
+| HU-4 | Human | 🟡 | No beginner-friendly onboarding experience |
+| HU-5 | Human | 🟡 | No accessibility features (dark mode, screen reader, font size) |
+| SEC-1 | Security | 🟠 | No file validation or safety checks |
+| SEC-2 | Security | 🟡 | SQL construction uses f-strings (risky pattern) |
+| SEC-3 | Security | 🟡 | No spending limits on API calls |
+| SEC-4 | Security | 🟡 | No privacy controls for unpublished research |
+| A-1 | Agents | 🟠 | Companion agents have no AI brain connected |
+| A-2 | Agents | 🟡 | Agent plans are fixed templates, not dynamic |
+| A-3 | Agents | 🟡 | Multiple agents can't coordinate with each other |
+| A-4 | Agents | 🟡 | No urgency system for proposal review |
+### Summary by Severity
+| Severity | Count | Meaning |
+|----------|-------|---------|
+| 🔴 Critical | 14 | System fundamentally broken without this |
+| 🟠 High | 33 | Significant quality or reliability problem |
+| 🟡 Medium | 31 | Important but not blocking |
+| **Total** | **78** | |
+### Summary by Category
+| Category | Count | Most Critical Issue |
+|----------|-------|-------------------|
+| Philosophy | 3 | Evidence-centered vs model-centered |
+| Data | 6 | All training data is synthetic |
+| Parser | 10 | No ML-based parser connected |
+| Brain | 7 | Model too small + mock is default |
+| Memory | 7 | Deduplication uses word counting |
+| Scoring | 5 | All qualifiers penalized equally |
+| Testing | 7 | No gold standard test set |
+| Training | 5 | Only stage 1 of 4 built |
+| Teamwork | 4 | Council doesn't debate + no router |
+| Longevity | 6 | No drift detection |
+| Human | 5 | Information overload |
+| Security | 4 | No input validation |
+| Agents | 4 | Agents have no brain |
+---
+## How This Document Was Created
+1. **Read every file** in the repository (60+ files, 545KB of code and documentation)
+2. **Compared code against design** — for each feature in SYSTEM_DESIGN.md, checked if it exists in the code
+3. **Incorporated 87 original blindspots** from BLINDSPOT_AUDIT_COMPLETE.md
+4. **Incorporated 12 architectural improvements** from previous expert review session
+5. **Wrote first draft** with 47 blindspots
+6. **Re-read with fresh eyes** and found 31 additional blindspots
+7. **Deduplicated and organized** into 78 unique findings across 14 categories
+8. **Rewrote everything** in high-school-readable language
+### Relationship to Other Documents
+| Document | What It Contains | How This One Is Different |
+|----------|-----------------|--------------------------|
+| BLINDSPOT_AUDIT_COMPLETE.md | 87 theoretical failure modes found by adversarial critique | Theoretical — found by thinking about what COULD go wrong |
+| SYSTEM_DESIGN.md | The dream architecture — what the system SHOULD be | Aspirational — describes the finished product |
+| **This document** | **78 practical problems found by reading the actual code** | **Practical — found by comparing code to design** |
+The key difference: the audit found problems by THINKING. This document found problems by READING THE CODE. Many overlap, but this document catches things the audit missed (like the mock extractor being the default, or the embedding model not existing), and skips things the audit included that are already partially addressed in code.
 ---
+*This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.*