phd-research-os-brain / FUTURE_IMPROVEMENTS.md

Append 16 new improvement items from prior art analysis to FUTURE_IMPROVEMENTS.md (total now 94)

eb2e95b verified 14 days ago

preview code

raw

history blame contribute delete

45.9 kB

PhD Research OS — Future Improvements Roadmap

Everything That Needs to Be Fixed, and How

Written so a high school student can understand every word.

Version: 2.0 (Final Compiled Edition) Date: 2026-04-23 Status: Comprehensive — compiled from 87 original audit findings, 12 architectural improvements, and iterative code review Total blindspots catalogued: 78 (across 14 categories)

What Is This Document?

Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts built. This document lists every problem we've found — from looking at the actual code, not just the plans — and explains how to fix each one.

Who is this for? Anyone who wants to understand, contribute to, or learn from this project. Every explanation uses everyday language. If you've ever written a school report, you already have the intuition for most of these problems.

How was this made? We read every file in the repository (60+ files), compared the actual code against the design document, re-read our findings multiple times looking for things we missed, and compiled everything into this single document.

The Big Picture
The Core Philosophy Problem
Data Problems — Your Training Material Is Weak
Reading Papers — The PDF Nightmare
The Brain — Your AI Is Underpowered
Memory — How the System Remembers
Scoring — Can You Trust the Numbers?
Testing — How Do You Know It Works?
Training — Teaching the AI Properly
Teamwork — Multiple Models Working Together
Staying Honest Over Time
The Human Experience
Security and Safety
The Companion Agent System
What To Build First
Master Blindspot Table

1. The Big Picture

What does this system do?

It reads science papers and does 5 things:

Extracts key findings ("This test detected cancer at 0.8 fM")
Labels each finding (Is it a proven fact? An educated guess? A theory?)
Scores how much you should trust each finding (using math formulas, not AI guesses)
Compares findings across papers and spots contradictions
Exports everything to tools researchers already use (Obsidian notes, CSV files)

What's built vs. what's planned?

Part	Status	Plain English
Database	✅ Works	The filing cabinet is built
PDF reader	⚠️ Basic	Can read papers but misses tables, figures, and equations
Claim extractor	⚠️ Uses fake AI	Has a pretend brain that guesses using word patterns
Deduplicator	⚠️ Primitive	Checks if two sentences share the same words (misses paraphrases)
Knowledge graph	⚠️ Basic	Can store connections but can't find subtle contradictions
Scoring engine	✅ Works	Math formulas work correctly
Evaluation	⚠️ Incomplete	Counts claims but never checks if they're RIGHT
Obsidian export	✅ Works	Can create notes for each finding
AI agent helpers	✅ Framework works	The assistant robots exist but have no real brains yet
Training data	⚠️ Too small	1,900 fake examples (need 10,000+ real ones)
Trained model	⚠️ Too small	Using a 3-billion parameter model (plan says 8-27 billion)

2. The Core Philosophy Problem

Blindspot CORE-1: The System Is Model-Centered, Not Evidence-Centered 🔴

This is the single most important problem in the entire project.

Right now, the system works like this:

Give text to AI
AI writes a summary of what it found
Hope the summary is correct

It should work like this:

AI scans the text and points to specific sentences
Each pointed-to sentence gets a label (Fact / Interpretation / Hypothesis)
You can click on any finding and see the EXACT sentence in the original paper

Analogy: Imagine writing a school essay. Model-centered is like saying "Shakespeare wrote about love and death" with no page references. Evidence-centered is like saying "In Act 3, Scene 1, line 88, Romeo says '...' which shows the theme of love." Teachers always want the second kind — because they can CHECK it.

What changes in the code: Every claim in the database needs to be a POINTER to a specific text span (page number, character position, exact quote). The AI's job is to FIND important spans, not to WRITE about what it found. The claim text should be copied directly from the paper, not generated by the AI.

Blindspot CORE-2: Design-Code Gap Is Enormous 🔴

What's wrong: The SYSTEM_DESIGN.md describes a sophisticated 7-layer system with ML-based parsing, constrained decoding, multi-round council debates, embedding-based deduplication, and a 4-stage training pipeline. The actual code has basic text extraction, pattern-matching classification, word-overlap deduplication, and single-stage training.

Why it matters: Someone reading the design document would think this is a near-complete system. Someone reading the code would see it's a prototype. This gap can mislead contributors, reviewers, and users.

The fix: The README and documentation should clearly mark which features are "✅ Implemented," "🚧 Skeleton exists," and "📋 Designed only." Honesty about the current state is itself an epistemic virtue — and this project is all about epistemic honesty.

Blindspot CORE-3: No End-to-End Pipeline Has Ever Run Successfully 🔴

What's wrong: There's no evidence (in tests, logs, or documentation) that anyone has ever: taken a real PDF → parsed it → extracted claims → deduplicated them → stored them in the graph → scored them → exported to Obsidian. Each layer has been tested in isolation, but the full pipeline has never been verified end-to-end.

Why it matters: Individual parts working doesn't mean the whole system works. A car where the engine works, the wheels spin, and the steering turns is still broken if they're not connected properly.

The fix: Create ONE end-to-end test. Take a real paper. Run it through every layer. Check the output. This single test is worth more than 100 unit tests for building confidence that the system actually does what it claims.

3. Data Problems — Your Training Material Is Weak

Blindspot D-1: Training Data Is All Fake 🔴

All 1,900 training examples were generated by a Python script using templates. Real papers don't write like templates. The AI is learning to process fake text, not real science.

Fix: Get 100 real papers, have a human expert label them, use those as training data.

Blindspot D-2: No "Wrong Answer" Examples 🔴

The AI only sees correct examples. It never sees examples of ALMOST correct but subtly wrong outputs. It's like studying for a test using only the answer key — you won't recognize trick questions.

Fix: Run the current model, collect its mistakes, and use those as "here's what NOT to do" training examples (hard-negative mining).

Blindspot D-3: No Error Categories 🟠

When the model makes a mistake, we don't know WHY. Was it a dropped unit? A missed qualifier? A wrong section? Without categorizing errors, we can't target specific weaknesses.

The 9 error types to track:

Qualifier loss — "may reduce" becomes "reduces"
Unit drop — "0.8 fM" becomes "0.8"
Negation flip — "did NOT work" becomes "worked"
Section confusion — Abstract claim scored as Results
Number hallucination — Paper says 0.8, AI says 0.6
Granularity mismatch — One table row treated as a paper-level conclusion
Citation theft — A finding from a cited paper treated as this paper's finding
Causal overclaim — "correlated with" becomes "caused by"
Statistics loss — p-value or sample size dropped

Blindspot D-4: Teacher Disagreement Not Preserved 🟠

The system design talks about running 5 AI models on the same paper and keeping ALL answers. The current pipeline just generates one answer. The disagreement between multiple models IS the most valuable training signal — it teaches the student model about uncertainty.

Blindspot D-5: No Counterfactual Examples 🟡

We never create "mirror" training examples where one key word is changed (add "not", swap a unit) to test if the model notices. Without these, we can't tell if the model is actually reading or just pattern-matching.

Blindspot D-6: Training Data Only Covers 5 Domains 🟡

The data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. Feed it a paper about ecology, psychology, or economics and it will hallucinate confidently. The data needs to be broader, or the system needs to refuse out-of-domain papers.

4. Reading Papers — The PDF Nightmare

Blindspot P-1: No Smart PDF Parser 🔴

The code uses basic text scrapers (PyMuPDF, pdfplumber) that destroy document structure. The design says to use Marker (ML-based), but it's not connected. This is the #1 dependency — everything downstream depends on good parsing.

Blindspot P-2: Tables Become Unreadable 🔴

Tables are extracted as flat pipe-separated text. The relationship between column headers and cell values is lost. "0.8" without knowing it's the "LOD" column is meaningless.

Fix: Store tables as structured data with headers and rows preserved.

Blindspot P-3: Figures Are Completely Skipped 🟠

The parser detects images and stores "[Image detected — requires VLM processing]". No analysis happens. 30-40% of key evidence in science papers is in figures.

Blindspot P-4: Section Detection Is Brittle 🟠

Sections are detected by looking for the word "Abstract" or "Results" at the start of a line. Papers that number sections ("2.1 Experimental Setup"), use non-standard names, or combine sections ("Results and Discussion") are misidentified. Since section identity directly affects confidence scores, this causes systematic scoring errors.

Blindspot P-5: Equations Become Garbage 🟠

Mathematical equations are not handled. LaTeX, inline math, and special symbols are either garbled or dropped. For papers in physics, chemistry, or engineering, equations ARE the findings.

Blindspot P-6: Supplements Can't Be Linked to Main Papers 🟠

The parser processes one file at a time. There's no way to say "this supplement belongs to that main paper." In biology and chemistry, the best data is often in supplements.

Blindspot P-7: No Language Detection 🟡

If someone uploads a Chinese or Japanese paper, the system produces English-looking garbage with high confidence. It should detect the language and refuse or flag reduced confidence.

Blindspot P-8: Cross-References Are Detected But Never Verified 🟡

The code finds "see Table 2" in the text but never checks if the parsed Table 2 is actually Table 2. If tables are mislabeled, every claim referencing them has wrong evidence.

Blindspot P-9: No File Safety Checks 🟠

The parser accepts any file with no checks for malicious PDFs (they exist), extremely large files, or corrupted data. A 500MB PDF would crash the system.

Blindspot P-10: Chunking Ignores Table/Caption Integrity 🟡

The section-aware chunker in parser.py merges consecutive body text regions, but tables and their captions can be split across chunks. A table in one chunk and its caption in another means the AI sees numbers without context.

5. The Brain — Your AI Is Underpowered

Blindspot B-1: Model Is Too Small 🔴

Qwen2.5-3B (current) is like a bright middle schooler — can follow instructions but doesn't deeply understand scientific reasoning. The design says upgrade to Qwen3-8B or Qwen3.5-27B. This hasn't happened.

Blindspot B-2: The Fake Brain Is the Default 🔴

When no real AI is connected (which is the default), a _extract_mock() function runs that classifies claims by keyword matching: any sentence with "measured" → Fact, any with "may" → Hypothesis. This is wildly inaccurate.

Example failures:

"We measured nothing significant" → Mock says Fact → Actually a null result
"The most reliable technique measured to date may revolutionize the field" → Mock says both Fact AND Hypothesis

Blindspot B-3: No Output Guarantees 🟠

The system asks the AI to output JSON and hopes for the best. Without constrained decoding (Guidance library), the model can produce broken JSON, invalid tags, or mixed text-and-JSON output.

Blindspot B-4: One Brain for Six Different Jobs 🟠

One model does: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want opposite behaviors (extraction wants to catch everything; qualifier detection wants surgical precision).

Fix: Shared base model with specialized output heads for different tasks.

Blindspot B-5: No Scientific Vocabulary Training 🟠

The base model was trained on general internet text. It doesn't deeply know what "LOD," "analyte," "p-value," or "cyclic voltammetry" mean. It can extract the words but might misclassify their importance.

Blindspot B-6: Can't Detect "I Don't Know This Field" 🟠

No out-of-distribution detection. Feed it a sociology paper and it produces confident-looking scientific claims that are nonsense. The model should flag when content is too different from its training data.

Blindspot B-7: No Verification of Claims Against Source 🟠

The design describes a "chain-of-verification" approach where a second model re-reads the source and checks if the extracted claim is actually supported. This isn't built. The system trusts its first extraction with no double-checking.

6. Memory — How the System Remembers

Blindspot M-1: Deduplication Uses Word Overlap, Not Meaning 🔴

Two claims that say the same thing in different words are treated as different findings. "The LOD was 0.8 fM" and "A detection limit of 0.8 femtomolar was achieved" would NOT be recognized as duplicates because they share few words.

Fix: Use an embedding model that converts sentences into number vectors. Similar meanings = similar vectors, regardless of word choice.

Blindspot M-2: No Embedding Model Exists in the Code 🟠

The design describes embedding-based similarity. The code imports no embedding model. No sentence-transformers, no ChromaDB, nothing. The entire deduplication and conflict detection system relies on word counting.

Blindspot M-3: Conflict Detection Is Limited to 500 Claims 🟠

The find_conflicts() method only checks the top 500 claims by confidence. If important claim #501 contradicts claim #10, it'll never be found. And it uses the same word-overlap approach, so semantically similar but differently-worded contradictions are invisible.

Blindspot M-4: No Concept of Time 🟡

A finding from 2015 and a finding from 2025 are weighted equally. But the 2025 paper might specifically disprove the 2015 one. The graph needs temporal reasoning.

Blindspot M-5: Gap Analysis Is Too Narrow 🟡

The gap-finding algorithm only looks for missing connections between same-type nodes. But the most interesting gaps are between different types — like a method that's never been applied to a particular material.

Blindspot M-6: No Retraction Checking 🟠

If a paper gets retracted (pulled back because it was wrong or fraudulent), the system doesn't know. Claims from retracted papers sit in the knowledge graph with full confidence.

Fix: Check CrossRef/Retraction Watch before ingesting a paper. If a paper in the database gets retracted later, propagate that information to all its claims.

Blindspot M-7: Obsidian Export Doesn't Include Graph Relationships 🟡

The Obsidian exporter creates notes for claims, conflicts, and goals, but doesn't export the actual graph edges (supports/refutes/extends). A researcher looking at the vault sees individual findings but can't see how they connect.

7. Scoring — Can You Trust the Numbers?

Blindspot S-1: All Qualifiers Get the Same Penalty 🟠

Every qualifier reduces confidence by exactly 0.1. But "may" (very uncertain) and "under controlled laboratory conditions" (limits scope but is quite certain) are treated identically.

Fix: Weighted qualifier types — strong hedges like "may" get -0.20, scope limiters like "in vitro" get -0.05.

Blindspot S-2: Only One Calibration Number 🟠

The system tracks one Brier score for overall calibration. But the model might be perfectly calibrated on "did I find a real claim?" and terribly calibrated on "are these two claims contradictory?" One number hides this.

Fix: Track 6 separate calibration curves: extraction confidence, qualifier confidence, statistical confidence, conflict confidence, section confidence, provenance confidence.

Blindspot S-3: Design Features Missing from Code 🟡

The SYSTEM_DESIGN.md describes:

Source count bonus (+0.2 for claims backed by multiple papers) → NOT in scorer.py
Conflict penalty (-0.3 for claims with active conflicts) → NOT in scorer.py
Corroboration bonus → NOT in scorer.py

The scorer only uses single-claim information. It ignores the knowledge graph entirely.

Blindspot S-4: Average Instead of Minimum for Composite Score 🟡

Composite confidence = average of three scores. This means terrific evidence + terrible qualifier strength = mediocre composite. A chain is only as strong as its weakest link — use the minimum, not the average.

Blindspot S-5: Parser Confidence Isn't Actually Meaningful 🟡

The score_parse_quality() function in parser.py assigns quality scores, but the thresholds are arbitrary. What does "garbled_ratio × 3000" actually correspond to? These numbers were chosen by gut feeling, not calibrated against human judgments of parse quality.

8. Testing — How Do You Know It Works?

Blindspot T-1: No Gold Standard Test Set 🔴

The evaluation counts claims and checks distributions but never compares against human-labeled ground truth. The system could produce 500 completely wrong claims and report "all metrics look normal."

Fix: 10 expert-labeled papers where every claim, tag, qualifier, and conflict is manually annotated. This is the #1 requirement for any ML system — you can't improve what you can't measure.

Blindspot T-2: No Paper-Level Test Splits 🔴

Train and test sets are randomly shuffled. Claims from the same paper could be in both. The model might memorize patterns per paper rather than learning to generalize.

Fix: Split by paper, lab, journal, year, and field. Test on genuinely unseen papers.

Blindspot T-3: No Counterfactual Tests 🟠

We never check if the model changes its answer for the right reason. Remove a table header — does the claim become Incomplete? Flip "significant" to "not significant" — does the tag change? Without these tests, the model might use shortcuts instead of understanding.

Blindspot T-4: No Stochastic Testing 🟡

LLMs give different answers each time. Running evaluation once gives one number that might change by ±5% next time. Run evaluations 5 times and report the range.

Blindspot T-5: Regression Gate Has No Teeth 🟡

The run_regression_gate() returns pass/fail but nothing blocks a bad model from being used. It's like a fire alarm with no fire department.

Blindspot T-6: No Inter-Annotator Agreement Tracking 🟡

When multiple people label the gold standard, they'll disagree on some claims. That disagreement is important information — it tells you which categories are genuinely ambiguous. The system has no mechanism to measure or track this.

Blindspot T-7: 143 Tests, Zero of Them Test Science Quality 🟠

All 143 tests verify that the CODE runs correctly (functions don't crash, data is stored properly). Zero tests verify that the SCIENCE is right (extracted claims match what the paper actually says). Code tests and science tests are completely different things.

9. Training — Teaching the AI Properly

Blindspot TR-1: Only Stage 1 of 4 Is Built 🔴

The 4-stage pipeline: SFT → DPO → GRPO → ConfTuner. Only SFT exists. This is like building a race car but only installing first gear.

SFT (Stage 1): "Here's the right answer" → ~70% quality
DPO (Stage 2): "This answer is better than that one" → ~80% quality
GRPO (Stage 3): "Reward for good JSON, correct tags, preserved qualifiers" → ~85-90% quality
ConfTuner (Stage 4): "Make your confidence numbers match reality" → Calibrated confidence

Blindspot TR-2: ZeroGPU Micro-Batching Breaks Learning 🟠

The training Space uses 60-second GPU bursts. The learning rate resets, the model reloads, and there's no training momentum. It's like studying in 2-minute bursts with naps between each one.

Fix: One continuous training job on proper GPU (HF Jobs or local).

Blindspot TR-3: No Curriculum Learning 🟡

Easy and hard examples are randomly mixed. Humans learn better easy-to-hard. The model should first learn single facts, then fact vs. interpretation, then multi-claim papers, then contradictions.

Blindspot TR-4: Loss Number Doesn't Mean Task Success 🟡

Training tracks eval_loss, which tells you if the model is generating likely-looking text. It does NOT tell you if the JSON is valid, tags are correct, qualifiers are preserved, or numbers are accurate. Task-specific evaluation during training is essential.

Blindspot TR-5: No Training on Real Model Failures 🟠

Hard-negative mining (collecting the model's actual mistakes and training on them) is the single most efficient way to improve quality. It's described in the previous conversation but not implemented anywhere. You need:

Run current model on 100 papers
Manually check which extractions are wrong
Categorize WHY they're wrong (using the 9 error types)
Train specifically on those failure cases

10. Teamwork — Multiple Models Working Together

Blindspot W-1: Council Is Sequential, Not Debating 🟠

The design describes a multi-round debate (Extractor → Critic → Chairman, with hidden confidence and revealed reasoning). The code is sequential: one call per member, no back-and-forth. The disagreement signal (which IS the whole point) is lost.

Blindspot W-2: No Router 🟠

Everything goes to the same model. Text, tables, figures, equations — all handled the same way. A router should decide: regular text → language model, table → specialized parser, figure → vision model, garbage → skip.

Blindspot W-3: System Can Never Say "I Don't Know" 🟠

Every input gets a confident-looking answer. There's no mechanism to say "I can't read this page," "I don't understand this field," or "I need a human to look at this." Confident garbage is the worst failure mode.

Blindspot W-4: Teachers Are Treated As Oracles 🟠

The design proposes using 5 frontier AI models as teachers. But these models share biases — they were all trained on similar internet text, they all struggle with the same types of negation, and they can all be wrong in the same direction.

Fix: Include at least one NON-AI anchor for each data type: deterministic table parser for tables, regex for statistics, CrossRef API for citations. Where the rules-based tool disagrees with all 5 AI models, investigate — the AIs might be harmonizing around an error.

11. Staying Honest Over Time

Blindspot H-1: No Drift Detection 🟠

If the model slowly gets worse (because it encounters new sub-fields or vocabulary evolves), nobody notices. Weekly re-evaluation against the gold standard is needed.

Blindspot H-2: Human Corrections Disappear 🟠

When a researcher fixes a wrong tag or adds a missing qualifier in the UI, that correction isn't saved for training. Those corrections are the most valuable data you can get — each one is an expert-labeled example from your exact domain.

Blindspot H-3: Taxonomy Changes Break Old Scores 🟡

If study type weights change (e.g., "in_vitro" goes from 0.85 to 0.80), old claims aren't re-scored. Claims from different taxonomy versions can't be compared meaningfully.

Blindspot H-4: No Backup Strategy 🟡

Everything is in one SQLite file. No automatic backups, no periodic snapshots, no way to recover from corruption or accidental deletion.

Blindspot H-5: No Retraction Monitoring 🟠

Once a paper is ingested, its claims live in the graph forever — even if the paper is retracted (pulled back for fraud or errors). The system needs to periodically check for retractions and propagate that information.

Blindspot H-6: v1 → v2 Migration Path Undefined 🟡

The repo has both phd_research_os/ (v1) and phd_research_os_v2/ (v2) with different database schemas. There's no migration script for anyone who has data in v1 format. Users could lose work when upgrading.

12. The Human Experience

Blindspot HU-1: Information Overload 🟠

100 papers → 3,000+ claims. The UI treats them all equally. Researchers need priority ranking: most surprising findings first, then contradictions, then gaps.

Blindspot HU-2: Scores Are Unexplained Numbers 🟡

"Confidence: 0.723" means nothing without a breakdown. Show the multiplication chain: evidence × quality × tier × section × completeness = 0.723.

Blindspot HU-3: Risk of Blind Trust 🟡

Once a researcher trusts the system, they stop checking. Built-in friction (hiding scores randomly, asking "do you agree?", tracking override rates) prevents over-reliance.

Blindspot HU-4: No "Fresh User" Onboarding 🟡

A new graduate student encountering the system for the first time sees a complex 7-layer architecture with 87 blindspots. There's no tutorial, no "start here" guide, no simplified workflow for beginners.

Fix: A "Getting Started" mode that processes one paper and walks the user through each step: "Here's what the parser found → Here's what the AI extracted → Here's how we scored it → Here's what we're not sure about."

Blindspot HU-5: No Accessibility Considerations 🟡

The UI has no dark mode, no font size options, no screen reader support. Research tools should be accessible to everyone.

13. Security and Safety

Blindspot SEC-1: No Input Validation 🟠

No file size limits, no malicious PDF detection, no format verification. A bad file could crash or compromise the system.

Blindspot SEC-2: Risky SQL Pattern 🟡

The get_stats() function uses f-strings for table names. Currently safe because names are hardcoded, but a bad pattern that could become dangerous if someone modifies the code.

Blindspot SEC-3: No API Cost Controls 🟡

No rate limiting on external API calls. Processing 1,000 papers could generate thousands of expensive API requests with no spending cap.

Blindspot SEC-4: No Data Privacy for Unpublished Work 🟡

The design describes an "Epistemic Embargo" (private graphs for unpublished research), but it's not implemented. A researcher analyzing their unpublished data has no privacy guarantees.

14. The Companion Agent System

Blindspot A-1: Agents Have No Real Brain 🟠

The companion agents (DataQualityAuditor, PromptOptimizer, etc.) are fully coded with lifecycle management, audit trails, and proposal systems. But when no AI model is connected (the default), they generate placeholder proposals that say "Brain not configured." They're robots with no intelligence.

Blindspot A-2: Agent Plans Are Hardcoded, Not Dynamic 🟡

The _plan() method assigns fixed step lists based on agent type. A DataQualityAuditor always gets the same 4 steps regardless of what task it's given. Real planning would look at the task, check what data is available, and decide what to do.

Blindspot A-3: No Agent Coordination 🟡

Multiple agents can run simultaneously but they can't talk to each other. If the DataQualityAuditor finds a problem and the PromptOptimizer could fix it, there's no mechanism for them to coordinate.

Blindspot A-4: Proposal Review Has No Urgency System 🟡

All proposals wait equally for human review. But some (like "this paper was retracted — its claims should be removed") are urgent, while others (like "consider adding training examples for this domain") are routine. No priority system exists.

15. What To Build First

Everything has dependencies. You can't build the roof before the walls. Here's the order that makes engineering sense:

🏗️ Foundation Phase (Weeks 1-4) — Nothing works without this

#	Task	Why First
1	Integrate Marker PDF parser	Every downstream layer depends on accurate parsing
2	Create gold standard test set (10 real papers, expert-labeled)	Can't measure improvement without ground truth
3	Add embedding model (sentence-transformers)	Needed for smart deduplication and conflict detection
4	Run one end-to-end pipeline test	Prove the layers actually connect

🔧 Reliability Phase (Weeks 5-8) — Make it work correctly

#	Task	Why Next
5	Connect real AI model (replace mock extractor)	The system needs a real brain
6	Add constrained decoding (Guidance library)	Guarantee valid JSON output
7	Build continuous training pipeline (replace ZeroGPU micro-batching)	Proper training produces proper models
8	Implement weighted qualifier penalties	Not all hedging words are equal

🧠 Intelligence Phase (Weeks 9-16) — Make it smart

#	Task	Why This Order
9	Expand training data to 10K+ with hard negatives	More and better data = better model
10	Implement DPO training (preference learning)	Stage 2 of training pipeline
11	Implement GRPO training (reward functions for JSON quality, tag correctness, qualifier preservation)	Stage 3 — the biggest quality jump
12	Add out-of-distribution detection	Know when to say "I don't know"

✅ Trust Phase (Weeks 17-24) — Make it honest

#	Task	Why Here
13	Build counterfactual evaluation suite	Test that the model reasons correctly, not just accurately
14	Add paper-level evaluation splits	Honest accuracy numbers
15	Implement human feedback loop	Every correction becomes training data
16	Add multi-dimensional calibration (6 Brier scores)	Know which confidence types are trustworthy
17	Add drift detection and monitoring	Catch problems before they become crises

🚀 Scale Phase (Weeks 25-32) — Make it complete

#	Task	Why Last
18	Add figure/table specialist models	Handle the 30-40% of evidence in images
19	Build content router	Right model for right content type
20	Implement supplement handling (paper bundles)	Complete paper coverage
21	Add retraction checking	Keep the knowledge graph honest
22	Build verification pipeline (double-check claims against source)	Reduce hallucinations dramatically

16. Master Blindspot Table

ID	Category	Severity	Problem (one sentence)
CORE-1	Philosophy	🔴	System generates summaries instead of pointing to evidence spans
CORE-2	Philosophy	🔴	Design document describes features the code doesn't have
CORE-3	Philosophy	🔴	End-to-end pipeline has never been tested on a real paper
D-1	Data	🔴	All training data is computer-generated, not from real papers
D-2	Data	🔴	No examples of what wrong output looks like
D-3	Data	🟠	Errors aren't categorized by type
D-4	Data	🟠	Multiple-teacher disagreement signal is thrown away
D-5	Data	🟡	No counterfactual (mirror) training examples
D-6	Data	🟡	Training covers only 5 scientific fields
P-1	Parser	🔴	No ML-based PDF parser is connected
P-2	Parser	🔴	Tables lose their header-value relationships
P-3	Parser	🟠	Figures are detected but never analyzed
P-4	Parser	🟠	Section headers are identified by keyword matching only
P-5	Parser	🟠	Equations are garbled or dropped
P-6	Parser	🟠	Supplementary files can't be linked to main papers
P-7	Parser	🟡	Non-English papers produce garbage silently
P-8	Parser	🟡	"See Table 2" references are found but never verified correct
P-9	Parser	🟠	No file size/safety checks before processing
P-10	Parser	🟡	Tables and their captions can be split across chunks
B-1	Brain	🔴	Model is 3B parameters (design says 8-27B)
B-2	Brain	🔴	Default mode uses keyword matching instead of AI
B-3	Brain	🟠	Output format is not guaranteed (no constrained decoding)
B-4	Brain	🟠	Single model does 6 different tasks that want opposite behaviors
B-5	Brain	🟠	No training on scientific vocabulary specifically
B-6	Brain	🟠	Can't detect when content is outside its training domain
B-7	Brain	🟠	No second-pass verification of extracted claims
M-1	Memory	🔴	Deduplication checks word overlap, not meaning
M-2	Memory	🟠	No embedding model exists anywhere in the code
M-3	Memory	🟠	Conflict detection only checks 500 claims using word overlap
M-4	Memory	🟡	Knowledge graph has no concept of time
M-5	Memory	🟡	Gap analysis only looks within same node types
M-6	Memory	🟠	Retracted papers aren't detected or flagged
M-7	Memory	🟡	Obsidian export doesn't include graph edges
S-1	Scoring	🟠	All qualifier types penalized equally
S-2	Scoring	🟠	One calibration number hides per-task calibration failures
S-3	Scoring	🟡	Design features (source bonus, conflict penalty) not in code
S-4	Scoring	🟡	Composite = average (should be minimum of weakest dimension)
S-5	Scoring	🟡	Parse quality scores are arbitrary, not calibrated
T-1	Testing	🔴	No human-labeled gold standard test set
T-2	Testing	🔴	Train/test split is random, not paper-level
T-3	Testing	🟠	No counterfactual robustness tests
T-4	Testing	🟡	Evaluations run once (should run 5× to check consistency)
T-5	Testing	🟡	Regression gate returns pass/fail but nothing enforces it
T-6	Testing	🟡	No tracking of human annotator agreement
T-7	Testing	🟠	143 code tests, zero science quality tests
TR-1	Training	🔴	Only 1 of 4 training stages exists
TR-2	Training	🟠	ZeroGPU micro-batching breaks learning continuity
TR-3	Training	🟡	Examples aren't ordered easy → hard
TR-4	Training	🟡	Training eval is loss-based, not task-based
TR-5	Training	🟠	No training on actual model failures
W-1	Teamwork	🟠	Council members don't actually debate
W-2	Teamwork	🟠	No router to send content to appropriate model
W-3	Teamwork	🟠	System can never say "I don't know"
W-4	Teamwork	🟠	Teacher ensemble assumed unbiased (they share biases)
H-1	Longevity	🟠	No automatic detection of model performance degradation
H-2	Longevity	🟠	Human corrections aren't saved for future training
H-3	Longevity	🟡	Taxonomy changes don't trigger re-scoring of old claims
H-4	Longevity	🟡	No database backup automation
H-5	Longevity	🟠	No retraction monitoring for ingested papers
H-6	Longevity	🟡	No migration path from v1 to v2 database schema
HU-1	Human	🟠	3,000 claims displayed with no priority ranking
HU-2	Human	🟡	Confidence scores show number but no breakdown
HU-3	Human	🟡	No safeguards against over-trusting the system
HU-4	Human	🟡	No beginner-friendly onboarding experience
HU-5	Human	🟡	No accessibility features (dark mode, screen reader, font size)
SEC-1	Security	🟠	No file validation or safety checks
SEC-2	Security	🟡	SQL construction uses f-strings (risky pattern)
SEC-3	Security	🟡	No spending limits on API calls
SEC-4	Security	🟡	No privacy controls for unpublished research
A-1	Agents	🟠	Companion agents have no AI brain connected
A-2	Agents	🟡	Agent plans are fixed templates, not dynamic
A-3	Agents	🟡	Multiple agents can't coordinate with each other
A-4	Agents	🟡	No urgency system for proposal review

Summary by Severity

Severity	Count	Meaning
🔴 Critical	14	System fundamentally broken without this
🟠 High	33	Significant quality or reliability problem
🟡 Medium	31	Important but not blocking
Total	78

Summary by Category

Category	Count	Most Critical Issue
Philosophy	3	Evidence-centered vs model-centered
Data	6	All training data is synthetic
Parser	10	No ML-based parser connected
Brain	7	Model too small + mock is default
Memory	7	Deduplication uses word counting
Scoring	5	All qualifiers penalized equally
Testing	7	No gold standard test set
Training	5	Only stage 1 of 4 built
Teamwork	4	Council doesn't debate + no router
Longevity	6	No drift detection
Human	5	Information overload
Security	4	No input validation
Agents	4	Agents have no brain

How This Document Was Created

Read every file in the repository (60+ files, 545KB of code and documentation)
Compared code against design — for each feature in SYSTEM_DESIGN.md, checked if it exists in the code
Incorporated 87 original blindspots from BLINDSPOT_AUDIT_COMPLETE.md
Incorporated 12 architectural improvements from previous expert review session
Wrote first draft with 47 blindspots
Re-read with fresh eyes and found 31 additional blindspots
Deduplicated and organized into 78 unique findings across 14 categories
Rewrote everything in high-school-readable language

Relationship to Other Documents

Document	What It Contains	How This One Is Different
BLINDSPOT_AUDIT_COMPLETE.md	87 theoretical failure modes found by adversarial critique	Theoretical — found by thinking about what COULD go wrong
SYSTEM_DESIGN.md	The dream architecture — what the system SHOULD be	Aspirational — describes the finished product
This document	78 practical problems found by reading the actual code	Practical — found by comparing code to design

The key difference: the audit found problems by THINKING. This document found problems by READING THE CODE. Many overlap, but this document catches things the audit missed (like the mock extractor being the default, or the embedding model not existing), and skips things the audit included that are already partially addressed in code.

This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.

Appendix: New Improvements from Prior Art Analysis (2026-04-23)

Source: Comprehensive analysis of 15 similar systems (PRIOR_ART_ANALYSIS.md, SYSTEM_INSPIRATIONS.md)

After analyzing every system that does something similar to PhD Research OS, we found 16 concrete improvements to adopt, adapt, or build. These are ON TOP of the 78 blindspots already documented above. Each one comes from a real, published, working system.

New Blindspots Discovered from Prior Art

ID	Category	Severity	Problem	Source System
PA-1	Memory	🔴	No scientific embedding model for dedup (word overlap misses paraphrases — "0.8 fM" ≠ "800 attomolar" to our system but they're the same)	SPECTER2 (AllenAI) proves embeddings fix this
PA-2	Testing	🔴	No standard benchmark for claim verification quality (we can't compare ourselves to anyone else)	SciFact (AllenAI) is the industry standard we should measure against
PA-3	Data	🔴	1,900 synthetic examples vs 137,000 expert-written ones available for free (72× gap)	SciRIFF (AllenAI) is sitting on HuggingFace waiting to be used
PA-4	Brain	🟠	No pre-filtering before expensive extraction (AI Council processes boilerplate text and acknowledgments as seriously as Results)	PaperQA2's RCS technique filters noise before analysis
PA-5	Brain	🟠	No code-based epistemic validation (AI classification has no deterministic cross-check)	KGX3's language-game filters provide rule-based validation
PA-6	Brain	🟠	No completeness audit after extraction (system could silently miss the most important finding)	Paper Circle's Coverage Checker catches omissions
PA-7	Brain	🟠	System never says "I don't know" (every input gets a confident-looking answer, even garbage)	PaperQA2 refuses to answer when evidence is insufficient
PA-8	Memory	🟠	Contradiction detection has no fast pre-filter (checking 500K pairs all require expensive LLM calls)	SciBERT-NLI can pre-screen in seconds, not hours
PA-9	Memory	🟠	No cross-reference verification between paper claims and knowledge graph (claims checked in isolation)	FactReview checks against both the paper AND external literature
PA-10	Scoring	🟡	No temporal confidence tracking (a claim that's been rising for 4 years looks the same as one that just appeared)	CLAIRE + PaperQA2 both track temporal patterns
PA-11	Scoring	🟡	Confidence numbers shown without explanation (0.72 means nothing without knowing WHY)	CLUE generates plain-English uncertainty explanations
PA-12	Human	🟡	No active skepticism mode (system only confirms, never challenges high-confidence claims)	CLAIRE actively looks for counter-evidence
PA-13	Human	🟡	No human verification tracking (machine-extracted claims look identical to expert-verified ones)	ORKG + Paper Circle track verification status
PA-14	Brain	🟡	No self-critique step in Council (Extractor produces claims without articulating uncertainty)	CritiCal proves that self-critique improves calibration
PA-15	Memory	🟡	Conflict detection is shallow (compares two claims in isolation, doesn't investigate with surrounding evidence)	CLAIRE's investigation loop retrieves additional evidence before flagging
PA-16	Data	🟡	No non-AI anchors in teacher ensemble (all 5 teacher models share similar biases from internet training)	FactReview uses CODE EXECUTION as a non-AI evidence source

Updated Summary by Severity (Including Prior Art Findings)

Severity	Original Count	New from Prior Art	Total
🔴 Critical	14	3	17
🟠 High	33	5	38
🟡 Medium	31	8	39
Total	78	16	94

Priority Integration Order for Prior Art Items

Do These First (they fix Critical blindspots with minimal effort):

Priority	Item	Effort	What It Fixes
1	PA-1: Integrate SPECTER2	1-2 days	Deduplication actually works
2	PA-2: Add SciFact benchmark	1 day	We can measure quality
3	PA-3: Integrate SciRIFF data	2-3 days	72× more training data

Do These Next (they fix High blindspots with moderate effort):

Priority	Item	Effort	What It Fixes
4	PA-8: SciBERT-NLI pre-filter	1-2 days	Conflict detection becomes 100× faster
5	PA-4: Pre-Extraction Filter	1 week	Removes noise before expensive processing
6	PA-5: Epistemic Trigger Words	1 week	Code-based validation of AI labels
7	PA-7: Low Confidence Quarantine	3-5 days	System learns to say "I don't know"
8	PA-6: Completeness Auditor	1 week	Catches silent omissions
9	PA-9: Cross-Reference Verification	1-2 weeks	Claims checked against both paper and graph

Do These When Core Is Stable (they add new capabilities):

Priority	Item	Effort	What It Adds
10	PA-14: Council Self-Critique	1 week	Better calibration
11	PA-13: Provenance Levels	1 week	Human verification tracking
12	PA-15: Investigation Protocol	2-3 weeks	Deep contradiction analysis
13	PA-10: Epistemic Velocity	3-5 days	Temporal confidence trends
14	PA-12: Devil's Advocate Mode	1-2 weeks	Active skepticism
15	PA-11: Confidence Explanations	3-5 days	Plain-English scoring reasons
16	PA-16: Non-AI anchors	Ongoing	Deterministic validation alongside AI

Appendix added 2026-04-23. These 16 items supplement the original 78 blindspots with findings from prior art analysis. See PRIOR_ART_ANALYSIS.md and SYSTEM_INSPIRATIONS.md for full details.