FUTURE_IMPROVEMENTS.md · nkshirsa/phd-research-os-brain at main

phd-research-os-brain / FUTURE_IMPROVEMENTS.md

nkshirsa

Append 16 new improvement items from prior art analysis to FUTURE_IMPROVEMENTS.md (total now 94)

eb2e95b verified 15 days ago

preview code

raw

history blame contribute delete

45.9 kB

	# PhD Research OS — Future Improvements Roadmap
	## Everything That Needs to Be Fixed, and How

	Written so a high school student can understand every word.

	Version: 2.0 (Final Compiled Edition)
	Date: 2026-04-23
	Status: Comprehensive — compiled from 87 original audit findings, 12 architectural improvements, and iterative code review
	Total blindspots catalogued: 78 (across 14 categories)

	---

	## What Is This Document?

	Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts built. This document lists every problem we've found — from looking at the actual code, not just the plans — and explains how to fix each one.

	Who is this for? Anyone who wants to understand, contribute to, or learn from this project. Every explanation uses everyday language. If you've ever written a school report, you already have the intuition for most of these problems.

	How was this made? We read every file in the repository (60+ files), compared the actual code against the design document, re-read our findings multiple times looking for things we missed, and compiled everything into this single document.

	---

	## Table of Contents

	1. [The Big Picture](#1-the-big-picture)
	2. [The Core Philosophy Problem](#2-the-core-philosophy-problem)
	3. [Data Problems — Your Training Material Is Weak](#3-data-problems)
	4. [Reading Papers — The PDF Nightmare](#4-reading-papers)
	5. [The Brain — Your AI Is Underpowered](#5-the-brain)
	6. [Memory — How the System Remembers](#6-memory)
	7. [Scoring — Can You Trust the Numbers?](#7-scoring)
	8. [Testing — How Do You Know It Works?](#8-testing)
	9. [Training — Teaching the AI Properly](#9-training)
	10. [Teamwork — Multiple Models Working Together](#10-teamwork)
	11. [Staying Honest Over Time](#11-staying-honest)
	12. [The Human Experience](#12-the-human-experience)
	13. [Security and Safety](#13-security-and-safety)
	14. [The Companion Agent System](#14-companion-agents)
	15. [What To Build First](#15-what-to-build-first)
	16. [Master Blindspot Table](#16-master-table)

	---

	## 1. The Big Picture

	### What does this system do?

	It reads science papers and does 5 things:
	1. Extracts key findings ("This test detected cancer at 0.8 fM")
	2. Labels each finding (Is it a proven fact? An educated guess? A theory?)
	3. Scores how much you should trust each finding (using math formulas, not AI guesses)
	4. Compares findings across papers and spots contradictions
	5. Exports everything to tools researchers already use (Obsidian notes, CSV files)

	### What's built vs. what's planned?

	\| Part \| Status \| Plain English \|
	\|------\|--------\|---------------\|
	\| Database \| ✅ Works \| The filing cabinet is built \|
	\| PDF reader \| ⚠️ Basic \| Can read papers but misses tables, figures, and equations \|
	\| Claim extractor \| ⚠️ Uses fake AI \| Has a pretend brain that guesses using word patterns \|
	\| Deduplicator \| ⚠️ Primitive \| Checks if two sentences share the same words (misses paraphrases) \|
	\| Knowledge graph \| ⚠️ Basic \| Can store connections but can't find subtle contradictions \|
	\| Scoring engine \| ✅ Works \| Math formulas work correctly \|
	\| Evaluation \| ⚠️ Incomplete \| Counts claims but never checks if they're RIGHT \|
	\| Obsidian export \| ✅ Works \| Can create notes for each finding \|
	\| AI agent helpers \| ✅ Framework works \| The assistant robots exist but have no real brains yet \|
	\| Training data \| ⚠️ Too small \| 1,900 fake examples (need 10,000+ real ones) \|
	\| Trained model \| ⚠️ Too small \| Using a 3-billion parameter model (plan says 8-27 billion) \|

	---

	## 2. The Core Philosophy Problem

	### Blindspot CORE-1: The System Is Model-Centered, Not Evidence-Centered 🔴

	This is the single most important problem in the entire project.

	Right now, the system works like this:
	1. Give text to AI
	2. AI writes a summary of what it found
	3. Hope the summary is correct

	It should work like this:
	1. AI scans the text and points to specific sentences
	2. Each pointed-to sentence gets a label (Fact / Interpretation / Hypothesis)
	3. You can click on any finding and see the EXACT sentence in the original paper

	Analogy: Imagine writing a school essay. Model-centered is like saying "Shakespeare wrote about love and death" with no page references. Evidence-centered is like saying "In Act 3, Scene 1, line 88, Romeo says '...' which shows the theme of love." Teachers always want the second kind — because they can CHECK it.

	What changes in the code: Every claim in the database needs to be a POINTER to a specific text span (page number, character position, exact quote). The AI's job is to FIND important spans, not to WRITE about what it found. The claim text should be copied directly from the paper, not generated by the AI.

	### Blindspot CORE-2: Design-Code Gap Is Enormous 🔴

	What's wrong: The SYSTEM_DESIGN.md describes a sophisticated 7-layer system with ML-based parsing, constrained decoding, multi-round council debates, embedding-based deduplication, and a 4-stage training pipeline. The actual code has basic text extraction, pattern-matching classification, word-overlap deduplication, and single-stage training.

	Why it matters: Someone reading the design document would think this is a near-complete system. Someone reading the code would see it's a prototype. This gap can mislead contributors, reviewers, and users.

	The fix: The README and documentation should clearly mark which features are "✅ Implemented," "🚧 Skeleton exists," and "📋 Designed only." Honesty about the current state is itself an epistemic virtue — and this project is all about epistemic honesty.

	### Blindspot CORE-3: No End-to-End Pipeline Has Ever Run Successfully 🔴

	What's wrong: There's no evidence (in tests, logs, or documentation) that anyone has ever: taken a real PDF → parsed it → extracted claims → deduplicated them → stored them in the graph → scored them → exported to Obsidian. Each layer has been tested in isolation, but the full pipeline has never been verified end-to-end.

	Why it matters: Individual parts working doesn't mean the whole system works. A car where the engine works, the wheels spin, and the steering turns is still broken if they're not connected properly.

	The fix: Create ONE end-to-end test. Take a real paper. Run it through every layer. Check the output. This single test is worth more than 100 unit tests for building confidence that the system actually does what it claims.

	---

	## 3. Data Problems — Your Training Material Is Weak

	### Blindspot D-1: Training Data Is All Fake 🔴

	All 1,900 training examples were generated by a Python script using templates. Real papers don't write like templates. The AI is learning to process fake text, not real science.

	Fix: Get 100 real papers, have a human expert label them, use those as training data.

	### Blindspot D-2: No "Wrong Answer" Examples 🔴

	The AI only sees correct examples. It never sees examples of ALMOST correct but subtly wrong outputs. It's like studying for a test using only the answer key — you won't recognize trick questions.

	Fix: Run the current model, collect its mistakes, and use those as "here's what NOT to do" training examples (hard-negative mining).

	### Blindspot D-3: No Error Categories 🟠

	When the model makes a mistake, we don't know WHY. Was it a dropped unit? A missed qualifier? A wrong section? Without categorizing errors, we can't target specific weaknesses.

	The 9 error types to track:
	1. Qualifier loss — "may reduce" becomes "reduces"
	2. Unit drop — "0.8 fM" becomes "0.8"
	3. Negation flip — "did NOT work" becomes "worked"
	4. Section confusion — Abstract claim scored as Results
	5. Number hallucination — Paper says 0.8, AI says 0.6
	6. Granularity mismatch — One table row treated as a paper-level conclusion
	7. Citation theft — A finding from a cited paper treated as this paper's finding
	8. Causal overclaim — "correlated with" becomes "caused by"
	9. Statistics loss — p-value or sample size dropped

	### Blindspot D-4: Teacher Disagreement Not Preserved 🟠

	The system design talks about running 5 AI models on the same paper and keeping ALL answers. The current pipeline just generates one answer. The disagreement between multiple models IS the most valuable training signal — it teaches the student model about uncertainty.

	### Blindspot D-5: No Counterfactual Examples 🟡

	We never create "mirror" training examples where one key word is changed (add "not", swap a unit) to test if the model notices. Without these, we can't tell if the model is actually reading or just pattern-matching.

	### Blindspot D-6: Training Data Only Covers 5 Domains 🟡

	The data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. Feed it a paper about ecology, psychology, or economics and it will hallucinate confidently. The data needs to be broader, or the system needs to refuse out-of-domain papers.

	---

	## 4. Reading Papers — The PDF Nightmare

	### Blindspot P-1: No Smart PDF Parser 🔴

	The code uses basic text scrapers (PyMuPDF, pdfplumber) that destroy document structure. The design says to use Marker (ML-based), but it's not connected. This is the #1 dependency — everything downstream depends on good parsing.

	### Blindspot P-2: Tables Become Unreadable 🔴

	Tables are extracted as flat pipe-separated text. The relationship between column headers and cell values is lost. "0.8" without knowing it's the "LOD" column is meaningless.

	Fix: Store tables as structured data with headers and rows preserved.

	### Blindspot P-3: Figures Are Completely Skipped 🟠

	The parser detects images and stores "[Image detected — requires VLM processing]". No analysis happens. 30-40% of key evidence in science papers is in figures.

	### Blindspot P-4: Section Detection Is Brittle 🟠

	Sections are detected by looking for the word "Abstract" or "Results" at the start of a line. Papers that number sections ("2.1 Experimental Setup"), use non-standard names, or combine sections ("Results and Discussion") are misidentified. Since section identity directly affects confidence scores, this causes systematic scoring errors.

	### Blindspot P-5: Equations Become Garbage 🟠

	Mathematical equations are not handled. LaTeX, inline math, and special symbols are either garbled or dropped. For papers in physics, chemistry, or engineering, equations ARE the findings.

	### Blindspot P-6: Supplements Can't Be Linked to Main Papers 🟠

	The parser processes one file at a time. There's no way to say "this supplement belongs to that main paper." In biology and chemistry, the best data is often in supplements.

	### Blindspot P-7: No Language Detection 🟡

	If someone uploads a Chinese or Japanese paper, the system produces English-looking garbage with high confidence. It should detect the language and refuse or flag reduced confidence.

	### Blindspot P-8: Cross-References Are Detected But Never Verified 🟡

	The code finds "see Table 2" in the text but never checks if the parsed Table 2 is actually Table 2. If tables are mislabeled, every claim referencing them has wrong evidence.

	### Blindspot P-9: No File Safety Checks 🟠

	The parser accepts any file with no checks for malicious PDFs (they exist), extremely large files, or corrupted data. A 500MB PDF would crash the system.

	### Blindspot P-10: Chunking Ignores Table/Caption Integrity 🟡

	The section-aware chunker in `parser.py` merges consecutive body text regions, but tables and their captions can be split across chunks. A table in one chunk and its caption in another means the AI sees numbers without context.

	---

	## 5. The Brain — Your AI Is Underpowered

	### Blindspot B-1: Model Is Too Small 🔴

	Qwen2.5-3B (current) is like a bright middle schooler — can follow instructions but doesn't deeply understand scientific reasoning. The design says upgrade to Qwen3-8B or Qwen3.5-27B. This hasn't happened.

	### Blindspot B-2: The Fake Brain Is the Default 🔴

	When no real AI is connected (which is the default), a `_extract_mock()` function runs that classifies claims by keyword matching: any sentence with "measured" → Fact, any with "may" → Hypothesis. This is wildly inaccurate.

	Example failures:
	- "We measured nothing significant" → Mock says Fact → Actually a null result
	- "The most reliable technique measured to date may revolutionize the field" → Mock says both Fact AND Hypothesis

	### Blindspot B-3: No Output Guarantees 🟠

	The system asks the AI to output JSON and hopes for the best. Without constrained decoding (Guidance library), the model can produce broken JSON, invalid tags, or mixed text-and-JSON output.

	### Blindspot B-4: One Brain for Six Different Jobs 🟠

	One model does: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want opposite behaviors (extraction wants to catch everything; qualifier detection wants surgical precision).

	Fix: Shared base model with specialized output heads for different tasks.

	### Blindspot B-5: No Scientific Vocabulary Training 🟠

	The base model was trained on general internet text. It doesn't deeply know what "LOD," "analyte," "p-value," or "cyclic voltammetry" mean. It can extract the words but might misclassify their importance.

	### Blindspot B-6: Can't Detect "I Don't Know This Field" 🟠

	No out-of-distribution detection. Feed it a sociology paper and it produces confident-looking scientific claims that are nonsense. The model should flag when content is too different from its training data.

	### Blindspot B-7: No Verification of Claims Against Source 🟠

	The design describes a "chain-of-verification" approach where a second model re-reads the source and checks if the extracted claim is actually supported. This isn't built. The system trusts its first extraction with no double-checking.

	---

	## 6. Memory — How the System Remembers

	### Blindspot M-1: Deduplication Uses Word Overlap, Not Meaning 🔴

	Two claims that say the same thing in different words are treated as different findings. "The LOD was 0.8 fM" and "A detection limit of 0.8 femtomolar was achieved" would NOT be recognized as duplicates because they share few words.

	Fix: Use an embedding model that converts sentences into number vectors. Similar meanings = similar vectors, regardless of word choice.

	### Blindspot M-2: No Embedding Model Exists in the Code 🟠

	The design describes embedding-based similarity. The code imports no embedding model. No sentence-transformers, no ChromaDB, nothing. The entire deduplication and conflict detection system relies on word counting.

	### Blindspot M-3: Conflict Detection Is Limited to 500 Claims 🟠

	The `find_conflicts()` method only checks the top 500 claims by confidence. If important claim #501 contradicts claim #10, it'll never be found. And it uses the same word-overlap approach, so semantically similar but differently-worded contradictions are invisible.

	### Blindspot M-4: No Concept of Time 🟡

	A finding from 2015 and a finding from 2025 are weighted equally. But the 2025 paper might specifically disprove the 2015 one. The graph needs temporal reasoning.

	### Blindspot M-5: Gap Analysis Is Too Narrow 🟡

	The gap-finding algorithm only looks for missing connections between same-type nodes. But the most interesting gaps are between different types — like a method that's never been applied to a particular material.

	### Blindspot M-6: No Retraction Checking 🟠

	If a paper gets retracted (pulled back because it was wrong or fraudulent), the system doesn't know. Claims from retracted papers sit in the knowledge graph with full confidence.

	Fix: Check CrossRef/Retraction Watch before ingesting a paper. If a paper in the database gets retracted later, propagate that information to all its claims.

	### Blindspot M-7: Obsidian Export Doesn't Include Graph Relationships 🟡

	The Obsidian exporter creates notes for claims, conflicts, and goals, but doesn't export the actual graph edges (supports/refutes/extends). A researcher looking at the vault sees individual findings but can't see how they connect.

	---

	## 7. Scoring — Can You Trust the Numbers?

	### Blindspot S-1: All Qualifiers Get the Same Penalty 🟠

	Every qualifier reduces confidence by exactly 0.1. But "may" (very uncertain) and "under controlled laboratory conditions" (limits scope but is quite certain) are treated identically.

	Fix: Weighted qualifier types — strong hedges like "may" get -0.20, scope limiters like "in vitro" get -0.05.

	### Blindspot S-2: Only One Calibration Number 🟠

	The system tracks one Brier score for overall calibration. But the model might be perfectly calibrated on "did I find a real claim?" and terribly calibrated on "are these two claims contradictory?" One number hides this.

	Fix: Track 6 separate calibration curves: extraction confidence, qualifier confidence, statistical confidence, conflict confidence, section confidence, provenance confidence.

	### Blindspot S-3: Design Features Missing from Code 🟡

	The SYSTEM_DESIGN.md describes:
	- Source count bonus (+0.2 for claims backed by multiple papers) → NOT in scorer.py
	- Conflict penalty (-0.3 for claims with active conflicts) → NOT in scorer.py
	- Corroboration bonus → NOT in scorer.py

	The scorer only uses single-claim information. It ignores the knowledge graph entirely.

	### Blindspot S-4: Average Instead of Minimum for Composite Score 🟡

	Composite confidence = average of three scores. This means terrific evidence + terrible qualifier strength = mediocre composite. A chain is only as strong as its weakest link — use the minimum, not the average.

	### Blindspot S-5: Parser Confidence Isn't Actually Meaningful 🟡

	The `score_parse_quality()` function in parser.py assigns quality scores, but the thresholds are arbitrary. What does "garbled_ratio × 3000" actually correspond to? These numbers were chosen by gut feeling, not calibrated against human judgments of parse quality.

	---

	## 8. Testing — How Do You Know It Works?

	### Blindspot T-1: No Gold Standard Test Set 🔴

	The evaluation counts claims and checks distributions but never compares against human-labeled ground truth. The system could produce 500 completely wrong claims and report "all metrics look normal."

	Fix: 10 expert-labeled papers where every claim, tag, qualifier, and conflict is manually annotated. This is the #1 requirement for any ML system — you can't improve what you can't measure.

	### Blindspot T-2: No Paper-Level Test Splits 🔴

	Train and test sets are randomly shuffled. Claims from the same paper could be in both. The model might memorize patterns per paper rather than learning to generalize.

	Fix: Split by paper, lab, journal, year, and field. Test on genuinely unseen papers.

	### Blindspot T-3: No Counterfactual Tests 🟠

	We never check if the model changes its answer for the right reason. Remove a table header — does the claim become Incomplete? Flip "significant" to "not significant" — does the tag change? Without these tests, the model might use shortcuts instead of understanding.

	### Blindspot T-4: No Stochastic Testing 🟡

	LLMs give different answers each time. Running evaluation once gives one number that might change by ±5% next time. Run evaluations 5 times and report the range.

	### Blindspot T-5: Regression Gate Has No Teeth 🟡

	The `run_regression_gate()` returns pass/fail but nothing blocks a bad model from being used. It's like a fire alarm with no fire department.

	### Blindspot T-6: No Inter-Annotator Agreement Tracking 🟡

	When multiple people label the gold standard, they'll disagree on some claims. That disagreement is important information — it tells you which categories are genuinely ambiguous. The system has no mechanism to measure or track this.

	### Blindspot T-7: 143 Tests, Zero of Them Test Science Quality 🟠

	All 143 tests verify that the CODE runs correctly (functions don't crash, data is stored properly). Zero tests verify that the SCIENCE is right (extracted claims match what the paper actually says). Code tests and science tests are completely different things.

	---

	## 9. Training — Teaching the AI Properly

	### Blindspot TR-1: Only Stage 1 of 4 Is Built 🔴

	The 4-stage pipeline: SFT → DPO → GRPO → ConfTuner. Only SFT exists. This is like building a race car but only installing first gear.

	- SFT (Stage 1): "Here's the right answer" → ~70% quality
	- DPO (Stage 2): "This answer is better than that one" → ~80% quality
	- GRPO (Stage 3): "Reward for good JSON, correct tags, preserved qualifiers" → ~85-90% quality
	- ConfTuner (Stage 4): "Make your confidence numbers match reality" → Calibrated confidence

	### Blindspot TR-2: ZeroGPU Micro-Batching Breaks Learning 🟠

	The training Space uses 60-second GPU bursts. The learning rate resets, the model reloads, and there's no training momentum. It's like studying in 2-minute bursts with naps between each one.

	Fix: One continuous training job on proper GPU (HF Jobs or local).

	### Blindspot TR-3: No Curriculum Learning 🟡

	Easy and hard examples are randomly mixed. Humans learn better easy-to-hard. The model should first learn single facts, then fact vs. interpretation, then multi-claim papers, then contradictions.

	### Blindspot TR-4: Loss Number Doesn't Mean Task Success 🟡

	Training tracks eval_loss, which tells you if the model is generating likely-looking text. It does NOT tell you if the JSON is valid, tags are correct, qualifiers are preserved, or numbers are accurate. Task-specific evaluation during training is essential.

	### Blindspot TR-5: No Training on Real Model Failures 🟠

	Hard-negative mining (collecting the model's actual mistakes and training on them) is the single most efficient way to improve quality. It's described in the previous conversation but not implemented anywhere. You need:
	1. Run current model on 100 papers
	2. Manually check which extractions are wrong
	3. Categorize WHY they're wrong (using the 9 error types)
	4. Train specifically on those failure cases

	---

	## 10. Teamwork — Multiple Models Working Together

	### Blindspot W-1: Council Is Sequential, Not Debating 🟠

	The design describes a multi-round debate (Extractor → Critic → Chairman, with hidden confidence and revealed reasoning). The code is sequential: one call per member, no back-and-forth. The disagreement signal (which IS the whole point) is lost.

	### Blindspot W-2: No Router 🟠

	Everything goes to the same model. Text, tables, figures, equations — all handled the same way. A router should decide: regular text → language model, table → specialized parser, figure → vision model, garbage → skip.

	### Blindspot W-3: System Can Never Say "I Don't Know" 🟠

	Every input gets a confident-looking answer. There's no mechanism to say "I can't read this page," "I don't understand this field," or "I need a human to look at this." Confident garbage is the worst failure mode.

	### Blindspot W-4: Teachers Are Treated As Oracles 🟠

	The design proposes using 5 frontier AI models as teachers. But these models share biases — they were all trained on similar internet text, they all struggle with the same types of negation, and they can all be wrong in the same direction.

	Fix: Include at least one NON-AI anchor for each data type: deterministic table parser for tables, regex for statistics, CrossRef API for citations. Where the rules-based tool disagrees with all 5 AI models, investigate — the AIs might be harmonizing around an error.

	---

	## 11. Staying Honest Over Time

	### Blindspot H-1: No Drift Detection 🟠

	If the model slowly gets worse (because it encounters new sub-fields or vocabulary evolves), nobody notices. Weekly re-evaluation against the gold standard is needed.

	### Blindspot H-2: Human Corrections Disappear 🟠

	When a researcher fixes a wrong tag or adds a missing qualifier in the UI, that correction isn't saved for training. Those corrections are the most valuable data you can get — each one is an expert-labeled example from your exact domain.

	### Blindspot H-3: Taxonomy Changes Break Old Scores 🟡

	If study type weights change (e.g., "in_vitro" goes from 0.85 to 0.80), old claims aren't re-scored. Claims from different taxonomy versions can't be compared meaningfully.

	### Blindspot H-4: No Backup Strategy 🟡

	Everything is in one SQLite file. No automatic backups, no periodic snapshots, no way to recover from corruption or accidental deletion.

	### Blindspot H-5: No Retraction Monitoring 🟠

	Once a paper is ingested, its claims live in the graph forever — even if the paper is retracted (pulled back for fraud or errors). The system needs to periodically check for retractions and propagate that information.

	### Blindspot H-6: v1 → v2 Migration Path Undefined 🟡

	The repo has both `phd_research_os/` (v1) and `phd_research_os_v2/` (v2) with different database schemas. There's no migration script for anyone who has data in v1 format. Users could lose work when upgrading.

	---

	## 12. The Human Experience

	### Blindspot HU-1: Information Overload 🟠

	100 papers → 3,000+ claims. The UI treats them all equally. Researchers need priority ranking: most surprising findings first, then contradictions, then gaps.

	### Blindspot HU-2: Scores Are Unexplained Numbers 🟡

	"Confidence: 0.723" means nothing without a breakdown. Show the multiplication chain: evidence × quality × tier × section × completeness = 0.723.

	### Blindspot HU-3: Risk of Blind Trust 🟡

	Once a researcher trusts the system, they stop checking. Built-in friction (hiding scores randomly, asking "do you agree?", tracking override rates) prevents over-reliance.

	### Blindspot HU-4: No "Fresh User" Onboarding 🟡

	A new graduate student encountering the system for the first time sees a complex 7-layer architecture with 87 blindspots. There's no tutorial, no "start here" guide, no simplified workflow for beginners.

	Fix: A "Getting Started" mode that processes one paper and walks the user through each step: "Here's what the parser found → Here's what the AI extracted → Here's how we scored it → Here's what we're not sure about."

	### Blindspot HU-5: No Accessibility Considerations 🟡

	The UI has no dark mode, no font size options, no screen reader support. Research tools should be accessible to everyone.

	---

	## 13. Security and Safety

	### Blindspot SEC-1: No Input Validation 🟠

	No file size limits, no malicious PDF detection, no format verification. A bad file could crash or compromise the system.

	### Blindspot SEC-2: Risky SQL Pattern 🟡

	The `get_stats()` function uses f-strings for table names. Currently safe because names are hardcoded, but a bad pattern that could become dangerous if someone modifies the code.

	### Blindspot SEC-3: No API Cost Controls 🟡

	No rate limiting on external API calls. Processing 1,000 papers could generate thousands of expensive API requests with no spending cap.

	### Blindspot SEC-4: No Data Privacy for Unpublished Work 🟡

	The design describes an "Epistemic Embargo" (private graphs for unpublished research), but it's not implemented. A researcher analyzing their unpublished data has no privacy guarantees.

	---

	## 14. The Companion Agent System

	### Blindspot A-1: Agents Have No Real Brain 🟠

	The companion agents (DataQualityAuditor, PromptOptimizer, etc.) are fully coded with lifecycle management, audit trails, and proposal systems. But when no AI model is connected (the default), they generate placeholder proposals that say "Brain not configured." They're robots with no intelligence.

	### Blindspot A-2: Agent Plans Are Hardcoded, Not Dynamic 🟡

	The `_plan()` method assigns fixed step lists based on agent type. A DataQualityAuditor always gets the same 4 steps regardless of what task it's given. Real planning would look at the task, check what data is available, and decide what to do.

	### Blindspot A-3: No Agent Coordination 🟡

	Multiple agents can run simultaneously but they can't talk to each other. If the DataQualityAuditor finds a problem and the PromptOptimizer could fix it, there's no mechanism for them to coordinate.

	### Blindspot A-4: Proposal Review Has No Urgency System 🟡

	All proposals wait equally for human review. But some (like "this paper was retracted — its claims should be removed") are urgent, while others (like "consider adding training examples for this domain") are routine. No priority system exists.

	---

	## 15. What To Build First

	Everything has dependencies. You can't build the roof before the walls. Here's the order that makes engineering sense:

	### 🏗️ Foundation Phase (Weeks 1-4) — Nothing works without this
	\| # \| Task \| Why First \|
	\|---\|------\|-----------\|
	\| 1 \| Integrate Marker PDF parser \| Every downstream layer depends on accurate parsing \|
	\| 2 \| Create gold standard test set (10 real papers, expert-labeled) \| Can't measure improvement without ground truth \|
	\| 3 \| Add embedding model (sentence-transformers) \| Needed for smart deduplication and conflict detection \|
	\| 4 \| Run one end-to-end pipeline test \| Prove the layers actually connect \|

	### 🔧 Reliability Phase (Weeks 5-8) — Make it work correctly
	\| # \| Task \| Why Next \|
	\|---\|------\|----------\|
	\| 5 \| Connect real AI model (replace mock extractor) \| The system needs a real brain \|
	\| 6 \| Add constrained decoding (Guidance library) \| Guarantee valid JSON output \|
	\| 7 \| Build continuous training pipeline (replace ZeroGPU micro-batching) \| Proper training produces proper models \|
	\| 8 \| Implement weighted qualifier penalties \| Not all hedging words are equal \|

	### 🧠 Intelligence Phase (Weeks 9-16) — Make it smart
	\| # \| Task \| Why This Order \|
	\|---\|------\|----------------\|
	\| 9 \| Expand training data to 10K+ with hard negatives \| More and better data = better model \|
	\| 10 \| Implement DPO training (preference learning) \| Stage 2 of training pipeline \|
	\| 11 \| Implement GRPO training (reward functions for JSON quality, tag correctness, qualifier preservation) \| Stage 3 — the biggest quality jump \|
	\| 12 \| Add out-of-distribution detection \| Know when to say "I don't know" \|

	### ✅ Trust Phase (Weeks 17-24) — Make it honest
	\| # \| Task \| Why Here \|
	\|---\|------\|----------\|
	\| 13 \| Build counterfactual evaluation suite \| Test that the model reasons correctly, not just accurately \|
	\| 14 \| Add paper-level evaluation splits \| Honest accuracy numbers \|
	\| 15 \| Implement human feedback loop \| Every correction becomes training data \|
	\| 16 \| Add multi-dimensional calibration (6 Brier scores) \| Know which confidence types are trustworthy \|
	\| 17 \| Add drift detection and monitoring \| Catch problems before they become crises \|

	### 🚀 Scale Phase (Weeks 25-32) — Make it complete
	\| # \| Task \| Why Last \|
	\|---\|------\|----------\|
	\| 18 \| Add figure/table specialist models \| Handle the 30-40% of evidence in images \|
	\| 19 \| Build content router \| Right model for right content type \|
	\| 20 \| Implement supplement handling (paper bundles) \| Complete paper coverage \|
	\| 21 \| Add retraction checking \| Keep the knowledge graph honest \|
	\| 22 \| Build verification pipeline (double-check claims against source) \| Reduce hallucinations dramatically \|

	---

	## 16. Master Blindspot Table

	\| ID \| Category \| Severity \| Problem (one sentence) \|
	\|----\|----------\|----------\|----------------------\|
	\| CORE-1 \| Philosophy \| 🔴 \| System generates summaries instead of pointing to evidence spans \|
	\| CORE-2 \| Philosophy \| 🔴 \| Design document describes features the code doesn't have \|
	\| CORE-3 \| Philosophy \| 🔴 \| End-to-end pipeline has never been tested on a real paper \|
	\| D-1 \| Data \| 🔴 \| All training data is computer-generated, not from real papers \|
	\| D-2 \| Data \| 🔴 \| No examples of what wrong output looks like \|
	\| D-3 \| Data \| 🟠 \| Errors aren't categorized by type \|
	\| D-4 \| Data \| 🟠 \| Multiple-teacher disagreement signal is thrown away \|
	\| D-5 \| Data \| 🟡 \| No counterfactual (mirror) training examples \|
	\| D-6 \| Data \| 🟡 \| Training covers only 5 scientific fields \|
	\| P-1 \| Parser \| 🔴 \| No ML-based PDF parser is connected \|
	\| P-2 \| Parser \| 🔴 \| Tables lose their header-value relationships \|
	\| P-3 \| Parser \| 🟠 \| Figures are detected but never analyzed \|
	\| P-4 \| Parser \| 🟠 \| Section headers are identified by keyword matching only \|
	\| P-5 \| Parser \| 🟠 \| Equations are garbled or dropped \|
	\| P-6 \| Parser \| 🟠 \| Supplementary files can't be linked to main papers \|
	\| P-7 \| Parser \| 🟡 \| Non-English papers produce garbage silently \|
	\| P-8 \| Parser \| 🟡 \| "See Table 2" references are found but never verified correct \|
	\| P-9 \| Parser \| 🟠 \| No file size/safety checks before processing \|
	\| P-10 \| Parser \| 🟡 \| Tables and their captions can be split across chunks \|
	\| B-1 \| Brain \| 🔴 \| Model is 3B parameters (design says 8-27B) \|
	\| B-2 \| Brain \| 🔴 \| Default mode uses keyword matching instead of AI \|
	\| B-3 \| Brain \| 🟠 \| Output format is not guaranteed (no constrained decoding) \|
	\| B-4 \| Brain \| 🟠 \| Single model does 6 different tasks that want opposite behaviors \|
	\| B-5 \| Brain \| 🟠 \| No training on scientific vocabulary specifically \|
	\| B-6 \| Brain \| 🟠 \| Can't detect when content is outside its training domain \|
	\| B-7 \| Brain \| 🟠 \| No second-pass verification of extracted claims \|
	\| M-1 \| Memory \| 🔴 \| Deduplication checks word overlap, not meaning \|
	\| M-2 \| Memory \| 🟠 \| No embedding model exists anywhere in the code \|
	\| M-3 \| Memory \| 🟠 \| Conflict detection only checks 500 claims using word overlap \|
	\| M-4 \| Memory \| 🟡 \| Knowledge graph has no concept of time \|
	\| M-5 \| Memory \| 🟡 \| Gap analysis only looks within same node types \|
	\| M-6 \| Memory \| 🟠 \| Retracted papers aren't detected or flagged \|
	\| M-7 \| Memory \| 🟡 \| Obsidian export doesn't include graph edges \|
	\| S-1 \| Scoring \| 🟠 \| All qualifier types penalized equally \|
	\| S-2 \| Scoring \| 🟠 \| One calibration number hides per-task calibration failures \|
	\| S-3 \| Scoring \| 🟡 \| Design features (source bonus, conflict penalty) not in code \|
	\| S-4 \| Scoring \| 🟡 \| Composite = average (should be minimum of weakest dimension) \|
	\| S-5 \| Scoring \| 🟡 \| Parse quality scores are arbitrary, not calibrated \|
	\| T-1 \| Testing \| 🔴 \| No human-labeled gold standard test set \|
	\| T-2 \| Testing \| 🔴 \| Train/test split is random, not paper-level \|
	\| T-3 \| Testing \| 🟠 \| No counterfactual robustness tests \|
	\| T-4 \| Testing \| 🟡 \| Evaluations run once (should run 5× to check consistency) \|
	\| T-5 \| Testing \| 🟡 \| Regression gate returns pass/fail but nothing enforces it \|
	\| T-6 \| Testing \| 🟡 \| No tracking of human annotator agreement \|
	\| T-7 \| Testing \| 🟠 \| 143 code tests, zero science quality tests \|
	\| TR-1 \| Training \| 🔴 \| Only 1 of 4 training stages exists \|
	\| TR-2 \| Training \| 🟠 \| ZeroGPU micro-batching breaks learning continuity \|
	\| TR-3 \| Training \| 🟡 \| Examples aren't ordered easy → hard \|
	\| TR-4 \| Training \| 🟡 \| Training eval is loss-based, not task-based \|
	\| TR-5 \| Training \| 🟠 \| No training on actual model failures \|
	\| W-1 \| Teamwork \| 🟠 \| Council members don't actually debate \|
	\| W-2 \| Teamwork \| 🟠 \| No router to send content to appropriate model \|
	\| W-3 \| Teamwork \| 🟠 \| System can never say "I don't know" \|
	\| W-4 \| Teamwork \| 🟠 \| Teacher ensemble assumed unbiased (they share biases) \|
	\| H-1 \| Longevity \| 🟠 \| No automatic detection of model performance degradation \|
	\| H-2 \| Longevity \| 🟠 \| Human corrections aren't saved for future training \|
	\| H-3 \| Longevity \| 🟡 \| Taxonomy changes don't trigger re-scoring of old claims \|
	\| H-4 \| Longevity \| 🟡 \| No database backup automation \|
	\| H-5 \| Longevity \| 🟠 \| No retraction monitoring for ingested papers \|
	\| H-6 \| Longevity \| 🟡 \| No migration path from v1 to v2 database schema \|
	\| HU-1 \| Human \| 🟠 \| 3,000 claims displayed with no priority ranking \|
	\| HU-2 \| Human \| 🟡 \| Confidence scores show number but no breakdown \|
	\| HU-3 \| Human \| 🟡 \| No safeguards against over-trusting the system \|
	\| HU-4 \| Human \| 🟡 \| No beginner-friendly onboarding experience \|
	\| HU-5 \| Human \| 🟡 \| No accessibility features (dark mode, screen reader, font size) \|
	\| SEC-1 \| Security \| 🟠 \| No file validation or safety checks \|
	\| SEC-2 \| Security \| 🟡 \| SQL construction uses f-strings (risky pattern) \|
	\| SEC-3 \| Security \| 🟡 \| No spending limits on API calls \|
	\| SEC-4 \| Security \| 🟡 \| No privacy controls for unpublished research \|
	\| A-1 \| Agents \| 🟠 \| Companion agents have no AI brain connected \|
	\| A-2 \| Agents \| 🟡 \| Agent plans are fixed templates, not dynamic \|
	\| A-3 \| Agents \| 🟡 \| Multiple agents can't coordinate with each other \|
	\| A-4 \| Agents \| 🟡 \| No urgency system for proposal review \|

	### Summary by Severity

	\| Severity \| Count \| Meaning \|
	\|----------\|-------\|---------\|
	\| 🔴 Critical \| 14 \| System fundamentally broken without this \|
	\| 🟠 High \| 33 \| Significant quality or reliability problem \|
	\| 🟡 Medium \| 31 \| Important but not blocking \|
	\| Total \| 78 \| \|

	### Summary by Category

	\| Category \| Count \| Most Critical Issue \|
	\|----------\|-------\|-------------------\|
	\| Philosophy \| 3 \| Evidence-centered vs model-centered \|
	\| Data \| 6 \| All training data is synthetic \|
	\| Parser \| 10 \| No ML-based parser connected \|
	\| Brain \| 7 \| Model too small + mock is default \|
	\| Memory \| 7 \| Deduplication uses word counting \|
	\| Scoring \| 5 \| All qualifiers penalized equally \|
	\| Testing \| 7 \| No gold standard test set \|
	\| Training \| 5 \| Only stage 1 of 4 built \|
	\| Teamwork \| 4 \| Council doesn't debate + no router \|
	\| Longevity \| 6 \| No drift detection \|
	\| Human \| 5 \| Information overload \|
	\| Security \| 4 \| No input validation \|
	\| Agents \| 4 \| Agents have no brain \|

	---

	## How This Document Was Created

	1. Read every file in the repository (60+ files, 545KB of code and documentation)
	2. Compared code against design — for each feature in SYSTEM_DESIGN.md, checked if it exists in the code
	3. Incorporated 87 original blindspots from BLINDSPOT_AUDIT_COMPLETE.md
	4. Incorporated 12 architectural improvements from previous expert review session
	5. Wrote first draft with 47 blindspots
	6. Re-read with fresh eyes and found 31 additional blindspots
	7. Deduplicated and organized into 78 unique findings across 14 categories
	8. Rewrote everything in high-school-readable language

	### Relationship to Other Documents

	\| Document \| What It Contains \| How This One Is Different \|
	\|----------\|-----------------\|--------------------------\|
	\| BLINDSPOT_AUDIT_COMPLETE.md \| 87 theoretical failure modes found by adversarial critique \| Theoretical — found by thinking about what COULD go wrong \|
	\| SYSTEM_DESIGN.md \| The dream architecture — what the system SHOULD be \| Aspirational — describes the finished product \|
	\| This document \| 78 practical problems found by reading the actual code \| Practical — found by comparing code to design \|

	The key difference: the audit found problems by THINKING. This document found problems by READING THE CODE. Many overlap, but this document catches things the audit missed (like the mock extractor being the default, or the embedding model not existing), and skips things the audit included that are already partially addressed in code.

	---

	This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.

	---

	## Appendix: New Improvements from Prior Art Analysis (2026-04-23)

	Source: Comprehensive analysis of 15 similar systems ([PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md), [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md))

	After analyzing every system that does something similar to PhD Research OS, we found 16 concrete improvements to adopt, adapt, or build. These are ON TOP of the 78 blindspots already documented above. Each one comes from a real, published, working system.

	### New Blindspots Discovered from Prior Art

	\| ID \| Category \| Severity \| Problem \| Source System \|
	\|----\|----------\|----------\|---------\|---------------\|
	\| PA-1 \| Memory \| 🔴 \| No scientific embedding model for dedup (word overlap misses paraphrases — "0.8 fM" ≠ "800 attomolar" to our system but they're the same) \| SPECTER2 (AllenAI) proves embeddings fix this \|
	\| PA-2 \| Testing \| 🔴 \| No standard benchmark for claim verification quality (we can't compare ourselves to anyone else) \| SciFact (AllenAI) is the industry standard we should measure against \|
	\| PA-3 \| Data \| 🔴 \| 1,900 synthetic examples vs 137,000 expert-written ones available for free (72× gap) \| SciRIFF (AllenAI) is sitting on HuggingFace waiting to be used \|
	\| PA-4 \| Brain \| 🟠 \| No pre-filtering before expensive extraction (AI Council processes boilerplate text and acknowledgments as seriously as Results) \| PaperQA2's RCS technique filters noise before analysis \|
	\| PA-5 \| Brain \| 🟠 \| No code-based epistemic validation (AI classification has no deterministic cross-check) \| KGX3's language-game filters provide rule-based validation \|
	\| PA-6 \| Brain \| 🟠 \| No completeness audit after extraction (system could silently miss the most important finding) \| Paper Circle's Coverage Checker catches omissions \|
	\| PA-7 \| Brain \| 🟠 \| System never says "I don't know" (every input gets a confident-looking answer, even garbage) \| PaperQA2 refuses to answer when evidence is insufficient \|
	\| PA-8 \| Memory \| 🟠 \| Contradiction detection has no fast pre-filter (checking 500K pairs all require expensive LLM calls) \| SciBERT-NLI can pre-screen in seconds, not hours \|
	\| PA-9 \| Memory \| 🟠 \| No cross-reference verification between paper claims and knowledge graph (claims checked in isolation) \| FactReview checks against both the paper AND external literature \|
	\| PA-10 \| Scoring \| 🟡 \| No temporal confidence tracking (a claim that's been rising for 4 years looks the same as one that just appeared) \| CLAIRE + PaperQA2 both track temporal patterns \|
	\| PA-11 \| Scoring \| 🟡 \| Confidence numbers shown without explanation (0.72 means nothing without knowing WHY) \| CLUE generates plain-English uncertainty explanations \|
	\| PA-12 \| Human \| 🟡 \| No active skepticism mode (system only confirms, never challenges high-confidence claims) \| CLAIRE actively looks for counter-evidence \|
	\| PA-13 \| Human \| 🟡 \| No human verification tracking (machine-extracted claims look identical to expert-verified ones) \| ORKG + Paper Circle track verification status \|
	\| PA-14 \| Brain \| 🟡 \| No self-critique step in Council (Extractor produces claims without articulating uncertainty) \| CritiCal proves that self-critique improves calibration \|
	\| PA-15 \| Memory \| 🟡 \| Conflict detection is shallow (compares two claims in isolation, doesn't investigate with surrounding evidence) \| CLAIRE's investigation loop retrieves additional evidence before flagging \|
	\| PA-16 \| Data \| 🟡 \| No non-AI anchors in teacher ensemble (all 5 teacher models share similar biases from internet training) \| FactReview uses CODE EXECUTION as a non-AI evidence source \|

	### Updated Summary by Severity (Including Prior Art Findings)

	\| Severity \| Original Count \| New from Prior Art \| Total \|
	\|----------\|---------------\|-------------------\|-------\|
	\| 🔴 Critical \| 14 \| 3 \| 17 \|
	\| 🟠 High \| 33 \| 5 \| 38 \|
	\| 🟡 Medium \| 31 \| 8 \| 39 \|
	\| Total \| 78 \| 16 \| 94 \|

	### Priority Integration Order for Prior Art Items

	Do These First (they fix Critical blindspots with minimal effort):

	\| Priority \| Item \| Effort \| What It Fixes \|
	\|----------\|------\|--------\|---------------\|
	\| 1 \| PA-1: Integrate SPECTER2 \| 1-2 days \| Deduplication actually works \|
	\| 2 \| PA-2: Add SciFact benchmark \| 1 day \| We can measure quality \|
	\| 3 \| PA-3: Integrate SciRIFF data \| 2-3 days \| 72× more training data \|

	Do These Next (they fix High blindspots with moderate effort):

	\| Priority \| Item \| Effort \| What It Fixes \|
	\|----------\|------\|--------\|---------------\|
	\| 4 \| PA-8: SciBERT-NLI pre-filter \| 1-2 days \| Conflict detection becomes 100× faster \|
	\| 5 \| PA-4: Pre-Extraction Filter \| 1 week \| Removes noise before expensive processing \|
	\| 6 \| PA-5: Epistemic Trigger Words \| 1 week \| Code-based validation of AI labels \|
	\| 7 \| PA-7: Low Confidence Quarantine \| 3-5 days \| System learns to say "I don't know" \|
	\| 8 \| PA-6: Completeness Auditor \| 1 week \| Catches silent omissions \|
	\| 9 \| PA-9: Cross-Reference Verification \| 1-2 weeks \| Claims checked against both paper and graph \|

	Do These When Core Is Stable (they add new capabilities):

	\| Priority \| Item \| Effort \| What It Adds \|
	\|----------\|------\|--------\|-------------\|
	\| 10 \| PA-14: Council Self-Critique \| 1 week \| Better calibration \|
	\| 11 \| PA-13: Provenance Levels \| 1 week \| Human verification tracking \|
	\| 12 \| PA-15: Investigation Protocol \| 2-3 weeks \| Deep contradiction analysis \|
	\| 13 \| PA-10: Epistemic Velocity \| 3-5 days \| Temporal confidence trends \|
	\| 14 \| PA-12: Devil's Advocate Mode \| 1-2 weeks \| Active skepticism \|
	\| 15 \| PA-11: Confidence Explanations \| 3-5 days \| Plain-English scoring reasons \|
	\| 16 \| PA-16: Non-AI anchors \| Ongoing \| Deterministic validation alongside AI \|

	---

	Appendix added 2026-04-23. These 16 items supplement the original 78 blindspots with findings from prior art analysis. See [PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md) and [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md) for full details.