phd-research-os-brain / FUTURE_IMPROVEMENTS.md
nkshirsa's picture
Append 16 new improvement items from prior art analysis to FUTURE_IMPROVEMENTS.md (total now 94)
eb2e95b verified

PhD Research OS β€” Future Improvements Roadmap

Everything That Needs to Be Fixed, and How

Written so a high school student can understand every word.

Version: 2.0 (Final Compiled Edition) Date: 2026-04-23 Status: Comprehensive β€” compiled from 87 original audit findings, 12 architectural improvements, and iterative code review Total blindspots catalogued: 78 (across 14 categories)


What Is This Document?

Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts built. This document lists every problem we've found β€” from looking at the actual code, not just the plans β€” and explains how to fix each one.

Who is this for? Anyone who wants to understand, contribute to, or learn from this project. Every explanation uses everyday language. If you've ever written a school report, you already have the intuition for most of these problems.

How was this made? We read every file in the repository (60+ files), compared the actual code against the design document, re-read our findings multiple times looking for things we missed, and compiled everything into this single document.


Table of Contents

  1. The Big Picture
  2. The Core Philosophy Problem
  3. Data Problems β€” Your Training Material Is Weak
  4. Reading Papers β€” The PDF Nightmare
  5. The Brain β€” Your AI Is Underpowered
  6. Memory β€” How the System Remembers
  7. Scoring β€” Can You Trust the Numbers?
  8. Testing β€” How Do You Know It Works?
  9. Training β€” Teaching the AI Properly
  10. Teamwork β€” Multiple Models Working Together
  11. Staying Honest Over Time
  12. The Human Experience
  13. Security and Safety
  14. The Companion Agent System
  15. What To Build First
  16. Master Blindspot Table

1. The Big Picture

What does this system do?

It reads science papers and does 5 things:

  1. Extracts key findings ("This test detected cancer at 0.8 fM")
  2. Labels each finding (Is it a proven fact? An educated guess? A theory?)
  3. Scores how much you should trust each finding (using math formulas, not AI guesses)
  4. Compares findings across papers and spots contradictions
  5. Exports everything to tools researchers already use (Obsidian notes, CSV files)

What's built vs. what's planned?

Part Status Plain English
Database βœ… Works The filing cabinet is built
PDF reader ⚠️ Basic Can read papers but misses tables, figures, and equations
Claim extractor ⚠️ Uses fake AI Has a pretend brain that guesses using word patterns
Deduplicator ⚠️ Primitive Checks if two sentences share the same words (misses paraphrases)
Knowledge graph ⚠️ Basic Can store connections but can't find subtle contradictions
Scoring engine βœ… Works Math formulas work correctly
Evaluation ⚠️ Incomplete Counts claims but never checks if they're RIGHT
Obsidian export βœ… Works Can create notes for each finding
AI agent helpers βœ… Framework works The assistant robots exist but have no real brains yet
Training data ⚠️ Too small 1,900 fake examples (need 10,000+ real ones)
Trained model ⚠️ Too small Using a 3-billion parameter model (plan says 8-27 billion)

2. The Core Philosophy Problem

Blindspot CORE-1: The System Is Model-Centered, Not Evidence-Centered πŸ”΄

This is the single most important problem in the entire project.

Right now, the system works like this:

  1. Give text to AI
  2. AI writes a summary of what it found
  3. Hope the summary is correct

It should work like this:

  1. AI scans the text and points to specific sentences
  2. Each pointed-to sentence gets a label (Fact / Interpretation / Hypothesis)
  3. You can click on any finding and see the EXACT sentence in the original paper

Analogy: Imagine writing a school essay. Model-centered is like saying "Shakespeare wrote about love and death" with no page references. Evidence-centered is like saying "In Act 3, Scene 1, line 88, Romeo says '...' which shows the theme of love." Teachers always want the second kind β€” because they can CHECK it.

What changes in the code: Every claim in the database needs to be a POINTER to a specific text span (page number, character position, exact quote). The AI's job is to FIND important spans, not to WRITE about what it found. The claim text should be copied directly from the paper, not generated by the AI.

Blindspot CORE-2: Design-Code Gap Is Enormous πŸ”΄

What's wrong: The SYSTEM_DESIGN.md describes a sophisticated 7-layer system with ML-based parsing, constrained decoding, multi-round council debates, embedding-based deduplication, and a 4-stage training pipeline. The actual code has basic text extraction, pattern-matching classification, word-overlap deduplication, and single-stage training.

Why it matters: Someone reading the design document would think this is a near-complete system. Someone reading the code would see it's a prototype. This gap can mislead contributors, reviewers, and users.

The fix: The README and documentation should clearly mark which features are "βœ… Implemented," "🚧 Skeleton exists," and "πŸ“‹ Designed only." Honesty about the current state is itself an epistemic virtue β€” and this project is all about epistemic honesty.

Blindspot CORE-3: No End-to-End Pipeline Has Ever Run Successfully πŸ”΄

What's wrong: There's no evidence (in tests, logs, or documentation) that anyone has ever: taken a real PDF β†’ parsed it β†’ extracted claims β†’ deduplicated them β†’ stored them in the graph β†’ scored them β†’ exported to Obsidian. Each layer has been tested in isolation, but the full pipeline has never been verified end-to-end.

Why it matters: Individual parts working doesn't mean the whole system works. A car where the engine works, the wheels spin, and the steering turns is still broken if they're not connected properly.

The fix: Create ONE end-to-end test. Take a real paper. Run it through every layer. Check the output. This single test is worth more than 100 unit tests for building confidence that the system actually does what it claims.


3. Data Problems β€” Your Training Material Is Weak

Blindspot D-1: Training Data Is All Fake πŸ”΄

All 1,900 training examples were generated by a Python script using templates. Real papers don't write like templates. The AI is learning to process fake text, not real science.

Fix: Get 100 real papers, have a human expert label them, use those as training data.

Blindspot D-2: No "Wrong Answer" Examples πŸ”΄

The AI only sees correct examples. It never sees examples of ALMOST correct but subtly wrong outputs. It's like studying for a test using only the answer key β€” you won't recognize trick questions.

Fix: Run the current model, collect its mistakes, and use those as "here's what NOT to do" training examples (hard-negative mining).

Blindspot D-3: No Error Categories 🟠

When the model makes a mistake, we don't know WHY. Was it a dropped unit? A missed qualifier? A wrong section? Without categorizing errors, we can't target specific weaknesses.

The 9 error types to track:

  1. Qualifier loss β€” "may reduce" becomes "reduces"
  2. Unit drop β€” "0.8 fM" becomes "0.8"
  3. Negation flip β€” "did NOT work" becomes "worked"
  4. Section confusion β€” Abstract claim scored as Results
  5. Number hallucination β€” Paper says 0.8, AI says 0.6
  6. Granularity mismatch β€” One table row treated as a paper-level conclusion
  7. Citation theft β€” A finding from a cited paper treated as this paper's finding
  8. Causal overclaim β€” "correlated with" becomes "caused by"
  9. Statistics loss β€” p-value or sample size dropped

Blindspot D-4: Teacher Disagreement Not Preserved 🟠

The system design talks about running 5 AI models on the same paper and keeping ALL answers. The current pipeline just generates one answer. The disagreement between multiple models IS the most valuable training signal β€” it teaches the student model about uncertainty.

Blindspot D-5: No Counterfactual Examples 🟑

We never create "mirror" training examples where one key word is changed (add "not", swap a unit) to test if the model notices. Without these, we can't tell if the model is actually reading or just pattern-matching.

Blindspot D-6: Training Data Only Covers 5 Domains 🟑

The data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. Feed it a paper about ecology, psychology, or economics and it will hallucinate confidently. The data needs to be broader, or the system needs to refuse out-of-domain papers.


4. Reading Papers β€” The PDF Nightmare

Blindspot P-1: No Smart PDF Parser πŸ”΄

The code uses basic text scrapers (PyMuPDF, pdfplumber) that destroy document structure. The design says to use Marker (ML-based), but it's not connected. This is the #1 dependency β€” everything downstream depends on good parsing.

Blindspot P-2: Tables Become Unreadable πŸ”΄

Tables are extracted as flat pipe-separated text. The relationship between column headers and cell values is lost. "0.8" without knowing it's the "LOD" column is meaningless.

Fix: Store tables as structured data with headers and rows preserved.

Blindspot P-3: Figures Are Completely Skipped 🟠

The parser detects images and stores "[Image detected β€” requires VLM processing]". No analysis happens. 30-40% of key evidence in science papers is in figures.

Blindspot P-4: Section Detection Is Brittle 🟠

Sections are detected by looking for the word "Abstract" or "Results" at the start of a line. Papers that number sections ("2.1 Experimental Setup"), use non-standard names, or combine sections ("Results and Discussion") are misidentified. Since section identity directly affects confidence scores, this causes systematic scoring errors.

Blindspot P-5: Equations Become Garbage 🟠

Mathematical equations are not handled. LaTeX, inline math, and special symbols are either garbled or dropped. For papers in physics, chemistry, or engineering, equations ARE the findings.

Blindspot P-6: Supplements Can't Be Linked to Main Papers 🟠

The parser processes one file at a time. There's no way to say "this supplement belongs to that main paper." In biology and chemistry, the best data is often in supplements.

Blindspot P-7: No Language Detection 🟑

If someone uploads a Chinese or Japanese paper, the system produces English-looking garbage with high confidence. It should detect the language and refuse or flag reduced confidence.

Blindspot P-8: Cross-References Are Detected But Never Verified 🟑

The code finds "see Table 2" in the text but never checks if the parsed Table 2 is actually Table 2. If tables are mislabeled, every claim referencing them has wrong evidence.

Blindspot P-9: No File Safety Checks 🟠

The parser accepts any file with no checks for malicious PDFs (they exist), extremely large files, or corrupted data. A 500MB PDF would crash the system.

Blindspot P-10: Chunking Ignores Table/Caption Integrity 🟑

The section-aware chunker in parser.py merges consecutive body text regions, but tables and their captions can be split across chunks. A table in one chunk and its caption in another means the AI sees numbers without context.


5. The Brain β€” Your AI Is Underpowered

Blindspot B-1: Model Is Too Small πŸ”΄

Qwen2.5-3B (current) is like a bright middle schooler β€” can follow instructions but doesn't deeply understand scientific reasoning. The design says upgrade to Qwen3-8B or Qwen3.5-27B. This hasn't happened.

Blindspot B-2: The Fake Brain Is the Default πŸ”΄

When no real AI is connected (which is the default), a _extract_mock() function runs that classifies claims by keyword matching: any sentence with "measured" β†’ Fact, any with "may" β†’ Hypothesis. This is wildly inaccurate.

Example failures:

  • "We measured nothing significant" β†’ Mock says Fact β†’ Actually a null result
  • "The most reliable technique measured to date may revolutionize the field" β†’ Mock says both Fact AND Hypothesis

Blindspot B-3: No Output Guarantees 🟠

The system asks the AI to output JSON and hopes for the best. Without constrained decoding (Guidance library), the model can produce broken JSON, invalid tags, or mixed text-and-JSON output.

Blindspot B-4: One Brain for Six Different Jobs 🟠

One model does: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want opposite behaviors (extraction wants to catch everything; qualifier detection wants surgical precision).

Fix: Shared base model with specialized output heads for different tasks.

Blindspot B-5: No Scientific Vocabulary Training 🟠

The base model was trained on general internet text. It doesn't deeply know what "LOD," "analyte," "p-value," or "cyclic voltammetry" mean. It can extract the words but might misclassify their importance.

Blindspot B-6: Can't Detect "I Don't Know This Field" 🟠

No out-of-distribution detection. Feed it a sociology paper and it produces confident-looking scientific claims that are nonsense. The model should flag when content is too different from its training data.

Blindspot B-7: No Verification of Claims Against Source 🟠

The design describes a "chain-of-verification" approach where a second model re-reads the source and checks if the extracted claim is actually supported. This isn't built. The system trusts its first extraction with no double-checking.


6. Memory β€” How the System Remembers

Blindspot M-1: Deduplication Uses Word Overlap, Not Meaning πŸ”΄

Two claims that say the same thing in different words are treated as different findings. "The LOD was 0.8 fM" and "A detection limit of 0.8 femtomolar was achieved" would NOT be recognized as duplicates because they share few words.

Fix: Use an embedding model that converts sentences into number vectors. Similar meanings = similar vectors, regardless of word choice.

Blindspot M-2: No Embedding Model Exists in the Code 🟠

The design describes embedding-based similarity. The code imports no embedding model. No sentence-transformers, no ChromaDB, nothing. The entire deduplication and conflict detection system relies on word counting.

Blindspot M-3: Conflict Detection Is Limited to 500 Claims 🟠

The find_conflicts() method only checks the top 500 claims by confidence. If important claim #501 contradicts claim #10, it'll never be found. And it uses the same word-overlap approach, so semantically similar but differently-worded contradictions are invisible.

Blindspot M-4: No Concept of Time 🟑

A finding from 2015 and a finding from 2025 are weighted equally. But the 2025 paper might specifically disprove the 2015 one. The graph needs temporal reasoning.

Blindspot M-5: Gap Analysis Is Too Narrow 🟑

The gap-finding algorithm only looks for missing connections between same-type nodes. But the most interesting gaps are between different types β€” like a method that's never been applied to a particular material.

Blindspot M-6: No Retraction Checking 🟠

If a paper gets retracted (pulled back because it was wrong or fraudulent), the system doesn't know. Claims from retracted papers sit in the knowledge graph with full confidence.

Fix: Check CrossRef/Retraction Watch before ingesting a paper. If a paper in the database gets retracted later, propagate that information to all its claims.

Blindspot M-7: Obsidian Export Doesn't Include Graph Relationships 🟑

The Obsidian exporter creates notes for claims, conflicts, and goals, but doesn't export the actual graph edges (supports/refutes/extends). A researcher looking at the vault sees individual findings but can't see how they connect.


7. Scoring β€” Can You Trust the Numbers?

Blindspot S-1: All Qualifiers Get the Same Penalty 🟠

Every qualifier reduces confidence by exactly 0.1. But "may" (very uncertain) and "under controlled laboratory conditions" (limits scope but is quite certain) are treated identically.

Fix: Weighted qualifier types β€” strong hedges like "may" get -0.20, scope limiters like "in vitro" get -0.05.

Blindspot S-2: Only One Calibration Number 🟠

The system tracks one Brier score for overall calibration. But the model might be perfectly calibrated on "did I find a real claim?" and terribly calibrated on "are these two claims contradictory?" One number hides this.

Fix: Track 6 separate calibration curves: extraction confidence, qualifier confidence, statistical confidence, conflict confidence, section confidence, provenance confidence.

Blindspot S-3: Design Features Missing from Code 🟑

The SYSTEM_DESIGN.md describes:

  • Source count bonus (+0.2 for claims backed by multiple papers) β†’ NOT in scorer.py
  • Conflict penalty (-0.3 for claims with active conflicts) β†’ NOT in scorer.py
  • Corroboration bonus β†’ NOT in scorer.py

The scorer only uses single-claim information. It ignores the knowledge graph entirely.

Blindspot S-4: Average Instead of Minimum for Composite Score 🟑

Composite confidence = average of three scores. This means terrific evidence + terrible qualifier strength = mediocre composite. A chain is only as strong as its weakest link β€” use the minimum, not the average.

Blindspot S-5: Parser Confidence Isn't Actually Meaningful 🟑

The score_parse_quality() function in parser.py assigns quality scores, but the thresholds are arbitrary. What does "garbled_ratio Γ— 3000" actually correspond to? These numbers were chosen by gut feeling, not calibrated against human judgments of parse quality.


8. Testing β€” How Do You Know It Works?

Blindspot T-1: No Gold Standard Test Set πŸ”΄

The evaluation counts claims and checks distributions but never compares against human-labeled ground truth. The system could produce 500 completely wrong claims and report "all metrics look normal."

Fix: 10 expert-labeled papers where every claim, tag, qualifier, and conflict is manually annotated. This is the #1 requirement for any ML system β€” you can't improve what you can't measure.

Blindspot T-2: No Paper-Level Test Splits πŸ”΄

Train and test sets are randomly shuffled. Claims from the same paper could be in both. The model might memorize patterns per paper rather than learning to generalize.

Fix: Split by paper, lab, journal, year, and field. Test on genuinely unseen papers.

Blindspot T-3: No Counterfactual Tests 🟠

We never check if the model changes its answer for the right reason. Remove a table header β€” does the claim become Incomplete? Flip "significant" to "not significant" β€” does the tag change? Without these tests, the model might use shortcuts instead of understanding.

Blindspot T-4: No Stochastic Testing 🟑

LLMs give different answers each time. Running evaluation once gives one number that might change by Β±5% next time. Run evaluations 5 times and report the range.

Blindspot T-5: Regression Gate Has No Teeth 🟑

The run_regression_gate() returns pass/fail but nothing blocks a bad model from being used. It's like a fire alarm with no fire department.

Blindspot T-6: No Inter-Annotator Agreement Tracking 🟑

When multiple people label the gold standard, they'll disagree on some claims. That disagreement is important information β€” it tells you which categories are genuinely ambiguous. The system has no mechanism to measure or track this.

Blindspot T-7: 143 Tests, Zero of Them Test Science Quality 🟠

All 143 tests verify that the CODE runs correctly (functions don't crash, data is stored properly). Zero tests verify that the SCIENCE is right (extracted claims match what the paper actually says). Code tests and science tests are completely different things.


9. Training β€” Teaching the AI Properly

Blindspot TR-1: Only Stage 1 of 4 Is Built πŸ”΄

The 4-stage pipeline: SFT β†’ DPO β†’ GRPO β†’ ConfTuner. Only SFT exists. This is like building a race car but only installing first gear.

  • SFT (Stage 1): "Here's the right answer" β†’ ~70% quality
  • DPO (Stage 2): "This answer is better than that one" β†’ ~80% quality
  • GRPO (Stage 3): "Reward for good JSON, correct tags, preserved qualifiers" β†’ ~85-90% quality
  • ConfTuner (Stage 4): "Make your confidence numbers match reality" β†’ Calibrated confidence

Blindspot TR-2: ZeroGPU Micro-Batching Breaks Learning 🟠

The training Space uses 60-second GPU bursts. The learning rate resets, the model reloads, and there's no training momentum. It's like studying in 2-minute bursts with naps between each one.

Fix: One continuous training job on proper GPU (HF Jobs or local).

Blindspot TR-3: No Curriculum Learning 🟑

Easy and hard examples are randomly mixed. Humans learn better easy-to-hard. The model should first learn single facts, then fact vs. interpretation, then multi-claim papers, then contradictions.

Blindspot TR-4: Loss Number Doesn't Mean Task Success 🟑

Training tracks eval_loss, which tells you if the model is generating likely-looking text. It does NOT tell you if the JSON is valid, tags are correct, qualifiers are preserved, or numbers are accurate. Task-specific evaluation during training is essential.

Blindspot TR-5: No Training on Real Model Failures 🟠

Hard-negative mining (collecting the model's actual mistakes and training on them) is the single most efficient way to improve quality. It's described in the previous conversation but not implemented anywhere. You need:

  1. Run current model on 100 papers
  2. Manually check which extractions are wrong
  3. Categorize WHY they're wrong (using the 9 error types)
  4. Train specifically on those failure cases

10. Teamwork β€” Multiple Models Working Together

Blindspot W-1: Council Is Sequential, Not Debating 🟠

The design describes a multi-round debate (Extractor β†’ Critic β†’ Chairman, with hidden confidence and revealed reasoning). The code is sequential: one call per member, no back-and-forth. The disagreement signal (which IS the whole point) is lost.

Blindspot W-2: No Router 🟠

Everything goes to the same model. Text, tables, figures, equations β€” all handled the same way. A router should decide: regular text β†’ language model, table β†’ specialized parser, figure β†’ vision model, garbage β†’ skip.

Blindspot W-3: System Can Never Say "I Don't Know" 🟠

Every input gets a confident-looking answer. There's no mechanism to say "I can't read this page," "I don't understand this field," or "I need a human to look at this." Confident garbage is the worst failure mode.

Blindspot W-4: Teachers Are Treated As Oracles 🟠

The design proposes using 5 frontier AI models as teachers. But these models share biases β€” they were all trained on similar internet text, they all struggle with the same types of negation, and they can all be wrong in the same direction.

Fix: Include at least one NON-AI anchor for each data type: deterministic table parser for tables, regex for statistics, CrossRef API for citations. Where the rules-based tool disagrees with all 5 AI models, investigate β€” the AIs might be harmonizing around an error.


11. Staying Honest Over Time

Blindspot H-1: No Drift Detection 🟠

If the model slowly gets worse (because it encounters new sub-fields or vocabulary evolves), nobody notices. Weekly re-evaluation against the gold standard is needed.

Blindspot H-2: Human Corrections Disappear 🟠

When a researcher fixes a wrong tag or adds a missing qualifier in the UI, that correction isn't saved for training. Those corrections are the most valuable data you can get β€” each one is an expert-labeled example from your exact domain.

Blindspot H-3: Taxonomy Changes Break Old Scores 🟑

If study type weights change (e.g., "in_vitro" goes from 0.85 to 0.80), old claims aren't re-scored. Claims from different taxonomy versions can't be compared meaningfully.

Blindspot H-4: No Backup Strategy 🟑

Everything is in one SQLite file. No automatic backups, no periodic snapshots, no way to recover from corruption or accidental deletion.

Blindspot H-5: No Retraction Monitoring 🟠

Once a paper is ingested, its claims live in the graph forever β€” even if the paper is retracted (pulled back for fraud or errors). The system needs to periodically check for retractions and propagate that information.

Blindspot H-6: v1 β†’ v2 Migration Path Undefined 🟑

The repo has both phd_research_os/ (v1) and phd_research_os_v2/ (v2) with different database schemas. There's no migration script for anyone who has data in v1 format. Users could lose work when upgrading.


12. The Human Experience

Blindspot HU-1: Information Overload 🟠

100 papers β†’ 3,000+ claims. The UI treats them all equally. Researchers need priority ranking: most surprising findings first, then contradictions, then gaps.

Blindspot HU-2: Scores Are Unexplained Numbers 🟑

"Confidence: 0.723" means nothing without a breakdown. Show the multiplication chain: evidence Γ— quality Γ— tier Γ— section Γ— completeness = 0.723.

Blindspot HU-3: Risk of Blind Trust 🟑

Once a researcher trusts the system, they stop checking. Built-in friction (hiding scores randomly, asking "do you agree?", tracking override rates) prevents over-reliance.

Blindspot HU-4: No "Fresh User" Onboarding 🟑

A new graduate student encountering the system for the first time sees a complex 7-layer architecture with 87 blindspots. There's no tutorial, no "start here" guide, no simplified workflow for beginners.

Fix: A "Getting Started" mode that processes one paper and walks the user through each step: "Here's what the parser found β†’ Here's what the AI extracted β†’ Here's how we scored it β†’ Here's what we're not sure about."

Blindspot HU-5: No Accessibility Considerations 🟑

The UI has no dark mode, no font size options, no screen reader support. Research tools should be accessible to everyone.


13. Security and Safety

Blindspot SEC-1: No Input Validation 🟠

No file size limits, no malicious PDF detection, no format verification. A bad file could crash or compromise the system.

Blindspot SEC-2: Risky SQL Pattern 🟑

The get_stats() function uses f-strings for table names. Currently safe because names are hardcoded, but a bad pattern that could become dangerous if someone modifies the code.

Blindspot SEC-3: No API Cost Controls 🟑

No rate limiting on external API calls. Processing 1,000 papers could generate thousands of expensive API requests with no spending cap.

Blindspot SEC-4: No Data Privacy for Unpublished Work 🟑

The design describes an "Epistemic Embargo" (private graphs for unpublished research), but it's not implemented. A researcher analyzing their unpublished data has no privacy guarantees.


14. The Companion Agent System

Blindspot A-1: Agents Have No Real Brain 🟠

The companion agents (DataQualityAuditor, PromptOptimizer, etc.) are fully coded with lifecycle management, audit trails, and proposal systems. But when no AI model is connected (the default), they generate placeholder proposals that say "Brain not configured." They're robots with no intelligence.

Blindspot A-2: Agent Plans Are Hardcoded, Not Dynamic 🟑

The _plan() method assigns fixed step lists based on agent type. A DataQualityAuditor always gets the same 4 steps regardless of what task it's given. Real planning would look at the task, check what data is available, and decide what to do.

Blindspot A-3: No Agent Coordination 🟑

Multiple agents can run simultaneously but they can't talk to each other. If the DataQualityAuditor finds a problem and the PromptOptimizer could fix it, there's no mechanism for them to coordinate.

Blindspot A-4: Proposal Review Has No Urgency System 🟑

All proposals wait equally for human review. But some (like "this paper was retracted β€” its claims should be removed") are urgent, while others (like "consider adding training examples for this domain") are routine. No priority system exists.


15. What To Build First

Everything has dependencies. You can't build the roof before the walls. Here's the order that makes engineering sense:

πŸ—οΈ Foundation Phase (Weeks 1-4) β€” Nothing works without this

# Task Why First
1 Integrate Marker PDF parser Every downstream layer depends on accurate parsing
2 Create gold standard test set (10 real papers, expert-labeled) Can't measure improvement without ground truth
3 Add embedding model (sentence-transformers) Needed for smart deduplication and conflict detection
4 Run one end-to-end pipeline test Prove the layers actually connect

πŸ”§ Reliability Phase (Weeks 5-8) β€” Make it work correctly

# Task Why Next
5 Connect real AI model (replace mock extractor) The system needs a real brain
6 Add constrained decoding (Guidance library) Guarantee valid JSON output
7 Build continuous training pipeline (replace ZeroGPU micro-batching) Proper training produces proper models
8 Implement weighted qualifier penalties Not all hedging words are equal

🧠 Intelligence Phase (Weeks 9-16) β€” Make it smart

# Task Why This Order
9 Expand training data to 10K+ with hard negatives More and better data = better model
10 Implement DPO training (preference learning) Stage 2 of training pipeline
11 Implement GRPO training (reward functions for JSON quality, tag correctness, qualifier preservation) Stage 3 β€” the biggest quality jump
12 Add out-of-distribution detection Know when to say "I don't know"

βœ… Trust Phase (Weeks 17-24) β€” Make it honest

# Task Why Here
13 Build counterfactual evaluation suite Test that the model reasons correctly, not just accurately
14 Add paper-level evaluation splits Honest accuracy numbers
15 Implement human feedback loop Every correction becomes training data
16 Add multi-dimensional calibration (6 Brier scores) Know which confidence types are trustworthy
17 Add drift detection and monitoring Catch problems before they become crises

πŸš€ Scale Phase (Weeks 25-32) β€” Make it complete

# Task Why Last
18 Add figure/table specialist models Handle the 30-40% of evidence in images
19 Build content router Right model for right content type
20 Implement supplement handling (paper bundles) Complete paper coverage
21 Add retraction checking Keep the knowledge graph honest
22 Build verification pipeline (double-check claims against source) Reduce hallucinations dramatically

16. Master Blindspot Table

ID Category Severity Problem (one sentence)
CORE-1 Philosophy πŸ”΄ System generates summaries instead of pointing to evidence spans
CORE-2 Philosophy πŸ”΄ Design document describes features the code doesn't have
CORE-3 Philosophy πŸ”΄ End-to-end pipeline has never been tested on a real paper
D-1 Data πŸ”΄ All training data is computer-generated, not from real papers
D-2 Data πŸ”΄ No examples of what wrong output looks like
D-3 Data 🟠 Errors aren't categorized by type
D-4 Data 🟠 Multiple-teacher disagreement signal is thrown away
D-5 Data 🟑 No counterfactual (mirror) training examples
D-6 Data 🟑 Training covers only 5 scientific fields
P-1 Parser πŸ”΄ No ML-based PDF parser is connected
P-2 Parser πŸ”΄ Tables lose their header-value relationships
P-3 Parser 🟠 Figures are detected but never analyzed
P-4 Parser 🟠 Section headers are identified by keyword matching only
P-5 Parser 🟠 Equations are garbled or dropped
P-6 Parser 🟠 Supplementary files can't be linked to main papers
P-7 Parser 🟑 Non-English papers produce garbage silently
P-8 Parser 🟑 "See Table 2" references are found but never verified correct
P-9 Parser 🟠 No file size/safety checks before processing
P-10 Parser 🟑 Tables and their captions can be split across chunks
B-1 Brain πŸ”΄ Model is 3B parameters (design says 8-27B)
B-2 Brain πŸ”΄ Default mode uses keyword matching instead of AI
B-3 Brain 🟠 Output format is not guaranteed (no constrained decoding)
B-4 Brain 🟠 Single model does 6 different tasks that want opposite behaviors
B-5 Brain 🟠 No training on scientific vocabulary specifically
B-6 Brain 🟠 Can't detect when content is outside its training domain
B-7 Brain 🟠 No second-pass verification of extracted claims
M-1 Memory πŸ”΄ Deduplication checks word overlap, not meaning
M-2 Memory 🟠 No embedding model exists anywhere in the code
M-3 Memory 🟠 Conflict detection only checks 500 claims using word overlap
M-4 Memory 🟑 Knowledge graph has no concept of time
M-5 Memory 🟑 Gap analysis only looks within same node types
M-6 Memory 🟠 Retracted papers aren't detected or flagged
M-7 Memory 🟑 Obsidian export doesn't include graph edges
S-1 Scoring 🟠 All qualifier types penalized equally
S-2 Scoring 🟠 One calibration number hides per-task calibration failures
S-3 Scoring 🟑 Design features (source bonus, conflict penalty) not in code
S-4 Scoring 🟑 Composite = average (should be minimum of weakest dimension)
S-5 Scoring 🟑 Parse quality scores are arbitrary, not calibrated
T-1 Testing πŸ”΄ No human-labeled gold standard test set
T-2 Testing πŸ”΄ Train/test split is random, not paper-level
T-3 Testing 🟠 No counterfactual robustness tests
T-4 Testing 🟑 Evaluations run once (should run 5Γ— to check consistency)
T-5 Testing 🟑 Regression gate returns pass/fail but nothing enforces it
T-6 Testing 🟑 No tracking of human annotator agreement
T-7 Testing 🟠 143 code tests, zero science quality tests
TR-1 Training πŸ”΄ Only 1 of 4 training stages exists
TR-2 Training 🟠 ZeroGPU micro-batching breaks learning continuity
TR-3 Training 🟑 Examples aren't ordered easy β†’ hard
TR-4 Training 🟑 Training eval is loss-based, not task-based
TR-5 Training 🟠 No training on actual model failures
W-1 Teamwork 🟠 Council members don't actually debate
W-2 Teamwork 🟠 No router to send content to appropriate model
W-3 Teamwork 🟠 System can never say "I don't know"
W-4 Teamwork 🟠 Teacher ensemble assumed unbiased (they share biases)
H-1 Longevity 🟠 No automatic detection of model performance degradation
H-2 Longevity 🟠 Human corrections aren't saved for future training
H-3 Longevity 🟑 Taxonomy changes don't trigger re-scoring of old claims
H-4 Longevity 🟑 No database backup automation
H-5 Longevity 🟠 No retraction monitoring for ingested papers
H-6 Longevity 🟑 No migration path from v1 to v2 database schema
HU-1 Human 🟠 3,000 claims displayed with no priority ranking
HU-2 Human 🟑 Confidence scores show number but no breakdown
HU-3 Human 🟑 No safeguards against over-trusting the system
HU-4 Human 🟑 No beginner-friendly onboarding experience
HU-5 Human 🟑 No accessibility features (dark mode, screen reader, font size)
SEC-1 Security 🟠 No file validation or safety checks
SEC-2 Security 🟑 SQL construction uses f-strings (risky pattern)
SEC-3 Security 🟑 No spending limits on API calls
SEC-4 Security 🟑 No privacy controls for unpublished research
A-1 Agents 🟠 Companion agents have no AI brain connected
A-2 Agents 🟑 Agent plans are fixed templates, not dynamic
A-3 Agents 🟑 Multiple agents can't coordinate with each other
A-4 Agents 🟑 No urgency system for proposal review

Summary by Severity

Severity Count Meaning
πŸ”΄ Critical 14 System fundamentally broken without this
🟠 High 33 Significant quality or reliability problem
🟑 Medium 31 Important but not blocking
Total 78

Summary by Category

Category Count Most Critical Issue
Philosophy 3 Evidence-centered vs model-centered
Data 6 All training data is synthetic
Parser 10 No ML-based parser connected
Brain 7 Model too small + mock is default
Memory 7 Deduplication uses word counting
Scoring 5 All qualifiers penalized equally
Testing 7 No gold standard test set
Training 5 Only stage 1 of 4 built
Teamwork 4 Council doesn't debate + no router
Longevity 6 No drift detection
Human 5 Information overload
Security 4 No input validation
Agents 4 Agents have no brain

How This Document Was Created

  1. Read every file in the repository (60+ files, 545KB of code and documentation)
  2. Compared code against design β€” for each feature in SYSTEM_DESIGN.md, checked if it exists in the code
  3. Incorporated 87 original blindspots from BLINDSPOT_AUDIT_COMPLETE.md
  4. Incorporated 12 architectural improvements from previous expert review session
  5. Wrote first draft with 47 blindspots
  6. Re-read with fresh eyes and found 31 additional blindspots
  7. Deduplicated and organized into 78 unique findings across 14 categories
  8. Rewrote everything in high-school-readable language

Relationship to Other Documents

Document What It Contains How This One Is Different
BLINDSPOT_AUDIT_COMPLETE.md 87 theoretical failure modes found by adversarial critique Theoretical β€” found by thinking about what COULD go wrong
SYSTEM_DESIGN.md The dream architecture β€” what the system SHOULD be Aspirational β€” describes the finished product
This document 78 practical problems found by reading the actual code Practical β€” found by comparing code to design

The key difference: the audit found problems by THINKING. This document found problems by READING THE CODE. Many overlap, but this document catches things the audit missed (like the mock extractor being the default, or the embedding model not existing), and skips things the audit included that are already partially addressed in code.


This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.


Appendix: New Improvements from Prior Art Analysis (2026-04-23)

Source: Comprehensive analysis of 15 similar systems (PRIOR_ART_ANALYSIS.md, SYSTEM_INSPIRATIONS.md)

After analyzing every system that does something similar to PhD Research OS, we found 16 concrete improvements to adopt, adapt, or build. These are ON TOP of the 78 blindspots already documented above. Each one comes from a real, published, working system.

New Blindspots Discovered from Prior Art

ID Category Severity Problem Source System
PA-1 Memory πŸ”΄ No scientific embedding model for dedup (word overlap misses paraphrases β€” "0.8 fM" β‰  "800 attomolar" to our system but they're the same) SPECTER2 (AllenAI) proves embeddings fix this
PA-2 Testing πŸ”΄ No standard benchmark for claim verification quality (we can't compare ourselves to anyone else) SciFact (AllenAI) is the industry standard we should measure against
PA-3 Data πŸ”΄ 1,900 synthetic examples vs 137,000 expert-written ones available for free (72Γ— gap) SciRIFF (AllenAI) is sitting on HuggingFace waiting to be used
PA-4 Brain 🟠 No pre-filtering before expensive extraction (AI Council processes boilerplate text and acknowledgments as seriously as Results) PaperQA2's RCS technique filters noise before analysis
PA-5 Brain 🟠 No code-based epistemic validation (AI classification has no deterministic cross-check) KGX3's language-game filters provide rule-based validation
PA-6 Brain 🟠 No completeness audit after extraction (system could silently miss the most important finding) Paper Circle's Coverage Checker catches omissions
PA-7 Brain 🟠 System never says "I don't know" (every input gets a confident-looking answer, even garbage) PaperQA2 refuses to answer when evidence is insufficient
PA-8 Memory 🟠 Contradiction detection has no fast pre-filter (checking 500K pairs all require expensive LLM calls) SciBERT-NLI can pre-screen in seconds, not hours
PA-9 Memory 🟠 No cross-reference verification between paper claims and knowledge graph (claims checked in isolation) FactReview checks against both the paper AND external literature
PA-10 Scoring 🟑 No temporal confidence tracking (a claim that's been rising for 4 years looks the same as one that just appeared) CLAIRE + PaperQA2 both track temporal patterns
PA-11 Scoring 🟑 Confidence numbers shown without explanation (0.72 means nothing without knowing WHY) CLUE generates plain-English uncertainty explanations
PA-12 Human 🟑 No active skepticism mode (system only confirms, never challenges high-confidence claims) CLAIRE actively looks for counter-evidence
PA-13 Human 🟑 No human verification tracking (machine-extracted claims look identical to expert-verified ones) ORKG + Paper Circle track verification status
PA-14 Brain 🟑 No self-critique step in Council (Extractor produces claims without articulating uncertainty) CritiCal proves that self-critique improves calibration
PA-15 Memory 🟑 Conflict detection is shallow (compares two claims in isolation, doesn't investigate with surrounding evidence) CLAIRE's investigation loop retrieves additional evidence before flagging
PA-16 Data 🟑 No non-AI anchors in teacher ensemble (all 5 teacher models share similar biases from internet training) FactReview uses CODE EXECUTION as a non-AI evidence source

Updated Summary by Severity (Including Prior Art Findings)

Severity Original Count New from Prior Art Total
πŸ”΄ Critical 14 3 17
🟠 High 33 5 38
🟑 Medium 31 8 39
Total 78 16 94

Priority Integration Order for Prior Art Items

Do These First (they fix Critical blindspots with minimal effort):

Priority Item Effort What It Fixes
1 PA-1: Integrate SPECTER2 1-2 days Deduplication actually works
2 PA-2: Add SciFact benchmark 1 day We can measure quality
3 PA-3: Integrate SciRIFF data 2-3 days 72Γ— more training data

Do These Next (they fix High blindspots with moderate effort):

Priority Item Effort What It Fixes
4 PA-8: SciBERT-NLI pre-filter 1-2 days Conflict detection becomes 100Γ— faster
5 PA-4: Pre-Extraction Filter 1 week Removes noise before expensive processing
6 PA-5: Epistemic Trigger Words 1 week Code-based validation of AI labels
7 PA-7: Low Confidence Quarantine 3-5 days System learns to say "I don't know"
8 PA-6: Completeness Auditor 1 week Catches silent omissions
9 PA-9: Cross-Reference Verification 1-2 weeks Claims checked against both paper and graph

Do These When Core Is Stable (they add new capabilities):

Priority Item Effort What It Adds
10 PA-14: Council Self-Critique 1 week Better calibration
11 PA-13: Provenance Levels 1 week Human verification tracking
12 PA-15: Investigation Protocol 2-3 weeks Deep contradiction analysis
13 PA-10: Epistemic Velocity 3-5 days Temporal confidence trends
14 PA-12: Devil's Advocate Mode 1-2 weeks Active skepticism
15 PA-11: Confidence Explanations 3-5 days Plain-English scoring reasons
16 PA-16: Non-AI anchors Ongoing Deterministic validation alongside AI

Appendix added 2026-04-23. These 16 items supplement the original 78 blindspots with findings from prior art analysis. See PRIOR_ART_ANALYSIS.md and SYSTEM_INSPIRATIONS.md for full details.