nkshirsa commited on
Commit
c80f8b7
·
verified ·
1 Parent(s): d7a7933

FUTURE_IMPROVEMENTS.md v2.0 Final — 78 blindspots, 14 categories, highschool-readable, compiled from code review + audit + architecture review

Browse files
Files changed (1) hide show
  1. FUTURE_IMPROVEMENTS.md +436 -474
FUTURE_IMPROVEMENTS.md CHANGED
@@ -1,710 +1,672 @@
1
  # PhD Research OS — Future Improvements Roadmap
2
- ## The Complete Guide to Making This System Actually Work
3
 
4
  **Written so a high school student can understand every word.**
5
 
6
- **Version**: 1.0
7
  **Date**: 2026-04-23
8
- **Status**: Living document updated through iterative blindspot discovery
9
- **Iteration count**: 1 of 100
10
 
11
  ---
12
 
13
  ## What Is This Document?
14
 
15
- Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts. This document is a list of everything that could go wrong and how to fix it written by looking at the actual code, not just the design plans.
16
 
17
- Think of it like a car safety inspection. The car looks good from the outside, but we need to check under the hood, test the brakes, and make sure the airbags actually work.
 
 
18
 
19
  ---
20
 
21
  ## Table of Contents
22
 
23
- 1. [The Big Picture Problems](#1-the-big-picture-problems)
24
- 2. [Data Problems Garbage In, Garbage Out](#2-data-problems)
25
- 3. [Reading PapersThe PDF Nightmare](#3-reading-papers)
26
- 4. [Understanding Science — The Brain Problems](#4-understanding-science)
27
- 5. [Remembering ThingsThe Memory Problems](#5-remembering-things)
28
- 6. [Scoring Confidence The Trust Problems](#6-scoring-confidence)
29
- 7. [TestingHow Do We Know It Works?](#7-testing)
30
- 8. [TrainingTeaching the AI](#8-training)
31
- 9. [Working Together Multi-Model Problems](#9-working-together)
32
- 10. [Staying Honest Over Time](#10-staying-honest-over-time)
33
- 11. [The Human Side](#11-the-human-side)
34
- 12. [Security and Safety](#12-security-and-safety)
35
- 13. [What To Build First](#13-what-to-build-first)
 
 
 
36
 
37
  ---
38
 
39
- ## 1. The Big Picture Problems
40
-
41
- ### What's the main goal?
42
-
43
- We want a computer that can:
44
- 1. Read a science paper (like a research study about a new cancer test)
45
- 2. Pull out the key findings ("This test detected cancer at 0.8 fM concentration")
46
- 3. Label each finding: Is it a proven fact? A guess? A theory?
47
- 4. Compare findings across many papers and spot contradictions
48
- 5. Tell the researcher how much to trust each finding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- ### What's actually built vs. what's designed?
51
-
52
- Think of building a house. Here's where we are:
53
 
54
- | Part | Status | Analogy |
55
- |------|--------|---------|
56
- | Database (where data lives) | ✅ Built and working | Foundation is poured |
57
- | PDF reader (Layer 0) | ⚠️ Basic version works | Walls are up but no insulation |
58
- | Claim extractor (Layer 2) | ⚠️ Works with mock data, needs real AI | Wiring is in but no electricity yet |
59
- | Deduplicator (Layer 3) | ⚠️ Uses simple word matching, not smart matching | Front door is installed but it's a screen door |
60
- | Knowledge graph (Layer 4) | ⚠️ Structure exists, conflict detection is basic | Plumbing pipes laid but not connected to water |
61
- | Scoring engine (Layer 5) | ✅ Formula works correctly | Heating system works |
62
- | Evaluation (Layer 6) | ⚠️ Counts things but doesn't check if they're right | Smoke detector is installed but has no battery |
63
- | Export to Obsidian (Layer 7) | ✅ Works | Mailbox is installed |
64
- | AI agent system | ✅ Framework works | Garage is built |
65
- | Training data | ⚠️ 1,900 examples — need 10,000+ | You have some furniture but most rooms are empty |
66
- | Trained model | ⚠️ Qwen2.5-3B — design says upgrade to Qwen3-8B | The car in the garage is a Honda Civic, plan calls for a Tesla |
67
 
68
- ### The Gap That Matters Most
69
 
70
- **The system is designed to be evidence-centered, but the code is still model-centered.**
71
 
72
- What does that mean?
 
 
 
73
 
74
- - **Model-centered** means: "Hey AI, read this text and tell me what you think." The AI writes a paragraph about what it found. You hope it's right.
75
- - **Evidence-centered** means: "Hey AI, find the exact sentence in this paper that says something important, highlight it, and classify it." The AI points to specific text. You can check it yourself.
 
 
76
 
77
- Right now, the code generates claims as free text. It should be generating claims as *pointers into the document*. Every claim should be a highlighted sentence with a label, not a summary written by the AI.
78
 
79
- **Why this matters for a high schooler**: Imagine your friend reads a book and tells you "The main character dies at the end." That's model-centered you're trusting your friend. Evidence-centered would be: "Look at page 342, paragraph 2 it says 'He took his last breath.'" Now you can check for yourself.
80
 
81
- ---
82
 
83
- ## 2. Data Problems Garbage In, Garbage Out
84
 
85
- ### Blindspot D-1: Training Data Is Synthetic, Not Real 🔴
86
 
87
- **What's wrong**: All 1,900 training examples were generated by a Python script (`generate_dataset.py`), not extracted from actual science papers. The fake data uses templates like "We investigated the effect of {param1} on {topic}..." real papers don't write like that.
88
 
89
- **Why it matters**: Imagine learning to cook by reading a cookbook written by someone who has never cooked. The recipes look right, but the timings are wrong, the ingredients are approximate, and the techniques are described theoretically. When you actually try to cook, nothing works quite right.
90
 
91
- **The fix**: We need real examples. Take 100 actual science papers. Have a human expert read each one and manually label: "This sentence is a Fact," "This sentence is an Interpretation," "This qualifier word 'may' was important." Those hand-labeled examples become the gold standard.
92
 
93
- **Cost**: About 200 hours of expert time ($10,000-$20,000 if hiring domain experts).
94
 
95
- ### Blindspot D-2: No Hard Negatives in Training Data 🔴
96
 
97
- **What's wrong**: The training data only has correct examples. There are no examples of *almost correct but subtly wrong* outputs.
98
 
99
- **Why it matters**: Think of studying for a multiple choice test. If you only study the right answers, you'll struggle with tricky wrong answers that look almost right. The AI needs to see examples like:
100
- - "The LOD was 0.8 fM" → correct extraction ✅
101
- - "The LOD was 0.8" (unit dropped) → wrong! ❌
102
- - "The LOD improved dramatically" (number replaced with vague word) → wrong! ❌
103
- - "The LOD was 0.8 µM" (wrong unit — fM ≠ µM, off by a billion) → wrong! ❌
104
 
105
- **The fix**: Run the current model on papers, collect its mistakes, and use those mistakes as "here's what NOT to do" training examples. This is called hard-negative mining.
106
 
107
- ### Blindspot D-3: No Error Categories in Training Data 🟠
108
 
109
- **What's wrong**: When the model makes a mistake, we just mark it as "wrong." We don't say WHY it's wrong. Was it because a qualifier was dropped? A number was changed? A unit was lost? A section was misidentified?
110
 
111
- **Why it matters**: If a student keeps getting math problems wrong, a good teacher figures out WHY — is it multiplication? Fractions? Word problems? Just saying "wrong" over and over doesn't help. We need to say "you keep dropping the units" or "you keep missing the word 'not' in sentences."
112
 
113
- **The fix**: Create a taxonomy of error types:
114
 
115
- | Error Type | Example | How Common |
116
- |-----------|---------|------------|
117
- | Qualifier loss | "may reduce" → "reduces" | Very common |
118
- | Unit drop | "0.8 fM" → "0.8" | Common |
119
- | Negation flip | "did NOT improve" → "improved" | Dangerous |
120
- | Section confusion | Abstract claim labeled as Results | Common |
121
- | Number hallucination | Paper says 0.8, AI says 0.6 | Rare but devastating |
122
- | Granularity mismatch | Table row treated as whole-paper finding | Common |
123
- | Citation attribution | Claim from cited paper treated as this paper's finding | Very common |
124
- | Causal overclaim | "correlated with" → "caused by" | Common |
125
- | Missing statistical context | p-value or sample size dropped | Common |
126
 
127
- Train the model to recognize its own error types. When it's uncertain, it should say "I might be dropping a qualifier here" instead of just guessing.
128
 
129
- ### Blindspot D-4: Teacher Signals Are Not Preserved 🟠
130
 
131
- **What's wrong**: The system design talks about running 5 different AI models (teachers) on the same paper and keeping ALL their answers. But the current data pipeline just generates one answer per example.
 
 
 
 
 
 
 
 
 
132
 
133
- **Why it matters**: Imagine 5 doctors examining the same patient. If 4 say "it's a cold" and 1 says "check for pneumonia," you don't want to throw away that 1 dissenting opinion. That minority view might save a life. Same with AI — the disagreement IS the signal.
134
 
135
- **The fix**: For each training example, store:
136
- - What Teacher 1 said (and how confident it was)
137
- - What Teacher 2 said (and how confident it was)
138
- - ... and so on for all teachers
139
- - Where they agreed → high confidence training signal
140
- - Where they disagreed → this is a "hard case" that teaches the student about uncertainty
141
 
142
- ### Blindspot D-5: No Counterfactual Data 🟡
143
 
144
- **What's wrong**: No training examples where we deliberately change one thing and check if the model notices.
145
 
146
- **Why it matters**: If a paper says "Treatment A did NOT reduce tumor size," and we change it to "Treatment A reduced tumor size," does the model catch the difference? If it gives the same answer for both, it's not actually reading — it's pattern matching on the other words.
147
 
148
- **The fix**: For every 10th training example, create a "mirror" version where one critical word is changed (add/remove "not", swap a unit, change a number). The model must give a DIFFERENT answer for the mirror version. If it doesn't, that tells us it's using shortcuts instead of actually understanding.
149
 
150
  ---
151
 
152
- ## 3. Reading Papers — The PDF Nightmare
153
 
154
- ### Blindspot P-1: No Actual ML-Based Parser Is Integrated 🔴
155
 
156
- **What's wrong**: The code in `parser.py` uses PyMuPDF (fitz) or pdfplumber. These are basic text extractors they just scrape text off the page. The system design says to use Marker (a machine learning-based parser that understands document layout), but Marker is not actually integrated.
157
 
158
- **Why it matters**: Basic text extractors destroy the structure of a paper. They merge columns, lose table formatting, can't tell the difference between a heading and body text, and turn equations into garbage characters. It's like trying to read a newspaper by photographing it and running basic OCR — you get a jumble of text from different columns mixed together.
159
 
160
- **The fix**: Actually install and integrate Marker. It's a Python library that uses deep learning to understand page layout. It knows that a two-column paper has separate columns, that a table has rows and cells, and that an equation is not regular text.
161
 
162
- ```
163
- Current: PDF → pdfplumber → flat text (broken tables, merged columns)
164
- Needed: PDF → Marker → structured markdown (tables preserved, columns separated)
165
- ```
166
 
167
- ### Blindspot P-2: Section Detection Is Fragile 🟠
168
 
169
- **What's wrong**: The code detects sections (Abstract, Methods, Results) using simple text pattern matching it looks for the word "Abstract" at the start of a line. But many papers:
170
- - Number their sections ("2. Materials and Methods" or "III. Experimental Setup")
171
- - Use non-standard names ("Findings" instead of "Results")
172
- - Are in languages other than English
173
- - Have combined sections ("Results and Discussion")
174
- - Don't have clear section headers at all (short communications, letters)
175
 
176
- **Why it matters**: The entire scoring system depends on knowing which section a claim comes from. A finding in the Results section gets full confidence (1.0×). The same finding in the Abstract gets penalized (0.7×). If we misidentify the section, we score the claim wrong.
177
 
178
- **The fix**: Use Marker's built-in section detection (it's trained on millions of papers), and add a fallback classifier that looks at the CONTENT of a paragraph to guess which section it belongs to (e.g., paragraphs with many citations are probably Introduction/Discussion, paragraphs with numbers and p-values are probably Results).
179
 
180
- ### Blindspot P-3: Tables Are Extracted As Flat Text 🔴
181
 
182
- **What's wrong**: When pdfplumber extracts a table, it stores it as pipe-separated text: `"0.8 | fM | PBS"`. The relationship between the header ("LOD") and the value ("0.8 fM") is lost. You can't tell which column header goes with which value.
183
 
184
- **Why it matters**: Tables contain the most important evidence in a science paper. If you lose the header-value relationship, you lose the meaning. "0.8" means nothing. "LOD = 0.8 fM" means everything.
185
 
186
- **The fix**: Store tables as structured data (like a spreadsheet), not flat text:
187
 
188
- ```json
189
- {
190
- "table_id": "TAB_001",
191
- "headers": ["Parameter", "Value", "Buffer", "Method"],
192
- "rows": [
193
- ["LOD", "0.8 fM", "10 mM PBS", "3σ/slope"],
194
- ["Dynamic range", "1 fM - 100 nM", "10 mM PBS", "calibration curve"]
195
- ]
196
- }
197
- ```
198
 
199
- Now the AI knows that "0.8 fM" is the "LOD" (limit of detection), measured in "10 mM PBS" buffer, using the "3σ/slope" method.
200
 
201
- ### Blindspot P-4: Figures Are Completely Ignored 🟠
202
 
203
- **What's wrong**: The parser detects image blocks and stores them as `"[Image detected requires VLM processing]"`. No actual image analysis happens.
204
 
205
- **Why it matters**: About 30-40% of the key evidence in a science paper is in figures. A scatter plot showing that sensitivity decreases at high ionic strength contains critical quantitative data. If we ignore all figures, we're reading the paper with one eye closed.
206
 
207
- **The fix**: This is a multi-step process:
208
- 1. Detect if a figure is a chart (bar, scatter, line), a diagram, or a photograph (micrograph)
209
- 2. For charts: use a plot digitizer to extract the actual data points (like WebPlotDigitizer)
210
- 3. For diagrams: use a vision-language model (VLM) to describe what the diagram shows
211
- 4. For micrographs: use a VLM to identify what the image shows (cells, crystals, etc.)
212
- 5. Cross-check: does the data from the figure match what the text says about it?
213
 
214
- ### Blindspot P-5: Cross-References Are Detected But Never Verified 🟡
215
 
216
- **What's wrong**: The code finds references like "see Figure 3" or "Table 2" in the text, but it never checks if Figure 3 actually exists in the parsed output, or if the table labeled as Table 2 is actually Table 2.
217
 
218
- **Why it matters**: If the parser mislabels tables (assigns Table 1's caption to Table 2's data), every claim that references those tables will have wrong evidence attached to it. Wrong evidence is worse than no evidence.
219
 
220
- ### Blindspot P-6: No Supplement Handling 🟠
221
 
222
- **What's wrong**: The system design describes "paper bundles" (main PDF + supplementary files), but the parser can only process one file at a time. There's no way to say "this supplement goes with that main paper."
223
 
224
- **Why it matters**: In many fields (biology, chemistry), the most important data is in the supplementary materials. The main paper says "see Supplementary Table S3 for full results" — if we can't read the supplement, we're missing the best evidence.
225
 
226
- ### Blindspot P-7: No Language Detection or Handling 🟡
227
 
228
- **What's wrong**: The system assumes all papers are in English. There's no detection of non-English text and no handling strategy for it.
229
 
230
- **Why it matters**: Many important papers, especially in Chinese, Japanese, German, and French science journals, are not in English. If someone feeds a Chinese paper into the system, it will try to extract "claims" from text it can't understand, and it will produce confident-sounding garbage.
 
 
231
 
232
- **The fix**: Before processing, detect the language. If it's not English, either: (a) refuse with a clear message, (b) translate first and flag reduced confidence, or (c) route to a multilingual model.
233
 
234
- ---
235
 
236
- ## 4. Understanding Science The Brain Problems
237
 
238
- ### Blindspot B-1: The Model Is Way Too Small 🔴
239
 
240
- **What's wrong**: The current model is Qwen2.5-3B (3 billion parameters). The design calls for Qwen3-8B. But even 8B may not be enough for reliable scientific reasoning.
241
 
242
- **Why it matters**: Think of brain size as vocabulary + reasoning ability. A 3B model is like a smart middle schooler — it can follow instructions and produce well-formatted output, but it doesn't deeply understand what it's writing. An 8B model is like a college freshman — much better, but still makes errors on complex reasoning. A 27B model is like a PhD student — it can genuinely reason about scientific claims.
243
 
244
- **The trade-off**: Bigger models need more computer memory (GPU/VRAM). The design targets a 16-24GB consumer GPU. Here are the options:
245
 
246
- | Model | Size | VRAM (quantized) | Reasoning Quality |
247
- |-------|------|-------------------|-------------------|
248
- | Qwen2.5-3B | 3B | ~2.5 GB | 🟡 Basic structure, frequent errors |
249
- | Qwen3-8B | 8B | ~5 GB | 🟢 Good structure, occasional errors |
250
- | Qwen3.5-9B | 9B | ~6 GB | 🟢 Better reasoning patterns |
251
- | Qwen3-30B-A3B MoE | 30B (3B active) | ~6 GB | 🟢🟢 Dense-14B quality at 3B cost |
252
- | Qwen3.5-27B | 27B | ~15 GB | 🟢🟢🟢 Best for this task |
253
 
254
- ### Blindspot B-2: No Specialist Heads One Brain Doing Too Many Jobs 🟠
255
 
256
- **What's wrong**: A single model does ALL tasks: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want very different behaviors.
257
 
258
- **Why it matters**: Think about a Swiss Army knife versus a chef's knife set. The Swiss Army knife can do everything but nothing well. Claim extraction wants to cast a wide net (find ALL the claims, even questionable ones). Qualifier detection wants surgical precision (keep the EXACT words). Statistical parsing wants numerical accuracy (get the numbers RIGHT). One model can't optimize for all of these simultaneously.
259
 
260
- **The fix**: Use a shared base model but add specialized "heads" (output layers) for different tasks:
261
- - Extraction head: optimized for recall (finding things)
262
- - Qualifier head: optimized for precision (keeping exact words)
263
- - Statistics head: optimized for numerical accuracy
264
- - Conflict head: optimized for comparison across claims
265
 
266
- This is like having one person who can speak 6 languages (the shared base) but wears different hats (the heads) depending on which task they're doing.
267
 
268
- ### Blindspot B-3: No Domain Pre-Training 🟠
269
 
270
- **What's wrong**: The base model (Qwen2.5-3B or Qwen3-8B) was trained on general internet text. It knows about cooking, sports, history, and programming. But it doesn't have deep knowledge of scientific terminology, experimental methods, or statistical analysis.
271
 
272
- **Why it matters**: If you asked someone who has never taken a chemistry class to read a chemistry paper, they'd struggle with terms like "LOD," "analyte," "electrochemical impedance spectroscopy," or "cyclic voltammetry." They might extract the sentence but misclassify it because they don't understand what the words mean. Our AI has the same problem.
273
 
274
- **The fix**: Before fine-tuning on the Research OS data, do a "domain warm-up" — feed the model thousands of science papers (just reading them, not labeling) so it learns the vocabulary and writing style. Datasets like S2ORC (Semantic Scholar Open Research Corpus) have millions of papers we could use.
275
 
276
- ### Blindspot B-4: Mock Extraction Is the Default Path 🔴
277
 
278
- **What's wrong**: The `QualifiedExtractor` class in `extractor.py` has a method called `_extract_mock()` that uses simple pattern matching (looking for words like "measured" or "suggests") to classify claims. When no AI model is connected, this is what runs. And based on the code, this is what actually runs in practice because the AI brain connection is optional.
279
 
280
- **Why it matters**: The mock extractor is a toy. It assigns "Fact" to any sentence containing the word "measured" and "Hypothesis" to any sentence containing "may." This is like a doctor who diagnoses every patient with a cough as having a cold and every patient with a headache as having a migraine. The real world is much more complex.
281
 
282
- **Example of mock failure**:
283
- - "We measured the baseline but found no significant difference" → Mock says: Fact (because "measured") → Actually: Null result
284
- - "The previously measured values suggest contamination" → Mock says: Fact → Actually: Interpretation about a past measurement
285
- - "It may be the most reliable technique measured to date" → Mock says: both Fact AND Hypothesis → Which one wins? Depends on code ordering.
286
 
287
- ### Blindspot B-5: No Constrained Decoding 🟠
288
 
289
- **What's wrong**: The design describes using the "Guidance" library to force the AI to output valid JSON with valid tags (only "Fact", "Interpretation", "Hypothesis", or "Conflict_Hypothesis"). But this is not implemented. The current code just asks the AI to output JSON and hopes for the best.
290
 
291
- **Why it matters**: LLMs are unreliable at producing valid JSON, especially small ones. Without constrained decoding, the model might output:
292
- - `{"epistemic_tag": "fact"}` (lowercase — not in the allowed list)
293
- - `{"epistemic_tag": "Observation"}` (a category we don't have)
294
- - Broken JSON with missing brackets
295
- - A mix of JSON and natural language explanation
296
 
297
- Constrained decoding GUARANTEES the output is valid. It's like fill-in-the-blank versus essay fill-in-the-blank can't have format errors.
298
 
299
- ### Blindspot B-6: No Out-of-Distribution Detection 🟠
300
 
301
- **What's wrong**: The system has no way to detect when it's being asked to process a paper from a field it doesn't understand.
302
 
303
- **Why it matters**: The training data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. If someone feeds it a sociology paper, a legal document, or a paper about ancient Egyptian archaeology, the model will still produce confident-looking scientific claim extractions. They'll be nonsense, but they'll look perfectly formatted.
304
 
305
- **The fix**: Before the AI processes anything, check if the paper's vocabulary and topic are similar to the training data. If they're too different, say "I don't know this field well enough my results may not be reliable" instead of producing garbage with high confidence.
306
 
307
  ---
308
 
309
- ## 5. Remembering Things The Memory Problems
310
-
311
- ### Blindspot M-1: Deduplication Uses Jaccard Similarity (Word Overlap), Not Semantic Similarity 🔴
312
 
313
- **What's wrong**: When the system checks if two claims are duplicates, it compares them by counting how many words they share. This is the `jaccard_similarity` function in `canonicalizer.py`.
314
 
315
- **Why it matters**: These two claims say the same thing but share almost no words:
316
- - "The LOD was 0.8 fM for the GFET biosensor"
317
- - "A detection limit of eight-tenths of a femtomolar was achieved using the graphene transistor"
318
 
319
- Jaccard similarity would rate these as very different (almost no shared words). But they're the same finding! The system would create two separate canonical claims for the same discovery.
320
 
321
- **The fix**: Use embedding-based similarity. Convert each claim into a number vector using a sentence embedding model (like `all-MiniLM-L6-v2`). Claims that mean the same thing will have similar vectors, even if they use different words.
322
 
323
- ### Blindspot M-2: No Embedding Model Is Connected 🟠
324
 
325
- **What's wrong**: The design mentions using embeddings for deduplication, but the code doesn't import or use any embedding model. There's no ChromaDB connection, no sentence-transformers, nothing.
326
 
327
- **The fix**: Add a lightweight embedding model (runs on CPU, very fast):
328
 
329
- ```python
330
- from sentence_transformers import SentenceTransformer
331
- model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, runs on CPU
332
- embedding = model.encode("The LOD was 0.8 fM")
333
- # Now compare embeddings with cosine similarity instead of word overlap
334
- ```
335
 
336
- ### Blindspot M-3: Knowledge Graph Conflict Detection Is Based on Word Overlap 🟠
337
 
338
- **What's wrong**: The `find_conflicts()` method in `graph.py` finds conflicts by checking if two claims share enough words. Same problem as deduplication — semantically related claims with different vocabulary will be missed.
339
 
340
- **Additionally**: The method only compares the top 500 claims by confidence. If claim #501 contradicts claim #3, it will never be found.
341
 
342
- ### Blindspot M-4: No Temporal Reasoning 🟡
343
 
344
- **What's wrong**: The knowledge graph has no concept of time. A claim from 2015 and a claim from 2025 about the same topic are treated equally. But the 2025 paper might specifically address and overturn the 2015 finding.
345
-
346
- **The fix**: Every claim needs a publication year. When comparing claims, newer evidence should be weighted more heavily. And when a newer paper directly cites and contradicts an older one, the graph should note this as a "supersedes" relationship.
347
-
348
- ### Blindspot M-5: Gap Analysis Only Compares Same-Type Nodes 🟡
349
-
350
- **What's wrong**: In `graph.py`, the `find_gaps()` method only looks for gaps between nodes of the same type (`if a["node_type"] != b["node_type"]: continue`). But interesting gaps can exist between different types — for example, between a claim node and a method node that have never been connected.
351
 
352
  ---
353
 
354
- ## 6. Scoring Confidence The Trust Problems
355
 
356
- ### Blindspot S-1: Qualifier Penalty Is Too Simple 🟠
357
 
358
- **What's wrong**: Each qualifier reduces confidence by exactly 0.1. So "may" (-0.1), "suggests" (-0.1), and "under controlled conditions" (-0.1) all get the same penalty. But "may" is much weaker than "under controlled conditions."
359
 
360
- **Why it matters**: Not all hedging words are equal:
361
- - "may" = very uncertain (should be -0.2)
362
- - "appears to" = moderately uncertain (should be -0.15)
363
- - "under these specific conditions" = limits scope but not uncertain (should be -0.05)
364
- - "demonstrates" = actually strengthens the claim (should be +0.05)
365
 
366
- **The fix**: Weight qualifiers by type:
367
 
368
- | Qualifier Type | Example Words | Penalty |
369
- |---------------|---------------|---------|
370
- | Strong hedge | may, might, could, possibly | -0.20 |
371
- | Moderate hedge | suggests, indicates, appears | -0.15 |
372
- | Weak hedge | likely, tends to, generally | -0.10 |
373
- | Scope limiter | under these conditions, in vitro | -0.05 |
374
- | Strengthener | demonstrates, confirms, proves | +0.05 |
375
 
376
- ### Blindspot S-2: No Multi-Dimensional Calibration 🟠
377
 
378
- **What's wrong**: The system tracks one Brier score for overall confidence calibration. But confidence means different things in different contexts:
379
- - "I'm confident this claim EXISTS in the paper" (extraction confidence)
380
- - "I'm confident this qualifier is important" (qualifier confidence)
381
- - "I'm confident these two claims contradict each other" (conflict confidence)
382
- - "I'm confident this claim came from the Results section" (section confidence)
383
 
384
- A single Brier score hides all of these behind one number.
385
 
386
- **The fix**: Track 6 separate calibration curves. A model can be brilliantly calibrated on extraction ("I found a claim" is right 85% of the time when it says 85%) but terribly calibrated on conflict detection ("I found a contradiction" is right only 40% of the time when it says 80%).
387
 
388
- ### Blindspot S-3: Source Count Bonus Is Not In The Code 🟡
389
 
390
- **What's wrong**: The SYSTEM_DESIGN.md describes a "source_count_bonus" where claims supported by multiple papers get up to +0.2 confidence. But in the actual `scorer.py` code, this bonus doesn't exist. The scorer only uses information from a single claim, not the canonical claim's evidence count.
391
 
392
- ### Blindspot S-4: Conflict Penalty Is Not In The Code 🟡
393
 
394
- **What's wrong**: Similarly, the design says claims with active conflicts get a -0.3 penalty. The scorer code doesn't check for conflicts.
395
 
396
- ### Blindspot S-5: The Composite Score Is Just An Average 🟡
397
 
398
- **What's wrong**: The composite confidence is `(evidence_quality + truth_likelihood + qualifier_strength) / 3`. This means a claim with evidence_quality=1.0 and qualifier_strength=0.0 gets the same score as a claim with both at 0.5. But the first claim is STRONG evidence that's completely hedged, while the second is mediocre evidence that's moderately hedged — very different situations.
399
 
400
- **The fix**: Use the MINIMUM of the three scores, not the average. A chain is only as strong as its weakest link. If any one dimension is terrible, the composite should be terrible.
401
 
402
  ---
403
 
404
- ## 7. TestingHow Do We Know It Works?
405
-
406
- ### Blindspot T-1: No Gold Standard Test Set 🔴
407
-
408
- **What's wrong**: The evaluation system (`evaluator.py`) counts things (how many claims, how many with qualifiers, how many null results) but doesn't compare against a known correct answer. It's like grading a test by checking if every question was answered, not if the answers are correct.
409
-
410
- **Why it matters**: The system could extract 500 claims from 10 papers and report "92% have qualifiers, 15% are null results, average confidence 0.65" — and ALL 500 claims could be wrong. We'd never know because we never checked against human-labeled ground truth.
411
-
412
- **The fix**: Create a gold standard test set:
413
- 1. Take 10 papers from your actual research domain
414
- 2. Have an expert manually label every claim, qualifier, epistemic tag, and conflict
415
- 3. Run the system on those same papers
416
- 4. Compare: How many real claims did it find? How many fake claims did it invent? How many tags did it get right?
417
-
418
- ### Blindspot T-2: No Counterfactual Robustness Tests 🟠
419
 
420
- **What's wrong**: We never test if the model changes its answer for the RIGHT reason.
421
 
422
- **Why it matters**: The model might get the right answer for the wrong reason. For example, it might learn that papers published in Nature usually contain "Facts" and papers from arXiv usually contain "Hypotheses." This is a shortcut, not understanding.
423
 
424
- **Test suite we need**:
 
 
 
425
 
426
- | Test | What We Do | What Should Happen |
427
- |------|-----------|-------------------|
428
- | Remove table header | Delete the column labels from a table | Claims from that table should become "Incomplete" |
429
- | Swap unit | Change "fM" to "µM" | The extracted value should change accordingly |
430
- | Flip negation | "significant" → "not significant" | Epistemic tag should change from Fact to null result |
431
- | Remove figure | Delete a figure and its caption | Claims that relied on that figure should lose confidence |
432
- | Change journal | Pretend the paper was published in a lower-tier journal | Journal tier weight should decrease |
433
 
434
- ### Blindspot T-3: No Paper-Level Evaluation Splits 🔴
435
 
436
- **What's wrong**: The train/test split in the dataset is random individual examples are shuffled. This means the same paper's claims might appear in both train and test. The model could memorize "when I see biosensor + LOD, output Fact with 0.85 confidence" instead of actually understanding.
437
 
438
- **The fix**: Split by PAPER, not by example. All claims from Paper A go in training. All claims from Paper B go in testing. This way, the test set truly measures: "Can the model handle a paper it's never seen?"
439
-
440
- Better yet, also split by:
441
- - **Lab** — all papers from one research group in holdout
442
- - **Journal** — all papers from one journal in holdout
443
- - **Year** — all papers from 2025+ in holdout
444
- - **Field** — all materials science papers in holdout
445
 
446
- ### Blindspot T-4: The Regression Gate Doesn't Actually Gate Anything 🟡
447
 
448
- **What's wrong**: The `run_regression_gate()` function checks metrics and returns pass/fail, but nothing in the system actually prevents a failing model from being deployed. There's no CI/CD pipeline that blocks a model update when the regression gate fails.
449
 
450
- ### Blindspot T-5: No Stochastic Testing 🟡
451
 
452
- **What's wrong**: LLMs are not deterministic ask the same question twice and you might get different answers. The evaluation runs once and reports a single number. But that number could be 72% one run and 68% the next.
453
 
454
- **The fix**: Run every evaluation 5 times. Report the mean AND the range. If the range is too wide (more than 5% spread), that's a problem the model is unreliable.
 
 
 
 
455
 
456
  ---
457
 
458
- ## 8. TrainingTeaching the AI
459
-
460
- ### Blindspot TR-1: Only SFT Is Implemented — No DPO, GRPO, or ConfTuner 🔴
461
-
462
- **What's wrong**: The design describes a 4-stage training pipeline:
463
- 1. **SFT** (Supervised Fine-Tuning): "Here's the right answer, learn it" ← Only this is built
464
- 2. **DPO** (Direct Preference Optimization): "This answer is better than that one" ← Not built
465
- 3. **GRPO** (Group Relative Policy Optimization): "Here's a reward for good JSON, good tags, and preserved qualifiers" ← Not built
466
- 4. **ConfTuner**: "Make your confidence scores match reality" ← Not built
467
-
468
- **Why it matters**: SFT alone gets you to maybe 70% quality. DPO pushes it to 80%. GRPO (with the custom reward functions) pushes it to 85-90%. ConfTuner ensures the confidence numbers are meaningful. Stopping at Stage 1 is like running a marathon but stopping at mile 7.
469
 
470
- ### Blindspot TR-2: ZeroGPU Training Space Has Fundamental Limitations 🟠
471
 
472
- **What's wrong**: The training Space (`phd-research-os-train`) uses ZeroGPU with 60-second time limits per call. It trains in micro-batches of ~20 steps. This means:
473
- - The learning rate scheduler resets every micro-batch (cosine schedule restarts)
474
- - The model is loaded from scratch every call (no persistent GPU state)
475
- - Total training time is limited by how many times you click "Train"
476
 
477
- **Why it matters**: Good fine-tuning needs continuous training with a smooth learning rate schedule. Chopping it into 20-step micro-batches is like studying for an exam in 2-minute bursts with a nap between each burst — you lose context and momentum.
478
 
479
- **The fix**: Use a proper training job (like HF Jobs) with a single continuous run. 3 epochs on 2,000 examples with Qwen2.5-3B takes about 1-2 hours on a single GPU. That's one uninterrupted training run, not 100 button clicks.
480
 
481
- ### Blindspot TR-3: No Curriculum Learning 🟡
482
 
483
- **What's wrong**: All training examples are shuffled randomly. The model sees hard cases (complex multi-claim extractions with conflicts) at the same time as easy cases (single fact with clear evidence).
484
 
485
- **Why it matters**: Humans learn better when material is presented from easy to hard. First learn to identify a single Fact. Then learn to distinguish Fact from Interpretation. Then handle multi-claim papers. Then handle contradictions. This is called curriculum learning, and it leads to more stable training and better handling of rare, difficult cases.
486
 
487
- ### Blindspot TR-4: No Evaluation During Training 🟡
488
 
489
- **What's wrong**: The `train.py` script does eval every 50 steps (reporting eval_loss), but eval_loss doesn't tell you if the model is actually getting better at the TASK. A model might have low eval_loss but still produce invalid JSON, drop qualifiers, or misclassify epistemic tags.
490
 
491
- **The fix**: During training, periodically run the model on a small "canary" set of 10 examples and check:
492
- - Is the JSON valid?
493
- - Are the epistemic tags correct?
494
- - Are qualifiers preserved?
495
- - Are the confidence scores reasonable?
496
 
497
- This is called task-specific evaluation, and it's much more informative than just watching the loss number go down.
498
 
499
- ---
500
 
501
- ## 9. Working Together Multi-Model Problems
502
 
503
- ### Blindspot W-1: The Council Has No Real Disagreement Mechanism 🟠
504
 
505
- **What's wrong**: The design describes a multi-round council (Extractor Critic Chairman) where members debate. But in the code, the council is sequential: one call to the Extractor, one call to the Critic, one call to the Chairman. There's no actual back-and-forth.
506
 
507
- **Why it matters**: The whole point of the council is that disagreement reveals uncertainty. If the Extractor says "Fact" and the Critic says "Actually, this is an Interpretation because the author used the word 'suggests'," that disagreement is incredibly valuable information. But in a sequential pipeline, the Chairman just picks a winner — there's no real debate.
508
 
509
- ### Blindspot W-2: No Router Exists 🟠
510
 
511
- **What's wrong**: The system design describes a "Dynamic Task Router" that decides which model should handle each piece of content. The code has no router. Everything goes to the same model (or the mock extractor).
512
 
513
- **The fix**: Build a simple classifier that looks at each content chunk and decides:
514
- - Regular text → primary extraction model
515
- - Table → specialized table parser
516
- - Figure → VLM or plot digitizer
517
- - Equation → LaTeX parser
518
- - Garbage/low quality → skip with "Unextractable" flag
519
- - Out of domain → flag for human review
520
 
521
- ### Blindspot W-3: No "I Don't Know" Option 🟠
522
 
523
- **What's wrong**: The system always produces an answer. There's no mechanism for any component to say "I can't handle this pass it to something else or ask a human."
524
 
525
- **Why it matters**: The worst possible failure is confident garbage. A system that says "I don't know" is infinitely better than a system that says "I'm 85% confident" when it's actually guessing.
526
 
527
- **What "I don't know" looks like in practice**:
528
- - Parser confidence < 0.4 → "I couldn't read this page clearly. Don't trust my extraction."
529
- - Model uncertainty is high → "I found what might be a claim, but I'm not sure if it's a Fact or Interpretation."
530
- - Content is out of domain → "This doesn't look like the kind of science I was trained on."
531
- - Input is non-English → "This appears to be in a language I can't process reliably."
532
 
533
  ---
534
 
535
- ## 10. Staying Honest Over Time
536
 
537
- ### Blindspot H-1: No Drift Detection 🟠
 
 
538
 
539
- **What's wrong**: If the model's performance gradually degrades (because it encounters papers from a new sub-field, or because the scientific vocabulary evolves), there's no alarm system.
540
 
541
- **The fix**: Every week, run the model on the gold standard test set. If any metric drops by more than 5% compared to the previous week, send an alert. This is like a regular health check-up — catch problems before they become emergencies.
542
 
543
- ### Blindspot H-2: No Human Feedback Integration 🟠
544
 
545
- **What's wrong**: When a human uses the UI and corrects a claim (changes a tag from Fact to Interpretation, adds a missing qualifier), that correction is not stored in a way that can be used for future training.
546
 
547
- **The fix**: Every human correction becomes a high-value training example. Over time, these corrections create a growing gold dataset tailored to YOUR specific research domain. This is the cheapest way to get high-quality labeled data — your daily work generates it automatically.
548
 
549
- ### Blindspot H-3: No Versioning of Taxonomy Changes 🟡
550
 
551
- **What's wrong**: The taxonomy (which study types exist and what weights they get) can change over time. But old claims don't get re-scored when the taxonomy changes. A claim scored under the old taxonomy has a different meaning than a claim scored under the new one.
552
 
553
- ### Blindspot H-4: No Recovery Plan for Database Corruption 🟡
554
 
555
- **What's wrong**: The system uses SQLite, which stores everything in a single file. If that file gets corrupted (power failure during a write, disk error, accidental deletion), everything is lost. There's no backup strategy, no WAL (well, WAL mode IS enabled), no periodic snapshots.
556
 
557
  ---
558
 
559
- ## 11. The Human Side
560
 
561
- ### Blindspot HU-1: Information Overload 🟠
562
-
563
- **What's wrong**: If you ingest 100 papers, you might get 3,000+ claims. The UI shows them all equally. There's no way to say "show me only the most important 20 claims" or "show me only the contradictions."
564
 
565
- **The fix**: Priority-ranked views:
566
- - **Dashboard**: Top 10 most surprising findings, Top 5 contradictions, Top 3 gaps
567
- - **Deep dive**: Filterable by section, tag, confidence, paper, topic
568
- - **Review queue**: Only items that need human attention (conflicts, low-confidence, out-of-domain)
569
 
570
- ### Blindspot HU-2: No Explanation of Scores 🟡
571
 
572
- **What's wrong**: The UI shows "Confidence: 0.723" but doesn't explain how that number was calculated. A researcher can't tell if 0.723 is good or bad, or why it's 0.723 and not 0.85.
573
 
574
- **The fix**: Show the breakdown:
575
- - Evidence strength: 0.85 (strong direct measurement)
576
- - Study quality: ×1.0 (primary experimental)
577
- - Journal tier: ×0.85 (Tier 2 journal)
578
- - Section: ×1.0 (Results)
579
- - Completeness: ×1.0 (all fields present)
580
- - = 0.723
581
 
582
- ### Blindspot HU-3: Automation Bias Risk 🟡
583
 
584
- **What's wrong**: Once a researcher trusts the system, they might stop checking its work. If the system says "Fact, confidence 0.92," the researcher might accept it without reading the original paper.
585
 
586
- **The fix**: Built-in friction:
587
- - Randomly hide the confidence score for 10% of claims and ask the researcher to rate it themselves
588
- - Periodically show a "quiz" — "The system classified this as a Fact. Do you agree?"
589
- - Track how often the researcher overrides the system — if overrides drop to zero, they might be over-trusting
590
 
591
  ---
592
 
593
- ## 12. Security and Safety
594
 
595
- ### Blindspot SEC-1: No Input Validation 🟠
 
 
596
 
597
- **What's wrong**: The parser accepts any file and tries to process it. There's no check for:
598
- - Malicious PDFs (they exist — PDFs can contain JavaScript and exploits)
599
- - Extremely large files (a 500MB PDF would crash the system)
600
- - Non-PDF files disguised as PDFs
601
- - Files with embedded malware
602
 
603
- ### Blindspot SEC-2: SQL Injection Risk 🟡
604
 
605
- **What's wrong**: While most database queries use parameterized queries (safe), the `get_stats()` function in `database.py` uses f-strings to construct table names: `f"SELECT COUNT(*) FROM {table}"`. If an attacker could control the table name, they could inject SQL.
606
 
607
- *Note*: This is low risk because the table names are hardcoded in the function, not from user input. But it's a bad pattern that could become dangerous if the code is modified later.
608
 
609
- ### Blindspot SEC-3: No Rate Limiting on API Calls 🟡
610
 
611
- **What's wrong**: If the AI brain is connected to an external API (Claude, GPT), there's no limit on how many calls can be made. Processing 1,000 papers could generate thousands of API calls costing hundreds of dollars.
612
 
613
  ---
614
 
615
- ## 13. What To Build First
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
616
 
617
- Based on impact and dependencies, here's the build order:
618
 
619
- ### Phase 1: Make It Work (Weeks 1-4)
620
- 1. **Integrate Marker parser** — Everything else depends on good PDF reading
621
- 2. **Add embedding model** Needed for smart deduplication and conflict detection
622
- 3. **Create gold standard test set** — 10 expert-labeled papers. Can't improve what you can't measure
623
- 4. **Connect the real AI model** Stop using mock extraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
624
 
625
- ### Phase 2: Make It Reliable (Weeks 5-8)
626
- 5. **Add constrained decoding** — Guarantee valid JSON output
627
- 6. **Build proper training pipeline** — One continuous training job, not micro-batches
628
- 7. **Implement weighted qualifier penalties** — Not all hedges are equal
629
- 8. **Add out-of-distribution detection** — Know when to say "I don't know"
630
 
631
- ### Phase 3: Make It Smart (Weeks 9-16)
632
- 9. **Expand training data to 10K+** — With real paper examples and hard negatives
633
- 10. **Implement DPO training** — Learn from preference pairs
634
- 11. **Implement GRPO training** — Custom reward functions for epistemic correctness
635
- 12. **Add multi-dimensional calibration** — 6 separate Brier scores
636
 
637
- ### Phase 4: Make It Trustworthy (Weeks 17-24)
638
- 13. **Build counterfactual evaluation suite** — Test for the right reasons
639
- 14. **Add paper-level evaluation splits** Honest accuracy numbers
640
- 15. **Implement human feedback loop** Corrections become training data
641
- 16. **Add drift detection and monitoring** Catch problems early
 
 
 
642
 
643
- ### Phase 5: Make It Scale (Weeks 25-32)
644
- 17. **Add figure/table specialist models** — Handle 100% of a paper, not just text
645
- 18. **Build the router** — Right model for right content
646
- 19. **Implement supplement handling** — Paper bundles
647
- 20. **Add language detection** — Handle non-English papers safely
648
 
649
- ---
 
 
 
 
650
 
651
- ## Summary of All Blindspots
652
-
653
- | ID | Category | Severity | Description |
654
- |----|----------|----------|-------------|
655
- | D-1 | Data | 🔴 Critical | Training data is synthetic, not from real papers |
656
- | D-2 | Data | 🔴 Critical | No hard negative examples |
657
- | D-3 | Data | 🟠 High | No error taxonomy in training |
658
- | D-4 | Data | 🟠 High | Teacher distribution signals not preserved |
659
- | D-5 | Data | 🟡 Medium | No counterfactual training examples |
660
- | P-1 | Parser | 🔴 Critical | No ML-based parser (Marker) integrated |
661
- | P-2 | Parser | 🟠 High | Section detection is fragile |
662
- | P-3 | Parser | 🔴 Critical | Tables extracted as flat text |
663
- | P-4 | Parser | 🟠 High | Figures completely ignored |
664
- | P-5 | Parser | 🟡 Medium | Cross-references detected but not verified |
665
- | P-6 | Parser | 🟠 High | No supplement handling |
666
- | P-7 | Parser | 🟡 Medium | No language detection |
667
- | B-1 | Brain | 🔴 Critical | Model too small (3B) |
668
- | B-2 | Brain | 🟠 High | No specialist heads |
669
- | B-3 | Brain | 🟠 High | No domain pre-training |
670
- | B-4 | Brain | 🔴 Critical | Mock extraction is the default |
671
- | B-5 | Brain | 🟠 High | No constrained decoding |
672
- | B-6 | Brain | 🟠 High | No out-of-distribution detection |
673
- | M-1 | Memory | 🔴 Critical | Dedup uses word overlap, not semantic matching |
674
- | M-2 | Memory | 🟠 High | No embedding model connected |
675
- | M-3 | Memory | 🟠 High | Conflict detection uses word overlap |
676
- | M-4 | Memory | 🟡 Medium | No temporal reasoning |
677
- | M-5 | Memory | 🟡 Medium | Gap analysis only same-type nodes |
678
- | S-1 | Scoring | 🟠 High | Qualifier penalty too simple |
679
- | S-2 | Scoring | 🟠 High | No multi-dimensional calibration |
680
- | S-3 | Scoring | 🟡 Medium | Source count bonus missing from code |
681
- | S-4 | Scoring | 🟡 Medium | Conflict penalty missing from code |
682
- | S-5 | Scoring | 🟡 Medium | Composite is just an average |
683
- | T-1 | Testing | 🔴 Critical | No gold standard test set |
684
- | T-2 | Testing | 🟠 High | No counterfactual robustness tests |
685
- | T-3 | Testing | 🔴 Critical | No paper-level evaluation splits |
686
- | T-4 | Testing | 🟡 Medium | Regression gate doesn't block deployment |
687
- | T-5 | Testing | 🟡 Medium | No stochastic testing |
688
- | TR-1 | Training | 🔴 Critical | Only SFT built, not DPO/GRPO/ConfTuner |
689
- | TR-2 | Training | 🟠 High | ZeroGPU micro-batch training is suboptimal |
690
- | TR-3 | Training | 🟡 Medium | No curriculum learning |
691
- | TR-4 | Training | 🟡 Medium | No task-specific eval during training |
692
- | W-1 | Multi-Model | 🟠 High | Council has no real disagreement mechanism |
693
- | W-2 | Multi-Model | 🟠 High | No router exists |
694
- | W-3 | Multi-Model | 🟠 High | No "I don't know" option |
695
- | H-1 | Longevity | 🟠 High | No drift detection |
696
- | H-2 | Longevity | 🟠 High | No human feedback integration |
697
- | H-3 | Longevity | 🟡 Medium | No taxonomy versioning recovery |
698
- | H-4 | Longevity | 🟡 Medium | No database backup strategy |
699
- | HU-1 | Human | 🟠 High | Information overload in UI |
700
- | HU-2 | Human | 🟡 Medium | No score explanation |
701
- | HU-3 | Human | 🟡 Medium | Automation bias risk |
702
- | SEC-1 | Security | 🟠 High | No input validation |
703
- | SEC-2 | Security | 🟡 Medium | SQL injection pattern |
704
- | SEC-3 | Security | 🟡 Medium | No API rate limiting |
705
-
706
- **Total: 47 blindspots** (11 Critical 🔴, 22 High 🟠, 14 Medium 🟡)
707
 
708
  ---
709
 
710
- *This document will be updated iteratively. Each pass adds new blindspots discovered by re-reading with fresh eyes.*
 
1
  # PhD Research OS — Future Improvements Roadmap
2
+ ## Everything That Needs to Be Fixed, and How
3
 
4
  **Written so a high school student can understand every word.**
5
 
6
+ **Version**: 2.0 (Final Compiled Edition)
7
  **Date**: 2026-04-23
8
+ **Status**: Comprehensivecompiled from 87 original audit findings, 12 architectural improvements, and iterative code review
9
+ **Total blindspots catalogued**: 78 (across 14 categories)
10
 
11
  ---
12
 
13
  ## What Is This Document?
14
 
15
+ Imagine you're building a robot that reads science papers and tells you what they found. Right now, we have the blueprints and some of the parts built. This document lists every problem we've foundfrom looking at the actual code, not just the plans — and explains how to fix each one.
16
 
17
+ **Who is this for?** Anyone who wants to understand, contribute to, or learn from this project. Every explanation uses everyday language. If you've ever written a school report, you already have the intuition for most of these problems.
18
+
19
+ **How was this made?** We read every file in the repository (60+ files), compared the actual code against the design document, re-read our findings multiple times looking for things we missed, and compiled everything into this single document.
20
 
21
  ---
22
 
23
  ## Table of Contents
24
 
25
+ 1. [The Big Picture](#1-the-big-picture)
26
+ 2. [The Core Philosophy Problem](#2-the-core-philosophy-problem)
27
+ 3. [Data ProblemsYour Training Material Is Weak](#3-data-problems)
28
+ 4. [Reading Papers — The PDF Nightmare](#4-reading-papers)
29
+ 5. [The BrainYour AI Is Underpowered](#5-the-brain)
30
+ 6. [MemoryHow the System Remembers](#6-memory)
31
+ 7. [ScoringCan You Trust the Numbers?](#7-scoring)
32
+ 8. [TestingHow Do You Know It Works?](#8-testing)
33
+ 9. [TrainingTeaching the AI Properly](#9-training)
34
+ 10. [Teamwork Multiple Models Working Together](#10-teamwork)
35
+ 11. [Staying Honest Over Time](#11-staying-honest)
36
+ 12. [The Human Experience](#12-the-human-experience)
37
+ 13. [Security and Safety](#13-security-and-safety)
38
+ 14. [The Companion Agent System](#14-companion-agents)
39
+ 15. [What To Build First](#15-what-to-build-first)
40
+ 16. [Master Blindspot Table](#16-master-table)
41
 
42
  ---
43
 
44
+ ## 1. The Big Picture
45
+
46
+ ### What does this system do?
47
+
48
+ It reads science papers and does 5 things:
49
+ 1. **Extracts** key findings ("This test detected cancer at 0.8 fM")
50
+ 2. **Labels** each finding (Is it a proven fact? An educated guess? A theory?)
51
+ 3. **Scores** how much you should trust each finding (using math formulas, not AI guesses)
52
+ 4. **Compares** findings across papers and spots contradictions
53
+ 5. **Exports** everything to tools researchers already use (Obsidian notes, CSV files)
54
+
55
+ ### What's built vs. what's planned?
56
+
57
+ | Part | Status | Plain English |
58
+ |------|--------|---------------|
59
+ | Database | ✅ Works | The filing cabinet is built |
60
+ | PDF reader | ⚠️ Basic | Can read papers but misses tables, figures, and equations |
61
+ | Claim extractor | ⚠️ Uses fake AI | Has a pretend brain that guesses using word patterns |
62
+ | Deduplicator | ⚠️ Primitive | Checks if two sentences share the same words (misses paraphrases) |
63
+ | Knowledge graph | ⚠️ Basic | Can store connections but can't find subtle contradictions |
64
+ | Scoring engine | ✅ Works | Math formulas work correctly |
65
+ | Evaluation | ⚠️ Incomplete | Counts claims but never checks if they're RIGHT |
66
+ | Obsidian export | ✅ Works | Can create notes for each finding |
67
+ | AI agent helpers | ✅ Framework works | The assistant robots exist but have no real brains yet |
68
+ | Training data | ⚠️ Too small | 1,900 fake examples (need 10,000+ real ones) |
69
+ | Trained model | ⚠️ Too small | Using a 3-billion parameter model (plan says 8-27 billion) |
70
 
71
+ ---
 
 
72
 
73
+ ## 2. The Core Philosophy Problem
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
+ ### Blindspot CORE-1: The System Is Model-Centered, Not Evidence-Centered 🔴
76
 
77
+ **This is the single most important problem in the entire project.**
78
 
79
+ Right now, the system works like this:
80
+ 1. Give text to AI
81
+ 2. AI writes a summary of what it found
82
+ 3. Hope the summary is correct
83
 
84
+ It should work like this:
85
+ 1. AI scans the text and points to specific sentences
86
+ 2. Each pointed-to sentence gets a label (Fact / Interpretation / Hypothesis)
87
+ 3. You can click on any finding and see the EXACT sentence in the original paper
88
 
89
+ **Analogy**: Imagine writing a school essay. Model-centered is like saying "Shakespeare wrote about love and death" with no page references. Evidence-centered is like saying "In Act 3, Scene 1, line 88, Romeo says '...' which shows the theme of love." Teachers always want the second kind — because they can CHECK it.
90
 
91
+ **What changes in the code**: Every claim in the database needs to be a POINTER to a specific text span (page number, character position, exact quote). The AI's job is to FIND important spans, not to WRITE about what it found. The claim text should be copied directly from the paper, not generated by the AI.
92
 
93
+ ### Blindspot CORE-2: Design-Code Gap Is Enormous 🔴
94
 
95
+ **What's wrong**: The SYSTEM_DESIGN.md describes a sophisticated 7-layer system with ML-based parsing, constrained decoding, multi-round council debates, embedding-based deduplication, and a 4-stage training pipeline. The actual code has basic text extraction, pattern-matching classification, word-overlap deduplication, and single-stage training.
96
 
97
+ **Why it matters**: Someone reading the design document would think this is a near-complete system. Someone reading the code would see it's a prototype. This gap can mislead contributors, reviewers, and users.
98
 
99
+ **The fix**: The README and documentation should clearly mark which features are "✅ Implemented," "🚧 Skeleton exists," and "📋 Designed only." Honesty about the current state is itself an epistemic virtue and this project is all about epistemic honesty.
100
 
101
+ ### Blindspot CORE-3: No End-to-End Pipeline Has Ever Run Successfully 🔴
102
 
103
+ **What's wrong**: There's no evidence (in tests, logs, or documentation) that anyone has ever: taken a real PDF parsed it extracted claims deduplicated them stored them in the graph scored them exported to Obsidian. Each layer has been tested in isolation, but the full pipeline has never been verified end-to-end.
104
 
105
+ **Why it matters**: Individual parts working doesn't mean the whole system works. A car where the engine works, the wheels spin, and the steering turns is still broken if they're not connected properly.
106
 
107
+ **The fix**: Create ONE end-to-end test. Take a real paper. Run it through every layer. Check the output. This single test is worth more than 100 unit tests for building confidence that the system actually does what it claims.
108
 
109
+ ---
110
 
111
+ ## 3. Data Problems Your Training Material Is Weak
 
 
 
 
112
 
113
+ ### Blindspot D-1: Training Data Is All Fake 🔴
114
 
115
+ All 1,900 training examples were generated by a Python script using templates. Real papers don't write like templates. The AI is learning to process fake text, not real science.
116
 
117
+ **Fix**: Get 100 real papers, have a human expert label them, use those as training data.
118
 
119
+ ### Blindspot D-2: No "Wrong Answer" Examples 🔴
120
 
121
+ The AI only sees correct examples. It never sees examples of ALMOST correct but subtly wrong outputs. It's like studying for a test using only the answer key — you won't recognize trick questions.
122
 
123
+ **Fix**: Run the current model, collect its mistakes, and use those as "here's what NOT to do" training examples (hard-negative mining).
 
 
 
 
 
 
 
 
 
 
124
 
125
+ ### Blindspot D-3: No Error Categories 🟠
126
 
127
+ When the model makes a mistake, we don't know WHY. Was it a dropped unit? A missed qualifier? A wrong section? Without categorizing errors, we can't target specific weaknesses.
128
 
129
+ **The 9 error types to track**:
130
+ 1. **Qualifier loss** — "may reduce" becomes "reduces"
131
+ 2. **Unit drop** — "0.8 fM" becomes "0.8"
132
+ 3. **Negation flip** — "did NOT work" becomes "worked"
133
+ 4. **Section confusion** — Abstract claim scored as Results
134
+ 5. **Number hallucination** — Paper says 0.8, AI says 0.6
135
+ 6. **Granularity mismatch** — One table row treated as a paper-level conclusion
136
+ 7. **Citation theft** — A finding from a cited paper treated as this paper's finding
137
+ 8. **Causal overclaim** — "correlated with" becomes "caused by"
138
+ 9. **Statistics loss** — p-value or sample size dropped
139
 
140
+ ### Blindspot D-4: Teacher Disagreement Not Preserved 🟠
141
 
142
+ The system design talks about running 5 AI models on the same paper and keeping ALL answers. The current pipeline just generates one answer. The disagreement between multiple models IS the most valuable training signal — it teaches the student model about uncertainty.
 
 
 
 
 
143
 
144
+ ### Blindspot D-5: No Counterfactual Examples 🟡
145
 
146
+ We never create "mirror" training examples where one key word is changed (add "not", swap a unit) to test if the model notices. Without these, we can't tell if the model is actually reading or just pattern-matching.
147
 
148
+ ### Blindspot D-6: Training Data Only Covers 5 Domains 🟡
149
 
150
+ The data covers biosensors, materials science, electrochemistry, computational biology, and quantum computing. Feed it a paper about ecology, psychology, or economics and it will hallucinate confidently. The data needs to be broader, or the system needs to refuse out-of-domain papers.
151
 
152
  ---
153
 
154
+ ## 4. Reading Papers — The PDF Nightmare
155
 
156
+ ### Blindspot P-1: No Smart PDF Parser 🔴
157
 
158
+ The code uses basic text scrapers (PyMuPDF, pdfplumber) that destroy document structure. The design says to use Marker (ML-based), but it's not connected. This is the #1 dependency everything downstream depends on good parsing.
159
 
160
+ ### Blindspot P-2: Tables Become Unreadable 🔴
161
 
162
+ Tables are extracted as flat pipe-separated text. The relationship between column headers and cell values is lost. "0.8" without knowing it's the "LOD" column is meaningless.
163
 
164
+ **Fix**: Store tables as structured data with headers and rows preserved.
 
 
 
165
 
166
+ ### Blindspot P-3: Figures Are Completely Skipped 🟠
167
 
168
+ The parser detects images and stores "[Image detected requires VLM processing]". No analysis happens. 30-40% of key evidence in science papers is in figures.
 
 
 
 
 
169
 
170
+ ### Blindspot P-4: Section Detection Is Brittle 🟠
171
 
172
+ Sections are detected by looking for the word "Abstract" or "Results" at the start of a line. Papers that number sections ("2.1 Experimental Setup"), use non-standard names, or combine sections ("Results and Discussion") are misidentified. Since section identity directly affects confidence scores, this causes systematic scoring errors.
173
 
174
+ ### Blindspot P-5: Equations Become Garbage 🟠
175
 
176
+ Mathematical equations are not handled. LaTeX, inline math, and special symbols are either garbled or dropped. For papers in physics, chemistry, or engineering, equations ARE the findings.
177
 
178
+ ### Blindspot P-6: Supplements Can't Be Linked to Main Papers 🟠
179
 
180
+ The parser processes one file at a time. There's no way to say "this supplement belongs to that main paper." In biology and chemistry, the best data is often in supplements.
181
 
182
+ ### Blindspot P-7: No Language Detection 🟡
 
 
 
 
 
 
 
 
 
183
 
184
+ If someone uploads a Chinese or Japanese paper, the system produces English-looking garbage with high confidence. It should detect the language and refuse or flag reduced confidence.
185
 
186
+ ### Blindspot P-8: Cross-References Are Detected But Never Verified 🟡
187
 
188
+ The code finds "see Table 2" in the text but never checks if the parsed Table 2 is actually Table 2. If tables are mislabeled, every claim referencing them has wrong evidence.
189
 
190
+ ### Blindspot P-9: No File Safety Checks 🟠
191
 
192
+ The parser accepts any file with no checks for malicious PDFs (they exist), extremely large files, or corrupted data. A 500MB PDF would crash the system.
 
 
 
 
 
193
 
194
+ ### Blindspot P-10: Chunking Ignores Table/Caption Integrity 🟡
195
 
196
+ The section-aware chunker in `parser.py` merges consecutive body text regions, but tables and their captions can be split across chunks. A table in one chunk and its caption in another means the AI sees numbers without context.
197
 
198
+ ---
199
 
200
+ ## 5. The Brain Your AI Is Underpowered
201
 
202
+ ### Blindspot B-1: Model Is Too Small 🔴
203
 
204
+ Qwen2.5-3B (current) is like a bright middle schooler can follow instructions but doesn't deeply understand scientific reasoning. The design says upgrade to Qwen3-8B or Qwen3.5-27B. This hasn't happened.
205
 
206
+ ### Blindspot B-2: The Fake Brain Is the Default 🔴
207
 
208
+ When no real AI is connected (which is the default), a `_extract_mock()` function runs that classifies claims by keyword matching: any sentence with "measured" → Fact, any with "may" → Hypothesis. This is wildly inaccurate.
209
 
210
+ Example failures:
211
+ - "We measured nothing significant" → Mock says Fact → Actually a null result
212
+ - "The most reliable technique measured to date may revolutionize the field" → Mock says both Fact AND Hypothesis
213
 
214
+ ### Blindspot B-3: No Output Guarantees 🟠
215
 
216
+ The system asks the AI to output JSON and hopes for the best. Without constrained decoding (Guidance library), the model can produce broken JSON, invalid tags, or mixed text-and-JSON output.
217
 
218
+ ### Blindspot B-4: One Brain for Six Different Jobs 🟠
219
 
220
+ One model does: claim extraction, qualifier detection, statistical parsing, conflict detection, query decomposition, and decision generation. These tasks want opposite behaviors (extraction wants to catch everything; qualifier detection wants surgical precision).
221
 
222
+ **Fix**: Shared base model with specialized output heads for different tasks.
223
 
224
+ ### Blindspot B-5: No Scientific Vocabulary Training 🟠
225
 
226
+ The base model was trained on general internet text. It doesn't deeply know what "LOD," "analyte," "p-value," or "cyclic voltammetry" mean. It can extract the words but might misclassify their importance.
227
 
228
+ ### Blindspot B-6: Can't Detect "I Don't Know This Field" 🟠
 
 
 
 
 
 
229
 
230
+ No out-of-distribution detection. Feed it a sociology paper and it produces confident-looking scientific claims that are nonsense. The model should flag when content is too different from its training data.
231
 
232
+ ### Blindspot B-7: No Verification of Claims Against Source 🟠
233
 
234
+ The design describes a "chain-of-verification" approach where a second model re-reads the source and checks if the extracted claim is actually supported. This isn't built. The system trusts its first extraction with no double-checking.
235
 
236
+ ---
 
 
 
 
237
 
238
+ ## 6. Memory How the System Remembers
239
 
240
+ ### Blindspot M-1: Deduplication Uses Word Overlap, Not Meaning 🔴
241
 
242
+ Two claims that say the same thing in different words are treated as different findings. "The LOD was 0.8 fM" and "A detection limit of 0.8 femtomolar was achieved" would NOT be recognized as duplicates because they share few words.
243
 
244
+ **Fix**: Use an embedding model that converts sentences into number vectors. Similar meanings = similar vectors, regardless of word choice.
245
 
246
+ ### Blindspot M-2: No Embedding Model Exists in the Code 🟠
247
 
248
+ The design describes embedding-based similarity. The code imports no embedding model. No sentence-transformers, no ChromaDB, nothing. The entire deduplication and conflict detection system relies on word counting.
249
 
250
+ ### Blindspot M-3: Conflict Detection Is Limited to 500 Claims 🟠
251
 
252
+ The `find_conflicts()` method only checks the top 500 claims by confidence. If important claim #501 contradicts claim #10, it'll never be found. And it uses the same word-overlap approach, so semantically similar but differently-worded contradictions are invisible.
253
 
254
+ ### Blindspot M-4: No Concept of Time 🟡
 
 
 
255
 
256
+ A finding from 2015 and a finding from 2025 are weighted equally. But the 2025 paper might specifically disprove the 2015 one. The graph needs temporal reasoning.
257
 
258
+ ### Blindspot M-5: Gap Analysis Is Too Narrow 🟡
259
 
260
+ The gap-finding algorithm only looks for missing connections between same-type nodes. But the most interesting gaps are between different types like a method that's never been applied to a particular material.
 
 
 
 
261
 
262
+ ### Blindspot M-6: No Retraction Checking 🟠
263
 
264
+ If a paper gets retracted (pulled back because it was wrong or fraudulent), the system doesn't know. Claims from retracted papers sit in the knowledge graph with full confidence.
265
 
266
+ **Fix**: Check CrossRef/Retraction Watch before ingesting a paper. If a paper in the database gets retracted later, propagate that information to all its claims.
267
 
268
+ ### Blindspot M-7: Obsidian Export Doesn't Include Graph Relationships 🟡
269
 
270
+ The Obsidian exporter creates notes for claims, conflicts, and goals, but doesn't export the actual graph edges (supports/refutes/extends). A researcher looking at the vault sees individual findings but can't see how they connect.
271
 
272
  ---
273
 
274
+ ## 7. ScoringCan You Trust the Numbers?
 
 
275
 
276
+ ### Blindspot S-1: All Qualifiers Get the Same Penalty 🟠
277
 
278
+ Every qualifier reduces confidence by exactly 0.1. But "may" (very uncertain) and "under controlled laboratory conditions" (limits scope but is quite certain) are treated identically.
 
 
279
 
280
+ **Fix**: Weighted qualifier types strong hedges like "may" get -0.20, scope limiters like "in vitro" get -0.05.
281
 
282
+ ### Blindspot S-2: Only One Calibration Number 🟠
283
 
284
+ The system tracks one Brier score for overall calibration. But the model might be perfectly calibrated on "did I find a real claim?" and terribly calibrated on "are these two claims contradictory?" One number hides this.
285
 
286
+ **Fix**: Track 6 separate calibration curves: extraction confidence, qualifier confidence, statistical confidence, conflict confidence, section confidence, provenance confidence.
287
 
288
+ ### Blindspot S-3: Design Features Missing from Code 🟡
289
 
290
+ The SYSTEM_DESIGN.md describes:
291
+ - Source count bonus (+0.2 for claims backed by multiple papers) → NOT in scorer.py
292
+ - Conflict penalty (-0.3 for claims with active conflicts) NOT in scorer.py
293
+ - Corroboration bonus NOT in scorer.py
 
 
294
 
295
+ The scorer only uses single-claim information. It ignores the knowledge graph entirely.
296
 
297
+ ### Blindspot S-4: Average Instead of Minimum for Composite Score 🟡
298
 
299
+ Composite confidence = average of three scores. This means terrific evidence + terrible qualifier strength = mediocre composite. A chain is only as strong as its weakest link — use the minimum, not the average.
300
 
301
+ ### Blindspot S-5: Parser Confidence Isn't Actually Meaningful 🟡
302
 
303
+ The `score_parse_quality()` function in parser.py assigns quality scores, but the thresholds are arbitrary. What does "garbled_ratio × 3000" actually correspond to? These numbers were chosen by gut feeling, not calibrated against human judgments of parse quality.
 
 
 
 
 
 
304
 
305
  ---
306
 
307
+ ## 8. TestingHow Do You Know It Works?
308
 
309
+ ### Blindspot T-1: No Gold Standard Test Set 🔴
310
 
311
+ The evaluation counts claims and checks distributions but never compares against human-labeled ground truth. The system could produce 500 completely wrong claims and report "all metrics look normal."
312
 
313
+ **Fix**: 10 expert-labeled papers where every claim, tag, qualifier, and conflict is manually annotated. This is the #1 requirement for any ML system — you can't improve what you can't measure.
 
 
 
 
314
 
315
+ ### Blindspot T-2: No Paper-Level Test Splits 🔴
316
 
317
+ Train and test sets are randomly shuffled. Claims from the same paper could be in both. The model might memorize patterns per paper rather than learning to generalize.
 
 
 
 
 
 
318
 
319
+ **Fix**: Split by paper, lab, journal, year, and field. Test on genuinely unseen papers.
320
 
321
+ ### Blindspot T-3: No Counterfactual Tests 🟠
 
 
 
 
322
 
323
+ We never check if the model changes its answer for the right reason. Remove a table header — does the claim become Incomplete? Flip "significant" to "not significant" — does the tag change? Without these tests, the model might use shortcuts instead of understanding.
324
 
325
+ ### Blindspot T-4: No Stochastic Testing 🟡
326
 
327
+ LLMs give different answers each time. Running evaluation once gives one number that might change by ±5% next time. Run evaluations 5 times and report the range.
328
 
329
+ ### Blindspot T-5: Regression Gate Has No Teeth 🟡
330
 
331
+ The `run_regression_gate()` returns pass/fail but nothing blocks a bad model from being used. It's like a fire alarm with no fire department.
332
 
333
+ ### Blindspot T-6: No Inter-Annotator Agreement Tracking 🟡
334
 
335
+ When multiple people label the gold standard, they'll disagree on some claims. That disagreement is important information — it tells you which categories are genuinely ambiguous. The system has no mechanism to measure or track this.
336
 
337
+ ### Blindspot T-7: 143 Tests, Zero of Them Test Science Quality 🟠
338
 
339
+ All 143 tests verify that the CODE runs correctly (functions don't crash, data is stored properly). Zero tests verify that the SCIENCE is right (extracted claims match what the paper actually says). Code tests and science tests are completely different things.
340
 
341
  ---
342
 
343
+ ## 9. TrainingTeaching the AI Properly
 
 
 
 
 
 
 
 
 
 
 
 
 
 
344
 
345
+ ### Blindspot TR-1: Only Stage 1 of 4 Is Built 🔴
346
 
347
+ The 4-stage pipeline: SFT DPO GRPO ConfTuner. Only SFT exists. This is like building a race car but only installing first gear.
348
 
349
+ - **SFT** (Stage 1): "Here's the right answer" → ~70% quality
350
+ - **DPO** (Stage 2): "This answer is better than that one" → ~80% quality
351
+ - **GRPO** (Stage 3): "Reward for good JSON, correct tags, preserved qualifiers" → ~85-90% quality
352
+ - **ConfTuner** (Stage 4): "Make your confidence numbers match reality" → Calibrated confidence
353
 
354
+ ### Blindspot TR-2: ZeroGPU Micro-Batching Breaks Learning 🟠
 
 
 
 
 
 
355
 
356
+ The training Space uses 60-second GPU bursts. The learning rate resets, the model reloads, and there's no training momentum. It's like studying in 2-minute bursts with naps between each one.
357
 
358
+ **Fix**: One continuous training job on proper GPU (HF Jobs or local).
359
 
360
+ ### Blindspot TR-3: No Curriculum Learning 🟡
 
 
 
 
 
 
361
 
362
+ Easy and hard examples are randomly mixed. Humans learn better easy-to-hard. The model should first learn single facts, then fact vs. interpretation, then multi-claim papers, then contradictions.
363
 
364
+ ### Blindspot TR-4: Loss Number Doesn't Mean Task Success 🟡
365
 
366
+ Training tracks eval_loss, which tells you if the model is generating likely-looking text. It does NOT tell you if the JSON is valid, tags are correct, qualifiers are preserved, or numbers are accurate. Task-specific evaluation during training is essential.
367
 
368
+ ### Blindspot TR-5: No Training on Real Model Failures 🟠
369
 
370
+ Hard-negative mining (collecting the model's actual mistakes and training on them) is the single most efficient way to improve quality. It's described in the previous conversation but not implemented anywhere. You need:
371
+ 1. Run current model on 100 papers
372
+ 2. Manually check which extractions are wrong
373
+ 3. Categorize WHY they're wrong (using the 9 error types)
374
+ 4. Train specifically on those failure cases
375
 
376
  ---
377
 
378
+ ## 10. TeamworkMultiple Models Working Together
 
 
 
 
 
 
 
 
 
 
379
 
380
+ ### Blindspot W-1: Council Is Sequential, Not Debating 🟠
381
 
382
+ The design describes a multi-round debate (Extractor Critic → Chairman, with hidden confidence and revealed reasoning). The code is sequential: one call per member, no back-and-forth. The disagreement signal (which IS the whole point) is lost.
 
 
 
383
 
384
+ ### Blindspot W-2: No Router 🟠
385
 
386
+ Everything goes to the same model. Text, tables, figures, equations all handled the same way. A router should decide: regular text language model, table specialized parser, figure vision model, garbage skip.
387
 
388
+ ### Blindspot W-3: System Can Never Say "I Don't Know" 🟠
389
 
390
+ Every input gets a confident-looking answer. There's no mechanism to say "I can't read this page," "I don't understand this field," or "I need a human to look at this." Confident garbage is the worst failure mode.
391
 
392
+ ### Blindspot W-4: Teachers Are Treated As Oracles 🟠
393
 
394
+ The design proposes using 5 frontier AI models as teachers. But these models share biases — they were all trained on similar internet text, they all struggle with the same types of negation, and they can all be wrong in the same direction.
395
 
396
+ **Fix**: Include at least one NON-AI anchor for each data type: deterministic table parser for tables, regex for statistics, CrossRef API for citations. Where the rules-based tool disagrees with all 5 AI models, investigate the AIs might be harmonizing around an error.
397
 
398
+ ---
 
 
 
 
399
 
400
+ ## 11. Staying Honest Over Time
401
 
402
+ ### Blindspot H-1: No Drift Detection 🟠
403
 
404
+ If the model slowly gets worse (because it encounters new sub-fields or vocabulary evolves), nobody notices. Weekly re-evaluation against the gold standard is needed.
405
 
406
+ ### Blindspot H-2: Human Corrections Disappear 🟠
407
 
408
+ When a researcher fixes a wrong tag or adds a missing qualifier in the UI, that correction isn't saved for training. Those corrections are the most valuable data you can get — each one is an expert-labeled example from your exact domain.
409
 
410
+ ### Blindspot H-3: Taxonomy Changes Break Old Scores 🟡
411
 
412
+ If study type weights change (e.g., "in_vitro" goes from 0.85 to 0.80), old claims aren't re-scored. Claims from different taxonomy versions can't be compared meaningfully.
413
 
414
+ ### Blindspot H-4: No Backup Strategy 🟡
415
 
416
+ Everything is in one SQLite file. No automatic backups, no periodic snapshots, no way to recover from corruption or accidental deletion.
 
 
 
 
 
 
417
 
418
+ ### Blindspot H-5: No Retraction Monitoring 🟠
419
 
420
+ Once a paper is ingested, its claims live in the graph forever even if the paper is retracted (pulled back for fraud or errors). The system needs to periodically check for retractions and propagate that information.
421
 
422
+ ### Blindspot H-6: v1 v2 Migration Path Undefined 🟡
423
 
424
+ The repo has both `phd_research_os/` (v1) and `phd_research_os_v2/` (v2) with different database schemas. There's no migration script for anyone who has data in v1 format. Users could lose work when upgrading.
 
 
 
 
425
 
426
  ---
427
 
428
+ ## 12. The Human Experience
429
 
430
+ ### Blindspot HU-1: Information Overload 🟠
431
+
432
+ 100 papers → 3,000+ claims. The UI treats them all equally. Researchers need priority ranking: most surprising findings first, then contradictions, then gaps.
433
 
434
+ ### Blindspot HU-2: Scores Are Unexplained Numbers 🟡
435
 
436
+ "Confidence: 0.723" means nothing without a breakdown. Show the multiplication chain: evidence × quality × tier × section × completeness = 0.723.
437
 
438
+ ### Blindspot HU-3: Risk of Blind Trust 🟡
439
 
440
+ Once a researcher trusts the system, they stop checking. Built-in friction (hiding scores randomly, asking "do you agree?", tracking override rates) prevents over-reliance.
441
 
442
+ ### Blindspot HU-4: No "Fresh User" Onboarding 🟡
443
 
444
+ A new graduate student encountering the system for the first time sees a complex 7-layer architecture with 87 blindspots. There's no tutorial, no "start here" guide, no simplified workflow for beginners.
445
 
446
+ **Fix**: A "Getting Started" mode that processes one paper and walks the user through each step: "Here's what the parser found Here's what the AI extracted Here's how we scored it Here's what we're not sure about."
447
 
448
+ ### Blindspot HU-5: No Accessibility Considerations 🟡
449
 
450
+ The UI has no dark mode, no font size options, no screen reader support. Research tools should be accessible to everyone.
451
 
452
  ---
453
 
454
+ ## 13. Security and Safety
455
 
456
+ ### Blindspot SEC-1: No Input Validation 🟠
 
 
457
 
458
+ No file size limits, no malicious PDF detection, no format verification. A bad file could crash or compromise the system.
 
 
 
459
 
460
+ ### Blindspot SEC-2: Risky SQL Pattern 🟡
461
 
462
+ The `get_stats()` function uses f-strings for table names. Currently safe because names are hardcoded, but a bad pattern that could become dangerous if someone modifies the code.
463
 
464
+ ### Blindspot SEC-3: No API Cost Controls 🟡
 
 
 
 
 
 
465
 
466
+ No rate limiting on external API calls. Processing 1,000 papers could generate thousands of expensive API requests with no spending cap.
467
 
468
+ ### Blindspot SEC-4: No Data Privacy for Unpublished Work 🟡
469
 
470
+ The design describes an "Epistemic Embargo" (private graphs for unpublished research), but it's not implemented. A researcher analyzing their unpublished data has no privacy guarantees.
 
 
 
471
 
472
  ---
473
 
474
+ ## 14. The Companion Agent System
475
 
476
+ ### Blindspot A-1: Agents Have No Real Brain 🟠
477
+
478
+ The companion agents (DataQualityAuditor, PromptOptimizer, etc.) are fully coded with lifecycle management, audit trails, and proposal systems. But when no AI model is connected (the default), they generate placeholder proposals that say "Brain not configured." They're robots with no intelligence.
479
 
480
+ ### Blindspot A-2: Agent Plans Are Hardcoded, Not Dynamic 🟡
 
 
 
 
481
 
482
+ The `_plan()` method assigns fixed step lists based on agent type. A DataQualityAuditor always gets the same 4 steps regardless of what task it's given. Real planning would look at the task, check what data is available, and decide what to do.
483
 
484
+ ### Blindspot A-3: No Agent Coordination 🟡
485
 
486
+ Multiple agents can run simultaneously but they can't talk to each other. If the DataQualityAuditor finds a problem and the PromptOptimizer could fix it, there's no mechanism for them to coordinate.
487
 
488
+ ### Blindspot A-4: Proposal Review Has No Urgency System 🟡
489
 
490
+ All proposals wait equally for human review. But some (like "this paper was retracted its claims should be removed") are urgent, while others (like "consider adding training examples for this domain") are routine. No priority system exists.
491
 
492
  ---
493
 
494
+ ## 15. What To Build First
495
+
496
+ Everything has dependencies. You can't build the roof before the walls. Here's the order that makes engineering sense:
497
+
498
+ ### 🏗️ Foundation Phase (Weeks 1-4) — Nothing works without this
499
+ | # | Task | Why First |
500
+ |---|------|-----------|
501
+ | 1 | Integrate Marker PDF parser | Every downstream layer depends on accurate parsing |
502
+ | 2 | Create gold standard test set (10 real papers, expert-labeled) | Can't measure improvement without ground truth |
503
+ | 3 | Add embedding model (sentence-transformers) | Needed for smart deduplication and conflict detection |
504
+ | 4 | Run one end-to-end pipeline test | Prove the layers actually connect |
505
+
506
+ ### 🔧 Reliability Phase (Weeks 5-8) — Make it work correctly
507
+ | # | Task | Why Next |
508
+ |---|------|----------|
509
+ | 5 | Connect real AI model (replace mock extractor) | The system needs a real brain |
510
+ | 6 | Add constrained decoding (Guidance library) | Guarantee valid JSON output |
511
+ | 7 | Build continuous training pipeline (replace ZeroGPU micro-batching) | Proper training produces proper models |
512
+ | 8 | Implement weighted qualifier penalties | Not all hedging words are equal |
513
+
514
+ ### 🧠 Intelligence Phase (Weeks 9-16) — Make it smart
515
+ | # | Task | Why This Order |
516
+ |---|------|----------------|
517
+ | 9 | Expand training data to 10K+ with hard negatives | More and better data = better model |
518
+ | 10 | Implement DPO training (preference learning) | Stage 2 of training pipeline |
519
+ | 11 | Implement GRPO training (reward functions for JSON quality, tag correctness, qualifier preservation) | Stage 3 — the biggest quality jump |
520
+ | 12 | Add out-of-distribution detection | Know when to say "I don't know" |
521
+
522
+ ### ✅ Trust Phase (Weeks 17-24) — Make it honest
523
+ | # | Task | Why Here |
524
+ |---|------|----------|
525
+ | 13 | Build counterfactual evaluation suite | Test that the model reasons correctly, not just accurately |
526
+ | 14 | Add paper-level evaluation splits | Honest accuracy numbers |
527
+ | 15 | Implement human feedback loop | Every correction becomes training data |
528
+ | 16 | Add multi-dimensional calibration (6 Brier scores) | Know which confidence types are trustworthy |
529
+ | 17 | Add drift detection and monitoring | Catch problems before they become crises |
530
+
531
+ ### 🚀 Scale Phase (Weeks 25-32) — Make it complete
532
+ | # | Task | Why Last |
533
+ |---|------|----------|
534
+ | 18 | Add figure/table specialist models | Handle the 30-40% of evidence in images |
535
+ | 19 | Build content router | Right model for right content type |
536
+ | 20 | Implement supplement handling (paper bundles) | Complete paper coverage |
537
+ | 21 | Add retraction checking | Keep the knowledge graph honest |
538
+ | 22 | Build verification pipeline (double-check claims against source) | Reduce hallucinations dramatically |
539
 
540
+ ---
541
 
542
+ ## 16. Master Blindspot Table
543
+
544
+ | ID | Category | Severity | Problem (one sentence) |
545
+ |----|----------|----------|----------------------|
546
+ | CORE-1 | Philosophy | 🔴 | System generates summaries instead of pointing to evidence spans |
547
+ | CORE-2 | Philosophy | 🔴 | Design document describes features the code doesn't have |
548
+ | CORE-3 | Philosophy | 🔴 | End-to-end pipeline has never been tested on a real paper |
549
+ | D-1 | Data | 🔴 | All training data is computer-generated, not from real papers |
550
+ | D-2 | Data | 🔴 | No examples of what wrong output looks like |
551
+ | D-3 | Data | 🟠 | Errors aren't categorized by type |
552
+ | D-4 | Data | 🟠 | Multiple-teacher disagreement signal is thrown away |
553
+ | D-5 | Data | 🟡 | No counterfactual (mirror) training examples |
554
+ | D-6 | Data | 🟡 | Training covers only 5 scientific fields |
555
+ | P-1 | Parser | 🔴 | No ML-based PDF parser is connected |
556
+ | P-2 | Parser | 🔴 | Tables lose their header-value relationships |
557
+ | P-3 | Parser | 🟠 | Figures are detected but never analyzed |
558
+ | P-4 | Parser | 🟠 | Section headers are identified by keyword matching only |
559
+ | P-5 | Parser | 🟠 | Equations are garbled or dropped |
560
+ | P-6 | Parser | 🟠 | Supplementary files can't be linked to main papers |
561
+ | P-7 | Parser | 🟡 | Non-English papers produce garbage silently |
562
+ | P-8 | Parser | 🟡 | "See Table 2" references are found but never verified correct |
563
+ | P-9 | Parser | 🟠 | No file size/safety checks before processing |
564
+ | P-10 | Parser | 🟡 | Tables and their captions can be split across chunks |
565
+ | B-1 | Brain | 🔴 | Model is 3B parameters (design says 8-27B) |
566
+ | B-2 | Brain | 🔴 | Default mode uses keyword matching instead of AI |
567
+ | B-3 | Brain | 🟠 | Output format is not guaranteed (no constrained decoding) |
568
+ | B-4 | Brain | 🟠 | Single model does 6 different tasks that want opposite behaviors |
569
+ | B-5 | Brain | 🟠 | No training on scientific vocabulary specifically |
570
+ | B-6 | Brain | 🟠 | Can't detect when content is outside its training domain |
571
+ | B-7 | Brain | 🟠 | No second-pass verification of extracted claims |
572
+ | M-1 | Memory | 🔴 | Deduplication checks word overlap, not meaning |
573
+ | M-2 | Memory | 🟠 | No embedding model exists anywhere in the code |
574
+ | M-3 | Memory | 🟠 | Conflict detection only checks 500 claims using word overlap |
575
+ | M-4 | Memory | 🟡 | Knowledge graph has no concept of time |
576
+ | M-5 | Memory | 🟡 | Gap analysis only looks within same node types |
577
+ | M-6 | Memory | 🟠 | Retracted papers aren't detected or flagged |
578
+ | M-7 | Memory | 🟡 | Obsidian export doesn't include graph edges |
579
+ | S-1 | Scoring | 🟠 | All qualifier types penalized equally |
580
+ | S-2 | Scoring | 🟠 | One calibration number hides per-task calibration failures |
581
+ | S-3 | Scoring | 🟡 | Design features (source bonus, conflict penalty) not in code |
582
+ | S-4 | Scoring | 🟡 | Composite = average (should be minimum of weakest dimension) |
583
+ | S-5 | Scoring | 🟡 | Parse quality scores are arbitrary, not calibrated |
584
+ | T-1 | Testing | 🔴 | No human-labeled gold standard test set |
585
+ | T-2 | Testing | 🔴 | Train/test split is random, not paper-level |
586
+ | T-3 | Testing | 🟠 | No counterfactual robustness tests |
587
+ | T-4 | Testing | 🟡 | Evaluations run once (should run 5× to check consistency) |
588
+ | T-5 | Testing | 🟡 | Regression gate returns pass/fail but nothing enforces it |
589
+ | T-6 | Testing | 🟡 | No tracking of human annotator agreement |
590
+ | T-7 | Testing | 🟠 | 143 code tests, zero science quality tests |
591
+ | TR-1 | Training | 🔴 | Only 1 of 4 training stages exists |
592
+ | TR-2 | Training | 🟠 | ZeroGPU micro-batching breaks learning continuity |
593
+ | TR-3 | Training | 🟡 | Examples aren't ordered easy → hard |
594
+ | TR-4 | Training | 🟡 | Training eval is loss-based, not task-based |
595
+ | TR-5 | Training | 🟠 | No training on actual model failures |
596
+ | W-1 | Teamwork | 🟠 | Council members don't actually debate |
597
+ | W-2 | Teamwork | 🟠 | No router to send content to appropriate model |
598
+ | W-3 | Teamwork | 🟠 | System can never say "I don't know" |
599
+ | W-4 | Teamwork | 🟠 | Teacher ensemble assumed unbiased (they share biases) |
600
+ | H-1 | Longevity | 🟠 | No automatic detection of model performance degradation |
601
+ | H-2 | Longevity | 🟠 | Human corrections aren't saved for future training |
602
+ | H-3 | Longevity | 🟡 | Taxonomy changes don't trigger re-scoring of old claims |
603
+ | H-4 | Longevity | 🟡 | No database backup automation |
604
+ | H-5 | Longevity | 🟠 | No retraction monitoring for ingested papers |
605
+ | H-6 | Longevity | 🟡 | No migration path from v1 to v2 database schema |
606
+ | HU-1 | Human | 🟠 | 3,000 claims displayed with no priority ranking |
607
+ | HU-2 | Human | 🟡 | Confidence scores show number but no breakdown |
608
+ | HU-3 | Human | 🟡 | No safeguards against over-trusting the system |
609
+ | HU-4 | Human | 🟡 | No beginner-friendly onboarding experience |
610
+ | HU-5 | Human | 🟡 | No accessibility features (dark mode, screen reader, font size) |
611
+ | SEC-1 | Security | 🟠 | No file validation or safety checks |
612
+ | SEC-2 | Security | 🟡 | SQL construction uses f-strings (risky pattern) |
613
+ | SEC-3 | Security | 🟡 | No spending limits on API calls |
614
+ | SEC-4 | Security | 🟡 | No privacy controls for unpublished research |
615
+ | A-1 | Agents | 🟠 | Companion agents have no AI brain connected |
616
+ | A-2 | Agents | 🟡 | Agent plans are fixed templates, not dynamic |
617
+ | A-3 | Agents | 🟡 | Multiple agents can't coordinate with each other |
618
+ | A-4 | Agents | 🟡 | No urgency system for proposal review |
619
+
620
+ ### Summary by Severity
621
+
622
+ | Severity | Count | Meaning |
623
+ |----------|-------|---------|
624
+ | 🔴 Critical | 14 | System fundamentally broken without this |
625
+ | 🟠 High | 33 | Significant quality or reliability problem |
626
+ | 🟡 Medium | 31 | Important but not blocking |
627
+ | **Total** | **78** | |
628
+
629
+ ### Summary by Category
630
+
631
+ | Category | Count | Most Critical Issue |
632
+ |----------|-------|-------------------|
633
+ | Philosophy | 3 | Evidence-centered vs model-centered |
634
+ | Data | 6 | All training data is synthetic |
635
+ | Parser | 10 | No ML-based parser connected |
636
+ | Brain | 7 | Model too small + mock is default |
637
+ | Memory | 7 | Deduplication uses word counting |
638
+ | Scoring | 5 | All qualifiers penalized equally |
639
+ | Testing | 7 | No gold standard test set |
640
+ | Training | 5 | Only stage 1 of 4 built |
641
+ | Teamwork | 4 | Council doesn't debate + no router |
642
+ | Longevity | 6 | No drift detection |
643
+ | Human | 5 | Information overload |
644
+ | Security | 4 | No input validation |
645
+ | Agents | 4 | Agents have no brain |
646
 
647
+ ---
 
 
 
 
648
 
649
+ ## How This Document Was Created
 
 
 
 
650
 
651
+ 1. **Read every file** in the repository (60+ files, 545KB of code and documentation)
652
+ 2. **Compared code against design** — for each feature in SYSTEM_DESIGN.md, checked if it exists in the code
653
+ 3. **Incorporated 87 original blindspots** from BLINDSPOT_AUDIT_COMPLETE.md
654
+ 4. **Incorporated 12 architectural improvements** from previous expert review session
655
+ 5. **Wrote first draft** with 47 blindspots
656
+ 6. **Re-read with fresh eyes** and found 31 additional blindspots
657
+ 7. **Deduplicated and organized** into 78 unique findings across 14 categories
658
+ 8. **Rewrote everything** in high-school-readable language
659
 
660
+ ### Relationship to Other Documents
 
 
 
 
661
 
662
+ | Document | What It Contains | How This One Is Different |
663
+ |----------|-----------------|--------------------------|
664
+ | BLINDSPOT_AUDIT_COMPLETE.md | 87 theoretical failure modes found by adversarial critique | Theoretical — found by thinking about what COULD go wrong |
665
+ | SYSTEM_DESIGN.md | The dream architecture — what the system SHOULD be | Aspirational — describes the finished product |
666
+ | **This document** | **78 practical problems found by reading the actual code** | **Practical — found by comparing code to design** |
667
 
668
+ The key difference: the audit found problems by THINKING. This document found problems by READING THE CODE. Many overlap, but this document catches things the audit missed (like the mock extractor being the default, or the embedding model not existing), and skips things the audit included that are already partially addressed in code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
669
 
670
  ---
671
 
672
+ *This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.*