nkshirsa commited on
Commit
eb2e95b
Β·
verified Β·
1 Parent(s): de24794

Append 16 new improvement items from prior art analysis to FUTURE_IMPROVEMENTS.md (total now 94)

Browse files
Files changed (1) hide show
  1. FUTURE_IMPROVEMENTS.md +76 -1
FUTURE_IMPROVEMENTS.md CHANGED
@@ -669,4 +669,79 @@ The key difference: the audit found problems by THINKING. This document found pr
669
 
670
  ---
671
 
672
- *This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
669
 
670
  ---
671
 
672
+ *This is the compiled final edition. Each finding is grounded in specific code files and can be verified by reading the source.*
673
+
674
+ ---
675
+
676
+ ## Appendix: New Improvements from Prior Art Analysis (2026-04-23)
677
+
678
+ **Source**: Comprehensive analysis of 15 similar systems ([PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md), [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md))
679
+
680
+ After analyzing every system that does something similar to PhD Research OS, we found 16 concrete improvements to adopt, adapt, or build. These are ON TOP of the 78 blindspots already documented above. Each one comes from a real, published, working system.
681
+
682
+ ### New Blindspots Discovered from Prior Art
683
+
684
+ | ID | Category | Severity | Problem | Source System |
685
+ |----|----------|----------|---------|---------------|
686
+ | PA-1 | Memory | πŸ”΄ | No scientific embedding model for dedup (word overlap misses paraphrases β€” "0.8 fM" β‰  "800 attomolar" to our system but they're the same) | SPECTER2 (AllenAI) proves embeddings fix this |
687
+ | PA-2 | Testing | πŸ”΄ | No standard benchmark for claim verification quality (we can't compare ourselves to anyone else) | SciFact (AllenAI) is the industry standard we should measure against |
688
+ | PA-3 | Data | πŸ”΄ | 1,900 synthetic examples vs 137,000 expert-written ones available for free (72Γ— gap) | SciRIFF (AllenAI) is sitting on HuggingFace waiting to be used |
689
+ | PA-4 | Brain | 🟠 | No pre-filtering before expensive extraction (AI Council processes boilerplate text and acknowledgments as seriously as Results) | PaperQA2's RCS technique filters noise before analysis |
690
+ | PA-5 | Brain | 🟠 | No code-based epistemic validation (AI classification has no deterministic cross-check) | KGX3's language-game filters provide rule-based validation |
691
+ | PA-6 | Brain | 🟠 | No completeness audit after extraction (system could silently miss the most important finding) | Paper Circle's Coverage Checker catches omissions |
692
+ | PA-7 | Brain | 🟠 | System never says "I don't know" (every input gets a confident-looking answer, even garbage) | PaperQA2 refuses to answer when evidence is insufficient |
693
+ | PA-8 | Memory | 🟠 | Contradiction detection has no fast pre-filter (checking 500K pairs all require expensive LLM calls) | SciBERT-NLI can pre-screen in seconds, not hours |
694
+ | PA-9 | Memory | 🟠 | No cross-reference verification between paper claims and knowledge graph (claims checked in isolation) | FactReview checks against both the paper AND external literature |
695
+ | PA-10 | Scoring | 🟑 | No temporal confidence tracking (a claim that's been rising for 4 years looks the same as one that just appeared) | CLAIRE + PaperQA2 both track temporal patterns |
696
+ | PA-11 | Scoring | 🟑 | Confidence numbers shown without explanation (0.72 means nothing without knowing WHY) | CLUE generates plain-English uncertainty explanations |
697
+ | PA-12 | Human | 🟑 | No active skepticism mode (system only confirms, never challenges high-confidence claims) | CLAIRE actively looks for counter-evidence |
698
+ | PA-13 | Human | 🟑 | No human verification tracking (machine-extracted claims look identical to expert-verified ones) | ORKG + Paper Circle track verification status |
699
+ | PA-14 | Brain | 🟑 | No self-critique step in Council (Extractor produces claims without articulating uncertainty) | CritiCal proves that self-critique improves calibration |
700
+ | PA-15 | Memory | 🟑 | Conflict detection is shallow (compares two claims in isolation, doesn't investigate with surrounding evidence) | CLAIRE's investigation loop retrieves additional evidence before flagging |
701
+ | PA-16 | Data | 🟑 | No non-AI anchors in teacher ensemble (all 5 teacher models share similar biases from internet training) | FactReview uses CODE EXECUTION as a non-AI evidence source |
702
+
703
+ ### Updated Summary by Severity (Including Prior Art Findings)
704
+
705
+ | Severity | Original Count | New from Prior Art | Total |
706
+ |----------|---------------|-------------------|-------|
707
+ | πŸ”΄ Critical | 14 | 3 | **17** |
708
+ | 🟠 High | 33 | 5 | **38** |
709
+ | 🟑 Medium | 31 | 8 | **39** |
710
+ | **Total** | **78** | **16** | **94** |
711
+
712
+ ### Priority Integration Order for Prior Art Items
713
+
714
+ **Do These First** (they fix Critical blindspots with minimal effort):
715
+
716
+ | Priority | Item | Effort | What It Fixes |
717
+ |----------|------|--------|---------------|
718
+ | 1 | PA-1: Integrate SPECTER2 | 1-2 days | Deduplication actually works |
719
+ | 2 | PA-2: Add SciFact benchmark | 1 day | We can measure quality |
720
+ | 3 | PA-3: Integrate SciRIFF data | 2-3 days | 72Γ— more training data |
721
+
722
+ **Do These Next** (they fix High blindspots with moderate effort):
723
+
724
+ | Priority | Item | Effort | What It Fixes |
725
+ |----------|------|--------|---------------|
726
+ | 4 | PA-8: SciBERT-NLI pre-filter | 1-2 days | Conflict detection becomes 100Γ— faster |
727
+ | 5 | PA-4: Pre-Extraction Filter | 1 week | Removes noise before expensive processing |
728
+ | 6 | PA-5: Epistemic Trigger Words | 1 week | Code-based validation of AI labels |
729
+ | 7 | PA-7: Low Confidence Quarantine | 3-5 days | System learns to say "I don't know" |
730
+ | 8 | PA-6: Completeness Auditor | 1 week | Catches silent omissions |
731
+ | 9 | PA-9: Cross-Reference Verification | 1-2 weeks | Claims checked against both paper and graph |
732
+
733
+ **Do These When Core Is Stable** (they add new capabilities):
734
+
735
+ | Priority | Item | Effort | What It Adds |
736
+ |----------|------|--------|-------------|
737
+ | 10 | PA-14: Council Self-Critique | 1 week | Better calibration |
738
+ | 11 | PA-13: Provenance Levels | 1 week | Human verification tracking |
739
+ | 12 | PA-15: Investigation Protocol | 2-3 weeks | Deep contradiction analysis |
740
+ | 13 | PA-10: Epistemic Velocity | 3-5 days | Temporal confidence trends |
741
+ | 14 | PA-12: Devil's Advocate Mode | 1-2 weeks | Active skepticism |
742
+ | 15 | PA-11: Confidence Explanations | 3-5 days | Plain-English scoring reasons |
743
+ | 16 | PA-16: Non-AI anchors | Ongoing | Deterministic validation alongside AI |
744
+
745
+ ---
746
+
747
+ *Appendix added 2026-04-23. These 16 items supplement the original 78 blindspots with findings from prior art analysis. See [PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md) and [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md) for full details.*