nkshirsa commited on
Commit
5d27056
·
verified ·
1 Parent(s): 198cde1

Update README to v2.0 — 7-layer architecture, 143 tests, 87 blindspots

Browse files
Files changed (1) hide show
  1. README.md +94 -196
README.md CHANGED
@@ -10,6 +10,8 @@ tags:
10
  - phd-tools
11
  - multi-agent
12
  - ecc-harness
 
 
13
  language:
14
  - en
15
  base_model: Qwen/Qwen2.5-3B-Instruct
@@ -18,245 +20,141 @@ datasets:
18
  pipeline_tag: text-generation
19
  ---
20
 
21
- # PhD Research OS Brain 🧠
22
 
23
- **An AI model, companion agent framework, and complete software system for PhD-level STEM research**, implementing the [Research OS v11.0 (The Grounded OS)](https://github.com/nkshirsa/phd-research-os) specification with the ECC Harness (V-SINGULARITY) for continuous improvement.
24
 
25
- ## What This Is
26
 
27
- Two systems in one:
28
 
29
- 1. **The Brain** A multi-task fine-tuned language model (Qwen2.5-3B-Instruct + QLoRA) that serves as the intelligent core: extracting claims from papers, classifying evidence, detecting contradictions, scoring confidence, and generating research decisions.
 
 
 
 
 
 
 
30
 
31
- 2. **The Agent OS** — An ECC Harness orchestrator (`agent_os.py`) that lets you spawn **companion AI agents** to continuously improve the Brain. Each companion follows a strict lifecycle (preflight → plan → execute → postflight), produces proposals that require human approval, and maintains an immutable audit trail.
32
 
33
- ## Architecture
34
-
35
- ```
36
- ┌─────────────────────────────────────────────────────────────┐
37
- │ PhD Research OS │
38
- ├──────────────────────┬──────────────────────────────────────┤
39
- │ Core Brain │ Agent OS (ECC Harness) │
40
- │ (agents.py) │ (agent_os.py) │
41
- │ │ │
42
- │ 6 Agent Roles: │ Companion Agents: │
43
- │ 1. Researcher │ • DataQualityAuditor │
44
- │ 2. Epistemic │ • PromptOptimizer │
45
- │ 3. Confidence │ • DomainExpander │
46
- │ 4. Verifier │ • CalibrationAnalyst │
47
- │ 5. Query Planner │ • CitationChaser │
48
- │ 6. Decision Gen │ • [Custom agents] │
49
- │ │ │
50
- │ Provenance: Lv5 │ Output: Proposals (human approval) │
51
- ├──────────────────────┴──────────────────────────────────────┤
52
- │ Data Layer (db.py) — SQLite + Fixed-Point Math │
53
- │ Claims | Sources | Goals | Conflicts | Decisions │
54
- │ Companions | Tasks | Proposals | Audit Log | Memory │
55
- ├─────────────────────────────────────────────────────────────┤
56
- │ Pipeline (pipeline.py) → Obsidian (obsidian_export.py) │
57
- │ Evaluation (evaluation.py) → Backup (backup.py) │
58
- └─────────────────────────────────────────────────────────────┘
59
  ```
60
 
61
- ## ECC Harness Companion AI System
62
-
63
- The Agent OS implements the ECC Harness: Principal Architect Edition (V-SINGULARITY). This means any time you want to improve the Research OS, you spawn a governed companion agent:
64
-
65
- ### Quick Start: Spawn a Companion
66
-
67
- ```python
68
- from phd_research_os.agent_os import AgentOS
69
-
70
- # Initialize the Agent OS
71
- aos = AgentOS()
72
-
73
- # Spawn a companion to audit data quality
74
- agent_id = aos.spawn_companion("DataQualityAuditor")
75
-
76
- # Assign it a task
77
- task_id = aos.assign_task(agent_id, "Audit the last 50 claims for hallucination patterns")
78
-
79
- # Run the full ECC lifecycle (preflight → plan → execute → postflight)
80
- result = aos.run_task(task_id)
81
- print(f"Status: {result['status']}")
82
- print(f"Proposals: {len(result['proposals'])}")
83
-
84
- # Review proposals (human-in-the-loop)
85
- for proposal in aos.get_proposals(agent_id):
86
- print(f" [{proposal['proposal_type']}] {proposal['description']}")
87
- # Approve or reject
88
- aos.approve_proposal(proposal['proposal_id'], reviewed_by="Dr. Smith")
89
- # OR: aos.reject_proposal(proposal['proposal_id'], "Not relevant", "Dr. Smith")
90
  ```
91
 
92
- ### 5 Built-in Companion Types
93
 
94
- | Agent Type | Purpose | How It Improves the Brain |
95
- |-----------|---------|--------------------------|
96
- | **DataQualityAuditor** | Audit claim extraction for drift and hallucination | Catches quality degradation over time |
97
- | **PromptOptimizer** | A/B test system prompts against golden dataset | Improves recall, precision, accuracy |
98
- | **DomainExpander** | Generate training data for new STEM fields | Expands model to new research areas |
99
- | **CalibrationAnalyst** | Analyze Brier scores, fix miscalibration | Reduces over/under-confidence |
100
- | **CitationChaser** | Find papers citing/contradicting current claims | Enriches knowledge base |
 
 
 
101
 
102
- ### Custom Companions
 
 
 
 
 
103
 
104
- ```python
105
- agent_id = aos.spawn_companion(
106
- "custom",
107
- purpose="Identify claims that need replication studies",
108
- system_prompt="You are a Replication Analyst. Find claims with high confidence but few supporting sources..."
109
- )
110
- ```
111
 
112
- ### ECC Lifecycle (Every Task)
113
 
114
- ```
115
- §1 PRE-FLIGHT → Load ARCHITECTURE.md + AGENTS.md, validate DB, check agent state
116
- §2 PLANNING → Obviousness test, build step list, classify reversibility
117
- §3 EXECUTION → Bounded iterations (max 3), time budget with kill heuristic (50% over = HALT)
118
- §4 POST-FLIGHT → Validate proposals, check invariants, log meta-learning
119
- ```
120
 
121
- ### Key Safety Properties
122
 
123
- - **Proposals, Not Actions**: Companions NEVER modify claims, sources, or goals directly. They produce Proposals that require human approval.
124
- - **Immutable Audit Trail**: Every action logged to `agent_audit_log` table. Cannot be modified.
125
- - **Kill Heuristic**: If a task exceeds its time budget by 50%, it auto-halts.
126
- - **Iteration Budget**: Max 1 retry for patches, max 3 for architecture changes.
127
- - **Harness Evolution**: The rules themselves can be amended via `propose_harness_evolution()` — but amendments require human approval.
128
 
129
- ### State Files
 
 
 
 
130
 
131
- | File | Purpose |
132
- |------|---------|
133
- | `ARCHITECTURE.md` | Project map — file locations, API config, invariants (read FIRST) |
134
- | `AGENTS.md` | Agent registry — contracts, boundaries, proposal schema |
135
- | `MEMORY.md` | Persistent assumptions with "Last Validated" markers |
136
- | `plan.md` | Current task plan (mutable) |
137
- | `HARNESS_EVOLUTION.md` | Rule amendment log (append-only) |
138
 
139
- ## 6 Core Brain Tasks
 
 
 
 
 
140
 
141
- *(Same as beforethe Brain powers the core pipeline)*
142
 
143
- ### Task 1: Scientific Claim Extraction
144
  ```python
145
- from phd_research_os.agents import ResearchOSBrain
146
- brain = ResearchOSBrain(backend="api")
147
- result = brain.extract_claims("Paper text here...")
 
 
148
  ```
149
 
150
- ### Task 2–6: Epistemic Classification, Confidence Scoring, Contradiction Detection, Query Decomposition, Decision Generation
151
-
152
- See the [Core Brain section](#6-core-tasks-detail) below.
153
 
154
- ## Research OS v11.0 Compliance
155
 
156
- | Rule | Implementation |
157
- |------|---------------|
158
- | **Provenance Hierarchy** | All AI outputs = Level 5 (LLM Hypothesis). Human verification required. |
159
- | **Anchor Divergence** | Agent output never overrides human-verified observations. |
160
- | **Shadow Archive** | Rejected proposals stored with reason. Can be resurrected with quorum. |
161
- | **Fixed-Point Math** | All probabilities stored as INTEGER × 1000. No floats in DB. |
162
- | **Causal Lineage** | Every claim traces to source DOI. Every proposal traces to agent_id + task_id. |
163
- | **Skeptic Thread** | Conflict detector examines existing data only — no simulation. |
164
 
165
- ## File Structure
166
-
167
- ```
168
- phd-research-os-brain/
169
- ├── README.md # This file
170
- ├── train.py # SFT training script
171
- ├── generate_dataset.py # Synthetic dataset generator
172
- ├── phd_research_os/
173
- │ ├── __init__.py # v1.0.0
174
- │ ├── db.py # Core data layer (Phase 0)
175
- │ ├── agents.py # AI brain — 6 agent roles
176
- │ ├── agent_os.py # ECC Harness — companion AI factory
177
- │ ├── pipeline.py # Paper ingestion (Phase 1+6)
178
- │ ├── obsidian_export.py # Obsidian vault export (Phase 4)
179
- │ ├── evaluation.py # Golden dataset eval (Phase 2)
180
- │ ├── conflict_detector.py # Contradiction detection (Phase 5)
181
- │ ├── backup.py # Backup & recovery (Phase 6)
182
- │ ├── ARCHITECTURE.md # Project map (Wake-Up doc)
183
- │ ├── AGENTS.md # Agent registry & contracts
184
- │ ├── MEMORY.md # Persistent state
185
- │ ├── plan.md # Current task plan
186
- │ └── HARNESS_EVOLUTION.md # Rule amendments
187
- ├── tests/
188
- │ ├── test_db.py # 22 unit tests (data layer)
189
- │ └── test_agent_os.py # 21 integration tests (ECC harness)
190
- ```
191
 
192
- ## Test Results
193
 
194
- ```
195
- tests/test_db.py — 22 passed ✅ (data layer, fixed-point math, CRUD, search)
196
- tests/test_agent_os.py — 21 passed ✅ (spawn, lifecycle, proposals, audit, memory, evolution)
197
- ─────────────────────────
198
- Total: 43 tests passing
199
- ```
200
 
201
- ## 6 Core Tasks (Detail)
202
 
203
- ### Task 1: Scientific Claim Extraction
204
- ```python
205
- result = brain.extract_claims("Paper text here...")
206
- # → {"claims": [{"text": "...", "epistemic_tag": "Fact", "confidence": 0.87, ...}]}
207
- ```
208
 
209
- ### Task 2: Epistemic Classification
210
- ```python
211
- result = brain.classify_epistemic("The measured ionic conductivity was 4.2 × 10⁻⁴ S/cm.")
212
- # → {"epistemic_tag": "Fact", "reasoning": "...", "confidence_in_classification": 0.95}
213
- ```
 
214
 
215
- ### Task 3: Confidence Scoring
216
- ```python
217
- result = brain.score_confidence("Claim text", "ACS Nano", "primary_experimental", 1)
218
- # → {"confidence": 0.855, ...formula_breakdown...}
219
- ```
220
 
221
- ### Task 4: Contradiction Detection
222
- ```python
223
- result = brain.detect_conflicts("Claim A", "Claim B")
224
- # → {"conflict_detected": true, "hypothesis_confidence": "low", ...} # ALWAYS low
225
- ```
226
 
227
- ### Task 5: Query Decomposition
228
- ```python
229
- result = brain.decompose_query("Broad research question?")
230
- # → {"sub_queries": ["specific Q1", "specific Q2", ...]}
231
  ```
232
-
233
- ### Task 6: Decision Generation
234
- ```python
235
- result = brain.generate_decision("Goal", ["gap1", "gap2"], ["low-conf claim 1"])
236
- # {"recommended_action": "experiment", "expected_information_gain": 0.72, ...}
 
 
237
  ```
238
 
239
  ## Training
240
 
241
- Dataset: [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) — 1,900 multi-task examples
242
-
243
- ```bash
244
- pip install torch transformers trl peft datasets bitsandbytes accelerate trackio
245
- python train.py # Needs GPU: T4 minimum, A10G recommended
246
- ```
247
-
248
- **Recipe:** Qwen2.5-3B-Instruct + QLoRA (r=64, all-linear) + assistant-only loss, 3 epochs, lr=2e-4
249
 
250
- ## Citation
251
 
252
- ```bibtex
253
- @software{phd_research_os_2026,
254
- title={PhD Research OS Brain: Multi-Task AI for Scientific Research Management},
255
- author={nkshirsa},
256
- year={2026},
257
- url={https://huggingface.co/nkshirsa/phd-research-os-brain}
258
- }
259
- ```
260
 
261
  ## License
262
 
 
10
  - phd-tools
11
  - multi-agent
12
  - ecc-harness
13
+ - knowledge-graph
14
+ - calibrated-scoring
15
  language:
16
  - en
17
  base_model: Qwen/Qwen2.5-3B-Instruct
 
20
  pipeline_tag: text-generation
21
  ---
22
 
23
+ # PhD Research OS v2.0 — The Epistemic Engine 🧠
24
 
25
+ A complete, local-first AI system for PhD-level STEM research. Extracts epistemic-tagged claims from scientific papers, builds a knowledge graph with typed edges, detects contradictions, identifies research gaps, and scores confidence using code-computed formulas not LLM guesses.
26
 
27
+ **53 files | 545KB | 143 tests passing | 87 blindspots audited and addressed**
28
 
29
+ ## Resources
30
 
31
+ | Resource | URL | Description |
32
+ |----------|-----|-------------|
33
+ | **Model + Full Code** | [nkshirsa/phd-research-os-brain](https://huggingface.co/nkshirsa/phd-research-os-brain) | This repo — all code, design docs, tests |
34
+ | **Training Dataset** | [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) | 1,900 multi-task examples across 6 tasks |
35
+ | **Taxonomy GUI** | [nkshirsa/phd-research-os-taxonomy](https://huggingface.co/spaces/nkshirsa/phd-research-os-taxonomy) | Live Gradio Space with 6 tabs |
36
+ | **Training Space** | [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) | ZeroGPU micro-batch training |
37
+ | **Blindspot Audit** | [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md) | 87 failure modes across 4 epochs |
38
+ | **System Design** | [SYSTEM_DESIGN.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/SYSTEM_DESIGN.md) | Complete 7-layer architecture spec |
39
 
40
+ ## Quick Start
41
 
42
+ ```bash
43
+ git clone https://huggingface.co/nkshirsa/phd-research-os-brain
44
+ cd phd-research-os-brain
45
+ pip install gradio pymupdf
46
+ python -m phd_research_os_v2.app
47
+ # Open http://localhost:7860
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ```
49
 
50
+ Works immediately with heuristic extraction. Add an API key for AI-powered extraction:
51
+ ```bash
52
+ export ANTHROPIC_API_KEY=sk-... # or OPENAI_API_KEY
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
+ ## Architecture
56
 
57
+ ```
58
+ PDF Bundle → Layer 0 (Structural Parse) → Layer 1 (Entity Resolution)
59
+ Layer 2 (Qualified Extraction via AI Council)
60
+ Layer 3 (Claim Canonicalization)
61
+ Layer 4 (Knowledge Graph + Gap Analysis)
62
+ Layer 5 (Code-Computed Calibrated Scoring)
63
+ Layer 6 (Evaluation Harness)
64
+ → Layer 7 (Provenance & Reproducibility)
65
+ → Outputs: Obsidian Vault | Courtroom UI | Decision Objects
66
+ ```
67
 
68
+ | Layer | Module | Purpose |
69
+ |-------|--------|---------|
70
+ | **0** | `layer0/parser.py` | PDF → section-aware regions with bbox, quality scores, cross-refs |
71
+ | **2** | `layer2/extractor.py` | AI Council extracts claims; Epistemic Separation Engine penalizes Abstract spin |
72
+ | **4** | `layer4/graph.py` | SQLite knowledge graph; typed edges; Gap Analysis finds structural holes |
73
+ | **5** | `layer5/scorer.py` | Code-computed 3-score system; parser confidence caps claims |
74
 
75
+ ## The 3-Score System
 
 
 
 
 
 
76
 
77
+ The LLM **never** sets final confidence. It provides components. The code computes:
78
 
79
+ | Score | What It Measures |
80
+ |-------|-----------------|
81
+ | **Evidence Quality** | evidence × study_quality × journal_tier × completeness × section_modifier |
82
+ | **Truth Likelihood** | evidence_quality + corroboration - conflict_penalty - null_penalty |
83
+ | **Qualifier Strength** | 1.0 - qualifier_count×0.1 - null_penalty - inherited_penalty |
 
84
 
85
+ Key gates: parser confidence caps claims. Large N + tiny effect → capped. Abstract = 0.7× penalty.
86
 
87
+ ## Epistemic Separation Engine
 
 
 
 
88
 
89
+ | Source Section | Confidence Modifier |
90
+ |---------------|-------------------|
91
+ | Results (with stats) | 1.0× |
92
+ | Abstract | 0.7× (forced to Interpretation) |
93
+ | Discussion | 0.75× |
94
 
95
+ ## AI Model Council
 
 
 
 
 
 
96
 
97
+ | Member | Role |
98
+ |--------|------|
99
+ | **Query Planner** | Breaks questions into search queries |
100
+ | **Extractor** | Extracts atomic claims with epistemic tags + qualifiers |
101
+ | **Critic** | Reviews claims against source, flags errors |
102
+ | **Chairman** | Synthesizes final claims with 0.7 completeness penalty |
103
 
104
+ ## ECC HarnessCompanion AI System
105
 
 
106
  ```python
107
+ from phd_research_os.agent_os import AgentOS
108
+ aos = AgentOS()
109
+ agent = aos.spawn_companion("DataQualityAuditor")
110
+ task = aos.assign_task(agent, "Audit last 50 claims")
111
+ result = aos.run_task(task)
112
  ```
113
 
114
+ 5 built-in types: DataQualityAuditor, PromptOptimizer, DomainExpander, CalibrationAnalyst, CitationChaser. All output goes through Proposals requiring human approval.
 
 
115
 
116
+ ## Superpowers Skill Tree
117
 
118
+ 7 skills enforcing Design → Plan → Execute → Verify: Brainstorming, Writing Plans, Git Worktrees, TDD, Systematic Debugging, Code Review, Security Review.
 
 
 
 
 
 
 
119
 
120
+ ## Meta-Improver AI
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ InternalMonitor (7 quality metrics) + ExternalScanner (arXiv, HF Hub, GitHub) + SelfReflector (learns from acceptance/rejection) + ImprovementEngine (ranked proposals).
123
 
124
+ ## Quantum-Bio Taxonomy V2
 
 
 
 
 
125
 
126
+ 8-tier study types: in_vivo (1.0) → direct_physical_measurement (1.0) → mathematical_proof (0.95) → in_vitro (0.85) → first_principles_simulation (0.80) → phenomenological_simulation (0.60) → review (0.40) → perspective (0.20). 5 pre-built domains + custom.
127
 
128
+ ## Blindspot Audit (87 findings)
 
 
 
 
129
 
130
+ | Epoch | Focus | Count |
131
+ |-------|-------|-------|
132
+ | I: Architectural | Model & Inference | 10 |
133
+ | II: Epistemic | Logic & Truth | 27 |
134
+ | III: Judgment | Conflict & UI | 19 |
135
+ | IV: Systemic | Time & Impact | 25 |
136
 
137
+ 81 addressed. 6 acknowledged as fundamental limitations. Full audit: [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md)
 
 
 
 
138
 
139
+ ## Tests
 
 
 
 
140
 
 
 
 
 
141
  ```
142
+ test_v2_integration.py — 24 ✅ (full pipeline)
143
+ test_db.py — 22 ✅ (data layer)
144
+ test_agent_os.py — 21 ✅ (ECC harness)
145
+ test_taxonomy.py — 27 ✅ (taxonomy)
146
+ test_skills_and_meta.py 30 ✅ (skills + meta)
147
+ test_council.py — 19 ✅ (AI council)
148
+ Total: 143 passing
149
  ```
150
 
151
  ## Training
152
 
153
+ **ZeroGPU**: [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) — set hardware to ZeroGPU, click Train repeatedly.
 
 
 
 
 
 
 
154
 
155
+ **Local** (needs GPU): `python train.py`
156
 
157
+ **Planned**: SFT → DPO → GRPO (epistemic rewards) → ConfTuner calibration.
 
 
 
 
 
 
 
158
 
159
  ## License
160