Update README to v2.0 — 7-layer architecture, 143 tests, 87 blindspots

Browse files

Files changed (1) hide show

README.md +94 -196

README.md CHANGED Viewed

@@ -10,6 +10,8 @@ tags:
   - phd-tools
   - multi-agent
   - ecc-harness
 language:
   - en
 base_model: Qwen/Qwen2.5-3B-Instruct
@@ -18,245 +20,141 @@ datasets:
 pipeline_tag: text-generation
 ---
-# PhD Research OS Brain 🧠
-**An AI model, companion agent framework, and complete software system for PhD-level STEM research**, implementing the [Research OS v11.0 (The Grounded OS)](https://github.com/nkshirsa/phd-research-os) specification with the ECC Harness (V-SINGULARITY) for continuous improvement.
-## What This Is
-Two systems in one:
-1. **The Brain** — A multi-task fine-tuned language model (Qwen2.5-3B-Instruct + QLoRA) that serves as the intelligent core: extracting claims from papers, classifying evidence, detecting contradictions, scoring confidence, and generating research decisions.
-2. **The Agent OS** — An ECC Harness orchestrator (`agent_os.py`) that lets you spawn **companion AI agents** to continuously improve the Brain. Each companion follows a strict lifecycle (preflight → plan → execute → postflight), produces proposals that require human approval, and maintains an immutable audit trail.
-## Architecture
-```
-┌─────────────────────────────────────────────────────────────┐
-│                    PhD Research OS                            │
-├──────────────────────┬──────────────────────────────────────┤
-│   Core Brain         │   Agent OS (ECC Harness)              │
-│   (agents.py)        │   (agent_os.py)                       │
-│                      │                                       │
-│   6 Agent Roles:     │   Companion Agents:                   │
-│   1. Researcher      │   • DataQualityAuditor                │
-│   2. Epistemic       │   • PromptOptimizer                   │
-│   3. Confidence      │   • DomainExpander                    │
-│   4. Verifier        │   • CalibrationAnalyst                │
-│   5. Query Planner   │   • CitationChaser                    │
-│   6. Decision Gen    │   • [Custom agents]                   │
-│                      │                                       │
-│   Provenance: Lv5    │   Output: Proposals (human approval)  │
-├──────────────────────┴──────────────────────────────────────┤
-│  Data Layer (db.py) — SQLite + Fixed-Point Math              │
-│    Claims | Sources | Goals | Conflicts | Decisions          │
-│    Companions | Tasks | Proposals | Audit Log | Memory       │
-├─────────────────────────────────────────────────────────────┤
-│  Pipeline (pipeline.py) → Obsidian (obsidian_export.py)      │
-│  Evaluation (evaluation.py) → Backup (backup.py)             │
-└─────────────────────────────────────────────────────────────┘
 ```
-## ECC Harness — Companion AI System
-The Agent OS implements the ECC Harness: Principal Architect Edition (V-SINGULARITY). This means any time you want to improve the Research OS, you spawn a governed companion agent:
-### Quick Start: Spawn a Companion
-```python
-from phd_research_os.agent_os import AgentOS
-# Initialize the Agent OS
-aos = AgentOS()
-# Spawn a companion to audit data quality
-agent_id = aos.spawn_companion("DataQualityAuditor")
-# Assign it a task
-task_id = aos.assign_task(agent_id, "Audit the last 50 claims for hallucination patterns")
-# Run the full ECC lifecycle (preflight → plan → execute → postflight)
-result = aos.run_task(task_id)
-print(f"Status: {result['status']}")
-print(f"Proposals: {len(result['proposals'])}")
-# Review proposals (human-in-the-loop)
-for proposal in aos.get_proposals(agent_id):
-    print(f"  [{proposal['proposal_type']}] {proposal['description']}")
-    # Approve or reject
-    aos.approve_proposal(proposal['proposal_id'], reviewed_by="Dr. Smith")
-    # OR: aos.reject_proposal(proposal['proposal_id'], "Not relevant", "Dr. Smith")
 ```
-### 5 Built-in Companion Types
-| Agent Type | Purpose | How It Improves the Brain |
-|-----------|---------|--------------------------|
-| **DataQualityAuditor** | Audit claim extraction for drift and hallucination | Catches quality degradation over time |
-| **PromptOptimizer** | A/B test system prompts against golden dataset | Improves recall, precision, accuracy |
-| **DomainExpander** | Generate training data for new STEM fields | Expands model to new research areas |
-| **CalibrationAnalyst** | Analyze Brier scores, fix miscalibration | Reduces over/under-confidence |
-| **CitationChaser** | Find papers citing/contradicting current claims | Enriches knowledge base |
-### Custom Companions
-```python
-agent_id = aos.spawn_companion(
-    "custom",
-    purpose="Identify claims that need replication studies",
-    system_prompt="You are a Replication Analyst. Find claims with high confidence but few supporting sources..."
-)
-```
-### ECC Lifecycle (Every Task)
-```
-§1 PRE-FLIGHT    → Load ARCHITECTURE.md + AGENTS.md, validate DB, check agent state
-§2 PLANNING      → Obviousness test, build step list, classify reversibility
-§3 EXECUTION     → Bounded iterations (max 3), time budget with kill heuristic (50% over = HALT)
-§4 POST-FLIGHT   → Validate proposals, check invariants, log meta-learning
-```
-### Key Safety Properties
-- **Proposals, Not Actions**: Companions NEVER modify claims, sources, or goals directly. They produce Proposals that require human approval.
-- **Immutable Audit Trail**: Every action logged to `agent_audit_log` table. Cannot be modified.
-- **Kill Heuristic**: If a task exceeds its time budget by 50%, it auto-halts.
-- **Iteration Budget**: Max 1 retry for patches, max 3 for architecture changes.
-- **Harness Evolution**: The rules themselves can be amended via `propose_harness_evolution()` — but amendments require human approval.
-### State Files
-| File | Purpose |
-|------|---------|
-| `ARCHITECTURE.md` | Project map — file locations, API config, invariants (read FIRST) |
-| `AGENTS.md` | Agent registry — contracts, boundaries, proposal schema |
-| `MEMORY.md` | Persistent assumptions with "Last Validated" markers |
-| `plan.md` | Current task plan (mutable) |
-| `HARNESS_EVOLUTION.md` | Rule amendment log (append-only) |
-## 6 Core Brain Tasks
-*(Same as before — the Brain powers the core pipeline)*
-### Task 1: Scientific Claim Extraction
 ```python
-from phd_research_os.agents import ResearchOSBrain
-brain = ResearchOSBrain(backend="api")
-result = brain.extract_claims("Paper text here...")
 ```
-### Task 2–6: Epistemic Classification, Confidence Scoring, Contradiction Detection, Query Decomposition, Decision Generation
-See the [Core Brain section](#6-core-tasks-detail) below.
-## Research OS v11.0 Compliance
-| Rule | Implementation |
-|------|---------------|
-| **Provenance Hierarchy** | All AI outputs = Level 5 (LLM Hypothesis). Human verification required. |
-| **Anchor Divergence** | Agent output never overrides human-verified observations. |
-| **Shadow Archive** | Rejected proposals stored with reason. Can be resurrected with quorum. |
-| **Fixed-Point Math** | All probabilities stored as INTEGER × 1000. No floats in DB. |
-| **Causal Lineage** | Every claim traces to source DOI. Every proposal traces to agent_id + task_id. |
-| **Skeptic Thread** | Conflict detector examines existing data only — no simulation. |
-## File Structure
-```
-phd-research-os-brain/
-├── README.md                              # This file
-├── train.py                               # SFT training script
-├── generate_dataset.py                    # Synthetic dataset generator
-├── phd_research_os/
-│   ├── __init__.py                        # v1.0.0
-│   ├── db.py                              # Core data layer (Phase 0)
-│   ├── agents.py                          # AI brain — 6 agent roles
-│   ├── agent_os.py                        # ECC Harness — companion AI factory
-│   ├── pipeline.py                        # Paper ingestion (Phase 1+6)
-│   ├── obsidian_export.py                 # Obsidian vault export (Phase 4)
-│   ├── evaluation.py                      # Golden dataset eval (Phase 2)
-│   ├── conflict_detector.py               # Contradiction detection (Phase 5)
-│   ├── backup.py                          # Backup & recovery (Phase 6)
-│   ├── ARCHITECTURE.md                    # Project map (Wake-Up doc)
-│   ├── AGENTS.md                          # Agent registry & contracts
-│   ├── MEMORY.md                          # Persistent state
-│   ├── plan.md                            # Current task plan
-│   └── HARNESS_EVOLUTION.md               # Rule amendments
-├── tests/
-│   ├── test_db.py                         # 22 unit tests (data layer)
-│   └── test_agent_os.py                   # 21 integration tests (ECC harness)
-```
-## Test Results
-```
-tests/test_db.py        — 22 passed ✅  (data layer, fixed-point math, CRUD, search)
-tests/test_agent_os.py  — 21 passed ✅  (spawn, lifecycle, proposals, audit, memory, evolution)
-─────────────────────────
-Total: 43 tests passing
-```
-## 6 Core Tasks (Detail)
-### Task 1: Scientific Claim Extraction
-```python
-result = brain.extract_claims("Paper text here...")
-# → {"claims": [{"text": "...", "epistemic_tag": "Fact", "confidence": 0.87, ...}]}
-```
-### Task 2: Epistemic Classification
-```python
-result = brain.classify_epistemic("The measured ionic conductivity was 4.2 × 10⁻⁴ S/cm.")
-# → {"epistemic_tag": "Fact", "reasoning": "...", "confidence_in_classification": 0.95}
-```
-### Task 3: Confidence Scoring
-```python
-result = brain.score_confidence("Claim text", "ACS Nano", "primary_experimental", 1)
-# → {"confidence": 0.855, ...formula_breakdown...}
-```
-### Task 4: Contradiction Detection
-```python
-result = brain.detect_conflicts("Claim A", "Claim B")
-# → {"conflict_detected": true, "hypothesis_confidence": "low", ...}  # ALWAYS low
-```
-### Task 5: Query Decomposition
-```python
-result = brain.decompose_query("Broad research question?")
-# → {"sub_queries": ["specific Q1", "specific Q2", ...]}
 ```
-### Task 6: Decision Generation
-```python
-result = brain.generate_decision("Goal", ["gap1", "gap2"], ["low-conf claim 1"])
-# → {"recommended_action": "experiment", "expected_information_gain": 0.72, ...}
 ```
 ## Training
-Dataset: [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) — 1,900 multi-task examples
-```bash
-pip install torch transformers trl peft datasets bitsandbytes accelerate trackio
-python train.py  # Needs GPU: T4 minimum, A10G recommended
-```
-**Recipe:** Qwen2.5-3B-Instruct + QLoRA (r=64, all-linear) + assistant-only loss, 3 epochs, lr=2e-4
-## Citation
-```bibtex
-@software{phd_research_os_2026,
-  title={PhD Research OS Brain: Multi-Task AI for Scientific Research Management},
-  author={nkshirsa},
-  year={2026},
-  url={https://huggingface.co/nkshirsa/phd-research-os-brain}
-}
-```
 ## License

   - phd-tools
   - multi-agent
   - ecc-harness
+  - knowledge-graph
+  - calibrated-scoring
 language:
   - en
 base_model: Qwen/Qwen2.5-3B-Instruct
 pipeline_tag: text-generation
 ---
+# PhD Research OS v2.0 — The Epistemic Engine 🧠
+A complete, local-first AI system for PhD-level STEM research. Extracts epistemic-tagged claims from scientific papers, builds a knowledge graph with typed edges, detects contradictions, identifies research gaps, and scores confidence using code-computed formulas — not LLM guesses.
+**53 files | 545KB | 143 tests passing | 87 blindspots audited and addressed**
+## Resources
+| Resource | URL | Description |
+|----------|-----|-------------|
+| **Model + Full Code** | [nkshirsa/phd-research-os-brain](https://huggingface.co/nkshirsa/phd-research-os-brain) | This repo — all code, design docs, tests |
+| **Training Dataset** | [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) | 1,900 multi-task examples across 6 tasks |
+| **Taxonomy GUI** | [nkshirsa/phd-research-os-taxonomy](https://huggingface.co/spaces/nkshirsa/phd-research-os-taxonomy) | Live Gradio Space with 6 tabs |
+| **Training Space** | [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) | ZeroGPU micro-batch training |
+| **Blindspot Audit** | [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md) | 87 failure modes across 4 epochs |
+| **System Design** | [SYSTEM_DESIGN.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/SYSTEM_DESIGN.md) | Complete 7-layer architecture spec |
+## Quick Start
+```bash
+git clone https://huggingface.co/nkshirsa/phd-research-os-brain
+cd phd-research-os-brain
+pip install gradio pymupdf
+python -m phd_research_os_v2.app
+# Open http://localhost:7860
 ```
+Works immediately with heuristic extraction. Add an API key for AI-powered extraction:
+```bash
+export ANTHROPIC_API_KEY=sk-...  # or OPENAI_API_KEY
 ```
+## Architecture
+```
+PDF Bundle → Layer 0 (Structural Parse) → Layer 1 (Entity Resolution)
+  → Layer 2 (Qualified Extraction via AI Council)
+  → Layer 3 (Claim Canonicalization)
+  → Layer 4 (Knowledge Graph + Gap Analysis)
+  → Layer 5 (Code-Computed Calibrated Scoring)
+  → Layer 6 (Evaluation Harness)
+  → Layer 7 (Provenance & Reproducibility)
+  → Outputs: Obsidian Vault | Courtroom UI | Decision Objects
+```
+| Layer | Module | Purpose |
+|-------|--------|---------|
+| **0** | `layer0/parser.py` | PDF → section-aware regions with bbox, quality scores, cross-refs |
+| **2** | `layer2/extractor.py` | AI Council extracts claims; Epistemic Separation Engine penalizes Abstract spin |
+| **4** | `layer4/graph.py` | SQLite knowledge graph; typed edges; Gap Analysis finds structural holes |
+| **5** | `layer5/scorer.py` | Code-computed 3-score system; parser confidence caps claims |
+## The 3-Score System
+The LLM **never** sets final confidence. It provides components. The code computes:
+| Score | What It Measures |
+|-------|-----------------|
+| **Evidence Quality** | evidence × study_quality × journal_tier × completeness × section_modifier |
+| **Truth Likelihood** | evidence_quality + corroboration - conflict_penalty - null_penalty |
+| **Qualifier Strength** | 1.0 - qualifier_count×0.1 - null_penalty - inherited_penalty |
+Key gates: parser confidence caps claims. Large N + tiny effect → capped. Abstract = 0.7× penalty.
+## Epistemic Separation Engine
+| Source Section | Confidence Modifier |
+|---------------|-------------------|
+| Results (with stats) | 1.0× |
+| Abstract | 0.7× (forced to Interpretation) |
+| Discussion | 0.75× |
+## AI Model Council
+| Member | Role |
+|--------|------|
+| **Query Planner** | Breaks questions into search queries |
+| **Extractor** | Extracts atomic claims with epistemic tags + qualifiers |
+| **Critic** | Reviews claims against source, flags errors |
+| **Chairman** | Synthesizes final claims with 0.7 completeness penalty |
+## ECC Harness — Companion AI System
 ```python
+from phd_research_os.agent_os import AgentOS
+aos = AgentOS()
+agent = aos.spawn_companion("DataQualityAuditor")
+task = aos.assign_task(agent, "Audit last 50 claims")
+result = aos.run_task(task)
 ```
+5 built-in types: DataQualityAuditor, PromptOptimizer, DomainExpander, CalibrationAnalyst, CitationChaser. All output goes through Proposals requiring human approval.
+## Superpowers Skill Tree
+7 skills enforcing Design → Plan → Execute → Verify: Brainstorming, Writing Plans, Git Worktrees, TDD, Systematic Debugging, Code Review, Security Review.
+## Meta-Improver AI
+InternalMonitor (7 quality metrics) + ExternalScanner (arXiv, HF Hub, GitHub) + SelfReflector (learns from acceptance/rejection) + ImprovementEngine (ranked proposals).
+## Quantum-Bio Taxonomy V2
+8-tier study types: in_vivo (1.0) → direct_physical_measurement (1.0) → mathematical_proof (0.95) → in_vitro (0.85) → first_principles_simulation (0.80) → phenomenological_simulation (0.60) → review (0.40) → perspective (0.20). 5 pre-built domains + custom.
+## Blindspot Audit (87 findings)
+| Epoch | Focus | Count |
+|-------|-------|-------|
+| I: Architectural | Model & Inference | 10 |
+| II: Epistemic | Logic & Truth | 27 |
+| III: Judgment | Conflict & UI | 19 |
+| IV: Systemic | Time & Impact | 25 |
+81 addressed. 6 acknowledged as fundamental limitations. Full audit: [BLINDSPOT_AUDIT_COMPLETE.md](https://huggingface.co/nkshirsa/phd-research-os-brain/blob/main/BLINDSPOT_AUDIT_COMPLETE.md)
+## Tests
 ```
+test_v2_integration.py  — 24 ✅  (full pipeline)
+test_db.py              — 22 ✅  (data layer)
+test_agent_os.py        — 21 ✅  (ECC harness)
+test_taxonomy.py        — 27 ✅  (taxonomy)
+test_skills_and_meta.py — 30 ✅  (skills + meta)
+test_council.py         — 19 ✅  (AI council)
+Total: 143 passing
 ```
 ## Training
+**ZeroGPU**: [nkshirsa/phd-research-os-train](https://huggingface.co/spaces/nkshirsa/phd-research-os-train) — set hardware to ZeroGPU, click Train repeatedly.
+**Local** (needs GPU): `python train.py`
+**Planned**: SFT → DPO → GRPO (epistemic rewards) → ConfTuner calibration.
 ## License