Update README with ECC Harness companion AI documentation

Browse files

Files changed (1) hide show

README.md +169 -221

README.md CHANGED Viewed

@@ -8,6 +8,8 @@ tags:
   - structured-output
   - research-assistant
   - phd-tools
 language:
   - en
 base_model: Qwen/Qwen2.5-3B-Instruct
@@ -18,289 +20,235 @@ pipeline_tag: text-generation
 # PhD Research OS Brain 🧠
-**An AI model and complete software system for PhD-level STEM research**, implementing the [Research OS v11.0 (The Grounded OS)](https://github.com/nkshirsa/phd-research-os) specification.
 ## What This Is
-A **multi-task fine-tuned language model** (Qwen2.5-3B-Instruct + QLoRA) that serves as the intelligent core of a PhD Research OS — a system that ingests scientific papers, extracts structured claims, classifies evidence, detects contradictions, and generates research decisions.
-The model + Python package together implement **Phases 0–6** of the construction schedule:
-- ✅ Phase 0: Core data layer with full CRUD (22 unit tests passing)
-- ✅ Phase 1: Paper ingestion pipeline (PDF → structured claims)
-- ✅ Phase 2: Evaluation harness with golden dataset support
-- ✅ Phase 3: Semantic search infrastructure (ChromaDB ready)
-- ✅ Phase 4: Obsidian export (one-directional, wiki-linked vault)
-- ✅ Phase 5: Conflict detection + Verifier Agent
-- ✅ Phase 6: Backup, batch processing, cost tracking, documentation
 ## Architecture
 ```
-┌─────────────────────────────────────────────────────────┐
-│                    PhD Research OS                        │
-├─────────────────────────────────────────────────────────┤
-│  Pipeline Orchestrator (pipeline.py)                     │
-│    PDF → Text → Claims → Conflicts → Obsidian           │
-├──────────┬──────────────────────────────────────────────┤
-│ AI Brain │  6 Agent Roles:                               │
-│(agents.py│  1. Researcher (claim extraction)             │
-│          │  2. Epistemic Classifier                      │
-│          │  3. Confidence Scorer                         │
-│          │  4. Verifier (conflict detection)             │
-│          │  5. Query Planner                             │
-│          │  6. Decision Generator                        │
-├──────────┴──────────────────────────────────────────────┤
-│  Data Layer (db.py) — SQLite + Fixed-Point Math          │
-│    Claims | Sources | Goals | Conflicts | Decisions      │
-│    Overrides | Experiments | API Usage | Calibration     │
-├─────────────────────────────────────────────────────────┤
-│  Outputs:                                                │
-│    Obsidian Vault | Streamlit Dashboard | JSON API       │
-└─────────────────────────────────────────────────────────┘
 ```
-## Research OS v11.0 Compliance
-This system adheres to the Grounded OS specification:
-| Rule | Implementation |
-|------|---------------|
-| **Provenance Hierarchy** | All AI outputs tagged as Level 5 (LLM Hypothesis). Human verification required to promote. |
-| **Anchor Divergence** | Agent output never overrides human-verified observations. Expert overrides lock confidence. |
-| **Shadow Archive** | Rejected claims stored in database with rejection reason. Resurrection requires 3+ independent observations. |
-| **Fixed-Point Math** | All probabilities stored as scaled integers (×1000). No floating-point in DB. |
-| **Causal Lineage** | Every claim traces back to source DOI and extraction event. |
-| **Skeptic Thread** | Conflict detector finds contradictions in existing data only — no simulation or what-if. |
-## 6 Core Tasks
-### Task 1: Scientific Claim Extraction
 ```python
-from phd_research_os.agents import ResearchOSBrain
-brain = ResearchOSBrain(backend="api")  # or "local" with fine-tuned model
-result = brain.extract_claims("Paper text here...")
-# Returns: {"claims": [{"text": "...", "epistemic_tag": "Fact", "confidence": 0.87, ...}]}
-```
-### Task 2: Epistemic Classification
-```python
-result = brain.classify_epistemic("The measured ionic conductivity was 4.2 × 10⁻⁴ S/cm at 25°C.")
-# Returns: {"epistemic_tag": "Fact", "reasoning": "...", "confidence_in_classification": 0.95}
-```
-### Task 3: Confidence Scoring
-```python
-result = brain.score_confidence(
-    "Graphene FET shows 45mV Dirac shift",
-    journal="ACS Nano", study_type="primary_experimental", journal_tier=1)
-# Returns: {"confidence": 0.855, "evidence_strength": 0.9, "study_quality_weight": 1.0, ...}
-```
-### Task 4: Contradiction Detection
-```python
-result = brain.detect_conflicts(
-    "Sensitivity increases with ionic strength",
-    "Sensitivity decreases with ionic strength")
-# Returns: {"conflict_detected": true, "conflict_type": "value_mismatch",
-#           "hypothesis_confidence": "low", ...}  # ALWAYS low
-```
-### Task 5: Query Decomposition
-```python
-result = brain.decompose_query("What determines graphene biosensor sensitivity?")
-# Returns: {"sub_queries": ["...", "...", "..."], "reasoning": "..."}
-```
-### Task 6: Decision Object Generation
-```python
-result = brain.generate_decision(
-    goal="Achieve sub-fM detection limit",
-    gaps=["Optimal aptamer not determined", "Debye screening unresolved"],
-    low_confidence_claims=["CLM_0042: PEG length optimal (conf: 0.35)"])
-# Returns: {"decision_id": "DEC_0001", "recommended_action": "experiment",
-#           "expected_information_gain": 0.72, ...}
 ```
-## Quick Start
-### 1. Install
-```bash
-git clone https://huggingface.co/nkshirsa/phd-research-os-brain
-cd phd-research-os-brain
-pip install -r requirements.txt  # or: pip install datasets httpx
-```
-### 2. Initialize Database
 ```python
-from phd_research_os.db import init_db, get_db, create_claim, create_goal
-init_db("data/research_os.db")
-conn = get_db("data/research_os.db")
-# Create your first research goal
-goal_id = create_goal(conn, "Achieve sub-femtomolar LOD for cardiac troponin", "high")
-# Add a claim manually
-claim_id = create_claim(conn,
-    text="GFET sensitivity to cTnI was 45 mV/decade in 10mM PBS",
-    epistemic_tag="Fact",
-    confidence=0.85,
-    evidence_strength=0.9,
-    study_quality_weight=1.0,
-    journal_tier_weight=1.0,
-    completeness_penalty=1.0,
-    source_doi="10.1234/example",
-    parameters={"ionic_strength_mM": 10, "sensitivity_mV_dec": 45})
 ```
-### 3. Process a Paper (with API brain)
-```python
-from phd_research_os.agents import ResearchOSBrain
-from phd_research_os.pipeline import Pipeline
-# Use Claude/GPT as brain (or load fine-tuned model locally)
-brain = ResearchOSBrain(backend="api")  # needs ANTHROPIC_API_KEY or OPENAI_API_KEY
-pipeline = Pipeline(brain=brain)
-result = pipeline.process_paper("path/to/paper.pdf", journal_tier=1, is_canonical=True)
-print(f"Extracted {result['claims_extracted']} claims")
-```
-### 4. Export to Obsidian
-```python
-from phd_research_os.obsidian_export import ObsidianExporter
-exporter = ObsidianExporter(vault_path="my_vault")
-exporter.export_all()
-# Opens as linked Obsidian vault with Claims/, Sources/, Goals/, Dashboard.md
-```
-### 5. Detect Conflicts
 ```python
-from phd_research_os.conflict_detector import ConflictDetector
-detector = ConflictDetector(brain=brain)
-conflicts = detector.detect_conflicts()
-# All conflict hypotheses are tagged confidence="low" — human review required
 ```
-## Training the Model
-The training dataset ([nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data)) contains 1,900 multi-task examples across all 6 tasks.
-```bash
-# On a GPU (T4 minimum, A10G recommended):
-pip install torch transformers trl peft datasets bitsandbytes accelerate trackio
-python train.py
-```
-**Training Recipe:**
-- Base: `Qwen/Qwen2.5-3B-Instruct`
-- Method: QLoRA (4-bit NF4, r=64, all-linear targets)
-- LR: 2e-4 (cosine schedule, 5% warmup)
-- Batch: effective 16 (2 × 8 gradient accumulation)
-- Epochs: 3
-- Loss: Assistant-only (masks system/user tokens)
-- References: [LoRA Without Regret (2025)](https://huggingface.co/docs/trl/lora_without_regret), [Multi-task Biomedical SFT (arxiv:2401.00579)](https://arxiv.org/abs/2401.00579)
-## Confidence Formula
-All confidence scores follow this fixed-point formula:
 ```
-confidence = evidence_strength × study_quality_weight × journal_tier_weight × completeness_penalty
-study_quality_weight:
-  primary_experimental: 1.000
-  meta_analysis:        1.000
-  in_vitro:             0.800
-  simulation:           0.600
-  review_non_systematic:0.400
-  case_study:           0.300
-journal_tier_weight:
-  tier 1:    1.000
-  tier 2:    0.850
-  tier 3:    0.700
-  preprint:  0.500
-completeness_penalty:
-  all fields present: 1.000
-  missing key fields: 0.700
 ```
-**Stored as scaled integers** (×1000) per Research OS Rule 5. No floating-point in the database.
-## File Structure
 ```
-phd-research-os-brain/
-├── README.md                          # This file
-├── train.py                           # SFT training script
-├── generate_dataset.py                # Synthetic dataset generator
-├── phd_research_os/
-│   ├── __init__.py
-│   ├── db.py                          # Core data layer (Phase 0)
-│   ├── agents.py                      # AI brain with 6 agent roles
-│   ├── pipeline.py                    # Paper ingestion pipeline (Phase 1+6)
-│   ├── obsidian_export.py             # Obsidian vault export (Phase 4)
-│   ├── evaluation.py                  # Golden dataset eval harness (Phase 2)
-│   ├── conflict_detector.py           # Contradiction detection (Phase 5)
-│   └── backup.py                      # Backup & recovery (Phase 6)
-└── tests/
-    └── test_db.py                     # 22 unit tests (all passing)
 ```
-## Evaluation Harness
 ```python
-from phd_research_os.evaluation import run_regression_gate, create_golden_paper
-# Create golden dataset (you annotate 5 papers from your field)
-create_golden_paper("paper_001", "Graphene FET Biosensors", [
-    {"text": "LOD was 1 fM in 10mM PBS", "epistemic_tag": "Fact", "confidence": 0.9},
-    {"text": "Surface defects enhance sensitivity", "epistemic_tag": "Interpretation", "confidence": 0.6},
-    # ... annotate all extractable claims
-])
-# Run regression gate (must pass before ANY config change)
-result = run_regression_gate()
-print(f"Passed: {result.passed}")
-print(f"Failures: {result.failures}")
-# Thresholds: recall ≥70%, hallucination ≤10%, epistemic accuracy ≥60%
 ```
-## Technology Stack
-| Layer | Choice | Rationale |
-|-------|--------|-----------|
-| Language | Python 3.11+ | Entire ML/NLP ecosystem native |
-| Database | SQLite + JSON | Single user, zero ops burden |
-| Vector Store | ChromaDB (planned) | Embedded, Python-native |
-| AI Brain | Qwen2.5-3B + QLoRA | Best structured output at size |
-| API Fallback | Claude / GPT-4o-mini | For immediate use before training |
-| Obsidian Sync | Direct filesystem | No plugin dependency |
-| Config | Git-versioned prompts | Every change regression-tested |
-## What's NOT Built Yet (By Design)
-Per the construction schedule, these are **not yet implemented** — they require real PhD research data:
-- ❌ Bidirectional Obsidian sync (Phase 8 — frontmatter only)
-- ❌ Temporal decay on confidence (Phase 8 — needs calibration data)
-- ❌ Cognitive modes (Phase 9 — needs workflow data)
-- ❌ Causal graph (Phase 9 — manual entry only)
-- ❌ VLM figure extraction (Phase 9 — pilot)
-- ❌ Multi-agent peer review (Phase 10 — triggered by error rate)
-- ❌ Event sourcing (Phase 10 — triggered by need)
 ## Citation
-If you use this system in your research:
 ```bibtex
 @software{phd_research_os_2026,
   title={PhD Research OS Brain: Multi-Task AI for Scientific Research Management},

   - structured-output
   - research-assistant
   - phd-tools
+  - multi-agent
+  - ecc-harness
 language:
   - en
 base_model: Qwen/Qwen2.5-3B-Instruct
 # PhD Research OS Brain 🧠
+**An AI model, companion agent framework, and complete software system for PhD-level STEM research**, implementing the [Research OS v11.0 (The Grounded OS)](https://github.com/nkshirsa/phd-research-os) specification with the ECC Harness (V-SINGULARITY) for continuous improvement.
 ## What This Is
+Two systems in one:
+1. **The Brain** — A multi-task fine-tuned language model (Qwen2.5-3B-Instruct + QLoRA) that serves as the intelligent core: extracting claims from papers, classifying evidence, detecting contradictions, scoring confidence, and generating research decisions.
+2. **The Agent OS** — An ECC Harness orchestrator (`agent_os.py`) that lets you spawn **companion AI agents** to continuously improve the Brain. Each companion follows a strict lifecycle (preflight → plan → execute → postflight), produces proposals that require human approval, and maintains an immutable audit trail.
 ## Architecture
 ```
+┌─────────────────────────────────────────────────────────────┐
+│                    PhD Research OS                            │
+├──────────────────────┬──────────────────────────────────────┤
+│   Core Brain         │   Agent OS (ECC Harness)              │
+│   (agents.py)        │   (agent_os.py)                       │
+│                      │                                       │
+│   6 Agent Roles:     │   Companion Agents:                   │
+│   1. Researcher      │   • DataQualityAuditor                │
+│   2. Epistemic       │   • PromptOptimizer                   │
+│   3. Confidence      │   • DomainExpander                    │
+│   4. Verifier        │   • CalibrationAnalyst                │
+│   5. Query Planner   │   • CitationChaser                    │
+│   6. Decision Gen    │   • [Custom agents]                   │
+│                      │                                       │
+│   Provenance: Lv5    │   Output: Proposals (human approval)  │
+├──────────────────────┴──────────────────────────────────────┤
+│  Data Layer (db.py) — SQLite + Fixed-Point Math              │
+│    Claims | Sources | Goals | Conflicts | Decisions          │
+│    Companions | Tasks | Proposals | Audit Log | Memory       │
+├─────────────────────────────────────────────────────────────┤
+│  Pipeline (pipeline.py) → Obsidian (obsidian_export.py)      │
+│  Evaluation (evaluation.py) → Backup (backup.py)             │
+└─────────────────────────────────────────────────────────────┘
 ```
+## ECC Harness — Companion AI System
+The Agent OS implements the ECC Harness: Principal Architect Edition (V-SINGULARITY). This means any time you want to improve the Research OS, you spawn a governed companion agent:
+### Quick Start: Spawn a Companion
 ```python
+from phd_research_os.agent_os import AgentOS
+# Initialize the Agent OS
+aos = AgentOS()
+# Spawn a companion to audit data quality
+agent_id = aos.spawn_companion("DataQualityAuditor")
+# Assign it a task
+task_id = aos.assign_task(agent_id, "Audit the last 50 claims for hallucination patterns")
+# Run the full ECC lifecycle (preflight → plan → execute → postflight)
+result = aos.run_task(task_id)
+print(f"Status: {result['status']}")
+print(f"Proposals: {len(result['proposals'])}")
+# Review proposals (human-in-the-loop)
+for proposal in aos.get_proposals(agent_id):
+    print(f"  [{proposal['proposal_type']}] {proposal['description']}")
+    # Approve or reject
+    aos.approve_proposal(proposal['proposal_id'], reviewed_by="Dr. Smith")
+    # OR: aos.reject_proposal(proposal['proposal_id'], "Not relevant", "Dr. Smith")
 ```
+### 5 Built-in Companion Types
+| Agent Type | Purpose | How It Improves the Brain |
+|-----------|---------|--------------------------|
+| **DataQualityAuditor** | Audit claim extraction for drift and hallucination | Catches quality degradation over time |
+| **PromptOptimizer** | A/B test system prompts against golden dataset | Improves recall, precision, accuracy |
+| **DomainExpander** | Generate training data for new STEM fields | Expands model to new research areas |
+| **CalibrationAnalyst** | Analyze Brier scores, fix miscalibration | Reduces over/under-confidence |
+| **CitationChaser** | Find papers citing/contradicting current claims | Enriches knowledge base |
+### Custom Companions
 ```python
+agent_id = aos.spawn_companion(
+    "custom",
+    purpose="Identify claims that need replication studies",
+    system_prompt="You are a Replication Analyst. Find claims with high confidence but few supporting sources..."
+)
 ```
+### ECC Lifecycle (Every Task)
+```
+§1 PRE-FLIGHT    → Load ARCHITECTURE.md + AGENTS.md, validate DB, check agent state
+§2 PLANNING      → Obviousness test, build step list, classify reversibility
+§3 EXECUTION     → Bounded iterations (max 3), time budget with kill heuristic (50% over = HALT)
+§4 POST-FLIGHT   → Validate proposals, check invariants, log meta-learning
+```
+### Key Safety Properties
+- **Proposals, Not Actions**: Companions NEVER modify claims, sources, or goals directly. They produce Proposals that require human approval.
+- **Immutable Audit Trail**: Every action logged to `agent_audit_log` table. Cannot be modified.
+- **Kill Heuristic**: If a task exceeds its time budget by 50%, it auto-halts.
+- **Iteration Budget**: Max 1 retry for patches, max 3 for architecture changes.
+- **Harness Evolution**: The rules themselves can be amended via `propose_harness_evolution()` — but amendments require human approval.
+### State Files
+| File | Purpose |
+|------|---------|
+| `ARCHITECTURE.md` | Project map — file locations, API config, invariants (read FIRST) |
+| `AGENTS.md` | Agent registry — contracts, boundaries, proposal schema |
+| `MEMORY.md` | Persistent assumptions with "Last Validated" markers |
+| `plan.md` | Current task plan (mutable) |
+| `HARNESS_EVOLUTION.md` | Rule amendment log (append-only) |
+## 6 Core Brain Tasks
+*(Same as before — the Brain powers the core pipeline)*
+### Task 1: Scientific Claim Extraction
 ```python
+from phd_research_os.agents import ResearchOSBrain
+brain = ResearchOSBrain(backend="api")
+result = brain.extract_claims("Paper text here...")
 ```
+### Task 2–6: Epistemic Classification, Confidence Scoring, Contradiction Detection, Query Decomposition, Decision Generation
+See the [Core Brain section](#6-core-tasks-detail) below.
+## Research OS v11.0 Compliance
+| Rule | Implementation |
+|------|---------------|
+| **Provenance Hierarchy** | All AI outputs = Level 5 (LLM Hypothesis). Human verification required. |
+| **Anchor Divergence** | Agent output never overrides human-verified observations. |
+| **Shadow Archive** | Rejected proposals stored with reason. Can be resurrected with quorum. |
+| **Fixed-Point Math** | All probabilities stored as INTEGER × 1000. No floats in DB. |
+| **Causal Lineage** | Every claim traces to source DOI. Every proposal traces to agent_id + task_id. |
+| **Skeptic Thread** | Conflict detector examines existing data only — no simulation. |
+## File Structure
+```
+phd-research-os-brain/
+├── README.md                              # This file
+├── train.py                               # SFT training script
+├── generate_dataset.py                    # Synthetic dataset generator
+├── phd_research_os/
+│   ├── __init__.py                        # v1.0.0
+│   ├── db.py                              # Core data layer (Phase 0)
+│   ├── agents.py                          # AI brain — 6 agent roles
+│   ├── agent_os.py                        # ECC Harness — companion AI factory
+│   ├── pipeline.py                        # Paper ingestion (Phase 1+6)
+│   ├── obsidian_export.py                 # Obsidian vault export (Phase 4)
+│   ├── evaluation.py                      # Golden dataset eval (Phase 2)
+│   ├── conflict_detector.py               # Contradiction detection (Phase 5)
+│   ├── backup.py                          # Backup & recovery (Phase 6)
+│   ├── ARCHITECTURE.md                    # Project map (Wake-Up doc)
+│   ├── AGENTS.md                          # Agent registry & contracts
+│   ├── MEMORY.md                          # Persistent state
+│   ├── plan.md                            # Current task plan
+│   └── HARNESS_EVOLUTION.md               # Rule amendments
+├── tests/
+│   ├── test_db.py                         # 22 unit tests (data layer)
+│   └── test_agent_os.py                   # 21 integration tests (ECC harness)
+```
+## Test Results
 ```
+tests/test_db.py        — 22 passed ✅  (data layer, fixed-point math, CRUD, search)
+tests/test_agent_os.py  — 21 passed ✅  (spawn, lifecycle, proposals, audit, memory, evolution)
+─────────────────────────
+Total: 43 tests passing
 ```
+## 6 Core Tasks (Detail)
+### Task 1: Scientific Claim Extraction
+```python
+result = brain.extract_claims("Paper text here...")
+# → {"claims": [{"text": "...", "epistemic_tag": "Fact", "confidence": 0.87, ...}]}
+```
+### Task 2: Epistemic Classification
+```python
+result = brain.classify_epistemic("The measured ionic conductivity was 4.2 × 10⁻⁴ S/cm.")
+# → {"epistemic_tag": "Fact", "reasoning": "...", "confidence_in_classification": 0.95}
 ```
+### Task 3: Confidence Scoring
+```python
+result = brain.score_confidence("Claim text", "ACS Nano", "primary_experimental", 1)
+# → {"confidence": 0.855, ...formula_breakdown...}
 ```
+### Task 4: Contradiction Detection
+```python
+result = brain.detect_conflicts("Claim A", "Claim B")
+# → {"conflict_detected": true, "hypothesis_confidence": "low", ...}  # ALWAYS low
+```
+### Task 5: Query Decomposition
 ```python
+result = brain.decompose_query("Broad research question?")
+# → {"sub_queries": ["specific Q1", "specific Q2", ...]}
 ```
+### Task 6: Decision Generation
+```python
+result = brain.generate_decision("Goal", ["gap1", "gap2"], ["low-conf claim 1"])
+# → {"recommended_action": "experiment", "expected_information_gain": 0.72, ...}
+```
+## Training
+Dataset: [nkshirsa/phd-research-os-sft-data](https://huggingface.co/datasets/nkshirsa/phd-research-os-sft-data) — 1,900 multi-task examples
+```bash
+pip install torch transformers trl peft datasets bitsandbytes accelerate trackio
+python train.py  # Needs GPU: T4 minimum, A10G recommended
+```
+**Recipe:** Qwen2.5-3B-Instruct + QLoRA (r=64, all-linear) + assistant-only loss, 3 epochs, lr=2e-4
 ## Citation
 ```bibtex
 @software{phd_research_os_2026,
   title={PhD Research OS Brain: Multi-Task AI for Scientific Research Management},