Codette-Reasoning / docs /training /codette-training-labEVALUATION_FRAMEWORK_SUMMARY.md
Jonathan Harrison
Full Codette codebase sync β€” transparency release
74f2af5

Evaluation Framework: Ready for Sprint

Date: 2026-03-19 Status: Framework Complete, Ready to Execute


What Changed

We're shifting from implementation validation β†’ empirical validation.

Phase 6 Status

Aspect Status Notes
Code βœ… Complete 1,330 lines across 5 components
Unit Tests βœ… 14/14 Pass All components tested individually
Integration βœ… Verified ForgeEngine loads Phase 6 correctly
Empirical Validation ⚠️ Not Yet THIS IS WHAT WE'RE DOING NOW

Evaluation Framework (Created)

1. Test Suite: 25 Rigorous Questions

  • Physics: Factual, technical (speed of light, blue sky, entropy)
  • Ethics: Rubric-based, multiple valid frameworks (honesty, transparency, morality)
  • Consciousness: Hard problems (machine consciousness, mind-body, qualia)
  • Creativity: Definition-dependent (what makes something creative?)
  • Systems: Abstract (emergence, feedback, balance)
  • Interdisciplinary: Complex reasoning (free will, knowledge, time)

Key Property: Each question has ground truth (factual or rubric-based) that we can score.

2. Four Testing Conditions

BASELINE
β”œβ”€ Plain Llama-3.1-8B (no routing, no debate)
β”œβ”€ Single response in ~5 seconds
└─ Establishes floor (what does model do alone?)

PHASE 1-5
β”œβ”€ Multi-round debate, memory weighting
β”œβ”€ NO semantic tension (heuristic opposition only)
β”œβ”€ NO specialization tracking
β”œβ”€ NO preflight prediction
β”œβ”€ Establishes debate value (does debating help?)
└─ ~30 seconds

PHASE 6 FULL
β”œβ”€ Everything Phase 1-5 PLUS:
β”‚  β”œβ”€ Semantic tension (Llama embeddings)
β”‚  β”œβ”€ Specialization tracking
β”‚  └─ Pre-flight prediction
β”œβ”€ Establishes Phase 6 total value
└─ ~40 seconds

PHASE 6 -PREFLIGHT
β”œβ”€ Phase 6 full EXCEPT no preflight
β”œβ”€ Isolates pre-flight contribution
└─ ~35 seconds

3. Five Key Metrics

Metric What Why Red Flag
Correctness % right answers THE metric Phase 6 < Baseline
Reasoning Depth # perspectives identified Quality of debate All conditions same
Calibration Error |confidence - accuracy| Trust in system >0.3 for Phase 6
Adapter Convergence Similarity of outputs Monoculture risk >0.85
Debate Efficiency Rounds to convergence Compute waste Phase 6 worse than 1-5

4. Emergent Behavior Monitoring

Three Critical Alerts:

  1. False Consensus: High Ξ“ (0.8+) but low correctness (<0.5)

    • System confident in wrong answer
    • Symptom of gaming coherence metric
  2. Semantic Convergence: Adapter outputs >0.85 similar

    • Loss of perspective diversity
    • Specialization tracking failed
  3. Miscalibration: Reported confidence β‰  actual correctness

    • System can't distinguish right from wrong
    • Can't know when to ask for help

Evaluation Sprint Structure

Phase 1: Smoke Test (Week 1)

python evaluation/run_evaluation_sprint.py --questions 5
  • 5 Γ— 4 conditions = 20 debates
  • ~15 minutes
  • Goal: Verify harness works, see initial patterns

Phase 2: Full Evaluation (Week 2)

python evaluation/run_evaluation_sprint.py --questions 25
  • 25 Γ— 4 conditions = 100 debates
  • ~2-3 hours
  • Goal: Statistical power for real conclusions

Phase 3: Analysis (Week 3)

  • Compute statistics (mean, std deviation)
  • Check for red flags
  • Statistical significance tests (t-tests, effect sizes)
  • Ablation analysis (which Phase 6 component adds value?)

Phase 4: Decisions (Week 4)

  • Strong Results? β†’ Ship Phase 6
  • Weak Results? β†’ Refine (tune weights, debug)
  • Broken Results? β†’ Pivot to Phase 7

Expected Outcomes

Best Case Scenario

Phase 1-5:    65% mean correctness
Phase 6 Full: 76% mean correctness
Improvement:  +11 percentage points (statistically significant)
Conclusion:   Phase 6 is clearly better, ship it

Realistic Scenario

Phase 1-5:    68% mean correctness
Phase 6 Full: 75% mean correctness
Improvement:  +7 percentage points (borderline significant)
Conclusion:   Phase 6 helps, but marginal. Investigate bottlenecks

Worst Case Scenario

Phase 1-5:    70% mean correctness
Phase 6 Full: 68% mean correctness
Improvement:  -2 percentage points (worse!)
Conclusion:   Phase 6 breaks something. Debug and fix

Risk Scenario

Phase 6 Full:
  - Correctness: 75%
  - Gamma: 0.85 (high coherence)
  - Calibration error: 0.4 (miscalibrated)
Conclusion:   System gaming coherence. Need external ground truth signal.

Files Created

File Purpose
evaluation/test_suite_evaluation.py 25-question test suite + evaluation harness
evaluation/run_evaluation_sprint.py Runner script with CLI
EVALUATION_STRATEGY.md Detailed strategy document
EVALUATION_FRAMEWORK_SUMMARY.md This file

What This Answers

Right Now:

  • Code works βœ…
  • Components integrated βœ…
  • Unit tests pass βœ…

After Evaluation:

  • Is it actually better? ❓
  • Which Phase 6 components add value? ❓
  • Is the system gaming metrics? ❓
  • Should Phase 7 research begin? ❓

Key Insight

We've built something mathematically coherent and architecturally sound.

But we don't yet know if it works empirically.

This evaluation sprint will answer that question rigorously.

If Phase 6 helps: ship it and begin Phase 7 research If Phase 6 doesn't help: understand why and refine If Phase 6 breaks things: fix and retest

No more guessing. Just measurement.


Ready to Begin?

Smoke Test (Quick)

cd J:\codette-training-lab
python evaluation/run_evaluation_sprint.py --questions 5

Expected: ~15 minutes, initial patterns emerge

Full Evaluation (Comprehensive)

python evaluation/run_evaluation_sprint.py --questions 25

Expected: ~2-3 hours, statistically sound conclusions


Next Steps

  1. Run smoke test β†’ Verify evaluator works
  2. Check for implementation bugs β†’ Fix as needed
  3. Run full evaluation β†’ Collect 100 debates' worth of data
  4. Analyze results β†’ Understand which conditions win
  5. Make decision β†’ Ship, refine, or pivot

This is the bottleneck between "we built it" and "it actually works."

Let's break through it with measurement.