fusion-design-lab / docs /P1_PPO_SMOKE_NOTE.md
CreativeEngineer's picture
fix: restore ppo smoke early termination
9827b11

P1 PPO Smoke Note

This note records the current tiny low-fidelity PPO smoke pass on the repaired 4-knob P1 environment.

Purpose

This run is diagnostic-only.

It exists to answer:

  • can a small PPO policy interact with the low-fidelity environment without code-path failures
  • can the smoke trainer discover any real positive repair signal quickly
  • what is the first obvious remaining behavior problem before any broader training push

It does not validate the high-fidelity submit contract.

Command

uv sync --extra training
uv run --extra training python training/ppo_smoke.py --eval-episodes 3 --seed 0

Artifact

  • ignored runtime artifact: training/artifacts/ppo_smoke/ppo_smoke_20260308T090723Z.json

Configuration

  • training mode: low-fidelity only
  • training reset seeds: easy-seed curriculum (seed 2 only)
  • evaluation reset seeds: frozen seeds 0, 1, and 2
  • action space: 2 diagnostic repair actions
    • rotational_transform increase medium
    • triangularity_scale increase medium
  • submit: intentionally excluded from the smoke loop
  • total timesteps: 32
  • evaluation episodes: 3
  • device: cpu

Result

  • the smoke path executed successfully and wrote a trajectory artifact
  • the trained policy discovered a real positive repair move and reached feasibility on the easy evaluation seed
  • the trained policy did not generalize across all frozen evaluation seeds yet
  • summary metrics:
    • mean_eval_reward = -0.1204
    • constraint_satisfaction_rate = 0.3333

What Changed

The original smoke trainer was asking PPO to solve too much at once:

  • a 25-action search space
  • mixed one-step and two-step repair behavior across seeds
  • no direct access to current control parameters in the observation
  • no early stop on first success or first failure

The current smoke trainer is now intentionally narrower:

  • observation includes current control parameters and explicit low-fidelity state fields
  • episodes stop on first feasible crossing or first failed evaluation
  • the action space is narrowed to the two known repair actions
  • training uses the easiest frozen seed so the smoke question stays diagnostic

Current Behavior

Deterministic evaluation now repeats a repair-seeking action instead of the older crashing loop:

  • policy action: triangularity_scale increase medium
  • seed 2: reaches feasibility in one step with reward +3.1533
  • seeds 0 and 1: keeps pushing triangularity in the same direction, improves nothing after the first few steps, and times out with negative reward

This is useful smoke evidence because it shows:

  • the PPO training path is wired correctly enough to produce trajectories
  • the smoke trainer can now discover a real positive repair signal instead of only collapsing into a bad local action
  • the next remaining issue is not plumbing; it is limited cross-seed generalization even inside the smoke action subset

Current Conclusion

The smoke pass is now doing the right job:

  • it is still diagnostic-only
  • it proves the low-fidelity PPO path can find at least one real repair action
  • it still leaves broader training work open, because the policy does not yet solve all frozen seeds