# P1 PPO Smoke Note This note records the current tiny low-fidelity PPO smoke pass on the repaired 4-knob `P1` environment. ## Purpose This run is diagnostic-only. It exists to answer: - can a small PPO policy interact with the low-fidelity environment without code-path failures - can the smoke trainer discover any real positive repair signal quickly - what is the first obvious remaining behavior problem before any broader training push It does **not** validate the high-fidelity `submit` contract. ## Command ```bash uv sync --extra training uv run --extra training python training/ppo_smoke.py --eval-episodes 3 --seed 0 ``` ## Artifact - ignored runtime artifact: `training/artifacts/ppo_smoke/ppo_smoke_20260308T090723Z.json` ## Configuration - training mode: low-fidelity only - training reset seeds: easy-seed curriculum (`seed 2` only) - evaluation reset seeds: frozen seeds `0`, `1`, and `2` - action space: 2 diagnostic repair actions - `rotational_transform increase medium` - `triangularity_scale increase medium` - `submit`: intentionally excluded from the smoke loop - total timesteps: `32` - evaluation episodes: `3` - device: `cpu` ## Result - the smoke path executed successfully and wrote a trajectory artifact - the trained policy discovered a real positive repair move and reached feasibility on the easy evaluation seed - the trained policy did **not** generalize across all frozen evaluation seeds yet - summary metrics: - `mean_eval_reward = -0.1204` - `constraint_satisfaction_rate = 0.3333` ## What Changed The original smoke trainer was asking PPO to solve too much at once: - a 25-action search space - mixed one-step and two-step repair behavior across seeds - no direct access to current control parameters in the observation - no early stop on first success or first failure The current smoke trainer is now intentionally narrower: - observation includes current control parameters and explicit low-fidelity state fields - episodes stop on first feasible crossing or first failed evaluation - the action space is narrowed to the two known repair actions - training uses the easiest frozen seed so the smoke question stays diagnostic ## Current Behavior Deterministic evaluation now repeats a repair-seeking action instead of the older crashing loop: - policy action: `triangularity_scale increase medium` - seed `2`: reaches feasibility in one step with reward `+3.1533` - seeds `0` and `1`: keeps pushing triangularity in the same direction, improves nothing after the first few steps, and times out with negative reward This is useful smoke evidence because it shows: - the PPO training path is wired correctly enough to produce trajectories - the smoke trainer can now discover a real positive repair signal instead of only collapsing into a bad local action - the next remaining issue is not plumbing; it is limited cross-seed generalization even inside the smoke action subset ## Current Conclusion The smoke pass is now doing the right job: - it is still diagnostic-only - it proves the low-fidelity PPO path can find at least one real repair action - it still leaves broader training work open, because the policy does not yet solve all frozen seeds