P1 PPO Smoke Note
This note records the current tiny low-fidelity PPO smoke pass on the repaired 4-knob P1 environment.
Purpose
This run is diagnostic-only.
It exists to answer:
- can a small PPO policy interact with the low-fidelity environment without code-path failures
- can the smoke trainer discover any real positive repair signal quickly
- what is the first obvious remaining behavior problem before any broader training push
It does not validate the high-fidelity submit contract.
Command
uv sync --extra training
uv run --extra training python training/ppo_smoke.py --eval-episodes 3 --seed 0
Artifact
- ignored runtime artifact:
training/artifacts/ppo_smoke/ppo_smoke_20260308T090723Z.json
Configuration
- training mode: low-fidelity only
- training reset seeds: easy-seed curriculum (
seed 2only) - evaluation reset seeds: frozen seeds
0,1, and2 - action space: 2 diagnostic repair actions
rotational_transform increase mediumtriangularity_scale increase medium
submit: intentionally excluded from the smoke loop- total timesteps:
32 - evaluation episodes:
3 - device:
cpu
Result
- the smoke path executed successfully and wrote a trajectory artifact
- the trained policy discovered a real positive repair move and reached feasibility on the easy evaluation seed
- the trained policy did not generalize across all frozen evaluation seeds yet
- summary metrics:
mean_eval_reward = -0.1204constraint_satisfaction_rate = 0.3333
What Changed
The original smoke trainer was asking PPO to solve too much at once:
- a 25-action search space
- mixed one-step and two-step repair behavior across seeds
- no direct access to current control parameters in the observation
- no early stop on first success or first failure
The current smoke trainer is now intentionally narrower:
- observation includes current control parameters and explicit low-fidelity state fields
- episodes stop on first feasible crossing or first failed evaluation
- the action space is narrowed to the two known repair actions
- training uses the easiest frozen seed so the smoke question stays diagnostic
Current Behavior
Deterministic evaluation now repeats a repair-seeking action instead of the older crashing loop:
- policy action:
triangularity_scale increase medium - seed
2: reaches feasibility in one step with reward+3.1533 - seeds
0and1: keeps pushing triangularity in the same direction, improves nothing after the first few steps, and times out with negative reward
This is useful smoke evidence because it shows:
- the PPO training path is wired correctly enough to produce trajectories
- the smoke trainer can now discover a real positive repair signal instead of only collapsing into a bad local action
- the next remaining issue is not plumbing; it is limited cross-seed generalization even inside the smoke action subset
Current Conclusion
The smoke pass is now doing the right job:
- it is still diagnostic-only
- it proves the low-fidelity PPO path can find at least one real repair action
- it still leaves broader training work open, because the policy does not yet solve all frozen seeds