Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

fusion-design-lab / docs /P1_PPO_SMOKE_NOTE.md

fix: restore ppo smoke early termination

9827b11 about 1 month ago

3.23 kB

P1 PPO Smoke Note

This note records the current tiny low-fidelity PPO smoke pass on the repaired 4-knob P1 environment.

This run is diagnostic-only.

It exists to answer:

can a small PPO policy interact with the low-fidelity environment without code-path failures
can the smoke trainer discover any real positive repair signal quickly
what is the first obvious remaining behavior problem before any broader training push

It does not validate the high-fidelity submit contract.

uv sync --extra training
uv run --extra training python training/ppo_smoke.py --eval-episodes 3 --seed 0

ignored runtime artifact: training/artifacts/ppo_smoke/ppo_smoke_20260308T090723Z.json

training mode: low-fidelity only
training reset seeds: easy-seed curriculum (seed 2 only)
evaluation reset seeds: frozen seeds 0, 1, and 2
action space: 2 diagnostic repair actions
- rotational_transform increase medium
- triangularity_scale increase medium
submit: intentionally excluded from the smoke loop
total timesteps: 32
evaluation episodes: 3
device: cpu

the smoke path executed successfully and wrote a trajectory artifact
the trained policy discovered a real positive repair move and reached feasibility on the easy evaluation seed
the trained policy did not generalize across all frozen evaluation seeds yet
summary metrics:
- mean_eval_reward = -0.1204
- constraint_satisfaction_rate = 0.3333

The original smoke trainer was asking PPO to solve too much at once:

The current smoke trainer is now intentionally narrower:

observation includes current control parameters and explicit low-fidelity state fields
episodes stop on first feasible crossing or first failed evaluation
the action space is narrowed to the two known repair actions
training uses the easiest frozen seed so the smoke question stays diagnostic

Deterministic evaluation now repeats a repair-seeking action instead of the older crashing loop:

policy action: triangularity_scale increase medium
seed 2: reaches feasibility in one step with reward +3.1533
seeds 0 and 1: keeps pushing triangularity in the same direction, improves nothing after the first few steps, and times out with negative reward

This is useful smoke evidence because it shows:

the PPO training path is wired correctly enough to produce trajectories
the smoke trainer can now discover a real positive repair signal instead of only collapsing into a bad local action
the next remaining issue is not plumbing; it is limited cross-seed generalization even inside the smoke action subset

The smoke pass is now doing the right job:

it is still diagnostic-only
it proves the low-fidelity PPO path can find at least one real repair action
it still leaves broader training work open, because the policy does not yet solve all frozen seeds