Spaces:

Pratyush-01
/

physix-live

Sleeping

Pratyush-01 commited on about 1 month ago

Commit

12e2f97

verified ·

1 Parent(s): 33cfd0a

docs: explain cold small model context for SFT rationale

Files changed (1) hide show

docs/writeup.md CHANGED Viewed

@@ -56,9 +56,11 @@ All reward is computed from `scipy.odeint` output — no model-in-the-loop scori
 ### Why SFT first
-With a cold base model, ~80% of completions are unparseable. GRPO needs variance across rollouts — if everything scores ~0, there's no gradient. SFT on 192 synthetic `(prompt, ground_truth)` pairs from the environment fixes the output format so GRPO has something to work with.
-After SFT: >90% of completions parse and simulate successfully.
 ### GRPO

 ### Why SFT first
+Qwen2.5-3B is a small, cold base model. Out of the box it has no concept of our ODE grammar or JSON output format — ~80% of its completions are LaTeX, prose, or malformed JSON that the verifier can't parse at all. GRPO needs *variance in reward* across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.
+SFT on 192 synthetic `(prompt, ground_truth)` pairs from the environment teaches the model the output format before RL begins. 3 epochs, ~5 min on L40S.
+After SFT: >90% of completions parse and simulate successfully — GRPO now has a signal to work with.
 ### GRPO