Pratyush-01 commited on
Commit
12e2f97
·
verified ·
1 Parent(s): 33cfd0a

docs: explain cold small model context for SFT rationale

Browse files
Files changed (1) hide show
  1. docs/writeup.md +4 -2
docs/writeup.md CHANGED
@@ -56,9 +56,11 @@ All reward is computed from `scipy.odeint` output — no model-in-the-loop scori
56
 
57
  ### Why SFT first
58
 
59
- With a cold base model, ~80% of completions are unparseable. GRPO needs variance across rollouts if everything scores ~0, there's no gradient. SFT on 192 synthetic `(prompt, ground_truth)` pairs from the environment fixes the output format so GRPO has something to work with.
60
 
61
- After SFT: >90% of completions parse and simulate successfully.
 
 
62
 
63
  ### GRPO
64
 
 
56
 
57
  ### Why SFT first
58
 
59
+ Qwen2.5-3B is a small, cold base model. Out of the box it has no concept of our ODE grammar or JSON output format — ~80% of its completions are LaTeX, prose, or malformed JSON that the verifier can't parse at all. GRPO needs *variance in reward* across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.
60
 
61
+ SFT on 192 synthetic `(prompt, ground_truth)` pairs from the environment teaches the model the output format before RL begins. 3 epochs, ~5 min on L40S.
62
+
63
+ After SFT: >90% of completions parse and simulate successfully — GRPO now has a signal to work with.
64
 
65
  ### GRPO
66