Spaces:
Sleeping
Sleeping
docs: explain cold small model context for SFT rationale
Browse files- docs/writeup.md +4 -2
docs/writeup.md
CHANGED
|
@@ -56,9 +56,11 @@ All reward is computed from `scipy.odeint` output — no model-in-the-loop scori
|
|
| 56 |
|
| 57 |
### Why SFT first
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
| 62 |
|
| 63 |
### GRPO
|
| 64 |
|
|
|
|
| 56 |
|
| 57 |
### Why SFT first
|
| 58 |
|
| 59 |
+
Qwen2.5-3B is a small, cold base model. Out of the box it has no concept of our ODE grammar or JSON output format — ~80% of its completions are LaTeX, prose, or malformed JSON that the verifier can't parse at all. GRPO needs *variance in reward* across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.
|
| 60 |
|
| 61 |
+
SFT on 192 synthetic `(prompt, ground_truth)` pairs from the environment teaches the model the output format before RL begins. 3 epochs, ~5 min on L40S.
|
| 62 |
+
|
| 63 |
+
After SFT: >90% of completions parse and simulate successfully — GRPO now has a signal to work with.
|
| 64 |
|
| 65 |
### GRPO
|
| 66 |
|