Update project report with 2B GRPO and absolute figures
Browse files
docs/cadforge-openenv-project-report.md
CHANGED
|
@@ -63,7 +63,7 @@ SFT worked clearly. Both Qwen3.5 models learned the shape of editable CADQuery p
|
|
| 63 |
| Qwen3.5-2B SFT | `1.4480 -> 0.1658` | `0.4477 -> 0.2676` | Learned prompt-to-CAD and repair format |
|
| 64 |
| Qwen3.5-9B SFT | `2.6020 -> 0.1413` | `0.3650 -> 0.2398` | Stronger syntax/style learning |
|
| 65 |
|
| 66 |
-

|
| 67 |
|
| 68 |
### Act 3: First GRPO Run, Too Much Reward Too Soon
|
| 69 |
|
|
|
|
| 76 |
| Qwen3.5-2B GRPO | `0.3387` | `0.5303` | `+0.000887 / step` |
|
| 77 |
| Qwen3.5-9B GRPO | `0.4355` | `0.6828` | `+0.000475 / step` |
|
| 78 |
|
| 79 |
+

|
| 80 |
+
|
| 81 |
+
The 2B GRPO run is not the headline result, but it is important evidence. It proved that the tiny model could receive a real CADForge reward signal and move in the right direction. Its weakness was diagnostic: because the dense reward was too generous, the model could improve style and partial structure without reliably crossing the buildability gate.
|
| 82 |
+
|
| 83 |
+

|
| 84 |
|
| 85 |
But training exposed an environment bug in the reward design: the reward was still too forgiving when CadQuery execution failed. The model learned structure and editability signals, but too many completions still failed the real build gate.
|
| 86 |
|
|
|
|
| 180 |
|
| 181 |
The 2B model learned the basic grammar of CADForge traces. It moved from broad, uncertain outputs to compact CadQuery-style responses.
|
| 182 |
|
| 183 |
+

|
| 184 |
+
|
| 185 |
+
### Run 1.5: Qwen3.5-2B Dense GRPO
|
| 186 |
+
|
| 187 |
+
The 2B GRPO run was intentionally included as the small-model learning probe. It reached mean reward `0.3387`, best reward `0.5303`, and a positive trend of `+0.000887 / step`.
|
| 188 |
+
|
| 189 |
+
That result is useful but incomplete. It shows the small model can respond to CADForge reward, but it also exposed the flaw that dense reward allowed too much credit for non-buildable programs. We used that failure to redesign the environment into strict build-gated GRPO.
|
| 190 |
+
|
| 191 |
+

|
| 192 |
+
|
| 193 |
+

|
| 194 |
|
| 195 |
### Run 2: Qwen3.5-9B SFT
|
| 196 |
|
| 197 |
The 9B model learned faster and reached better eval loss than 2B.
|
| 198 |
|
| 199 |
+

|
| 200 |
|
| 201 |
### Run 3: Dense GRPO
|
| 202 |
|
| 203 |
Dense GRPO improved reward, but still allowed too many non-buildable completions. This run exposed the reward-design flaw.
|
| 204 |
|
| 205 |
+

|
| 206 |
|
| 207 |
### Run 4: Strict Build-Gated GRPO
|
| 208 |
|
|
|
|
| 217 |
- best CADForge total score: `0.9352`
|
| 218 |
- mean per-step reward trend: `+0.003549 / step`
|
| 219 |
|
| 220 |
+

|
| 221 |
|
| 222 |
+

|
| 223 |
|
| 224 |
The remaining failures are useful curriculum targets:
|
| 225 |
|