sanjuhs commited on
Commit
095b2cc
Β·
verified Β·
1 Parent(s): e35a000

Add CADForge project report

Browse files
docs/cadforge-openenv-project-report.md ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: CADForge RLVE
3
+ colorFrom: blue
4
+ colorTo: green
5
+ sdk: docker
6
+ pinned: false
7
+ app_port: 8791
8
+ base_path: /
9
+ tags:
10
+ - openenv
11
+ - cadquery
12
+ - grpo
13
+ - rlhf
14
+ - code-cad
15
+ ---
16
+
17
+ # CADForge RLVE
18
+
19
+ ### Can a tiny model learn editable CAD by interacting with a real CadQuery build-and-reward environment?
20
+
21
+ CADForge is a reinforcement learning environment for code-CAD. The agent receives a product/design request, writes executable CadQuery Python, gets a real compiler/render/reward observation back, and learns to repair its CAD over long-horizon revision loops.
22
+
23
+ The goal is not just text-to-3D mesh generation. The goal is **editable, parametric CAD**: named dimensions, reusable helper functions, final fixture construction, stable topology, semantic parts, and buildable geometry that can survive a professional engineering workflow.
24
+
25
+ This project targets OpenEnv themes:
26
+
27
+ - **Theme 2: Long-horizon planning**: CAD improves over repeated code edits and reward feedback.
28
+ - **Theme 3.1: Professional world modeling**: the model interacts with real CadQuery tooling, mesh exports, renders, task specs, and persistent artifacts.
29
+ - **Theme 4: Self-improvement**: teacher traces, reward failures, and adversarial task generation improve both the model and the environment.
30
+
31
+ ---
32
+
33
+ ## The Story: From Pretty Code to Buildable CAD
34
+
35
+ ### Act 1: The Cold Start
36
+
37
+ The initial model can often write plausible-looking CadQuery code, but the environment quickly reveals a painful truth: plausible CAD code is not the same thing as executable CAD.
38
+
39
+ Early generations failed on simple but fatal issues:
40
+
41
+ - invalid Python syntax
42
+ - invented CadQuery APIs such as non-existent `Workplane` methods
43
+ - missing `fixture = ...`
44
+ - undefined helper variables
45
+ - too-long outputs clipped before the final fixture
46
+ - disconnected assemblies or oversized gaps
47
+
48
+ That is exactly why CADForge is an environment, not a static benchmark. The model has to survive a compiler, a mesh pipeline, semantic scoring, and visual/structural comparison.
49
+
50
+ ### Act 2: SFT Teaches the Language of CADQuery
51
+
52
+ We generated teacher traces and prompt-to-CAD examples from:
53
+
54
+ - ideal Markus chair CadQuery code
55
+ - GPT-5.4/GPT-5.5 agentic repair traces
56
+ - prompt-to-CAD cold-start rows
57
+ - environment transcripts with previous code, reward JSON, and corrected code
58
+
59
+ SFT worked clearly. Both Qwen3.5 models learned the shape of editable CADQuery programs.
60
+
61
+ | Model | Train Loss | Eval Loss | Result |
62
+ |---|---:|---:|---|
63
+ | Qwen3.5-2B SFT | `1.4480 -> 0.1658` | `0.4477 -> 0.2676` | Learned prompt-to-CAD and repair format |
64
+ | Qwen3.5-9B SFT | `2.6020 -> 0.1413` | `0.3650 -> 0.2398` | Stronger syntax/style learning |
65
+
66
+ ![Qwen3.5-9B SFT loss](../training/reports/qwen35-9b-sft-final/sft_loss_curve.png)
67
+
68
+ ### Act 3: First GRPO Run, Too Much Reward Too Soon
69
+
70
+ The first GRPO reward was intentionally dense. It rewarded build attempts, topology, semantic parts, reference similarity, contact/gap structure, and editability.
71
+
72
+ That produced positive reward movement:
73
+
74
+ | Model | Mean Reward | Best Reward | Trend |
75
+ |---|---:|---:|---:|
76
+ | Qwen3.5-2B GRPO | `0.3387` | `0.5303` | `+0.000887 / step` |
77
+ | Qwen3.5-9B GRPO | `0.4355` | `0.6828` | `+0.000475 / step` |
78
+
79
+ ![Qwen3.5-9B GRPO reward](../training/reports/qwen35-9b-grpo-final/grpo_reward_curve.png)
80
+
81
+ But training exposed an environment bug in the reward design: the reward was still too forgiving when CadQuery execution failed. The model learned structure and editability signals, but too many completions still failed the real build gate.
82
+
83
+ This is the same pattern as strong OpenEnv projects: the model's failures teach us where the environment is wrong.
84
+
85
+ ### Act 4: The Environment Fights Back
86
+
87
+ We tightened GRPO into a **strict build-gated reward**:
88
+
89
+ - executable CadQuery build is now the first gate
90
+ - invalid builds get negative rewards
91
+ - syntax errors, missing fixtures, undefined variables, and invented APIs are explicitly penalized
92
+ - successful builds recover positive dense rewards for topology, semantics, contact, reference similarity, and editability
93
+ - debug logs now store parsed reward JSON directly, instead of inferring build status from truncated stdout
94
+
95
+ Strict smoke test produced the intended signal:
96
+
97
+ | Failure | Reward Behavior |
98
+ |---|---:|
99
+ | Missing final fixture | negative |
100
+ | TypeError in helper call | negative |
101
+ | SyntaxError | more negative |
102
+ | NameError undefined variable | negative |
103
+
104
+ This gives GRPO useful variance: buildable CAD should separate sharply from pretty-but-broken CAD.
105
+
106
+ The strict 9B run completed on an H200 and produced exactly that separation:
107
+
108
+ | Run | Completions | Buildable CAD | Build Rate | Best Candidate Reward | Best CADForge Score |
109
+ |---|---:|---:|---:|---:|---:|
110
+ | Qwen3.5-9B strict GRPO | `320` | `96` | `30.0%` | `0.9449` | `0.9352` |
111
+
112
+ The per-step GRPO reward mean stayed lower because failed builds are now intentionally negative. That is good: GRPO now sees the contrast between broken syntax and real executable CAD instead of rewarding both.
113
+
114
+ ---
115
+
116
+ ## How It Works
117
+
118
+ ```
119
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
120
+ β”‚ CADFORGE RLVE LOOP β”‚
121
+ β”‚ β”‚
122
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
123
+ β”‚ β”‚ CAD Task │──►│ Qwen / GPT │──►│ CadQuery Code β”‚ β”‚
124
+ β”‚ β”‚ prompt + ref β”‚ β”‚ CAD Agent β”‚ β”‚ candidate β”‚ β”‚
125
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
126
+ β”‚ β”‚ β”‚ β”‚
127
+ β”‚ β”‚ real tools β–Ό β”‚
128
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
129
+ β”‚ └──►│ CadQuery build β†’ STL/mesh β†’ normalize β†’ score β”‚ β”‚
130
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
131
+ β”‚ β”‚ reward JSON + artifacts β”‚
132
+ β”‚ β–Ό β”‚
133
+ β”‚ SFT traces + GRPO β”‚
134
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
135
+ ```
136
+
137
+ The environment writes persistent artifacts under each episode:
138
+
139
+ - candidate code
140
+ - build logs
141
+ - STL/mesh exports
142
+ - normalized mesh metrics
143
+ - rendered views
144
+ - reward JSON
145
+ - markdown reports
146
+
147
+ ---
148
+
149
+ ## Reward Function
150
+
151
+ CADForge uses layered rewards so the model cannot win with a single shortcut.
152
+
153
+ | Dimension | What It Checks | Why It Matters |
154
+ |---|---|---|
155
+ | Build | CadQuery executes and exports geometry | CAD must compile before anything else matters |
156
+ | Topology | component count, volume, watertightness, bounds | prevents empty or broken geometry |
157
+ | Contact/gaps | disconnected bodies and large separations | chairs/fixtures need plausible physical assembly |
158
+ | Semantic parts | task-specific part hints in code and geometry | asks for the requested object, not generic blobs |
159
+ | Reference similarity | bbox/profile/silhouette/mesh comparison when GLB exists | aligns generated CAD to target object |
160
+ | Editability | named dimensions, helper functions, final fixture, clean structure | rewards reusable engineering CAD |
161
+ | Efficiency | compact code and stable build path | discourages bloated or brittle outputs |
162
+
163
+ Strict GRPO changes the order:
164
+
165
+ 1. Build first.
166
+ 2. If build fails, return negative reward with small code-structure shaping.
167
+ 3. If build succeeds, use dense CADForge reward.
168
+
169
+ This is the important lesson from the first GRPO run: **dense rewards are useful only after the build gate is respected.**
170
+
171
+ ---
172
+
173
+ ## Results
174
+
175
+ ### Run 1: Qwen3.5-2B SFT
176
+
177
+ The 2B model learned the basic grammar of CADForge traces. It moved from broad, uncertain outputs to compact CadQuery-style responses.
178
+
179
+ ![2B SFT loss](../training/reports/qwen35-2b-sft-final/sft_loss_curve.png)
180
+
181
+ ### Run 2: Qwen3.5-9B SFT
182
+
183
+ The 9B model learned faster and reached better eval loss than 2B.
184
+
185
+ ![9B SFT loss](../training/reports/qwen35-9b-sft-final/sft_loss_curve.png)
186
+
187
+ ### Run 3: Dense GRPO
188
+
189
+ Dense GRPO improved reward, but still allowed too many non-buildable completions. This run exposed the reward-design flaw.
190
+
191
+ ![9B dense GRPO code health](../training/reports/qwen35-9b-grpo-final/grpo_code_health.png)
192
+
193
+ ### Run 4: Strict Build-Gated GRPO
194
+
195
+ This run is designed to fix the first GRPO issue. It starts from the 9B SFT checkpoint, not the forgiving GRPO checkpoint, and trains with buildability as the first reward gate.
196
+
197
+ Completed result:
198
+
199
+ - `320` completions scored through the real CADForge environment
200
+ - `96` completions built successfully
201
+ - `30.0%` strict build rate during GRPO
202
+ - best individual candidate reward: `0.9449`
203
+ - best CADForge total score: `0.9352`
204
+ - mean per-step reward trend: `+0.003549 / step`
205
+
206
+ ![Strict 9B GRPO reward](../training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/grpo_reward_curve.png)
207
+
208
+ ![Strict 9B GRPO code health](../training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/grpo_code_health.png)
209
+
210
+ The remaining failures are useful curriculum targets:
211
+
212
+ | Outcome | Count |
213
+ |---|---:|
214
+ | SyntaxError | `109` |
215
+ | build_ok | `96` |
216
+ | TypeError | `25` |
217
+ | ValueError | `24` |
218
+ | NameError | `24` |
219
+ | AttributeError | `21` |
220
+
221
+ ### Quick Held-Out Eval
222
+
223
+ After upload, the strict adapter was evaluated on 3 prompts:
224
+
225
+ | Task | Reward | Build | Semantic | Editability |
226
+ |---|---:|---:|---:|---:|
227
+ | axial_motor_stator_12_slot | `0.708` | `1.0` | `0.300` | `0.825` |
228
+ | caster_wheel_fork | `0.738` | `1.0` | `0.452` | `0.942` |
229
+ | four_leg_chair_700n | `-1.000` | `0.0` | `0.000` | `0.000` |
230
+
231
+ Eval summary: **2/3 buildable**, `66.7%` build rate, best reward `0.738`. The failed chair output was clipped before the final union closed, which tells us the next curriculum should target shorter valid finalization and syntax closure.
232
+
233
+ ---
234
+
235
+ ## Model Artifacts
236
+
237
+ - [Qwen3.5-2B CADForge SFT LoRA](https://huggingface.co/sanjuhs/qwen35-2b-cadforge-sft-lora)
238
+ - [Qwen3.5-2B CADForge GRPO LoRA](https://huggingface.co/sanjuhs/qwen35-2b-cadforge-grpo-lora)
239
+ - [Qwen3.5-9B CADForge SFT LoRA](https://huggingface.co/sanjuhs/qwen35-9b-cadforge-sft-lora)
240
+ - [Qwen3.5-9B CADForge GRPO LoRA](https://huggingface.co/sanjuhs/qwen35-9b-cadforge-grpo-lora)
241
+ - [Qwen3.5-9B CADForge Strict Build-Gated GRPO LoRA](https://huggingface.co/sanjuhs/qwen35-9b-cadforge-grpo-strict-build-lora)
242
+
243
+ ---
244
+
245
+ ## Training on RunPod H200
246
+
247
+ ```bash
248
+ cd /workspace/open-env-meta-final
249
+ export HF_TOKEN=...
250
+ export OPENAI_API_KEY=...
251
+ export FAL_AI_API_KEY=...
252
+
253
+ # Full overnight chain used for SFT + GRPO
254
+ ./training/continue_after_2b_grpo.sh
255
+
256
+ # Strict build-gated 9B GRPO follow-up
257
+ ./training/run_strict_9b_grpo.sh
258
+ ```
259
+
260
+ ---
261
+
262
+ ## What We Learned
263
+
264
+ 1. SFT is necessary for CADQuery. Tiny models need to learn the program shape before RL can help.
265
+ 2. Dense reward alone is too easy to exploit. Buildability must be the first gate.
266
+ 3. Reward logs need parsed components, not only human-readable stdout tails.
267
+ 4. Output clipping is a real CAD failure mode. If the model never reaches `fixture = ...`, the geometry cannot build.
268
+ 5. CADForge should train on repair loops, not only one-shot prompt-to-code. The real skill is diagnosis and correction.
269
+
270
+ ---
271
+
272
+ ## Next Steps
273
+
274
+ - Add AST-level syntax checks before CadQuery execution for faster reward.
275
+ - Add targeted repair curricula for common failures: missing fixture, undefined variable, invalid Workplane API, clipped code.
276
+ - Run strict GRPO with more generations per prompt once build-rate starts moving.
277
+ - Add vLLM server mode for higher-throughput Qwen rollouts.
278
+ - Evaluate trained adapters against held-out GLB-backed tasks and Markus chair reference similarity.