cadforge-cadquery-openenv / CADFORGE_BLOG.md
sanjuhs's picture
Link real GitHub training scripts gist
1551e7d verified

CADForge: Training Tiny Models to Write Buildable CADQuery

Detailed judge-facing blog with screenshots, diagrams, references, and all run interpretation:

docs/detailed-blog/cadforge-detailed-blog.md

Judge Reproducibility Links

The full SFT and GRPO runs were executed on a RunPod H200 through distinct production scripts. The Colab notebook is the public judge-runnable smoke path that validates the same OpenEnv, dataset, reward backend, and training entry points on a tiny run.

The Problem

LLMs can describe a chair, bracket, hook, or motor stator, but tiny models are still unreliable at producing CAD that an engineer can actually use.

The common failure mode is not imagination. It is execution:

  • the code has a syntax error
  • the model invents a CadQuery API
  • the final fixture is missing
  • the geometry exports but contains disconnected parts
  • the shape looks roughly correct but has no editable dimensions or helper structure
  • the answer is clipped before the final assembly is complete

CADForge turns those failures into a reinforcement learning environment.

The agent receives a professional design prompt and must output a complete executable CadQuery Python file. CADForge compiles it, exports geometry, scores topology and task semantics, writes artifacts, and returns reward JSON. The goal is not a pretty one-shot mesh. The goal is editable, buildable code-CAD.

Why This Fits OpenEnv

CADForge targets three hackathon themes.

Theme 2: Long-horizon planning. A CAD model is naturally built through steps: make a base shape, add supports, add constraints, connect parts, repair broken operations, simplify brittle features, and finalize the object. The environment supports long-horizon repair traces where the model sees previous code, reward JSON, and verifier notes before producing the next revision.

Theme 3.1: Professional world modeling. The world is not a toy simulator. The model interacts with real CadQuery execution, mesh export, reference objects, task specs, and persistent artifacts. If the code does not build, the environment says so.

Theme 4: Self-improvement. The first reward design was too forgiving. Training exposed that flaw, so we tightened the environment: buildability became the first gate, broken code became negative reward, and individual failure types became curriculum targets.

The Environment Loop

design prompt
  -> LLM writes CadQuery Python
  -> CADForge runs CadQuery
  -> STL/mesh/artifacts are produced
  -> reward JSON scores build, topology, contact, semantics, reference similarity, editability
  -> model trains through SFT and GRPO

Each episode writes persistent artifacts:

  • generated CadQuery code
  • build logs
  • STL/mesh outputs
  • rendered views when available
  • reward JSON
  • verifier notes
  • markdown summaries

Reward Design

CADForge uses layered reward instead of a single pass/fail bit.

Dimension What It Checks Why It Matters
Build CadQuery executes and exports geometry no professional CAD workflow starts from broken code
Topology volume, bounds, connectedness, watertight proxy prevents empty or incoherent geometry
Contact disconnected assemblies and large gaps chairs, fixtures, and hooks must be physically plausible
Semantic parts task-specific shape hints a chair should have chair-like structure, not a generic block
Reference similarity bbox/silhouette/mesh comparison when a GLB exists supports object-specific CAD imitation
Editability named dimensions, helper functions, final fixture rewards real code-CAD, not opaque mesh blobs
Efficiency compact, stable output discourages bloated brittle programs

The final strict GRPO run changed the reward order:

  1. Build first.
  2. If build fails, return negative reward with diagnostics.
  3. If build succeeds, unlock the dense CADForge score.

This is the environment "fighting back." The model can no longer get meaningful reward for pretty-looking text that does not compile.

Training Data

We used:

  • prompt-to-CadQuery cold-start examples
  • GPT-5.4/GPT-5.5 teacher repair traces
  • ideal Markus chair CadQuery code
  • environment transcripts with previous code, reward JSON, and corrected code
  • generated CAD prompts across easy, medium, and hard mechanical objects

Dataset: sanjuhs/cadforge-cadquery-agentic-traces

The SFT mix upsampled cold-start rows because repair traces outnumbered fresh prompt-to-CAD examples. Upsampling means repeating those rows so the model still learns to create the first complete CAD file, not only repair an existing one.

Training Pipeline

The runnable notebook is here:

training/cadforge_openenv_training_colab.ipynb

The production scripts are:

  • training/train_sft_unsloth.py
  • training/train_grpo_cadforge.py
  • training/evaluate_cadforge_model.py
  • training/make_training_report.py
  • training/run_strict_9b_grpo.sh

The real run used Unsloth for LoRA SFT and TRL GRPO for environment reward training on a RunPod H200.

Training Logs and Evidence

The raw logs are public so the training claims are inspectable, not just summarized:

  • Training evidence dataset: sanjuhs/cadforge-training-evidence
  • Compressed archive on that dataset: archives/cadforge-training-evidence-20260426.tar.gz
  • Per-sample reward traces: training/logs/*completions.jsonl
  • Generated plots and parsed metrics: training/reports/*

The most important thing the logs show is that reward alone was not enough. The dense GRPO runs had positive-looking scalar reward but 0% buildability. The strict build-gated run changed the environment so broken CAD stayed negative and buildable CAD unlocked dense reward. The final adaptive run then mined failures from strict GRPO and trained directly on repair prompts.

Training evidence build-rate summary

Results

Run Result
Qwen3.5-2B SFT train loss 1.4480 -> 0.1658, eval loss 0.4477 -> 0.2676
Qwen3.5-2B dense GRPO mean reward 0.3387, best 0.5303; useful reward signal but too forgiving on broken builds
Qwen3.5-9B SFT train loss 2.6020 -> 0.1413, eval loss 0.3650 -> 0.2398
Qwen3.5-9B strict GRPO 320 completions, 96 buildable, 30.0% build rate
Qwen3.5-9B adaptive repair v1 120 repair completions, 0 buildable; exposed clipped completions and bad curriculum ordering
Qwen3.5-9B adaptive repair final 8192 180 repair completions, 53 buildable, 0 clipped completions, best reward 0.882
Strict 9B quick eval 2/3 held-out prompts built successfully

The strict GRPO run produced:

  • best individual reward: 0.9449
  • best CADForge total score: 0.9352
  • mean per-step reward trend: +0.003549 / step
  • held-out eval build rate: 66.7%

Strict GRPO reward curve

Strict GRPO code health

Final adaptive repair chunk metrics

What The Model Learned

SFT taught the model the language of editable CadQuery:

  • named dimensions
  • helper builders
  • final fixture
  • compact Python-only output
  • simple robust CadQuery operations

Strict GRPO then taught it the first hard constraint: buildable CAD is categorically better than broken CAD.

The quick eval built two held-out objects:

Task Reward Build Editability
axial_motor_stator_12_slot 0.708 1.0 0.825
caster_wheel_fork 0.738 1.0 0.942
four_leg_chair_700n -1.000 0.0 0.000

The failed chair output was clipped before closing the final union. That failure is useful: it gives the next curriculum a concrete target.

How CADForge Improves Itself

The self-improvement loop is automatic: the environment uses failed rollouts to create the next training distribution.

rollout batch
  -> reward JSON
  -> failure classifier
  -> targeted repair task generator
  -> adaptive sampler
  -> next SFT / GRPO batch

Examples:

  • SyntaxError: '(' was never closed becomes: "preserve the current chair structure, close the final union, and assign fixture."
  • AttributeError: Workplane has no cone becomes: "replace the invented API with cylinders, boxes, lofts, or cuts that CadQuery supports."
  • disconnected caster assembly becomes: "bridge the wheel, axle, fork, and top plate so contact reward improves."
  • weak Markus similarity becomes: "adjust the backrest height, armrests, gas cylinder, star base, and caster proportions."

The next action is scored by reward delta:

delta_reward = new_total_reward - previous_total_reward

This prevents the model from merely restating plausible CAD. It must repair the concrete failure the environment found.

The curriculum controller then samples more from weak failure types:

weight = base_weight * (1 - recent_mastery)^2 * difficulty_multiplier

So if the model keeps failing syntax closure, syntax repair appears more often. Once build rate rises, the environment shifts toward harder semantic and reference-similarity tasks.

That turns CADForge into a curriculum engine: the environment watches where the model fails and creates the next training distribution from those failures.

Detailed plan: docs/cadforge-self-improving-curriculum.md

Artifacts

Adaptive Repair Curriculum

After strict GRPO, CADForge mined the 320 completion traces and generated a new adaptive repair curriculum. The curriculum found the dominant failure class automatically:

Failure class Count
syntax closure 110
type/value/CAD kernel errors 47
disconnected or weak geometry 26
undefined names 20
invented API 17
missing fixture 15
unknown build failure 15
low editability 4

The first adaptive run was useful because it failed loudly: 120 repair completions produced 0 builds and showed clipped outputs. We fixed the curriculum and completion budget, reran it as 20260426-adaptive-repair-final-8192, and got 53/180 buildable repairs with 0 clipped completions. That is the self-improvement loop in miniature: the environment found a concrete weakness, generated a targeted repair distribution, and the next run improved that failure mode.