sanjuhs's picture
Link real GitHub training scripts gist
1551e7d verified
metadata
title: CADForge CadQuery
emoji: 🪑
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - cadquery
  - reinforcement-learning

CADForge Experiment 2

CADForge is an OpenEnv environment for training LLMs to produce editable, buildable CadQuery CAD.

The agent receives a design request, writes a complete CadQuery Python file, and the environment runs real CAD tooling: CadQuery build, STL export, topology checks, semantic scoring, reference similarity, editability scoring, and persistent artifact logging.

Judge-Facing Links

RunPod/H200 clarification: the full 2B/9B SFT and GRPO runs were executed on RunPod H200 as distinct production scripts. The Colab notebook is the judge-runnable smoke path that validates OpenEnv, the public dataset, the CadQuery reward backend, and tiny SFT/GRPO launches using those same scripts.

Results Snapshot

Run Result
Qwen3.5-2B SFT train loss 1.4480 -> 0.1658, eval loss 0.4477 -> 0.2676
Qwen3.5-2B dense GRPO mean reward 0.3387, best 0.5303; useful reward signal but too forgiving on broken builds
Qwen3.5-9B SFT train loss 2.6020 -> 0.1413, eval loss 0.3650 -> 0.2398
Qwen3.5-9B strict GRPO 320 completions, 96 buildable, best CADForge score 0.9352
Qwen3.5-9B adaptive repair GRPO 180 repair completions, 53 buildable, 0 clipped completions
Strict 9B quick eval 2/3 held-out prompts built successfully
Stator inference comparison base Qwen failed build; RL-tuned Qwen built a 0.654 stator; GPT-5.4 built a 0.709 stator

Strict GRPO reward curve

Strict GRPO code health

Base Qwen vs RL-tuned Qwen vs GPT-5.4 stator comparison

Training Logs

The raw logs are backed up separately so judges can inspect the training evidence without relying on screenshots:

  • Evidence dataset: sanjuhs/cadforge-training-evidence
  • Compressed archive: archives/cadforge-training-evidence-20260426.tar.gz
  • Key JSONL traces: training/logs/*completions.jsonl

The logs show the core result: dense GRPO had positive-looking reward but 0% buildability; strict build-gating produced 96/320 buildable completions; adaptive repair fixed clipped outputs and produced 53/180 buildable repairs.

Hackathon Theme Alignment

  • Theme 2: Long-horizon planning: CAD improves through repeated code edits and reward feedback.
  • Theme 3.1: Professional world modeling: the agent must use real CadQuery tools and survive compiler/export/mesh checks.
  • Theme 4: Self-improvement: environment failures become new curriculum. The strict build-gated reward was created because the first dense reward was too forgiving.
  • Theme 5: Wild Card: editable CAD generation is a practical, underexplored RLVE target.

The Environment Fights Back

The first dense GRPO reward gave useful shape feedback, but it still rewarded some non-buildable CAD. CADForge responded by tightening the rules:

  1. Buildability became the first gate.
  2. failed CadQuery code receives negative reward.
  3. syntax errors, missing fixture, undefined variables, and invented APIs are tracked separately.
  4. successful builds unlock dense rewards for topology, semantics, reference similarity, contact, editability, and efficiency.

This produced useful GRPO variance: buildable CAD separated from pretty-but-broken code.


Legacy prototype notes follow.

Local prototype for a multi-step CADForge environment: prompt -> CSG/CAD actions -> geometry validation -> structural household part scoring.

Experiment 1 focuses on prompt-to-mechanical-design plus coarse 3D FEA. Experiment 2 keeps that renderer/verifier base, but reframes the loop around reliable code-CAD behavior:

  • the agent plans small CAD operations,
  • the trace is treated like an AST/feature-tree construction episode,
  • the verifier reports CADForge metrics such as AST nodes, connected components, watertight/manifold proxy, editability proxy, and pseudo-OpenSCAD output,
  • structural MechForge feedback remains as the first physical reward suite.

Why This Exists

LLMs can often describe a chair, hook, or bracket, but they are unreliable at making CAD that builds, edits, exports, and stays physically coherent. CADForge turns those failure modes into reward:

  • no floating parts,
  • connected CSG/feature tree,
  • watertight/manifold exported geometry,
  • clean editable parameters,
  • manufacturable features,
  • structural safety under load.

The long-term target is an OpenEnv-compatible RLVE environment where an agent can take 100-300 CAD actions before committing a valid part.

OpenEnv Space

This directory is now a deployable OpenEnv environment named cadforge_cadquery. The action is a complete CadQuery Python file. The environment runs it through a constrained CadQuery runner, exports STL, scores build/topology/contact/task semantics/reference similarity/editability, and returns reward JSON plus verifier notes.

Local validation:

../.venv/bin/openenv validate .
PYTHONPATH=python_tools ../.venv/bin/uvicorn server.app:app --host 0.0.0.0 --port 8000
OPENENV_BASE_URL=http://localhost:8000 ../.venv/bin/python inference.py

Push to Hugging Face Spaces:

set -a; source ../.env; set +a
../.venv/bin/openenv push . --repo-id sanjuhs/cadforge-cadquery-openenv --interface

Setup

cp .env.example .env
# Either paste your OpenAI key into this .env, or keep it in the repo-root .env.
npm install
npm run dev

Open:

http://localhost:5177

The API listens on:

http://localhost:8791

What To Try

Chair benchmark:

Build a simple four-legged chair as editable code-CAD. It must support a 700 N seated load, include a seat panel, four connected legs, lower crossbars, and a backrest, fit inside a 500 mm x 500 mm x 900 mm envelope, and avoid floating parts.

Truss benchmark:

Build a simple lightweight truss support as code-CAD. Use connected triangular load paths, two fixed mounting holes on the left, a load boss on the right, and enough ribs/cross-members to carry a 250 N downward load with safety factor above 2.0.

Wall hook benchmark:

Build a wall-mounted J hook as code-CAD. It needs two screw holes, one connected curved hook arm, a rounded tip lip, and support ribs at the root. It must carry a 120 N hanging load and avoid floating or disconnected geometry.

OpenSCAD Rendering

The UI includes an OpenSCAD code panel with:

  • Generate SCAD
  • Iterate SCAD
  • Render SCAD
  • Load Example

This is a real browser-side CSG renderer for a constrained OpenSCAD subset. It currently supports:

  • cube
  • sphere
  • cylinder
  • translate
  • rotate
  • scale
  • union
  • difference
  • intersection

The renderer parses SCAD text and builds an actual Three.js mesh. Boolean operations use three-csg-ts.

Full OpenSCAD CLI rendering is not enabled yet because openscad is not installed on this machine. The UI and README should not claim full OpenSCAD compatibility until that real dependency is available.

The server endpoints are:

POST /api/scad-generate
POST /api/scad-iterate

Both use the configured model API key. They do not return fallback or mock SCAD when the key is missing.

Current CADForge Metrics

The current prototype adds a cadforge block to each analysis result:

  • ast_nodes
  • connected_components
  • floating_parts
  • watertight_proxy
  • manifold_proxy
  • clean_feature_tree_proxy
  • named_parameter_count
  • editability_score
  • chair_core_features_passed
  • pseudo_openscad

These are MVP proxies, not a full OpenSCAD/trimesh compile yet. The next step is to replace the analysis proxies with:

CSG AST -> OpenSCAD/CadQuery -> STL/STEP -> trimesh/solid validation -> reward

OpenEnv Direction

The final environment should expose actions such as:

  • add_cube
  • add_cylinder
  • translate
  • rotate
  • union
  • difference
  • add_mount_hole
  • add_rib
  • compile_cad
  • check_connected_components
  • check_watertight
  • check_editability
  • run_structural_check
  • commit_design

This gives judges the story they want:

The agent improves on a long-horizon world-modeling task where every CAD operation changes the physical world, and rewards come from objective geometric and structural checks.

Python Solver

This copy still includes the MechForge Python solver under python_tools/mechforge. Prefer the repo-level Python 3.12 virtual environment:

UV_CACHE_DIR=.uv-cache uv venv --python python3.12 .venv
UV_CACHE_DIR=.uv-cache uv pip install numpy scipy pydantic fastapi uvicorn meshio gmsh scikit-fem cadquery openmdao openenv-core openai trimesh

Headless smoke test:

PYTHONPATH=experiment-2-cadforge/python_tools .venv/bin/python -m mechforge.cli sample