sanjuhs's picture
Link real GitHub training scripts gist
1551e7d verified
---
title: CADForge CadQuery
emoji: 🪑
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- cadquery
- reinforcement-learning
---
# CADForge Experiment 2
CADForge is an OpenEnv environment for training LLMs to produce **editable, buildable CadQuery CAD**.
The agent receives a design request, writes a complete CadQuery Python file, and the environment runs real CAD tooling: CadQuery build, STL export, topology checks, semantic scoring, reference similarity, editability scoring, and persistent artifact logging.
## Judge-Facing Links
- **GitHub repo:** [sanjuhs/open-env-meta-final-hackathon](https://github.com/sanjuhs/open-env-meta-final-hackathon)
- **GitHub Gist: training scripts:** [CADForge OpenEnv SFT/GRPO scripts](https://gist.github.com/sanjuhs/10596f688e8b4560910a3b1b137bfeeb)
- **Raw training logs and evidence:** [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
- Training notebook on this HF Space: [training/cadforge_openenv_training_colab.ipynb](training/cadforge_openenv_training_colab.ipynb)
- Open the same notebook in Google Colab: [Colab training notebook](https://colab.research.google.com/github/sanjuhs/open-env-meta-final-hackathon/blob/main/training/cadforge_openenv_training_colab.ipynb)
- Mini-blog: [CADFORGE_BLOG.md](CADFORGE_BLOG.md)
- Detailed technical blog: [docs/detailed-blog/cadforge-detailed-blog.md](docs/detailed-blog/cadforge-detailed-blog.md)
- Full project report: [docs/cadforge-openenv-project-report.md](docs/cadforge-openenv-project-report.md)
- Self-improving RLVE design: [docs/brainstorm/21-cadforge-self-improving-rlve.md](docs/brainstorm/21-cadforge-self-improving-rlve.md)
- Strict GRPO training report: [training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/training_curve_report.md](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/blob/main/training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/training_curve_report.md)
- Strict GRPO eval report: [training/eval/qwen35-9b-cadforge-grpo-strict-build-20260426-strict-build/eval_report.md](training/eval/qwen35-9b-cadforge-grpo-strict-build-20260426-strict-build/eval_report.md)
- Inference comparison: [inference/results/stator-qwen-vs-frontier/report.md](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/blob/main/inference/results/stator-qwen-vs-frontier/report.md)
- Training dataset: [sanjuhs/cadforge-cadquery-agentic-traces](https://huggingface.co/datasets/sanjuhs/cadforge-cadquery-agentic-traces)
- Training logs and evidence bundle: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
- Strict 9B GRPO LoRA: [sanjuhs/qwen35-9b-cadforge-grpo-strict-build-lora](https://huggingface.co/sanjuhs/qwen35-9b-cadforge-grpo-strict-build-lora)
- Adaptive repair GRPO LoRA: [sanjuhs/qwen35-9b-cadforge-grpo-adaptive-repair-lora](https://huggingface.co/sanjuhs/qwen35-9b-cadforge-grpo-adaptive-repair-lora)
**RunPod/H200 clarification:** the full 2B/9B SFT and GRPO runs were executed on RunPod H200 as distinct production scripts. The Colab notebook is the judge-runnable smoke path that validates OpenEnv, the public dataset, the CadQuery reward backend, and tiny SFT/GRPO launches using those same scripts.
## Results Snapshot
| Run | Result |
|---|---|
| Qwen3.5-2B SFT | train loss `1.4480 -> 0.1658`, eval loss `0.4477 -> 0.2676` |
| Qwen3.5-2B dense GRPO | mean reward `0.3387`, best `0.5303`; useful reward signal but too forgiving on broken builds |
| Qwen3.5-9B SFT | train loss `2.6020 -> 0.1413`, eval loss `0.3650 -> 0.2398` |
| Qwen3.5-9B strict GRPO | `320` completions, `96` buildable, best CADForge score `0.9352` |
| Qwen3.5-9B adaptive repair GRPO | `180` repair completions, `53` buildable, `0` clipped completions |
| Strict 9B quick eval | `2/3` held-out prompts built successfully |
| Stator inference comparison | base Qwen failed build; RL-tuned Qwen built a `0.654` stator; GPT-5.4 built a `0.709` stator |
![Strict GRPO reward curve](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/resolve/main/training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/grpo_reward_curve.png)
![Strict GRPO code health](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/resolve/main/training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/grpo_code_health.png)
![Base Qwen vs RL-tuned Qwen vs GPT-5.4 stator comparison](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/resolve/main/inference/results/stator-qwen-vs-frontier/comparison.png)
## Training Logs
The raw logs are backed up separately so judges can inspect the training evidence without relying on screenshots:
- Evidence dataset: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
- Compressed archive: `archives/cadforge-training-evidence-20260426.tar.gz`
- Key JSONL traces: `training/logs/*completions.jsonl`
The logs show the core result: dense GRPO had positive-looking reward but `0%` buildability; strict build-gating produced `96/320` buildable completions; adaptive repair fixed clipped outputs and produced `53/180` buildable repairs.
## Hackathon Theme Alignment
- **Theme 2: Long-horizon planning**: CAD improves through repeated code edits and reward feedback.
- **Theme 3.1: Professional world modeling**: the agent must use real CadQuery tools and survive compiler/export/mesh checks.
- **Theme 4: Self-improvement**: environment failures become new curriculum. The strict build-gated reward was created because the first dense reward was too forgiving.
- **Theme 5: Wild Card**: editable CAD generation is a practical, underexplored RLVE target.
## The Environment Fights Back
The first dense GRPO reward gave useful shape feedback, but it still rewarded some non-buildable CAD. CADForge responded by tightening the rules:
1. Buildability became the first gate.
2. failed CadQuery code receives negative reward.
3. syntax errors, missing `fixture`, undefined variables, and invented APIs are tracked separately.
4. successful builds unlock dense rewards for topology, semantics, reference similarity, contact, editability, and efficiency.
This produced useful GRPO variance: buildable CAD separated from pretty-but-broken code.
---
Legacy prototype notes follow.
Local prototype for a multi-step CADForge environment: prompt -> CSG/CAD actions -> geometry validation -> structural household part scoring.
Experiment 1 focuses on prompt-to-mechanical-design plus coarse 3D FEA. Experiment 2 keeps that renderer/verifier base, but reframes the loop around reliable code-CAD behavior:
- the agent plans small CAD operations,
- the trace is treated like an AST/feature-tree construction episode,
- the verifier reports CADForge metrics such as AST nodes, connected components, watertight/manifold proxy, editability proxy, and pseudo-OpenSCAD output,
- structural MechForge feedback remains as the first physical reward suite.
## Why This Exists
LLMs can often describe a chair, hook, or bracket, but they are unreliable at making CAD that builds, edits, exports, and stays physically coherent. CADForge turns those failure modes into reward:
- no floating parts,
- connected CSG/feature tree,
- watertight/manifold exported geometry,
- clean editable parameters,
- manufacturable features,
- structural safety under load.
The long-term target is an OpenEnv-compatible RLVE environment where an agent can take 100-300 CAD actions before committing a valid part.
## OpenEnv Space
This directory is now a deployable OpenEnv environment named `cadforge_cadquery`.
The action is a complete CadQuery Python file. The environment runs it through a constrained CadQuery runner, exports STL, scores build/topology/contact/task semantics/reference similarity/editability, and returns reward JSON plus verifier notes.
Local validation:
```bash
../.venv/bin/openenv validate .
PYTHONPATH=python_tools ../.venv/bin/uvicorn server.app:app --host 0.0.0.0 --port 8000
OPENENV_BASE_URL=http://localhost:8000 ../.venv/bin/python inference.py
```
Push to Hugging Face Spaces:
```bash
set -a; source ../.env; set +a
../.venv/bin/openenv push . --repo-id sanjuhs/cadforge-cadquery-openenv --interface
```
## Setup
```bash
cp .env.example .env
# Either paste your OpenAI key into this .env, or keep it in the repo-root .env.
npm install
npm run dev
```
Open:
```text
http://localhost:5177
```
The API listens on:
```text
http://localhost:8791
```
## What To Try
Chair benchmark:
```text
Build a simple four-legged chair as editable code-CAD. It must support a 700 N seated load, include a seat panel, four connected legs, lower crossbars, and a backrest, fit inside a 500 mm x 500 mm x 900 mm envelope, and avoid floating parts.
```
Truss benchmark:
```text
Build a simple lightweight truss support as code-CAD. Use connected triangular load paths, two fixed mounting holes on the left, a load boss on the right, and enough ribs/cross-members to carry a 250 N downward load with safety factor above 2.0.
```
Wall hook benchmark:
```text
Build a wall-mounted J hook as code-CAD. It needs two screw holes, one connected curved hook arm, a rounded tip lip, and support ribs at the root. It must carry a 120 N hanging load and avoid floating or disconnected geometry.
```
## OpenSCAD Rendering
The UI includes an OpenSCAD code panel with:
- `Generate SCAD`
- `Iterate SCAD`
- `Render SCAD`
- `Load Example`
This is a real browser-side CSG renderer for a constrained OpenSCAD subset. It currently supports:
- `cube`
- `sphere`
- `cylinder`
- `translate`
- `rotate`
- `scale`
- `union`
- `difference`
- `intersection`
The renderer parses SCAD text and builds an actual Three.js mesh. Boolean operations use `three-csg-ts`.
Full OpenSCAD CLI rendering is not enabled yet because `openscad` is not installed on this machine. The UI and README should not claim full OpenSCAD compatibility until that real dependency is available.
The server endpoints are:
```text
POST /api/scad-generate
POST /api/scad-iterate
```
Both use the configured model API key. They do not return fallback or mock SCAD when the key is missing.
## Current CADForge Metrics
The current prototype adds a `cadforge` block to each analysis result:
- `ast_nodes`
- `connected_components`
- `floating_parts`
- `watertight_proxy`
- `manifold_proxy`
- `clean_feature_tree_proxy`
- `named_parameter_count`
- `editability_score`
- `chair_core_features_passed`
- `pseudo_openscad`
These are MVP proxies, not a full OpenSCAD/trimesh compile yet. The next step is to replace the analysis proxies with:
```text
CSG AST -> OpenSCAD/CadQuery -> STL/STEP -> trimesh/solid validation -> reward
```
## OpenEnv Direction
The final environment should expose actions such as:
- `add_cube`
- `add_cylinder`
- `translate`
- `rotate`
- `union`
- `difference`
- `add_mount_hole`
- `add_rib`
- `compile_cad`
- `check_connected_components`
- `check_watertight`
- `check_editability`
- `run_structural_check`
- `commit_design`
This gives judges the story they want:
> The agent improves on a long-horizon world-modeling task where every CAD operation changes the physical world, and rewards come from objective geometric and structural checks.
## Python Solver
This copy still includes the MechForge Python solver under `python_tools/mechforge`. Prefer the repo-level Python 3.12 virtual environment:
```bash
UV_CACHE_DIR=.uv-cache uv venv --python python3.12 .venv
UV_CACHE_DIR=.uv-cache uv pip install numpy scipy pydantic fastapi uvicorn meshio gmsh scikit-fem cadquery openmdao openenv-core openai trimesh
```
Headless smoke test:
```bash
PYTHONPATH=experiment-2-cadforge/python_tools .venv/bin/python -m mechforge.cli sample
```