CADForge Inference Comparisons
This folder contains local inference/evaluation scripts for comparing generated CadQuery outputs.
The main benchmark is:
.venv/bin/python inference/compare_cadquery_models.py --baseline-source ollama
It compares three candidates on the same axial_motor_stator_12_slot task:
- Base Qwen: generated live through local Ollama, default
qwen3.5:9b. - RL-tuned Qwen: saved strict build-gated GRPO held-out stator artifact.
- GPT-5.4: saved frontier baseline artifact by default, or live OpenAI generation with
--gpt-source openaiandOPENAI_API_KEY.
Outputs are written under inference/results/<run-id>/:
report.mdcomparison.pngresults.json- per-model
candidate.py,reward.json, STL files, and render images
Important: the default run is a reproducible local comparison using one live base-Qwen generation plus saved trained/frontier artifacts. It is not a broad benchmark. The right claim is that CADForge makes a small Qwen model competitive on buildable, editable code-CAD behavior for a medium-difficulty part family, not that it beats frontier models globally.
Current Stator Result
Latest local run:
- Report: results/stator-qwen-vs-frontier/report.md
- Comparison image: results/stator-qwen-vs-frontier/comparison.png
| Model | Total | Build | Semantic | Editability |
|---|---|---|---|---|
| Base Qwen | -1.000 | 0.0 | 0.000 | 0.000 |
| RL-tuned Qwen | 0.654 | 1.0 | 0.300 | 0.825 |
| GPT-5.4 | 0.709 | 1.0 | 0.638 | 0.825 |
