Spaces:

sanjuhs
/

cadforge-cadquery-openenv

Running

App Files Files Community

sanjuhs commited on 12 days ago

Commit

ca16fdf

verified ·

1 Parent(s): 00580e9

Upload CADForge judge evidence docs

Browse files

Files changed (36) hide show

docs/best-example-project.md +374 -0
docs/brainstorm/00-hackathon-readout.md +75 -0
docs/brainstorm/01-idea-scorecard.md +59 -0
docs/brainstorm/02-recommended-idea-regulatory-dossier-control-room.md +272 -0
docs/brainstorm/03-rapid-build-plan.md +164 -0
docs/brainstorm/04-physics-design-environments.md +258 -0
docs/brainstorm/05-mechforge-rendering-stack.md +85 -0
docs/brainstorm/06-production-simulation-stack.md +207 -0
docs/brainstorm/07-mechforge-domain-choice.md +169 -0
docs/brainstorm/08-agentic-3d-engineering-environment.md +249 -0
docs/brainstorm/09-cad-rlve-structural-household-parts.md +472 -0
docs/brainstorm/10-cadforge-rlve-environment.md +564 -0
docs/brainstorm/11-reference-model-reward-pipeline.md +192 -0
docs/brainstorm/12-markus-chair-scope-grpo-rlve.md +161 -0
docs/brainstorm/13-markus-chair-cadquery-grpo-rlve-plan.md +799 -0
docs/brainstorm/14-cadquery-sft-grpo-rlve-training-plan.md +295 -0
docs/brainstorm/15-cadquery-agentic-traces-sft-grpo-plan.md +246 -0
docs/brainstorm/16-tonight-execution-plan.md +140 -0
docs/brainstorm/17-cadquery-reward-functions-deep-dive.md +272 -0
docs/brainstorm/18-how-sft-and-grpo-data-works.md +192 -0
docs/brainstorm/19-qwen35-2b-9b-cadforge-sft-grpo-runpod-plan.md +224 -0
docs/brainstorm/20-cadforge-qwen-training-runbook.md +380 -0
docs/cadforge-openenv-project-report.md +1 -1
docs/cadforge-submission-checklist.md +71 -0
docs/competiton-round1/COMPETITION_REQUIREMENTS.md +69 -0
docs/competiton-round1/inference-script-example.md +189 -0
docs/competiton-round1/objective.md +581 -0
docs/competiton-round1/pre-vaidationscript-example.md +185 -0
docs/detailed-blog/cadforge-detailed-blog.md +1 -1
docs/doc-edit-game-v2.md +149 -0
docs/docs-guide.md +1 -0
docs/final-postmortem-round1.md +240 -0
docs/hackathon_help_guide.md +425 -0
docs/judging_criteria.md +166 -0
docs/project-setup.md +3 -0
docs/round1-corrections.md +32 -0

docs/best-example-project.md ADDED Viewed

	@@ -0,0 +1,374 @@

+https://github.com/sid-rp/kube-sre-gym
+---
+title: Kube SRE Gym
+emoji: 🔧
+colorFrom: red
+colorTo: yellow
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# Kube SRE Gym
+### Can a 0.6B model learn to be an on-call SRE — from scratch?
+We gave a tiny language model a pager, a live Kubernetes cluster, and zero knowledge of what a pod even is. No pre-training on DevOps docs. No few-shot examples. Just a PagerDuty alert and a `kubectl` prompt.
+Within 8 episodes, it learned to discover namespaces, read pod statuses, identify OOMKills from CrashLoopBackOffs, and apply the correct fix. By episode 4, it was resolving incidents faster than our hand-written baselines.
+**This is Kube SRE Gym** — a self-improving environment where an RL agent learns to diagnose and fix real production Kubernetes failures through adversarial self-play, curriculum-driven difficulty, and GRPO.
+> **1st Place, OpenEnv Hackathon** (PyTorch + Cerebral Valley, $15K prize) | Built with [OpenEnv v0.2.1](https://github.com/meta-pytorch/OpenEnv/tree/v0.2.1) | Deployed on [HF Spaces](https://huggingface.co/spaces/openenv-community/kube-sre-gym) | Training via [HF TRL](https://github.com/huggingface/trl) in [Colab](kube_sre_gym_colab.ipynb)
+[![Hackathon Winner](https://raw.githubusercontent.com/sid-rp/kube-sre-gym/main/assets/hackathon_winner.png)](https://cerebralvalley.ai/e/openenv-hackathon-sf/hackathon/gallery)
+---
+## The Story: From Blind to On-Call
+### Act 1: The Cold Start
+Episode 1. The agent receives its first alert: *"CRITICAL: payment-gateway pods OOMKilled in payments namespace."*
+It has never seen Kubernetes before. It doesn't know what namespaces are, what pods look like, or that `kubectl` even exists. It tries random commands. Everything fails. Reward: **-2.0**.
+### Act 2: First Light
+Episode 4. Something clicks. The agent discovers `kubectl get pods -A` — a single command that reveals the entire cluster. It sees `OOMKilled` in the STATUS column. It connects this to the alert. It runs `kubectl set resources deployment/payment-gateway --limits=memory=128Mi -n payments`.
+The pod restarts. The health check passes. The LLM judge confirms resolution. Reward: **+3.95**.
+### Act 3: The Environment Fights Back
+As the agent masters simple faults, the **Adversarial Designer** (Claude) notices. It starts creating compound incidents — an OOMKill in `payments` *and* a bad image in `frontend` simultaneously. Red herrings appear. The agent must learn to triage, not just react.
+The **Curriculum Controller** tracks per-fault-type mastery and escalates: warmup → beginner → intermediate → advanced → expert. The training distribution adapts in real-time. No scenario is ever repeated.
+### Act 4: The Environment Improves Itself
+Here's what made this project different from what we planned: **the environment itself had bugs that training exposed.**
+During training, we discovered our kubectl command parser only accepted `deployment/name` format (with a slash). The model kept sending perfectly valid `kubectl scale deployment frontend-cache --replicas=1` — and the environment rejected it every time. The model was right. Our environment was wrong.
+We also found the LLM judge was truncating cluster snapshots at 2000 chars, cutting off pods alphabetically after `payment-*`. And a race condition between health checks and judge API calls was causing false negatives — pods would appear healthy during the health check but unhealthy by the time the judge snapshot ran.
+**The agent's failures taught us to fix the environment.** This is the self-improvement loop we didn't expect — not just the model getting better, but the training infrastructure co-evolving with it.
+---
+## Problem Statements Addressed
+### Primary: Statement 4 — Self-Improvement
+Kube SRE Gym is an environment where the agent **generates its own challenges, escalates difficulty, and improves through adaptive curricula** — exactly the recursive skill amplification described in Statement 4.
+- **Adversarial self-play**: Claude designs incidents that target the agent's tracked weaknesses
+- **Automatic curriculum**: Difficulty escalates as per-fault-type mastery improves (warmup → beginner → intermediate → advanced → expert)
+- **No manual authoring**: The training distribution adapts as the agent learns — infinite novel scenarios
+- **Co-evolutionary improvement**: Training runs exposed environment bugs, making the platform itself better
+### Secondary: Statement 3.1 — World Modeling / Professional Tasks
+The agent interacts with **real Kubernetes tools and APIs** — not mocked responses or shortcuts. It must maintain internal state across multi-step kubectl workflows and reason about causal effects of its actions on a live cluster.
+- **Real tool interaction**: Every `kubectl` command executes against a live GKE cluster
+- **Multi-step workflows**: Triage → investigate → fix → verify, with no shortcuts
+- **Persistent world state**: Pod restarts, OOM events, and cascading failures are real K8s events
+### Partner Sub-Theme: Snorkel AI — Simulated Experts-in-the-Loop
+The LLM judge uses **three expert personas** (Junior, Senior, Principal) with progressively stricter evaluation criteria, simulating interaction with subject-matter experts whose requirements change as the agent improves:
+- **Junior**: Lenient scoring, partial credit, provides hints
+- **Senior**: Standard SRE expectations, rewards systematic diagnosis
+- **Principal**: High standards, penalizes inefficiency, rewards elegant fixes
+---
+## How It Works
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        SELF-IMPROVING LOOP                         │
+│                                                                    │
+│  ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌────────────┐  │
+│  │Adversarial│───►│  Real GKE  │───►│  Agent   │───►│ LLM Judge  │  │
+│  │ Designer  │    │  Cluster   │    │(Qwen 1.7B│    │(Claude/    │  │
+│  │(Claude)   │    │            │    │  + LoRA)  │    │ Qwen 14B)  │  │
+│  └─────▲─────┘    └────────────┘    └────┬─────┘    └─────┬──────┘  │
+│        │                                 │                │         │
+│        │         ┌──────────────┐        │     reward     │         │
+│        │         │  Curriculum  │◄───────┴────────────────┘         │
+│        └─────────│  Controller  │                                   │
+│     weak spots   │  (mastery    │──► GRPO gradient update           │
+│     & difficulty │   tracking)  │    (TRL + vLLM on H100)           │
+│                  └──────────────┘                                   │
+└─────────────────────────────────────────────────────────────────────┘
+```
+### The Loop
+1. **Adversarial Designer** (Claude) creates targeted incidents based on the agent's weak spots — single faults for warmup, multi-fault cascading failures for harder tiers
+2. **Fault Injection** executes real `kubectl` commands against a live GKE cluster (set memory to 4Mi, inject bad images, corrupt env vars, scale to zero)
+3. **Agent** (Qwen3-1.7B + LoRA) receives a PagerDuty-style alert and must diagnose + fix using only kubectl commands — no hints about cluster topology
+4. **LLM Judge** scores each action for SRE workflow correctness (triage → investigate → fix → verify) and verifies resolution by checking actual cluster state
+5. **Curriculum Controller** tracks per-fault-type mastery and escalates difficulty — the agent gets harder scenarios as it improves
+6. **GRPO** computes advantages across 8 parallel rollouts and updates the policy — the agent gets better at fixing incidents it previously failed
+### What Makes This Different
+- **Real cluster, not a simulator** — kubectl commands execute against live GKE pods. OOMKills, CrashLoopBackOffs, and ImagePullBackOffs are real Kubernetes events
+- **Self-generating scenarios** — the adversarial designer creates new incident types targeting the agent's weaknesses, so the training distribution adapts as the agent learns
+- **Multi-layer verification** — programmatic health checks (expected pod count, restart tracking, OOM detection) + LLM judge verification prevents false resolution
+- **No hardcoded knowledge** — the agent prompt contains zero information about cluster topology, namespace names, or deployment details. It must discover everything via `kubectl get pods -A`
+- **Environment co-evolution** — training revealed bugs in our own infrastructure, making the platform better alongside the agent
+---
+## Architecture
+```
+H100 GPU (80GB)                              GKE Cluster (3 namespaces)
+┌──────────────────────────────────┐          ┌─────────────────────────┐
+│                                  │          │ payments/               │
+│  OpenEnv Server :8000            │  K8s API │   payment-api (Flask)   │
+│  ├─ Environment (reset/step)     │◄────────►│   payment-gateway       │
+│  ├─ Fault Injector               │          │   payment-worker        │
+│  ├─ Curriculum Controller        │          │                         │
+│  ├─ Adversarial Designer ──────────►Claude  │ frontend/               │
+│  └─ LLM Judge ─────────────────────►Claude  │   web-app (nginx)       │
+│                                  │          │   frontend-cache        │
+│  GRPO Trainer (TRL 0.29.0)       │          │                         │
+│  ├─ Qwen3-1.7B + LoRA (BF16)    │          │ auth/                   │
+│  ├─ vLLM colocate (inference)    │          │   auth-service          │
+│  └─ 8 rollouts × grad_accum=8   │          └─────────────────────────┘
+│                                  │
+└──────────────────────────────────┘
+```
+## Failure Types
+| Type | What Gets Injected | What Agent Must Do |
+|------|--------------------|---------------------|
+| `oom_kill` | Memory limit set to 4Mi | Increase to 128Mi via `kubectl set resources` |
+| `crashloop` | Container command set to `exit 1` | Remove bad command via `kubectl patch` |
+| `image_pull` | Image set to `nginx:nonexistent-tag-99999` | Fix image tag via `kubectl set image` |
+| `bad_config` | DATABASE_URL pointed to `wrong-host.invalid` | Correct env var via `kubectl set env` |
+| `scale_zero` | Replicas set to 0 | Scale back up via `kubectl scale` |
+| `liveness_probe` | Probe path set to `/nonexistent` | Fix probe via `kubectl patch` |
+| `multi-fault` | 2-3 faults across different namespaces | Find and fix ALL faults |
+## Training Signal
+The reward function has multiple layers to ensure clean GRPO signal:
+- **Per-step LLM judge score** (-1.0 to +1.0) — evaluates SRE workflow quality (phase-aware: triage, investigate, fix, verify)
+- **Repeat penalty** — -0.15 per repeated command (teaches exploration over repetition)
+- **Resolution bonus** — +1.0 to +5.0 for confirmed fixes (efficiency-scaled: faster fixes get higher bonuses)
+- **Timeout penalty** — failed episodes wiped to net -2.0 total reward
+- **Judge verification** — LLM confirms fix is real by reviewing cluster state + action history
+- **Phase-order bonus** — +0.2 for following correct SRE workflow, -0.3 for skipping phases
+This produces clear separation: successful episodes score +3 to +8, failed episodes score -2.0. GRPO needs this variance to compute meaningful advantages.
+---
+## Results
+### Training Run 1: Qwen2.5-1.5B — The Cold Start
+![Qwen2.5-1.5B Reward Curve](https://raw.githubusercontent.com/sid-rp/kube-sre-gym/main/assets/reward_curve_qwen2.5_1.5b.png)
+Our first attempt. 12 episodes, massive variance swinging between -7.5 and +3.7. The upward trend (+0.447/ep) was encouraging — the model *was* learning — but the signal was too noisy. We traced this to **environment bugs**: our command parser rejected valid kubectl syntax, the error penalty override was masking real progress, and the judge was truncating cluster snapshots.
+The model was fighting two battles: learning Kubernetes AND working around our broken environment.
+### Training Run 2: Qwen3-1.7B — Too Much Reward, Too Soon
+![Qwen3-1.7B Reward Curve](https://raw.githubusercontent.com/sid-rp/kube-sre-gym/main/assets/reward_curve_qwen3_1.7b.png)
+After fixing the environment bugs, we switched to Qwen3-1.7B. It started strong (avg ~5.0) but the reward signal was *too generous* — the model found a plateau at 3.0-3.5 and stopped improving. The slight downward trend (-0.073/ep) over 29 episodes told us the curriculum wasn't pushing hard enough.
+This run taught us that **a good environment needs to fight back**. We tightened the reward function, added repeat-command penalties, and activated adversarial mode.
+### Training Run 3: Qwen3-1.7B — Environment Fights Back (Ongoing)
+Current run with all fixes applied — adversarial scenarios, tighter rewards, repeat-command circuit breaker:
+| Episode | Reward | Diagnosis | Fix |
+|---------|--------|-----------|-----|
+| 1 | +1.80 | 0.30 | -0.10 |
+| 2 | +5.38 | 0.30 | +0.10 |
+| 3 | -2.50 | 0.70 | 0.00 |
+| 4 | **+6.58** | 0.70 | -0.60 |
+| 5 | +5.45 | 0.70 | 0.00 |
+| 6 | -2.00 | 0.55 | -0.60 |
+| 7 | **+6.79** | 0.70 | +0.50 |
+| 8 | +6.35 | 0.20 | +0.40 |
+**Mean: 3.48 | Best: 6.79** — with real adversarial difficulty. The high-variance episodes (ep3, ep6 are negatives; ep4, ep7 are +6.5) show GRPO is getting the signal variance it needs to compute meaningful advantages.
+### What the agent learned (from reward signal alone)
+1. Run `kubectl get pods -A` to discover cluster topology
+2. Identify fault types from pod STATUS column (OOMKilled, ImagePullBackOff, CrashLoopBackOff)
+3. Map fault types to correct fix commands (`set resources`, `set image`, `patch`, `scale`)
+4. Check ALL namespaces after each fix — there may be multiple faults
+5. Never repeat a failed command — try a different approach
+### What we learned (from the agent's failures)
+1. Our command parser was too strict — valid kubectl syntax was being rejected
+2. Judge snapshot truncation hid pods alphabetically after `payment-*`
+3. Error penalty override was masking real progress with false negatives
+4. Too-generous rewards cause plateaus — the environment must fight back
+5. The environment needs to evolve alongside the agent — static environments miss bugs
+---
+## Training with HF TRL (Colab)
+A complete training notebook is provided at [`kube_sre_gym_colab.ipynb`](kube_sre_gym_colab.ipynb) using **HF TRL's GRPO** implementation. The notebook covers:
+1. Connect to the OpenEnv server on HF Spaces
+2. Configure GRPO training with TRL (`GRPOConfig`, `GRPOTrainer`)
+3. Run training episodes against the live environment
+4. Save checkpoints to HuggingFace Hub
+Training uses TRL's experimental OpenEnv integration (`trl.experimental.openenv.generate_rollout_completions`) for seamless environment-trainer communication.
+## Quick Start
+```python
+from kube_sre_gym import KubeSreGymAction, KubeSreGymEnv
+with KubeSreGymEnv(base_url="http://localhost:8000") as client:
+    obs = client.reset()
+    print(obs.observation.command_output)  # PagerDuty alert
+    obs = client.step(KubeSreGymAction(command="kubectl get pods -A"))
+    obs = client.step(KubeSreGymAction(command="kubectl describe pod payment-api-xxx -n payments"))
+    obs = client.step(KubeSreGymAction(command="fix: kubectl set resources deployment/payment-api --limits=memory=128Mi -n payments"))
+    # reward > 0 if fix is correct, episode done
+```
+## Deployment on HF Spaces
+The environment is deployed as a Docker-based HF Space using OpenEnv v0.2.1:
+```bash
+# Dockerfile uses openenv-base image
+FROM ghcr.io/meta-pytorch/openenv-base:latest
+# Serves OpenEnv HTTP/WebSocket API on port 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+Configuration in `openenv.yaml`:
+```yaml
+spec_version: 1
+name: kube_sre_gym
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+```
+## Training on H100
+**Install**
+```bash
+git clone https://huggingface.co/spaces/openenv-community/kube-sre-gym && cd kube-sre-gym
+pip install -e ".[train]"
+```
+**Set credentials**
+```bash
+export K8S_TOKEN=<gke-bearer-token>
+export K8S_ENDPOINT=<gke-api-url>
+export K8S_CA_CERT=<base64-ca-cert>
+export ANTHROPIC_API_KEY=<key>       # for adversarial designer + judge
+export HF_TOKEN=<token>              # for pushing checkpoints
+```
+**Launch (2 terminals)**
+```bash
+# Terminal 1: Environment server
+GYM_MODE=adversarial LLM_BACKEND=anthropic uv run server
+# Terminal 2: GRPO training
+python train.py --vllm-mode colocate --num-generations 8 --max-steps 8 --save-steps 1 \
+  --push-to-hub --hub-repo your-name/k8s-sre-agent
+```
+The curriculum automatically progresses: warmup (single faults) → intermediate (harder faults) → expert (multi-fault adversarial scenarios designed by Claude).
+## Evaluation
+```bash
+# Compare base model vs trained checkpoint
+python eval.py
+```
+Runs both models through random adversarial scenarios and reports resolution rate, average reward, and steps-to-fix.
+## Configuration
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `K8S_TOKEN` | Bearer token for GKE | required |
+| `K8S_ENDPOINT` | GKE API endpoint | required |
+| `K8S_CA_CERT` | Base64 CA cert | required |
+| `GYM_MODE` | `standard` or `adversarial` | `standard` |
+| `LLM_BACKEND` | `openai`, `hf`, or `anthropic` | `openai` |
+| `ANTHROPIC_API_KEY` | For adversarial designer + judge | required in adversarial mode |
+| `MAX_STEPS` | Max commands per episode | `16` |
+| `EVAL_MIN_DIFFICULTY` | Override min difficulty for eval | `0.0` |
+## Project Structure
+```
+kube-sre-gym/
+├── train.py                # GRPO training (TRL 0.29.0 + vLLM colocate)
+├── eval.py                 # Base vs trained model comparison
+├── kube_sre_gym_colab.ipynb # Google Colab training notebook (HF TRL)
+├── plot_rewards.py          # Reward curve visualization
+├── models.py               # Action, Observation, State dataclasses
+├── client.py               # KubeSreGymEnv sync client
+├── Dockerfile               # HF Spaces deployment (OpenEnv base image)
+├── openenv.yaml             # OpenEnv v0.2.1 Space config
+├── server/
+│   ├── kube_sre_gym_environment.py  # Core env: reset → inject → step → judge → reward
+│   ├── k8s_backend.py      # K8s auth, execute, reset, health checks
+│   ├── k8s_commands.py      # kubectl command handlers (get/describe/logs/set/patch)
+│   ├── k8s_injectors.py    # Real fault injection via K8s API
+│   ├── adversarial_designer.py  # LLM designs multi-step incidents
+│   ├── judge.py             # LLMJudge + AdversarialJudge (phase-aware SRE scoring)
+│   ├── curriculum.py        # Progressive difficulty + mastery tracking
+│   ├── scenario_generator.py  # Fault scenario pool
+│   ├─�� llm_client.py       # OpenAI/HF/Anthropic wrapper
+│   ├── constants.py         # Cluster topology, healthy state definitions
+│   └── app.py              # FastAPI + WebSocket server
+└── sample_app/
+    ├── namespaces.yaml      # payments, frontend, auth
+    └── base/                # Healthy deployment manifests
+```
+## Key Design Decisions
+1. **Real cluster over simulator** — Simulators can't reproduce the timing, state transitions, and failure modes of real Kubernetes. OOM kills happen when the kernel actually runs out of memory, not when a flag is set.
+2. **Adversarial self-play** — The designer targets the agent's weaknesses (tracked by curriculum), creating an automatic curriculum that gets harder as the agent improves. No manual scenario authoring needed.
+3. **Multi-layer resolution check** — Programmatic (expected pod count + restart tracking + OOM detection) + LLM judge verification. This prevents false resolution from OOM-flapping pods or partial fixes in multi-fault scenarios.
+4. **No topology in prompt** — The agent receives zero information about namespaces, deployment names, or images. It must learn to discover the cluster layout via `kubectl get pods -A`, making the learned policy transferable to any cluster.
+5. **GRPO over PPO** — GRPO compares multiple rollouts of the same prompt, producing stable advantages without a value function. Better suited for sparse, delayed rewards (most reward comes at episode end).
+6. **Environment co-evolution** — We intentionally treat environment bugs as part of the story. When training exposed issues in our command parser, judge, and health checks, we fixed them — making the environment better alongside the agent. This is recursive self-improvement at the platform level.

docs/brainstorm/00-hackathon-readout.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# OpenEnv Hackathon Readout
+Date: 2026-04-24
+## What The Hackathon Wants
+The winning submission should be an OpenEnv-compliant environment where an LLM acts step by step, receives programmatic feedback, and measurably improves through RL or RL-style training.
+The most important judging weights are:
+| Criterion | Weight | Practical meaning |
+|---|---:|---|
+| Environment innovation | 40% | Novel, challenging, meaningful agent behavior, not a clone of common games or toy tasks. |
+| Storytelling | 30% | A judge should understand the world, the agent, what it learned, and why it matters in 3 to 5 minutes. |
+| Showing improvement | 20% | Reward curves, before/after runs, baseline comparison, actual training evidence. |
+| Reward/training pipeline | 10% | Coherent rubrics, TRL or Unsloth script, reproducible pipeline. |
+Minimum gates:
+- Use latest OpenEnv.
+- Hosted Hugging Face Space.
+- OpenEnv-compliant `reset`, `step`, `state`, typed models, `openenv.yaml`.
+- Training script using Unsloth or HF TRL, ideally Colab.
+- Evidence of real training, including reward/loss plots.
+- README with problem, environment, actions, observations, tasks, setup, results.
+## Strategic Lessons From The Docs
+1. Pick a task where success can be verified programmatically.
+2. Make the environment ambitious but keep the first curriculum levels easy enough for non-zero reward.
+3. Use multiple reward signals, not one monolithic score.
+4. Build the environment and verifier before training.
+5. Show a before/after behavior difference, not only a training script.
+6. Avoid a static benchmark. Adaptive curriculum and self-play read as much more ambitious.
+7. The story matters almost as much as the engineering.
+## Lessons From The Prior DocEdit Work
+The old DocEdit environment passed because it was:
+- Real-world, not a game.
+- OpenEnv compliant.
+- Lightweight enough for the constraints.
+- Deterministically graded.
+- Easy to explain.
+The later Qwen SFT + GRPO postmortem proved that document repair can improve with training, but it also exposed a strategic limitation: full-document rewrite policies are probably not the best final design. A stronger next step is a planner/executor setup with structured edit actions and verifier feedback.
+## Lessons From The Winning Kube SRE Example
+The winning pattern was not just "Kubernetes environment." It was:
+- A vivid professional world: a tiny model learns to be on-call.
+- Real or realistic tools.
+- Multi-step investigation and repair.
+- Adaptive curriculum.
+- Adversarial scenario generation.
+- Multi-layer rewards.
+- A story where the agent and environment co-evolve.
+The key insight to borrow:
+> The environment should fight back as the agent improves.
+## Our Target Shape
+To maximize win probability, the idea should combine:
+- Theme 2: long-horizon planning, ideally up to 300 actions.
+- Theme 3.1: professional world modeling with realistic tools and persistent state.
+- Theme 4: self-improvement through adaptive scenario generation.
+- Existing leverage from DocEdit so we can build fast.
+The strongest direction is therefore not "another document editor." It is a long-horizon professional control room where document edits are one part of a larger verified workflow.

docs/brainstorm/01-idea-scorecard.md ADDED Viewed

	@@ -0,0 +1,59 @@

+# Idea Scorecard
+## Scoring Rubric
+Scores are out of 10 and weighted roughly by the hackathon criteria.
+| Field | Meaning |
+|---|---|
+| Innovation | Would judges find the environment fresh and research-worthy? |
+| Story | Can the demo be explained clearly and memorably? |
+| Trainability | Can we show reward improvement in the available time? |
+| Verifiability | Can rewards be objective and hard to game? |
+| Build speed | Can we build a credible OpenEnv environment quickly? |
+## Candidate Ideas
+| Rank | Idea | Innovation | Story | Trainability | Verifiability | Build speed | Verdict |
+|---:|---|---:|---:|---:|---:|---:|---|
+| 1 | Regulatory Dossier Control Room | 9 | 9 | 8 | 9 | 8 | Best overall. Uses DocEdit leverage but expands into long-horizon professional world modeling. |
+| 2 | Personal Chief of Staff Simulator | 8 | 9 | 7 | 7 | 6 | Excellent theme fit, but personalization and reward design may get fuzzy. |
+| 3 | Codebase Migration Gym | 7 | 7 | 8 | 9 | 6 | Verifiable with tests, but code agents are crowded and less novel. |
+| 4 | Research Reproduction Lab | 9 | 8 | 5 | 7 | 4 | Very ambitious, likely too hard to build and train under time pressure. |
+| 5 | Multi-Agent Procurement Negotiation | 8 | 8 | 6 | 6 | 5 | Good multi-agent story, but objective grading and RL loop are harder. |
+| 6 | Supply Chain Crisis Planner | 7 | 8 | 7 | 8 | 6 | Solid simulator, but can feel like an operations game if not grounded enough. |
+## Recommended Winner Candidate
+Build **Regulatory Dossier Control Room**.
+One-line pitch:
+> Train an agent to manage a 300-step regulatory document crisis: inspect a simulated pharma submission, discover scattered inconsistencies, apply precise cross-document edits, validate the dossier, and improve through adversarially generated new compliance failures.
+Why this is the best fit:
+- It hits long-horizon planning directly.
+- It is professional and high-value.
+- It has crisp verification via hidden canonical facts and compliance rules.
+- It extends prior DocEdit work instead of restarting from zero.
+- It creates a very strong story: "Can a small model learn to behave like a regulatory operations associate?"
+- It can show training improvement without requiring a real external system like Kubernetes.
+- It can scale from easy 10-step tasks to hard 300-step tasks through curriculum.
+## Why Not Just Continue DocEdit V2?
+DocEdit V2 is useful but too narrow for this round's themes. It is mostly local edit application. The judging criteria now heavily reward long-horizon behavior, self-improvement, and world modeling.
+We should reuse DocEdit-style document generation, corruption, chunking, and grading, but wrap it inside a larger workflow:
+- Multiple documents.
+- Persistent investigation state.
+- Hidden facts.
+- Cross-document dependencies.
+- Validation loops.
+- Audit notes.
+- Adaptive scenario generator.
+That gives the old strength a much bigger judging surface.

docs/brainstorm/02-recommended-idea-regulatory-dossier-control-room.md ADDED Viewed

	@@ -0,0 +1,272 @@

+# Recommended Idea: Regulatory Dossier Control Room
+## Short Pitch
+**Regulatory Dossier Control Room** is an OpenEnv environment where an LLM acts as a regulatory operations agent during a simulated pharma submission crisis.
+The agent receives a high-level change request, such as a dosage update, safety warning, manufacturing site change, or adverse-event correction. The change is scattered across a dossier of many interlinked documents: drug label, clinical study report, investigator brochure, patient leaflet, quality summary, cover letter, amendment log, and internal review notes.
+The agent has up to 300 tool steps to inspect, search, edit, validate, and audit the dossier. Rewards come from objective checks against a hidden consistency graph and regulatory rules.
+## Why Judges Should Care
+Real regulatory work is long-horizon, high-stakes, and brutally detail-sensitive. A single inconsistent dosage, date, or contraindication across documents can delay a submission.
+Current LLMs are good at explaining documents, but they struggle with:
+- Tracking facts across many files.
+- Applying the same change consistently.
+- Avoiding collateral damage.
+- Remembering decisions over long sessions.
+- Recovering from early mistakes.
+- Knowing when to validate and when to stop.
+This environment trains exactly that behavior.
+## Theme Fit
+Primary theme:
+- Theme 2: Super long-horizon planning and instruction following.
+Secondary themes:
+- Theme 3.1: Professional task world modeling.
+- Theme 4: Self-improvement through adaptive curricula.
+- Theme 5: Wild card, because it turns document editing into a realistic compliance control room.
+## The 300-Step Task
+Hard episodes have:
+- 20 to 60 dossier files.
+- 40 to 150 hidden obligations.
+- 100 to 300 possible action steps.
+- Cross-document dependencies.
+- Red herrings and stale memo fragments.
+- Validation reports that reveal partial but not complete truth.
+Example hard prompt:
+> A late safety update changes the maximum daily dose from 40 mg to 30 mg for renal impairment patients, adds a contraindication for severe hepatic impairment, removes an outdated trial endpoint from Study RX-204, and requires all patient-facing materials to use plain-language wording. Update the dossier, preserve unrelated content, and leave an audit trail.
+The agent must discover that this affects:
+- Drug label dosage section.
+- Contraindications section.
+- Patient leaflet.
+- Clinical study report summary table.
+- Investigator brochure safety section.
+- Cover letter.
+- Amendment log.
+- Cross-reference table.
+- Internal review checklist.
+## Action Space
+Potential actions:
+```json
+{"tool": "search", "query": "renal impairment 40 mg"}
+{"tool": "open_file", "path": "label/section_4_2_dosage.xml"}
+{"tool": "inspect_window", "path": "csr/rx_204_summary.xml", "start": 120, "length": 40}
+{"tool": "replace_text", "path": "label/section_4_2_dosage.xml", "target": "40 mg", "replacement": "30 mg"}
+{"tool": "patch_section", "path": "patient_leaflet.xml", "section_id": "dose_warning", "content": "..."}
+{"tool": "add_audit_note", "document": "amendment_log.xml", "note": "..."}
+{"tool": "run_validator", "validator": "dose_consistency"}
+{"tool": "commit_episode"}
+```
+Optional later actions:
+```json
+{"tool": "assign_subtask", "agent": "safety_reviewer", "objective": "..."}
+{"tool": "resolve_conflict", "fact": "max_daily_dose", "value": "30 mg", "evidence": ["..."]}
+```
+## Observation Space
+The agent sees:
+- Current task brief.
+- Current file/window content.
+- Search results.
+- Known facts discovered so far.
+- Validation warnings.
+- Edit history.
+- Remaining step budget.
+- Reward components from the last action.
+The agent does not see:
+- The full hidden canonical answer.
+- All affected files upfront.
+- The complete dependency graph.
+## Reward Design
+Use multiple independent reward components:
+| Reward component | Purpose |
+|---|---|
+| Fact correction reward | Correctly updates canonical facts like dosage, dates, safety claims, study endpoints. |
+| Cross-document consistency reward | Same fact is consistent across all required files. |
+| Coverage reward | Agent discovers and touches all impacted nodes in the hidden dependency graph. |
+| Collateral damage penalty | Penalize changing unrelated text or breaking valid facts. |
+| Audit reward | Correctly records what changed and why. |
+| Validation reward | Reward using validators and resolving their warnings. |
+| Efficiency reward | Encourage completion before 300 steps. |
+| Anti-hacking penalty | Penalize invalid paths, repeated no-ops, format-breaking edits, or validator spam. |
+Suggested total:
+```text
+reward =
+  delta_fact_score
+  + delta_consistency_score
+  + 0.2 * delta_coverage
+  + 0.1 * audit_score_delta
+  + validator_resolution_bonus
+  - collateral_damage_penalty
+  - repeat_action_penalty
+  - invalid_action_penalty
+```
+Final success score:
+```text
+final_score =
+  0.35 * fact_accuracy
+  + 0.25 * cross_doc_consistency
+  + 0.15 * affected_file_coverage
+  + 0.10 * audit_quality
+  + 0.10 * structural_validity
+  + 0.05 * efficiency
+  - collateral_damage
+```
+## Self-Improvement Loop
+The environment includes an **Adversarial Compliance Designer**.
+It tracks the agent's weaknesses:
+- Misses patient-facing documents.
+- Fixes label but forgets clinical study report tables.
+- Over-edits unrelated sections.
+- Fails to write audit notes.
+- Repeats search actions.
+- Stops before running validators.
+Then it generates harder future episodes:
+- More files.
+- More cross-references.
+- More red herrings.
+- More subtle wording differences.
+- Compound changes.
+- Longer dependency chains.
+Curriculum levels:
+| Level | Episode shape | Expected horizon |
+|---|---|---:|
+| 1 | One file, one fact | 5 to 15 steps |
+| 2 | Three files, one fact | 15 to 35 steps |
+| 3 | Ten files, two facts | 35 to 80 steps |
+| 4 | Twenty files, compound update | 80 to 160 steps |
+| 5 | Full dossier crisis with red herrings | 160 to 300 steps |
+This gives us non-zero reward early, then a path to the 300-step headline.
+## What We Train
+Start with a small instruct model and train it to:
+- Search before editing.
+- Build a working memory of discovered facts.
+- Use validators.
+- Apply narrow patches instead of broad rewrites.
+- Maintain consistency across files.
+- Stop only after validation passes.
+Training recipe:
+1. Baseline inference with frontier or small model.
+2. Optional light SFT on synthetic tool traces from an oracle policy.
+3. GRPO or RLVR using the verifier reward.
+4. Compare base vs trained on held-out dossier seeds.
+## Demo Story
+The demo can be extremely clear:
+1. Show the crisis brief.
+2. Show a baseline model making local edits but missing cross-document consequences.
+3. Show the validator catching unresolved inconsistencies.
+4. Show reward curve improving during training.
+5. Show trained agent: search, patch, validate, audit, commit.
+6. Show final score breakdown and affected-file map.
+Tagline:
+> From "edit this paragraph" to "manage a 300-step regulatory crisis."
+## Why This Can Win
+It has the same strengths as Kube SRE Gym without copying it:
+- Professional task.
+- Tool-based world.
+- Multi-step investigation.
+- Adaptive curriculum.
+- Agent learns from verifier feedback.
+- Strong before/after story.
+But it is more directly aligned with the user's existing assets:
+- Existing DocEdit document generation.
+- Existing structured edit actions.
+- Existing similarity/collateral grading idea.
+- Existing proof that small-model training can improve document repair.
+## MVP Scope
+Minimum credible hackathon version:
+- 8 to 12 document templates.
+- 5 scenario families.
+- 3 difficulty tiers.
+- 8 to 10 tools.
+- Hidden consistency graph.
+- Programmatic validators.
+- OpenEnv server.
+- Baseline inference.
+- TRL/Unsloth training script.
+- Reward plots from at least one short training run.
+- README plus 2-minute pitch video or mini-blog.
+Scenario families:
+1. Dosage update.
+2. Contraindication update.
+3. Clinical endpoint correction.
+4. Manufacturing site change.
+5. Patient-language simplification.
+## Risk And Mitigation
+| Risk | Mitigation |
+|---|---|
+| 300-step tasks are too hard for training | Use curriculum. Train on 5 to 80 steps first, show hard eval as stretch. |
+| Reward is too complex | Keep hidden graph simple: facts, required files, forbidden changes. |
+| Judges think it is just DocEdit V2 | Pitch it as dossier-level world modeling, not local editing. |
+| Training takes too long | Train a tiny model or run short GRPO over easy/medium levels and show upward reward. |
+| LLM outputs invalid JSON | Constrain action schema and give format rewards/penalties. |
+## Decision
+This is the idea I would pick.
+It is ambitious enough to impress, grounded enough to build, and close enough to existing work that we have a realistic path to shipping evidence rather than just slides.

docs/brainstorm/03-rapid-build-plan.md ADDED Viewed

	@@ -0,0 +1,164 @@

+# Rapid Build Plan For The Recommended Idea
+## Goal
+Build a convincing OpenEnv submission around **Regulatory Dossier Control Room** with long-horizon tasks, adaptive curriculum, objective rewards, and visible training improvement.
+## First 45 Minutes
+Decision checkpoint:
+- Commit to Regulatory Dossier Control Room unless a better idea beats it on build speed and judge impact.
+- Define the MVP around 5 scenario families and 3 difficulty levels.
+- Keep the first implementation deterministic and lightweight.
+Immediate choices:
+- Python 3.12 + uv.
+- OpenEnv latest release.
+- FastAPI server.
+- Simple HTML/JS or Gradio demo if time allows.
+- Store generated dossier as in-memory structured files, with optional JSON fixtures.
+## Build Architecture
+```text
+regulatory_dossier_control_room/
+  openenv.yaml
+  pyproject.toml
+  README.md
+  inference.py
+  train_grpo.py
+  server/
+    app.py
+    environment.py
+    models.py
+    scenario_generator.py
+    dossier.py
+    tools.py
+    validators.py
+    rewards.py
+    curriculum.py
+  assets/
+    reward_curve.png
+    baseline_vs_trained.png
+```
+## Core Environment
+State:
+- Task seed.
+- Difficulty.
+- Dossier files.
+- Hidden canonical facts.
+- Hidden affected file graph.
+- Current file/window.
+- Search history.
+- Edit history.
+- Validator history.
+- Step count.
+- Score components.
+Actions:
+- `search`
+- `open_file`
+- `inspect_window`
+- `replace_text`
+- `patch_section`
+- `add_audit_note`
+- `run_validator`
+- `commit_episode`
+Observations:
+- Task brief.
+- Current file/window.
+- Search or validator output.
+- Last reward breakdown.
+- Known discovered facts.
+- Remaining steps.
+## Scenario Families
+1. Dosage update:
+   - Change max dose across label, patient leaflet, CSR, investigator brochure.
+2. Contraindication update:
+   - Add/remove safety contraindications across medical and patient documents.
+3. Clinical endpoint correction:
+   - Correct endpoint wording and tables across CSR, abstract, briefing doc.
+4. Manufacturing site change:
+   - Update site names, IDs, certificates, cover letter, audit trail.
+5. Patient-language simplification:
+   - Convert technical warnings to patient-facing plain language without changing meaning.
+## Difficulty Tiers
+| Tier | Files | Hidden obligations | Max steps | Use |
+|---|---:|---:|---:|---|
+| Easy | 2 to 4 | 3 to 8 | 30 | Fast learning signal. |
+| Medium | 8 to 15 | 12 to 30 | 100 | Main training target. |
+| Hard | 20 to 60 | 40 to 150 | 300 | Headline long-horizon demo. |
+## Training Evidence Plan
+Minimum viable evidence:
+- Run baseline over 20 seeds.
+- Run short GRPO or SFT+GRPO over easy/medium curriculum.
+- Save reward curve.
+- Evaluate base vs trained on 20 held-out seeds.
+Metrics:
+- Mean episode reward.
+- Final dossier score.
+- Fact accuracy.
+- Cross-document consistency.
+- Affected file coverage.
+- Collateral damage.
+- Validator warnings remaining.
+- Steps to completion.
+## Demo Output
+README should include:
+- One-paragraph pitch.
+- Why long-horizon dossier management matters.
+- Action/observation space.
+- Reward breakdown.
+- Curriculum/self-improvement loop.
+- Baseline vs trained table.
+- Reward plot.
+- One trace excerpt showing better behavior after training.
+Video or mini-blog story:
+1. "A safety change arrives 12 hours before submission."
+2. Baseline fixes only the obvious label line.
+3. Validator reveals missed patient leaflet and CSR table.
+4. Training reward improves.
+5. Trained agent searches, patches, validates, audits, and commits.
+## What To Avoid
+- Do not market this as "DocEdit V3." That undersells it.
+- Do not start with full 300-step training. Build 30-step and 100-step curricula first.
+- Do not rely on an LLM judge as the only reward.
+- Do not make the UI the main project. The environment and training evidence are the submission.
+- Do not overfit to static hand-authored tasks. Procedural seeds matter.
+## Final Recommendation
+Start implementation with a narrow MVP:
+- Dosage update family only.
+- 6 documents.
+- 3 difficulty settings.
+- Hidden consistency graph.
+- Search/open/replace/validate/audit/commit tools.
+Once this works, add the other scenario families and the adversarial curriculum.

docs/brainstorm/04-physics-design-environments.md ADDED Viewed

	@@ -0,0 +1,258 @@

+# Physics, CAD, Chip, And Media Environment Brainstorm
+Date: 2026-04-24
+## Core Question
+Could we build an OpenEnv environment where an LLM improves at designing objects, systems, or artifacts that can be verified by simulation?
+Short answer:
+**Yes. This is a strong hackathon direction, but only if we constrain the design language and simulator.**
+The best version is not "LLM generates arbitrary 3D geometry from scratch." That is too broad and brittle. The best version is:
+> The agent edits a parametric engineering design through a small set of meaningful actions, runs a verifier/simulator, and learns to optimize objective tradeoffs like stiffness, mass, stress, torque, loss, cost, manufacturability, or timing.
+## Current Reality Of AI For CAD
+Frontier models are becoming surprisingly good at simple parametric CAD, especially when the output is code in libraries like CadQuery. CAD Arena's early 2026 benchmark shows frontier and commercial systems producing many valid executable CAD outputs on simple to medium prompts, but failures still appear on complex functional parts.
+This means the opportunity is not "can an LLM make a cube or bracket?" The opportunity is:
+- Can a small or open model learn engineering design behavior from simulator feedback?
+- Can it iterate over many design steps without losing constraints?
+- Can it trade off mass, stiffness, stress, manufacturability, and safety margin?
+- Can it recover from bad simulations?
+- Can it learn design heuristics through RL rather than only prompt engineering?
+That is very aligned with OpenEnv.
+## Useful Tooling
+| Tool | Use | Notes |
+|---|---|---|
+| CadQuery | Parametric 3D CAD from Python | Good for generating STEP/STL-style geometry through code. |
+| MuJoCo | Fast rigid-body/contact simulation | Excellent for mechanisms and robotics, not the right core tool for structural FEA. |
+| FEniCSx | Finite element PDE solving | Powerful but heavier; risky if we need a polished 2-day build. |
+| topoptlab | Topology optimization research/benchmarking | Very relevant, but we should verify install/runtime before betting on it. |
+| OpenMDAO | Multidisciplinary design optimization | Strong for system-level optimization, design variables, constraints, analytic derivatives. |
+| Pyleecan | Electrical machine and drive simulation | Very relevant to motors; FEMM coupling is Windows-only right now, which is a Mac/HF risk. |
+| cocotb/Yosys/OpenROAD | Chip design and verification | Very verifiable and compelling, but a crowded/coding-adjacent domain. |
+| FFmpeg/MoviePy | Programmatic video editing | Buildable and verifiable, but reward quality is less objective unless tasks are synthetic. |
+## Candidate A: MechForge Gym
+One-line pitch:
+> Train an LLM to act as a mechanical design engineer: iteratively design a lightweight bracket, bridge, clamp, or motor mount, run simulation, and improve stiffness-to-weight while respecting stress and manufacturability constraints.
+### Environment
+The agent receives:
+- Design brief.
+- Load cases.
+- Mounting constraints.
+- Forbidden zones.
+- Material.
+- Manufacturing process.
+- Current design parameters.
+- Simulation report.
+The agent acts through constrained tools:
+```json
+{"tool": "set_dimension", "part": "base_plate", "parameter": "thickness", "value": 4.0}
+{"tool": "add_rib", "from": "mount_a", "to": "load_point", "width": 5.0, "height": 12.0}
+{"tool": "add_lightening_hole", "center": [20, 15], "radius": 4.0}
+{"tool": "change_material", "material": "aluminum_6061"}
+{"tool": "run_simulation"}
+{"tool": "commit_design"}
+```
+### Reward
+```text
+reward =
+  + stiffness_score
+  + safety_factor_score
+  + manufacturability_score
+  - mass_penalty
+  - stress_violation_penalty
+  - invalid_geometry_penalty
+  - repeated_failed_sim_penalty
+```
+### Why It Could Win
+- Very visual.
+- Very easy to explain.
+- Verifier is real math, not vibes.
+- Mechanical engineering angle is distinctive.
+- Long-horizon optimization loop is natural.
+- Can show before/after designs and reward curves.
+### Main Risk
+Full 3D FEA is hard to make fast and robust in two days. The MVP should use a simplified finite-element/truss/beam solver first, then render the result as CAD. That is credible if we are honest:
+> The environment trains engineering design behavior with a fast verifier; high-fidelity FEA is a stretch backend.
+## Candidate B: Axial Flux Motor Design Gym
+One-line pitch:
+> Train an LLM to design axial-flux motor variants by choosing rotor/stator geometry, magnet layout, winding parameters, and cooling assumptions, then score torque, efficiency, mass, thermal margin, and manufacturability.
+### Why This Is Exciting
+This is the most personally differentiated idea because of mechanical/electrical design expertise. It sounds like real R&D, not a toy. It also gives a good story:
+> Can a small model learn the design instincts of an electric motor engineer?
+### Possible Action Space
+```json
+{"tool": "set_slot_count", "value": 12}
+{"tool": "set_pole_pairs", "value": 10}
+{"tool": "set_airgap_mm", "value": 0.8}
+{"tool": "set_magnet_thickness_mm", "value": 4.0}
+{"tool": "set_winding_turns", "value": 38}
+{"tool": "run_electromagnetic_sim"}
+{"tool": "run_thermal_check"}
+{"tool": "commit_design"}
+```
+### Reward
+- Torque density.
+- Efficiency.
+- Cogging torque penalty.
+- Thermal margin.
+- Current density limit.
+- Magnet mass/cost.
+- Manufacturability constraints.
+### Main Risk
+The real simulation stack is not trivial. Pyleecan is exactly in the domain, but its strongest FEMM coupling is currently Windows-only, which is awkward for a MacBook and HF Space. A simplified analytic motor model is feasible, but judges may ask whether it is too toy-like unless we present it as a curriculum level.
+### Verdict
+Extremely cool, but I would not choose this as the first build unless we intentionally scope it as **MotorBench Lite**:
+- Analytic/equivalent-circuit verifier for the hackathon.
+- Pyleecan/FEMM as stretch or future backend.
+- CAD render as a bonus, not core.
+## Candidate C: Chip Design / EDA Gym
+One-line pitch:
+> Train an LLM to design and optimize small digital circuits through Verilog, simulation, synthesis, formal tests, and area/timing/power metrics.
+### Why It Is Strong
+- Verifiability is excellent.
+- Tools exist: cocotb for Python verification, Yosys for synthesis, OpenROAD for physical design.
+- Rewards are crisp: tests pass, area, timing slack, DRC count, power proxy.
+- Long-horizon flow is real: design, simulate, synthesize, place, route, inspect metrics, revise.
+### Main Risk
+This space is closer to coding benchmarks, so it may feel less novel. Also, full OpenROAD flows can be slow/heavy. But a small RTL-to-synthesis environment could be highly shippable.
+### Verdict
+Very viable, especially if we focus on **hardware optimization**, not generic coding:
+> "The agent learns to trade timing, area, and correctness under a real synthesis verifier."
+## Candidate D: Video Editing Gym
+One-line pitch:
+> Train an LLM to assemble a coherent video from clips using FFmpeg/MoviePy tools, optimized against objective timeline, audio, caption, and narrative constraints.
+### Why It Is Interesting
+- Very demo-friendly.
+- Easy to render before/after.
+- Tool use is realistic.
+- Long-horizon timeline assembly is possible.
+### Main Risk
+Quality is hard to verify objectively. We can make synthetic tasks with objective constraints, but "good narrative" will need an LLM judge or a weak proxy. That is less clean than physics/code verification.
+### Verdict
+Good product demo, weaker OpenEnv winner candidate unless we make the task highly structured:
+- Given transcript and clips, align exact semantic beats.
+- Reward caption timing, shot coverage, audio loudness, no black frames, no forbidden clips.
+## My Updated Ranking
+| Rank | Idea | Innovation | Story | Trainability | Verifiability | Build speed | Verdict |
+|---:|---|---:|---:|---:|---:|---:|---|
+| 1 | MechForge Gym: simulated mechanical design optimization | 10 | 10 | 7 | 8 | 7 | Best new contender. More visually compelling than regulatory if scoped well. |
+| 2 | Regulatory Dossier Control Room | 9 | 9 | 8 | 9 | 8 | Still safest high-scoring option. Less spectacular, more shippable. |
+| 3 | RTL/Chip Optimization Gym | 8 | 8 | 8 | 10 | 6 | Strong verifier; risk is looking like code benchmark. |
+| 4 | Axial Flux Motor Design Gym | 10 | 10 | 5 | 7 | 4 | Most exciting personally, but risky for two days unless simplified hard. |
+| 5 | Video Editing Gym | 8 | 9 | 6 | 5 | 8 | Great demo, weaker reward objectivity. |
+## Recommended Physics Build
+If choosing the physics route, build **MechForge Gym**, not full arbitrary generative design and not full motor design first.
+### MVP
+Build a 2D/2.5D structural design environment:
+- Agent edits a bracket/bridge/motor-mount design through parametric actions.
+- Fast internal solver computes stress/compliance/mass.
+- CadQuery renders a 3D preview/STL from the parameterized design.
+- Curriculum grows from 5-step changes to 300-step design campaigns.
+- Adversarial scenario generator creates new load cases and manufacturing constraints.
+### Why This Is The Sweet Spot
+It preserves the magic of generative design:
+- simulation-verifiable,
+- visual,
+- engineering-real,
+- optimization-driven,
+- self-improving.
+But it avoids the two big traps:
+- arbitrary geometry generation,
+- slow brittle high-fidelity simulation.
+## Buildable Story
+Demo story:
+1. "A drone arm bracket must hold 120 N at the tip but weigh under 30 g."
+2. Baseline model adds material everywhere and passes stress but is overweight.
+3. The environment runs simulation and shows mass/stress/compliance breakdown.
+4. After training, the agent learns ribs, fillets, and lightening holes.
+5. The trained design is lighter, still safe, and has fewer invalid simulations.
+Tagline:
+> From text prompts to simulation-trained design instincts.
+## Final Thought
+This may be the most emotionally convincing idea in the set. Judges will remember a model that learns to design a lighter bracket from simulation feedback.
+The key discipline is scope:
+- Do not promise "full CAD/FEA/motor design from scratch."
+- Promise "a verifiable OpenEnv for engineering design behavior."
+- Show an actual reward curve and visible before/after geometry.

docs/brainstorm/05-mechforge-rendering-stack.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# MechForge Rendering And Simulation Stack
+Date: 2026-04-24
+## The Confusion To Resolve
+For MechForge there are four separate jobs:
+1. Generate or modify a design.
+2. Render the design so humans can inspect it.
+3. Simulate or verify the design.
+4. Export the design to real CAD/manufacturing formats.
+One tool does not need to do all four.
+## Recommended MVP Stack
+| Layer | MVP choice | Why |
+|---|---|---|
+| Design representation | Structured parametric JSON | Easy for LLMs, easy to validate, easy to convert. |
+| Browser renderer | Three.js | Fast, visual, interactive, works inside a web demo. |
+| Fast verifier | Custom beam/truss-style solver | Good enough for reward curves and RL feedback. |
+| Export | STL from Three.js mesh | Immediate tangible artifact. |
+| Future CAD backend | CadQuery first, OpenSCAD second | CadQuery is Python-native and more flexible for OpenEnv. |
+| Future simulation backend | simplified FEM, FEniCSx, or specialized solver | Swap in after the environment loop works. |
+## Why Not OpenSCAD First?
+OpenSCAD is good for deterministic programmatic CAD. It is available on macOS and can generate real geometry, but it is not the fastest path for a live web app.
+Use OpenSCAD later if we want:
+- scriptable constructive solid geometry,
+- reproducible `.scad` artifacts,
+- STL export through the OpenSCAD CLI,
+- simple parts made from unions/differences.
+For the first experiment, Three.js is better because it gives immediate visual feedback in the browser.
+## Why Not Full FEA First?
+Full FEA is the wrong first milestone. It risks spending the hackathon on meshing, solver stability, and packaging instead of the OpenEnv loop.
+Better:
+1. Start with a simplified verifier that produces a reward.
+2. Show that LLM behavior improves under that reward.
+3. Add higher-fidelity simulation only after the loop is stable.
+The judges care most that the environment trains meaningful behavior and shows improvement. A simple but coherent verifier is acceptable if we explain the limitations honestly.
+## Benchmark Plan
+Before committing to the full environment, run GPT-5.4 through a small prompt-to-design benchmark:
+- Prompt asks for a lightweight bracket under a load case.
+- Model returns structured design JSON.
+- Renderer shows the part.
+- Verifier scores mass, stress proxy, deflection proxy, safety factor, and manufacturability.
+- We inspect whether the model uses real design patterns like ribs, load paths, holes in low-stress areas, and avoids invalid geometry.
+This tells us whether current frontier models already solve the task or whether there is room for RL improvement.
+## What The Experiment App Does
+The app in `experiment-mechanical-idea/` implements this benchmark:
+- Frontend: Vite + Three.js.
+- Backend: Express + OpenAI Responses API.
+- Input: natural-language mechanical design prompt.
+- Output: structured parametric design JSON.
+- Render: plate, ribs, holes, bosses, fixed holes, load arrow.
+- Verifier: fast beam-style estimate.
+- Export: STL from the rendered mesh.
+## Final Recommendation
+For the OpenEnv version:
+1. Keep the agent action space constrained.
+2. Use Three.js for the judge-facing demo.
+3. Use Python/CadQuery later for real CAD export.
+4. Keep simulation/verifier independent from the renderer.
+5. Do not let the LLM generate arbitrary meshes in the first version.

docs/brainstorm/06-production-simulation-stack.md ADDED Viewed

	@@ -0,0 +1,207 @@

+# Production Simulation Stack For MechForge
+Date: 2026-04-24
+## Short Answer
+For a production-quality MechForge, do **not** use MuJoCo as the main solver for stress, heat, or electromagnetics.
+Use MuJoCo only if the environment is about:
+- mechanism motion,
+- contact,
+- robotics,
+- actuated joints,
+- dynamic control,
+- impact-like rigid-body behavior.
+For a full-stack engineering design simulator, the better architecture is:
+```text
+LLM / Agent
+  -> constrained design actions
+  -> CAD kernel
+  -> meshing
+  -> multiphysics solvers
+  -> post-processing
+  -> reward + visual trace
+```
+## Recommended Production Stack
+| Layer | Recommended tool | Why |
+|---|---|---|
+| Agent orchestration | OpenAI Responses API now, Agents SDK later | Responses is enough for benchmark; Agents SDK is useful when tool traces and multi-agent workflows become first-class. |
+| Design representation | Parametric feature graph | Better than arbitrary mesh generation; supports CAD, constraints, versioning, and RL actions. |
+| CAD kernel | CadQuery / OpenCASCADE | Python-native CAD generation, real B-rep/STEP export, deterministic parametric geometry. |
+| Meshing | Gmsh | Mature, scriptable 2D/3D mesh generator with OpenCASCADE geometry support. |
+| Structural FEA | FEniCSx or scikit-fem | FEniCSx is stronger for serious PDE work; scikit-fem is lighter and easier for hackathon packaging. |
+| Thermal FEA | FEniCSx / scikit-fem / Elmer | Heat equation is straightforward in finite element tools. |
+| Electromagnetic FEA | Elmer FEM, GetDP, MFEM, or FEMM/Pyleecan for motor-specific workflows | Motors need magnetic vector potential, materials, windings, airgap, torque extraction. |
+| Visualization | Three.js in UI, ParaView/VTK for heavy post-processing | Three.js is judge/demo-facing; VTK/ParaView is engineering-facing. |
+| Optimization | OpenMDAO or custom curriculum/RL loop | OpenMDAO is excellent for deterministic design-variable optimization; OpenEnv/RL is the hackathon learning loop. |
+| Artifact storage | per-iteration JSON + STEP/STL + mesh + VTK + screenshot | Enables side-by-side version comparison. |
+## Exact Production Loop
+A single design episode should look like this:
+```text
+1. reset()
+   - Generate design brief.
+   - Define load cases, fixtures, materials, constraints, objective.
+2. agent action: edit_design
+   - Add rib, change thickness, move hole, set magnet width, change winding turns, etc.
+3. geometry build
+   - Build CAD from parametric feature graph through CadQuery/OpenCASCADE.
+   - Export STEP/STL.
+4. mesh build
+   - Run Gmsh.
+   - Tag boundaries: fixed faces, load faces, heat sources, winding regions, magnets, airgap.
+5. solver run
+   - Structural: displacement, stress, strain, safety factor.
+   - Thermal: temperature field, hot spot, thermal margin.
+   - Electromagnetic: flux density, torque, losses, cogging proxy.
+6. post-process
+   - Save VTK/VTU fields.
+   - Produce scalar metrics.
+   - Render screenshot or send mesh fields to Three.js.
+7. reward
+   - Score constraints and objective tradeoffs.
+8. next observation
+   - Return metrics, failed constraints, top stress/thermal/EM hotspots, and visual artifacts.
+```
+## What The Agent Should Output
+Do not ask the LLM to output an entire arbitrary CAD file as the main action.
+For serious parts, use tool calls like:
+```json
+{"tool": "set_parameter", "name": "base_thickness_mm", "value": 4.5}
+{"tool": "add_rib", "start": [12, -18, 4], "end": [92, -4, 20], "width_mm": 5}
+{"tool": "move_lightening_hole", "id": "hole_2", "center": [54, 12, 0], "radius_mm": 4}
+{"tool": "set_boundary_condition", "face": "left_mount_faces", "type": "fixed"}
+{"tool": "set_load", "face": "tip_boss", "vector_n": [0, 0, -120]}
+{"tool": "run_simulation", "physics": ["structural", "thermal"]}
+```
+Why:
+- Tool calls are inspectable.
+- Invalid actions can be rejected.
+- The environment can apply partial progress rewards.
+- The CAD remains valid more often.
+- The same action sequence becomes training data.
+The current experiment returns a full structured design JSON because it is a fast benchmark. The OpenEnv version should move toward smaller incremental design actions.
+## 3D Structural FEA Path
+For full 3D structural FEA, I would implement:
+```text
+CadQuery/OpenCASCADE -> STEP/B-rep -> Gmsh tetra mesh -> FEniCSx or scikit-fem -> VTU fields -> Three.js/VTK viewer
+```
+### Fastest hackathon path
+- Use `scikit-fem` for 3D linear elasticity on simple tetrahedral meshes.
+- Use Gmsh for meshing simple CAD.
+- Use meshio to bridge Gmsh meshes into Python/VTK outputs.
+### More serious production path
+- Use FEniCSx for PDE solves and scalable linear algebra.
+- Use PETSc-backed solvers.
+- Store post-processing fields as VTK/VTU.
+## Electromagnetic + Thermal Motor Path
+For motor design, do **not** start with arbitrary 3D motor FEA.
+Production path:
+```text
+parametric motor template
+  -> 2D cross-section CAD
+  -> Gmsh mesh with material regions
+  -> EM solver for magnetic vector potential
+  -> torque / B-field / losses
+  -> thermal network or thermal FEA
+  -> reward
+```
+Candidate solvers:
+- Elmer FEM: multiphysics, includes heat transfer and electromagnetics.
+- GetDP: finite element solver often used with Gmsh for EM problems.
+- Pyleecan: motor-specific design framework, but deployment constraints need checking.
+- FEMM: common motor workflow but Windows-centric, not ideal for HF/Linux deployment.
+## Visual Versioning
+Every iteration should save:
+```text
+runs/{run_id}/
+  iter_001/
+    design.json
+    actions.jsonl
+    geometry.step
+    geometry.stl
+    mesh.msh
+    structural.vtu
+    thermal.vtu
+    electromagnetic.vtu
+    screenshot.png
+    metrics.json
+  iter_002/
+    ...
+```
+The UI should show:
+- version timeline,
+- side-by-side geometry,
+- stress heatmap,
+- deformation magnification slider,
+- thermal heatmap,
+- EM flux density heatmap for motor tasks,
+- tool-call/action trace,
+- score curve over iterations.
+## What To Build First
+Best next implementation step:
+1. Replace the current JS frame FEA with a Python simulator service.
+2. Start with scikit-fem 3D structural FEA for a simple cantilever bracket template.
+3. Add Gmsh meshing.
+4. Add VTK/VTU export.
+5. Keep Three.js for browser rendering.
+6. Only then add thermal.
+7. Add electromagnetics only if we pick motor design as the final domain.
+## Installation Note
+The local environment currently does not have the heavy solver packages installed:
+- `scikit-fem`
+- `dolfinx`
+- `gmsh`
+- `meshio`
+- `cadquery`
+- `mujoco`
+- `openmdao`
+Installing those packages is a real environment change. Do it intentionally once we pick the stack.

docs/brainstorm/07-mechforge-domain-choice.md ADDED Viewed

	@@ -0,0 +1,169 @@

+# MechForge Domain Choice
+Date: 2026-04-24
+## The Decision
+We need to decide what the OpenEnv task is actually about:
+1. Cantilever/bracket/mount structural optimization.
+2. Motor design.
+3. Mechanism/dynamics design.
+4. Chip/EDA optimization.
+5. Video or media editing.
+The strongest physics candidates are structural design and motor design.
+## Option A: Cantilever / Bracket / Mount Design
+Pitch:
+> Train an agent to design lightweight structural parts that survive real load cases.
+Examples:
+- drone arm bracket,
+- motor mount,
+- shelf bracket,
+- lightweight bridge segment,
+- 3D-printable fixture,
+- robotic gripper finger.
+Pros:
+- Fastest to build.
+- Easy to visualize.
+- Structural FEA is simpler than EM.
+- Clear rewards: mass, stress, strain, deflection, safety factor.
+- Easy curriculum: 2D frame -> 3D linear elasticity -> multi-load cases.
+- Excellent for showing iteration screenshots and STL versions.
+Cons:
+- Less exotic than motor design.
+- Need good problem framing to avoid feeling like a simple topology optimization toy.
+Verdict:
+**Best hackathon target.** It is the strongest balance of spectacle, feasibility, and real verification.
+## Option B: Axial Flux Motor Design
+Pitch:
+> Train an agent to design axial-flux motor variants under torque, efficiency, thermal, mass, and manufacturability constraints.
+Examples:
+- pole/slot selection,
+- magnet geometry,
+- airgap,
+- winding turns,
+- rotor/stator dimensions,
+- cooling features,
+- torque ripple/cogging reduction.
+Pros:
+- Most personally differentiated for a mechanical/electrical engineer.
+- Very impressive story.
+- Naturally multiphysics: EM + thermal + structural.
+- Strong R&D flavor.
+Cons:
+- Hardest to implement correctly.
+- EM FEA and motor post-processing are not trivial.
+- 3D axial-flux simulation is expensive.
+- Faster practical path is 2D/axisymmetric/analytical first, which may disappoint if pitched as full 3D.
+Verdict:
+**Best long-term product/research direction, risky for this hackathon.** Use as a stretch or second environment if structural MechForge works.
+## Option C: Mechanism / Dynamics Design
+Pitch:
+> Train an agent to design mechanisms that move correctly under physical simulation.
+Examples:
+- linkage design,
+- gripper mechanism,
+- passive walker,
+- robot end-effector,
+- compliant-ish mechanism approximated as rigid joints.
+Pros:
+- MuJoCo is actually a great fit.
+- Visual and interactive.
+- Rewards are measurable: trajectory error, contact stability, energy, joint limits.
+Cons:
+- Not FEA.
+- Less aligned with stress/thermal/electromagnetics.
+- Could drift into robotics control instead of engineering design.
+Verdict:
+Good if we want MuJoCo. Not the right answer if we want structural/thermal/EM.
+## Option D: Full Multiphysics MotorBench
+Pitch:
+> A multiphysics OpenEnv where agents design electric machines and learn from EM, thermal, and structural simulation.
+Pros:
+- Huge wow factor.
+- Most ambitious.
+- Strong self-improvement story.
+Cons:
+- Too much for a two-day MVP unless heavily constrained.
+- Many solvers and file formats.
+- Risk of spending the whole hackathon packaging tools.
+Verdict:
+Great final vision, not the first build.
+## Recommendation
+Build:
+> **MechForge Structural 3D: motor-mount/bracket optimization with real 3D FEA.**
+Frame it as the first task family in a larger MechForge platform:
+- structural bracket/motor mount now,
+- thermal add-on next,
+- motor EM design later.
+This preserves the big dream while keeping the first submission shippable.
+## Why Motor Mount Is The Sweet Spot
+A motor mount bridges both worlds:
+- It is structurally verifiable.
+- It is visually clear.
+- It can later connect to motor design.
+- It supports heat and vibration extensions.
+- It feels more interesting than a generic cantilever.
+Suggested final prompt family:
+> Design a lightweight motor mount for a drone/EV test rig. It must support thrust/load, keep shaft alignment under deflection, avoid high stress near bolt holes, and optionally dissipate heat from the motor face.
+That gives us:
+- structural FEA now,
+- thermal next,
+- motor design story later.

docs/brainstorm/08-agentic-3d-engineering-environment.md ADDED Viewed

	@@ -0,0 +1,249 @@

+# Document 8: Agentic 3D Engineering Environment
+Date: 2026-04-24
+## Honest Product Definition
+The winning product is not:
+> LLM generates a 3D model.
+The winning product is:
+> An AI engineering agent that turns natural-language physical requirements into parameterized CAD, runs simulation and manufacturability checks, optimizes the design, and outputs manufacturable files with a safety-factor report.
+For fixtures:
+> Given load, material, envelope, mounting constraints, print/manufacturing process, and target safety factor, generate a manufacturable bracket/fixture and prove it with FEA.
+For motors:
+> Given magnets, bearings, shaft, voltage/current limits, printer/process, and target torque/RPM, generate a printable axial-flux BLDC motor kit and prove it with EM/thermal/structural simulation.
+The fixture path is more commercially useful and shippable.
+The motor path is more fun and demo-worthy.
+Together, they are a strong long-term research/product direction.
+## Hackathon Choice
+For the next 24 to 26 effective hours, pick:
+> **Option A: 3D structural motor-mount/bracket/fixture design with real 3D linear elasticity.**
+Why:
+- It is doable fast.
+- It is visual.
+- It is objectively verifiable.
+- It supports 300+ step agentic loops.
+- It can expand later to thermal, dynamics, NVH, and motor design.
+Do not start with full axial-flux motor EM unless structural MechForge is already working. MotorBench should be the long-term second stage.
+## Agent Loop
+The environment should run this loop:
+```text
+prompt
+  -> parse requirements
+  -> ask/assume missing boundary conditions
+  -> generate 3-5 design families
+  -> create parametric CAD/actions
+  -> run 3D FEA
+  -> identify stress concentrations and deflection
+  -> add fillets/ribs/thickness or move holes
+  -> re-run FEA
+  -> optimize mass/safety/deflection/manufacturability
+  -> export CAD/STL + simulation report
+```
+In OpenEnv terms:
+```text
+reset(task_seed)
+  -> observation: design brief, constraints, materials, available tools
+step(action)
+  -> environment applies CAD/tool action
+  -> if requested, simulator runs
+  -> reward components returned
+  -> observation includes metrics, warnings, hotspots, artifacts
+done
+  -> when agent commits a design or exceeds step budget
+```
+## Missing-Information Handling
+A naive LLM will fail because it will not know boundary conditions.
+The environment must either ask for or infer:
+- Is 120 Nm a torque around a shaft?
+- Is it a force through a hook/tip/load face?
+- Where is the fixture mounted?
+- What are the bolt holes?
+- Static, cyclic, or impact load?
+- Desired safety factor?
+- Material ambiguity, e.g. "601 aluminum" likely means 6061 aluminum.
+- Manufacturing process: 3D print, CNC, sheet metal, casting.
+- Temperature or thermal expansion constraints.
+For hackathon speed, the environment should include default load templates:
+1. **Cantilever tip load**: fixed face at x=0, downward force at free tip.
+2. **Motor mount**: bolt holes fixed, radial/axial motor load at boss, optional torque couple.
+3. **Chair/seat support**: fixed feet or base, downward distributed load on seat surface.
+4. **Torque fixture**: equal/opposite force couple around a shaft axis.
+If the user explicitly tells where the load goes, that overrides defaults.
+## Tool Calls
+Use incremental tool calls rather than one huge CAD file.
+Design tools:
+```json
+{"tool": "create_design_family", "family": "ribbed_cantilever_bracket"}
+{"tool": "set_material", "material": "aluminum_6061"}
+{"tool": "set_envelope", "length_mm": 100, "width_mm": 45, "height_mm": 30}
+{"tool": "add_mount_hole", "id": "m1", "center": [10, -15, 0], "radius_mm": 2.6}
+{"tool": "add_rib", "id": "r1", "start": [12, -15, 4], "end": [92, -4, 22], "width_mm": 5}
+{"tool": "add_lightening_hole", "id": "h1", "center": [55, 12, 2], "radius_mm": 4}
+{"tool": "set_base_thickness", "value_mm": 4.5}
+```
+Load/boundary tools:
+```json
+{"tool": "set_fixed_region", "region": "left_face"}
+{"tool": "set_fixed_region", "region": "mounting_holes"}
+{"tool": "set_force", "region": "tip_boss", "vector_n": [0, 0, -120]}
+{"tool": "set_torque", "axis": "x", "origin": [90, 0, 10], "torque_nm": 120}
+{"tool": "set_temperature", "region": "motor_face", "temperature_c": 80}
+{"tool": "set_heat_source", "region": "motor_face", "power_w": 12}
+```
+Simulation tools:
+```json
+{"tool": "build_cad"}
+{"tool": "mesh_geometry", "target_size_mm": 5}
+{"tool": "run_fea", "physics": "linear_elasticity"}
+{"tool": "run_thermal", "physics": "steady_state_heat"}
+{"tool": "inspect_hotspots", "field": "von_mises_stress"}
+{"tool": "export_artifacts", "formats": ["json", "stl", "step", "vtu", "png"]}
+```
+Optimization tools:
+```json
+{"tool": "propose_revision", "objective": "reduce_mass_keep_sf_above_2"}
+{"tool": "sweep_parameter", "name": "base_thickness_mm", "values": [3.5, 4, 4.5, 5]}
+{"tool": "optimize_parameters", "method": "cma_es", "budget": 40}
+{"tool": "commit_design"}
+```
+## Environment Responses
+After a design action:
+```json
+{
+  "valid": true,
+  "changed_parameters": ["base_thickness_mm"],
+  "geometry_status": "buildable",
+  "warnings": []
+}
+```
+After FEA:
+```json
+{
+  "method": "3D linear tetrahedral elasticity",
+  "nodes": 240,
+  "elements": 620,
+  "max_von_mises_mpa": 138.4,
+  "max_principal_strain": 0.0021,
+  "max_displacement_mm": 2.7,
+  "safety_factor": 2.0,
+  "mass_g": 51.2,
+  "hotspots": [
+    {"region": "fixed_root", "severity": 0.50},
+    {"region": "rib_root_r1", "severity": 0.42}
+  ],
+  "constraints": {
+    "safety_factor_above_2": true,
+    "mass_below_45g": false,
+    "tip_deflection_below_2mm": false
+  }
+}
+```
+After commit:
+```json
+{
+  "final_score": 0.82,
+  "reward_breakdown": {
+    "safety": 0.30,
+    "stiffness": 0.22,
+    "mass": 0.12,
+    "manufacturability": 0.10,
+    "invalid_action_penalty": 0.0
+  },
+  "artifacts": {
+    "design_json": "runs/.../design.json",
+    "stl": "runs/.../geometry.stl",
+    "report": "runs/.../report.md"
+  }
+}
+```
+## Long-Horizon Step Design
+A 300-500 step episode is plausible if the environment exposes detailed action space:
+- 20-50 requirement parsing and constraint-confirmation actions.
+- 30-80 design family generation and selection actions.
+- 100-250 geometry edits.
+- 50-100 simulation/inspection actions.
+- 50-100 optimization sweeps and revisions.
+- 10-20 final export/report actions.
+Training goal:
+- Baseline model takes many invalid or inefficient steps.
+- Trained model learns good action order:
+  - clarify/assume loads,
+  - create feasible design family,
+  - run simulation early,
+  - inspect hotspots,
+  - revise local features,
+  - avoid over-lightening,
+  - commit only after constraints pass.
+The story becomes:
+> RL teaches engineering workflow discipline, not just CAD generation.
+## 24-Hour MVP
+Build in order:
+1. Current experiment: GPT-5.4 structured design + Three.js viewer.
+2. Add 3D tetrahedral linear elasticity solver.
+3. Add load manager for cantilever/motor-mount/torque defaults.
+4. Add trace view for tool calls and simulator responses.
+5. Save per-iteration artifacts.
+6. Convert to OpenEnv API.
+7. Add baseline inference.
+8. Run short training or at least repeated benchmark showing improvement.
+## What To Say In The Pitch
+> We built an OpenEnv for engineering design agents. The agent receives physical requirements, infers boundary conditions, creates parametric CAD actions, runs real 3D FEA, reads stress/deformation feedback, and iteratively improves the design. This is a foundation for simulation-trained engineering models across fixtures, mounts, thermal constraints, and eventually motors.

docs/brainstorm/09-cad-rlve-structural-household-parts.md ADDED Viewed

	@@ -0,0 +1,472 @@

+# Document 9: CAD RLVE For Structural Household Mechanical Parts
+Date: 2026-04-25
+## Core Realization
+Current AI is not reliably bad at "imagining 3D objects."
+It is bad at:
+- making valid CAD,
+- making the right geometric edit at the right time,
+- preserving design intent,
+- keeping features clean and editable,
+- avoiding broken booleans and non-manifold geometry,
+- producing parts that survive physical checks.
+That is exactly why a CAD-focused RLVE environment is interesting.
+The product should not be:
+> Generate a cool-looking mesh.
+The product should be:
+> Train an agent to create and revise parametric CAD code until the part is valid, editable, manufacturable, and structurally correct.
+This is a stronger and more general version of MechForge. MechForge becomes the first structural benchmark suite inside a larger CAD-agent training environment.
+## Name
+Working names:
+- **CADForge**
+- **MechForge CAD**
+- **OpenCAD Gym**
+- **ParametricCAD RLVE**
+- **FeatureForge**
+Best current framing:
+> **CADForge: an RLVE environment where agents learn reliable parametric CAD creation and editing for functional mechanical parts.**
+## Why Code-CAD Is The Right Medium
+The agent should write or edit code-CAD, not arbitrary meshes.
+Good backend candidates:
+- OpenSCAD-style constructive solid geometry,
+- CadQuery,
+- build123d,
+- a constrained internal DSL that compiles to CadQuery/OpenSCAD/STEP/STL.
+The user said "OpenCADD"; if this means OpenSCAD or a similar code-CAD tool, the important idea is the same:
+> CAD should be represented as executable, inspectable, deterministic code.
+That gives the environment objective checks:
+- Did the code run?
+- Did the model build?
+- Did operations apply in the right order?
+- Is the feature tree clean?
+- Are dimensions parameterized?
+- Can a downstream edit change the part without breaking it?
+- Does export work?
+- Is the final geometry watertight and manifold?
+This is much better for RLVE than asking the model to emit a raw mesh.
+## Target Domain
+Start with structural household and mechanical parts:
+- wall hook,
+- shelf bracket,
+- cantilever support,
+- chair seat support,
+- stool leg joint,
+- phone stand,
+- desk clamp,
+- handle,
+- hinge plate,
+- simple enclosure mount,
+- motor mount,
+- cable guide,
+- pegboard fixture,
+- plant hanger,
+- small appliance bracket.
+These are ideal because they are:
+- easy to understand visually,
+- mechanically meaningful,
+- structurally verifiable,
+- small enough to simulate quickly,
+- familiar enough that judges understand failures,
+- broad enough to expose many CAD operations.
+## Agent Task
+The agent receives a physical design brief:
+```text
+Design a wall hook that mounts with two screws, fits inside an 80 mm x 60 mm x 45 mm envelope,
+holds a 5 kg load with safety factor above 2, is printable without support where possible,
+and has rounded edges for safe household use.
+```
+The agent must produce code-CAD:
+```python
+part = create_base_plate(width=60, height=80, thickness=6)
+part = add_mounting_holes(part, spacing=48, diameter=5)
+part = add_hook_arm(part, length=42, thickness=8, root_fillet=6)
+part = add_tip_lip(part, height=8, radius=4)
+part = add_ribs(part, count=2, thickness=4)
+part = fillet_edges(part, radius=2)
+```
+Then the environment builds, validates, simulates, and scores the result.
+## Action Space
+Use incremental tool calls rather than one giant CAD program.
+Sketch and feature tools:
+```json
+{"tool": "create_sketch", "plane": "XY", "id": "base_profile"}
+{"tool": "add_rectangle", "sketch": "base_profile", "width_mm": 60, "height_mm": 80}
+{"tool": "constrain_symmetric", "sketch": "base_profile", "axis": "Y"}
+{"tool": "extrude", "sketch": "base_profile", "distance_mm": 6, "id": "base_plate"}
+{"tool": "add_hole", "target": "base_plate", "center_mm": [0, 24], "diameter_mm": 5}
+{"tool": "add_hole", "target": "base_plate", "center_mm": [0, -24], "diameter_mm": 5}
+{"tool": "add_hook_arm", "length_mm": 42, "thickness_mm": 8, "angle_deg": 12}
+{"tool": "add_rib", "from": "base_plate", "to": "hook_arm", "thickness_mm": 4}
+{"tool": "fillet", "target": "root_edges", "radius_mm": 5}
+```
+Validation tools:
+```json
+{"tool": "build_cad"}
+{"tool": "check_feature_tree"}
+{"tool": "check_constraints"}
+{"tool": "check_manifold"}
+{"tool": "check_watertight"}
+{"tool": "check_min_wall_thickness", "min_mm": 2.5}
+{"tool": "check_overhangs", "process": "fdm_3d_printing", "max_angle_deg": 45}
+{"tool": "export_artifacts", "formats": ["step", "stl", "json", "png"]}
+```
+Simulation tools:
+```json
+{"tool": "set_material", "material": "pla"}
+{"tool": "set_fixed_region", "region": "mounting_hole_faces"}
+{"tool": "set_force", "region": "hook_tip", "vector_n": [0, 0, -50]}
+{"tool": "mesh_geometry", "target_size_mm": 3}
+{"tool": "run_structural_check", "physics": "linear_elasticity"}
+{"tool": "inspect_hotspots", "field": "von_mises_stress"}
+```
+Revision tools:
+```json
+{"tool": "increase_parameter", "name": "root_fillet_mm", "delta": 1}
+{"tool": "move_feature", "feature": "mount_hole_1", "delta_mm": [0, 4, 0]}
+{"tool": "add_support_rib", "region": "hook_root", "thickness_mm": 3}
+{"tool": "reduce_mass", "strategy": "lightening_holes_low_stress_regions"}
+{"tool": "commit_design"}
+```
+## Reward Design
+Reward should be multi-layered. The early curriculum rewards basic CAD validity; later stages reward engineering quality.
+Suggested reward:
+```text
+valid_code_execution:        0.10
+cad_build_success:           0.15
+clean_feature_tree:          0.10
+editability_test_passed:     0.10
+manifold_watertight_mesh:    0.10
+constraint_satisfaction:     0.10
+manufacturability:           0.10
+structural_safety:           0.15
+mass_efficiency:             0.05
+revision_efficiency:         0.05
+```
+Penalties:
+```text
+syntax_error:                -0.20
+failed_boolean_operation:    -0.15
+non_manifold_geometry:       -0.15
+self_intersection:           -0.15
+unparameterized_magic_values: -0.05
+uneditable_feature_chain:    -0.10
+unsafe_stress_or_deflection: -0.20
+invalid_final_export:        -0.20
+```
+The important thing:
+> The agent is not rewarded for a pretty model. It is rewarded for reliable CAD behavior.
+## Editability Tests
+This is the key differentiator.
+After the agent submits a part, the environment should mutate requirements and test whether the CAD remains editable:
+- increase load by 20%,
+- change screw spacing,
+- change material,
+- change envelope,
+- increase minimum wall thickness,
+- require a larger fillet,
+- move mounting holes,
+- change manufacturing process from FDM to CNC or vice versa.
+Example:
+```json
+{
+  "edit_test": "change_mount_hole_spacing",
+  "old_spacing_mm": 48,
+  "new_spacing_mm": 56,
+  "expected": "model_rebuilds_without_manual_rewrite"
+}
+```
+This catches fake CAD solutions that only work once.
+The trained behavior we want:
+- define named parameters,
+- reference parameters consistently,
+- avoid brittle coordinate hacks,
+- keep sketches constrained,
+- isolate features cleanly,
+- choose operations that survive downstream edits.
+This is where AI currently fails badly, which makes it a strong RLVE target.
+## Geometry Quality Checks
+The environment should check:
+- watertight mesh,
+- no holes or gaps,
+- manifold edges,
+- no self-intersections,
+- no zero-thickness faces,
+- no tiny sliver faces,
+- no duplicate coincident geometry,
+- acceptable triangle quality for exported mesh,
+- consistent normals,
+- minimum feature size,
+- minimum wall thickness,
+- proper contact/union between features.
+For code-CAD, this can be done after export:
+```text
+CAD code -> solid model -> STEP/STL -> mesh/solid validation -> reward
+```
+The "tight mesh" requirement should not mean the agent directly optimizes mesh triangles first. It should mean:
+> The CAD-generated solid exports to a clean, watertight, simulation-ready mesh.
+## Structural Verification
+For MechForge, start with fast structural checks:
+- cantilever beam approximation,
+- plate/rib stress proxies,
+- simple linear elasticity,
+- later tetrahedral FEA.
+Each part gets a load template:
+| Part | Boundary condition | Load |
+|---|---|---|
+| Wall hook | screw holes fixed | downward load at hook tip |
+| Shelf bracket | wall plate fixed | distributed shelf load |
+| Chair support | feet fixed | downward seat load |
+| Phone stand | base contact fixed | device weight and tipping check |
+| Clamp | screw pad contact | clamping force and jaw bending |
+| Motor mount | bolt holes fixed | radial/axial force and torque |
+The environment returns feedback:
+```json
+{
+  "build": "success",
+  "geometry": {
+    "watertight": true,
+    "manifold": true,
+    "min_wall_thickness_mm": 3.2,
+    "self_intersections": 0
+  },
+  "feature_tree": {
+    "named_parameters": 12,
+    "editable": true,
+    "failed_edit_tests": []
+  },
+  "structural": {
+    "max_stress_mpa": 31.4,
+    "max_displacement_mm": 1.8,
+    "safety_factor": 2.4,
+    "hotspots": ["hook_root"]
+  },
+  "reward": 0.86
+}
+```
+## Curriculum
+Stage 1: Valid code-CAD
+- simple plates,
+- holes,
+- extrusions,
+- fillets,
+- no physical simulation yet.
+Stage 2: Editable parametric parts
+- change dimensions,
+- move holes,
+- alter thickness,
+- regenerate cleanly.
+Stage 3: Manufacturable household parts
+- wall hook,
+- phone stand,
+- shelf bracket,
+- clamp,
+- hinge plate.
+Stage 4: Structural MechForge
+- loads,
+- fixed regions,
+- stress proxy,
+- displacement proxy,
+- safety factor,
+- mass efficiency.
+Stage 5: Multi-step engineering revision
+- inspect hotspots,
+- add ribs,
+- change fillets,
+- move holes,
+- reduce mass,
+- rerun checks,
+- commit design.
+Stage 6: Higher-fidelity CAD and FEA
+- STEP export,
+- tetrahedral meshing,
+- linear elasticity,
+- thermal add-on,
+- multi-load cases.
+## Why This Is Reliable
+This environment has unusually objective feedback.
+The verifier does not need to understand aesthetics or taste. It can simply check:
+- code runs,
+- CAD builds,
+- geometry is closed,
+- edits survive,
+- features are named,
+- constraints are satisfied,
+- physical load cases pass,
+- artifacts export.
+That gives a clean training signal.
+It also directly targets a frontier-model weakness:
+> Models can often write one plausible CAD script, but they are unreliable at iterative geometric repair and robust parametric editing.
+That gap is where RLVE can show measurable improvement.
+## 24-Hour MVP
+Build the MVP around one or two part families:
+1. Wall hook.
+2. Shelf bracket or motor mount.
+Minimum viable loop:
+```text
+prompt
+-> agent emits constrained CAD JSON or code-CAD
+-> environment builds part
+-> exports STL
+-> validates watertight/manifold geometry
+-> runs simple structural score
+-> applies one editability mutation
+-> returns reward
+```
+Artifacts:
+- CAD code,
+- design JSON,
+- STL,
+- PNG render,
+- validation report,
+- reward breakdown,
+- trace of actions.
+The demo story:
+> At first, the agent creates CAD that looks plausible but breaks under edits or produces bad geometry. After training, it learns to create clean, editable, watertight parametric parts that survive structural checks.
+## Relationship To MechForge
+MechForge should become the structural subset of CADForge.
+Old framing:
+> Train an agent to design lightweight brackets/mounts under load.
+New framing:
+> Train an agent to create reliable parametric CAD for functional mechanical parts, with MechForge structural checks as the first reward suite.
+This is better because it solves the deeper problem.
+Structural optimization is valuable, but CAD reliability is the bottleneck. If an agent cannot make clean editable CAD, it cannot become a useful engineering agent.
+## Rating
+Score: **9/10**
+Why:
+- Very strong pain point.
+- Easy to explain.
+- Objective rewards.
+- Strong long-horizon action space.
+- Clear frontier-model weakness.
+- Good bridge between CAD, simulation, manufacturability, and agent training.
+- More generally useful than a pure mesh-generation environment.
+Main risk:
+- Real CAD kernels can be annoying to package and debug.
+- The MVP should avoid arbitrary free-form CAD at first.
+- Start with a constrained DSL and only later expose full OpenSCAD/CadQuery code.
+Best near-term choice:
+> Build a constrained code-CAD environment for wall hooks and brackets, validate clean geometry and editability, then add structural MechForge rewards.

docs/brainstorm/10-cadforge-rlve-environment.md ADDED Viewed

	@@ -0,0 +1,564 @@

+# Document 10: CADForge RLVE Environment
+Date: 2026-04-25
+## Thesis
+CADForge should be an RLVE environment for training agents to create and revise constructive solid geometry code.
+The strongest medium is code-CAD:
+- OpenSCAD-style CSG,
+- CadQuery/build123d feature scripts,
+- or a constrained AST/DSL that compiles to OpenSCAD/CadQuery.
+The important move is to treat CAD code as an interactive, verifiable environment, not a one-shot file format.
+> CADForge is a REPL for mechanical geometry. The agent proposes a small CAD action, the environment builds and verifies the resulting object, then returns geometry, manufacturability, editability, and structural rewards.
+This directly fits the hackathon themes:
+- **Long-horizon planning:** 300+ small CAD/tool actions to reach a valid final object.
+- **World modeling:** the agent must maintain a mental model of the evolving geometry and consequences of each operation.
+- **Self-improvement:** adaptive curricula can generate harder CAD briefs and edit tests.
+- **Wild card:** reliable CAD creation is an underexplored and high-value frontier for LLM training.
+## Why This Is Better Than Raw CAD Text
+Do not train the agent to emit raw OpenSCAD text character by character.
+That wastes most of the learning signal on syntax:
+- semicolons,
+- braces,
+- parameter order,
+- missing parentheses,
+- malformed module calls.
+Instead, constrain the action space to AST operations. The policy chooses valid grammar moves:
+```json
+{"action": "add_primitive", "type": "cube", "size_mm": [80, 40, 6]}
+{"action": "apply_transform", "type": "translate", "vector_mm": [0, 0, 6]}
+{"action": "apply_boolean", "type": "union", "children": ["base", "rib_1"]}
+{"action": "apply_boolean", "type": "difference", "target": "base", "tool": "mount_hole_1"}
+```
+Then the environment renders or compiles that AST into OpenSCAD/CadQuery.
+This gives:
+- syntactic validity by construction,
+- clean action traces,
+- easier reward attribution,
+- interpretable failure modes,
+- curriculum control over which operations are unlocked.
+## Environment Loop
+The RL step loop should look like:
+```text
+reset(task_seed)
+  -> returns design brief, target constraints, allowed grammar/actions
+step(action)
+  -> updates CSG AST / feature tree
+  -> optionally compiles the current CAD
+  -> validates syntax, topology, editability, manufacturability, structure
+  -> returns observation, reward, done, artifacts
+done
+  -> when agent commits design, exceeds step budget, or creates unrecoverable invalid geometry
+```
+The environment is effectively a CAD REPL:
+```text
+agent action
+-> AST update
+-> generated OpenSCAD/CadQuery
+-> headless CAD build
+-> STL/STEP export
+-> trimesh/solid validation
+-> simple structural check
+-> reward + warnings
+```
+## Action Space
+Start with a small grammar.
+Primitive actions:
+```json
+{"tool": "add_cube", "id": "seat", "size_mm": [420, 380, 35]}
+{"tool": "add_cylinder", "id": "leg_1", "height_mm": 450, "radius_mm": 18}
+{"tool": "add_sphere", "id": "edge_round_proxy", "radius_mm": 12}
+```
+Transform actions:
+```json
+{"tool": "translate", "target": "leg_1", "vector_mm": [-170, -140, -225]}
+{"tool": "rotate", "target": "back_leg_1", "axis": "x", "degrees": -8}
+{"tool": "scale", "target": "rib_1", "factor": [1, 1, 1.2]}
+```
+Boolean actions:
+```json
+{"tool": "union", "id": "chair_frame", "children": ["seat", "leg_1", "leg_2", "leg_3", "leg_4"]}
+{"tool": "difference", "target": "seat", "tool": "lightening_cutout_1"}
+{"tool": "intersection", "id": "trimmed_backrest", "children": ["backrest", "envelope_box"]}
+```
+Feature actions:
+```json
+{"tool": "add_mount_hole", "target": "wall_plate", "diameter_mm": 5, "center_mm": [0, 24, 0]}
+{"tool": "add_fillet", "target": "load_path_edges", "radius_mm": 4}
+{"tool": "add_rib", "from": "seat", "to": "leg_1", "thickness_mm": 8}
+{"tool": "add_crossbar", "between": ["leg_1", "leg_2"], "radius_mm": 10}
+```
+Validation and simulation actions:
+```json
+{"tool": "compile_cad"}
+{"tool": "check_connected_components"}
+{"tool": "check_watertight"}
+{"tool": "check_manifold"}
+{"tool": "check_editability"}
+{"tool": "run_structural_check"}
+{"tool": "commit_design"}
+```
+## Verifiable REPL Implementation
+Use Python as the bridge.
+MVP stack:
+```text
+Gymnasium/OpenEnv API
+-> Python CSG AST
+-> SolidPython or direct OpenSCAD text emitter
+-> OpenSCAD CLI headless compile
+-> STL output
+-> trimesh validation
+-> reward
+```
+Headless compile:
+```bash
+openscad -o temp.stl generated_script.scad
+```
+For speed:
+- compile every N actions during long episodes,
+- compile immediately after high-risk boolean/edit operations,
+- run multiple environments in parallel,
+- write temporary files to a RAM disk when available,
+- cache compiled subtrees if the AST supports stable node IDs.
+## Reward Function
+The reward should combine code validity, geometry coherence, editability, manufacturability, and structural performance.
+```text
+R_total =
+  + w_build         * build_success
+  + w_connected     * single_connected_component
+  + w_manifold      * watertight_manifold_mesh
+  + w_contact       * required_parts_touch_and_align
+  + w_editable      * editability_tests_passed
+  + w_constraints   * task_constraints_satisfied
+  + w_structure     * safety_factor_score
+  + w_efficiency    * mass_efficiency
+  - w_nodes         * ast_node_count
+  - w_invalid       * invalid_operation_count
+  - w_floating      * disconnected_component_count
+```
+Suggested first weights:
+```text
+build_success:              0.20
+single_connected_component: 0.20
+watertight_manifold_mesh:   0.20
+part_contact_alignment:     0.10
+editability_tests_passed:   0.10
+constraints_satisfied:      0.05
+structural_safety:          0.10
+mass_efficiency:            0.025
+manufacturability:          0.025
+```
+Big penalties:
+```text
+syntax_or_compile_error:    -1.00 and terminate
+floating_parts:             -0.60 to -1.00 and terminate on final commit
+non_manifold_mesh:          -0.50
+self_intersection:          -0.50
+unjoined_touching_failure:   -0.35
+edge_misalignment:          -0.25
+zero_thickness_geometry:    -0.20
+failed_required_edit:       -0.20
+unsafe_final_design:        -0.30
+```
+The first principle:
+> A pretty shape that does not compile, is not watertight, or contains floating parts should receive a near-zero score.
+For CADForge, topology is not a secondary check. It is the gate that decides whether the object is even a valid candidate.
+## Floating Parts And Coherence
+This is a major reward term.
+After the CAD compiles to STL, load it with `trimesh`:
+```python
+mesh = trimesh.load("generated.stl")
+components = mesh.split()
+floating_count = max(0, len(components) - 1)
+```
+If `len(components) > 1`, the part contains disconnected/floating geometry.
+Policy:
+- small intermediate penalty while exploring,
+- large penalty after compile checkpoints,
+- immediate episode termination if the final committed design has floating parts.
+The reward should strongly prefer one physically coherent object.
+For multi-part everyday objects such as chairs, tables, hooks, clamps, and trusses, "close enough visually" is not enough. The environment should verify that load-bearing parts actually touch or overlap:
+- legs must contact or penetrate the underside of the seat enough to form a union,
+- crossbars must touch both legs they claim to connect,
+- backrests must connect to the seat or rear legs,
+- hook arms must connect to the mounting plate,
+- truss members must meet at nodes,
+- holes/cutouts must pass through the intended parent solid, not float in empty space.
+This needs edge/contact and alignment checks:
+- bounding-box contact tests between named features,
+- nearest-surface distance between intended mating faces,
+- shared-node or overlap checks after boolean union,
+- non-manifold edge count,
+- boundary/open edge count,
+- connected-component count,
+- semantic contact graph pass/fail.
+The contact graph can be part of the observation:
+```json
+{
+  "contact_graph": {
+    "seat": ["front_left_leg", "front_right_leg", "rear_left_leg", "rear_right_leg", "backrest"],
+    "front_crossbar": ["front_left_leg", "front_right_leg"],
+    "rear_crossbar": ["rear_left_leg", "rear_right_leg"]
+  },
+  "missing_contacts": [],
+  "floating_features": []
+}
+```
+If the contact graph fails, the episode should continue only if the agent still has repair budget. A final committed design with missing required contacts should fail hard.
+## Mesh And Solid Quality
+The "tight mesh" requirement means:
+> The CAD-generated solid exports to a closed, watertight, manifold, simulation-ready mesh.
+Checks:
+- `mesh.is_watertight`,
+- one connected component,
+- no zero-area faces,
+- no duplicate faces,
+- no inverted normals,
+- no obvious self-intersections,
+- minimum wall thickness,
+- minimum feature size,
+- acceptable triangle aspect ratios,
+- volume above a small threshold,
+- bounding box inside the task envelope.
+These checks are objective and judge-friendly.
+For browser-side MVP rendering, the same checks can be computed directly on the generated mesh:
+- count connected components from triangle/vertex adjacency,
+- count boundary edges,
+- count non-manifold edges,
+- mark watertight only if every edge has exactly two incident faces,
+- expose the failure as a reward penalty and UI metric.
+This is not a mock reward. It is a real topology check on the actual rendered mesh, even if the renderer initially supports only a constrained OpenSCAD subset.
+## Reference Model Reward
+A second reward channel can compare the generated CAD mesh against a reference object.
+For the chair benchmark, the first reference asset is:
+```text
+3d-models/ikea_markus_office_chair.glb
+```
+This should not replace topology rewards. It should sit after them:
+```text
+if not build_success or not watertight or floating_parts > 0:
+  reward = near_zero
+else:
+  reward = topology_reward + structural_reward + reference_similarity_reward
+```
+Reference similarity can use:
+- normalized bounding-box dimensions,
+- oriented bounding-box alignment,
+- voxel IoU,
+- Chamfer distance between sampled surface points,
+- silhouette overlap from canonical views,
+- part-presence classifiers for seat/legs/back/arms/headrest,
+- center-of-mass and support-polygon checks.
+For chairs:
+- CAD output should have a seat-like horizontal surface,
+- support structures should reach the floor,
+- backrest should rise behind the seat,
+- optional armrests/headrest should align with the reference,
+- dimensions should fit a plausible chair envelope.
+This gives the model a shape target without letting it cheat by creating a raw mesh. The final artifact still needs to be editable SCAD/CAD code.
+## Prompt-To-Reference-To-CAD Pipeline
+The general everyday-object pipeline:
+```text
+input prompt
+-> generate or retrieve reference image
+-> generate/retrieve watertight reference mesh or GLB
+-> normalize reference mesh scale/orientation
+-> agent creates SCAD/CAD through constrained actions
+-> compile/render candidate mesh
+-> reject if uncompiled, non-watertight, or floating
+-> compare candidate mesh to reference mesh
+-> run object-specific structural/contact checks
+-> return reward and repair hints
+```
+Possible reference sources:
+- curated GLB files for common objects,
+- image-to-3D systems such as Tripo-style services,
+- generated image followed by image-to-3D,
+- public model libraries when licensing permits,
+- procedural target generators for simple brackets, hooks, tables, and trusses.
+Reward order matters:
+1. **Compile validity**: code must parse and render.
+2. **Topology validity**: one watertight connected component.
+3. **Semantic contact graph**: required parts touch and align.
+4. **Reference similarity**: looks like the target/reference.
+5. **Structural validity**: load path and safety checks.
+6. **Editability**: parameters survive changes.
+This attacks a core AI CAD weakness:
+> Models can generate objects that look plausible in one view but are topologically broken, uneditable, disconnected, or mechanically meaningless.
+CADForge should train against exactly those failures.
+## Editability Tests
+This is where CADForge becomes much stronger than ordinary shape generation.
+After a final design is produced, mutate the design brief:
+- make the chair seat 10% wider,
+- increase chair load from 700 N to 900 N,
+- move screw spacing from 48 mm to 56 mm,
+- increase hook tip load by 20%,
+- change material from PLA to PETG,
+- require all fillets above 3 mm,
+- shrink the envelope,
+- switch from FDM printing to CNC constraints.
+The environment recompiles after the mutation.
+Reward the agent only if the design survives without a full rewrite. This encourages:
+- named parameters,
+- stable feature IDs,
+- constrained sketches,
+- reusable modules,
+- clean boolean ordering,
+- avoiding brittle one-off coordinates.
+## First Benchmark: Chairs
+A chair is a surprisingly good first benchmark if the grammar is constrained.
+Why:
+- everyone understands what a chair should contain,
+- it requires multiple connected components to become one coherent assembly,
+- it exposes floating-part failures clearly,
+- it has real load-bearing structure,
+- it needs symmetry, legs, bracing, seat, and backrest,
+- it can support long-horizon 100-300 step episodes.
+Start simple:
+```text
+Build a four-legged chair that supports a 700 N seated load.
+It must include a seat panel, four legs, lower crossbars, and a backrest.
+All parts must be connected into one watertight/manifold solid.
+Keep the bounding box under 500 mm x 500 mm x 900 mm.
+```
+Then increase difficulty:
+- ergonomic curved chair,
+- armrests,
+- headrest,
+- 1000 N seat load plus 100 N backrest load,
+- lightweighting,
+- printability constraints,
+- edit test: widen seat and raise load.
+Other early benchmark tasks:
+- wall hook,
+- shelf bracket,
+- table,
+- truss,
+- clamp,
+- motor mount.
+## Training Path
+Do not start with full free-form CAD.
+Curriculum:
+1. **2D primitives:** square, circle, translate, union.
+2. **2D booleans:** difference and intersection.
+3. **2.5D extrusion:** plate with holes.
+4. **Simple 3D CSG:** cube/cylinder compositions.
+5. **Connectivity tasks:** all parts must touch or union.
+6. **Household parts:** hook, bracket, table, chair.
+7. **Structural tasks:** loads, fixed regions, stress proxies.
+8. **Editability tasks:** mutate dimensions and rebuild.
+9. **Long-horizon tasks:** 300-step chair/bracket/truss episodes.
+This makes the learning curve visible:
+```text
+random/untrained agent:
+  compiles sometimes, often disconnected, weak structure
+trained agent:
+  valid AST, connected geometry, cleaner feature tree, passes structural checks
+```
+## Model And Algorithm Notes
+For the hackathon, the minimum is not a perfect RL algorithm. The key is showing improvement.
+Practical options:
+- collect successful traces from a scripted expert,
+- train/fine-tune an LLM with TRL/Unsloth on action traces,
+- evaluate before/after in the environment,
+- optionally run PPO or GRPO over tool-action rewards.
+For a deeper version:
+- PPO works well for discrete grammar actions,
+- GRPO may be attractive for LLM tool-action policies,
+- state encoder can combine text brief, AST history, and geometry metrics,
+- later geometry encoders can use voxel grids, PointNet, or compact mesh summaries.
+Do not overbuild the neural architecture for the MVP. The environment and reward quality matter more.
+## Experiment 2 Scope
+Experiment 2 should be the CADForge prototype:
+```text
+Experiment 1:
+  prompt -> structured mechanical design -> Three.js render -> coarse FEA
+Experiment 2:
+  prompt -> multi-step CSG/CAD actions -> CAD validity checks -> connected/watertight reward -> structural household part score
+```
+The browser should show:
+- design prompt,
+- system prompt,
+- CSG action trace,
+- generated pseudo-OpenSCAD/code-CAD,
+- geometry validation metrics,
+- structural/checkpoint reward,
+- 3D viewer,
+- before/after or untrained/trained comparison.
+The initial fun demo:
+> Ask the agent to build a chair. Watch it add seat, legs, crossbars, backrest, fillets/ribs, run validation, detect floating parts, connect them, and commit a valid CAD-like object.
+## Judging Story
+Problem:
+> LLMs can describe parts but are unreliable at building valid, editable CAD.
+Environment:
+> CADForge turns CAD creation into a long-horizon verifiable tool environment. The agent edits a CSG/feature tree, compiles it, receives topology/manufacturability/structure feedback, and revises until the design passes.
+Training:
+> We fine-tune or RL-train on CAD action traces and reward feedback. The model learns to use fewer invalid operations, avoid floating parts, create connected watertight solids, and satisfy structural constraints.
+Evidence:
+- reward curves,
+- valid build rate,
+- connected component count,
+- watertight rate,
+- editability pass rate,
+- structural safety pass rate,
+- before/after rendered parts.
+Why it matters:
+> Reliable CAD agents would unlock practical engineering workflows. The hard part is not drawing a shape; it is making geometry that builds, edits, exports, and survives physical constraints.
+## Score
+Score: **9/10**
+This direction is strong because it is novel, verifiable, visual, long-horizon, and directly aimed at a known frontier-model weakness.
+The best MVP path is:
+> Build chairs/hooks/brackets through constrained CSG actions, validate with OpenSCAD/trimesh-style checks, and add MechForge structural rewards as the next layer.

docs/brainstorm/11-reference-model-reward-pipeline.md ADDED Viewed

	@@ -0,0 +1,192 @@

+# Document 11: Reference Model Reward Pipeline
+Date: 2026-04-25
+## Core Idea
+CADForge should not only reward whether code compiles. It should reward whether the generated CAD becomes a valid, watertight, physically coherent object that resembles the intended target.
+The strongest reward stack is:
+```text
+prompt
+-> reference image or reference mesh
+-> agent-generated SCAD/CAD
+-> compiled/rendered candidate mesh
+-> topology gate
+-> shape similarity reward
+-> semantic contact graph reward
+-> structural/editability reward
+```
+The topology gate comes first.
+If the candidate does not compile, is not watertight, or has floating parts, the reward should be near zero regardless of visual similarity.
+## Markus Chair Reference
+Current reference asset:
+```text
+3d-models/ikea_markus_office_chair.glb
+```
+This can become the first chair-reference target.
+The CAD agent still outputs editable SCAD/CAD code. The GLB is only used as a reward/reference object.
+## Why A Reference Mesh Helps
+Prompt-only reward is too fuzzy.
+For example:
+```text
+Build an office chair.
+```
+The agent may create:
+- a stool,
+- a flat bracket,
+- a disconnected pile of cubes,
+- a chair-like silhouette with no structural connections,
+- a raw mesh that looks okay but is not editable CAD.
+A reference mesh gives a concrete target distribution:
+- overall proportions,
+- seat/back/arm/headrest layout,
+- rough silhouette,
+- support footprint,
+- height/width/depth ratios,
+- part presence.
+But the reference mesh must not become the final output. The final output remains parametric CAD code.
+## Reward Order
+Use a strict reward order:
+1. **Compile validity**
+   - SCAD/CAD parses.
+   - CAD kernel or renderer produces geometry.
+   - No unsupported operations.
+2. **Topology validity**
+   - one connected component,
+   - no floating parts,
+   - watertight mesh,
+   - no boundary edges,
+   - no non-manifold edges,
+   - nonzero volume.
+3. **Semantic contact graph**
+   - chair legs touch seat,
+   - backrest touches seat or rear supports,
+   - crossbars touch both intended legs,
+   - hook arm touches wall plate,
+   - truss members meet at nodes.
+4. **Reference similarity**
+   - voxel IoU,
+   - Chamfer distance,
+   - silhouette overlap,
+   - bounding-box proportions,
+   - support footprint similarity.
+5. **Engineering checks**
+   - load path,
+   - safety factor,
+   - deflection,
+   - material/process constraints.
+6. **Editability**
+   - named parameters,
+   - stable features,
+   - rebuilds after dimension/load/material mutation.
+## Reference Similarity Metrics
+Candidate metrics:
+- **Bounding-box ratio score**
+  - Compare normalized width/depth/height proportions.
+- **Voxel IoU**
+  - Normalize both meshes into a unit cube.
+  - Voxelize into 32^3 or 64^3 occupancy.
+  - Reward intersection over union.
+- **Chamfer distance**
+  - Sample surface points from both meshes.
+  - Reward low bidirectional nearest-neighbor distance.
+- **Silhouette reward**
+  - Render front, side, top, and isometric masks.
+  - Reward 2D IoU per view.
+- **Part-presence reward**
+  - For chairs: seat, legs/base, backrest, arms/headrest if requested.
+  - For hooks: wall plate, hook arm, tip lip, screw holes.
+  - For trusses: triangular members and joint nodes.
+## Everyday-Object Pipeline
+For general everyday objects:
+```text
+Input prompt X
+-> generate or retrieve image of X
+-> image-to-3D system creates a reference mesh
+-> repair/validate reference mesh for watertightness if needed
+-> normalize reference mesh
+-> CADForge agent generates SCAD/CAD code
+-> candidate compiles/renders to mesh
+-> topology gate rejects bad CAD
+-> candidate/reference similarity gives dense reward
+-> structural/contact/editability checks give engineering reward
+```
+Possible reference sources:
+- curated GLB library,
+- generated image plus image-to-3D,
+- Tripo-style API,
+- procedural targets for simple mechanical parts,
+- user-supplied GLB/STL.
+## Important Constraint
+Do not let the system solve the task by returning the reference mesh.
+The agent must output:
+```text
+editable SCAD/CAD source code
+```
+The reward can compare against a mesh, but the submitted artifact must be code-CAD.
+## MVP Plan
+1. Use the Markus GLB as a chair reference.
+2. Load candidate SCAD mesh and reference GLB in the browser or Python verifier.
+3. Compute bounding-box proportion score first.
+4. Add connected-component and watertight hard gates.
+5. Add voxel IoU or Chamfer distance.
+6. Add semantic chair checks:
+   - seat-like surface,
+   - backrest-like vertical surface,
+   - floor-contacting supports,
+   - no floating parts.
+7. Show before/after reward curves for untrained vs trained SCAD generation.
+## Why This Matters
+This closes a major gap in AI CAD:
+> The model must learn to create valid editable geometry that both resembles the requested object and behaves like a coherent physical part.
+That is a stronger training signal than prompt-only judging, image-only judging, or topology-only judging.

docs/brainstorm/12-markus-chair-scope-grpo-rlve.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# Brainstorm 12: Markus Chair Scope for CADForge RLVE
+## Core scope
+For the hackathon, CADForge should scope down to one object family: an office chair similar to the IKEA Markus chair. This is narrow enough for a 0.6B model to learn meaningful structure, but still hard enough to prove the thesis that a model can generate editable, valid CAD instead of merely producing decorative 3D mesh output.
+The environment should train and evaluate prompts such as:
+- Make an editable SCAD model as similar as possible to the Markus chair reference.
+- Make a Markus-like chair with a taller backrest.
+- Make the chair with thicker armrests.
+- Make the chair base wider while preserving the five-star support.
+- Repair this candidate chair so every structural part touches the assembly.
+- Improve the candidate so it is watertight, editable, and made from clean primitives.
+This gives us a focused benchmark: not "make any CAD object," but "learn the grammar and construction pattern for one real household/mechanical object."
+## Why this is good for GRPO and RLVE
+This is a strong fit for GRPO because each prompt can produce a group of SCAD candidates, and the verifier can rank them with real geometry signals. The reward does not need to be vague preference text. It can be built from compile success, mesh validity, connected components, watertightness, contact graph quality, bounding-box alignment, silhouette similarity, and shape similarity to the reference GLB.
+It is also a strong fit for RLVE because the environment is not only judging final text. It is compiling and rendering the artifact, measuring what actually exists, and feeding that back into the next attempt.
+The important constraint is that the 0.6B model should not be asked to emit arbitrary raw CAD text from nothing. It should be trained around a constrained SCAD/CSG grammar or AST-style action space:
+- Add primitive: cube, cylinder, sphere.
+- Add transform: translate, rotate, scale.
+- Add boolean: union, difference, intersection.
+- Add semantic chair part: seat, backrest, armrest, central column, five-star base spoke, caster proxy.
+- Repair operation: snap a part to nearest body, thicken a thin wall, union overlapping contact regions, remove floating component.
+That is how a small model can become surprisingly good. It learns the local construction game and the verifier keeps it from drifting into invalid geometry.
+## Reference target
+The fixed hackathon reference should be the existing Markus chair GLB in the repo. We normalize it once and use it as the reward target:
+1. Load the GLB.
+2. Normalize scale, orientation, and origin.
+3. Extract reference bounding box, silhouette renders, voxel occupancy, point cloud samples, and major-part hints.
+4. Generate a SCAD candidate.
+5. Render candidate to mesh.
+6. Normalize candidate to the same coordinate frame.
+7. Score topology first, then score similarity.
+The reference mesh does not need to be CAD-native. It can be a target signal. The output we care about is still editable SCAD.
+## Reward shape
+Topology should dominate the reward. A model that makes a pretty chair with floating parts should lose badly.
+Suggested hard gates:
+- Uncompilable SCAD: terminate with severe penalty.
+- Empty mesh: terminate with severe penalty.
+- More than one connected component: severe penalty or terminate.
+- Non-watertight mesh: severe penalty.
+- Boundary edges or non-manifold edges: severe penalty.
+- Parts that are close but not touching: high penalty.
+- Excessive node count for the same shape quality: mild parsimony penalty.
+Suggested positive rewards:
+- Similar voxel occupancy to Markus reference.
+- Low Chamfer distance between sampled candidate/reference point clouds.
+- Matching front, side, top, and isometric silhouettes.
+- Matching chair-specific dimensions: tall backrest, seat height, base radius, armrest height.
+- Valid contact graph: back touches seat, armrests touch seat/back, column touches seat/base, all spokes touch the hub.
+- Clean editability: named or segmented code blocks for seat, backrest, arms, hub, spokes, and caster proxies.
+The reward should look roughly like:
+```text
+R = topology_gate
+  + shape_similarity
+  + silhouette_similarity
+  + chair_part_contact_score
+  + editability_score
+  - node_count_penalty
+  - thin_wall_penalty
+```
+Where `topology_gate` can zero out or heavily negate the rest when the CAD is invalid.
+## Self-correction loop
+The agent should be allowed to see verifier output and revise:
+- connected component count
+- boundary edge count
+- non-manifold edge count
+- watertight true/false
+- bounding-box error
+- silhouette error by view
+- nearest-part contact gaps
+- largest missing region against the reference voxel grid
+This makes the task agentic. The model can take steps like:
+- "The backrest is disconnected from the seat; lower it by 4 mm and overlap by 2 mm."
+- "The five-star base spokes are separate; add a central hub cylinder and union each spoke into it."
+- "The silhouette is too short; increase backrest height."
+- "The right side silhouette is missing armrests; add two horizontal cylinders/boxes connected to the backrest and seat."
+For this benchmark, part assembly is better than carving a chair from one large slab. A slab can make connected geometry easier, but it hurts editability and does not teach meaningful mechanical construction. The model should learn to assemble semantic parts with intentional overlaps and unions.
+## FAL reference-model expansion path
+For Markus scope, we start with the provided GLB. For broader household objects later, we can generate reference meshes from prompts:
+1. User enters an engineering prompt.
+2. OpenAI image generation creates a clean, white-background product image.
+3. FAL SAM 3D Objects reconstructs the object from that image.
+4. We download `model_glb`, `individual_glbs`, `gaussian_splat`, metadata, and `artifacts_zip` when available.
+5. We normalize the GLB into a reward target.
+6. CADForge trains or evaluates SCAD candidates against that reference.
+Example input prompt:
+```text
+Design a simple 6061 aluminum wall-mounted J hook for a 120 N downward hanging load at the hook tip. It should visibly look like a hook, with a compact wall mount and a curved hook arm, not a ribbed cantilever bracket.
+```
+The image generation prompt should keep the reference clean:
+```text
+Design a simple 6061 aluminum wall-mounted J hook for a 120 N downward hanging load at the hook tip. It should visibly look like a hook, with a compact wall mount and a curved hook arm, not a ribbed cantilever bracket. Realistic product render, plain white background, isolated object, no labels, no text.
+```
+FAL's `fal-ai/sam-3/3d-objects` endpoint takes an `image_url` plus optional segmentation prompt/masks/point prompts/box prompts, and can return GLB outputs, Gaussian splat output, metadata, individual object files, and an artifacts zip. The API docs recommend keeping the FAL key on the server side through `FAL_KEY`. The playground currently lists the request cost as `$0.02` per generation, but the app should read and log actual cost metadata at runtime if FAL exposes it.
+For our app, this should be server-side only:
+```text
+POST /api/reference-models
+  prompt -> image generation
+  image_url -> fal-ai/sam-3/3d-objects
+  result -> download GLB/artifacts to data/reference-models/<slug>/
+  normalized target -> rewards/<slug>.json
+```
+The frontend should never expose `FAL_KEY` or `FAL_AI_API_KEY`.
+## Minimal hackathon deliverable
+The most credible v1 is:
+1. Simple UI with one `Generate` button.
+2. Prompt asks for a Markus-like chair.
+3. Model generates SCAD.
+4. Browser renders the SCAD as real geometry.
+5. Viewer automatically shows isometric, front, back, left, right, top, and bottom views.
+6. Verifier reports watertightness, floating components, boundary edges, and non-manifold edges.
+7. Reward code compares the candidate mesh against the Markus GLB reference.
+8. GRPO loop ranks multiple candidates and fine-tunes a 0.6B model on the winning patterns.
+This is a better hackathon scope than generic CAD generation because the demo can show visible learning: early chairs have floating parts and broken bases; later chairs become coherent, connected, watertight, and more Markus-like.
+## Position
+I would rate this direction a 9/10 for the hackathon if we keep the scope narrow. It has a clean story, a measurable verifier, a concrete reference object, and a realistic training target for a small model. The risk is trying to generalize too early. The winning move is to make one chair family work extremely well, then expand the same environment to hooks, brackets, tables, trusses, and other household mechanical objects.

docs/brainstorm/13-markus-chair-cadquery-grpo-rlve-plan.md ADDED Viewed

	@@ -0,0 +1,799 @@

+# Brainstorm 13: CadQuery Markus Chair GRPO/RLVE Hackathon Plan
+Date: 2026-04-25
+## One-line pitch
+CADForge is a reinforcement-learning environment where language models learn to create and revise real parametric CadQuery models by acting through CAD tools, observing rendered geometry, and optimizing verifiable rewards for topology, editability, and similarity to a reference object.
+The flagship benchmark is a Markus-style office chair reconstructed as editable CadQuery code from a reference GLB.
+## Hackathon thesis
+Most models can write plausible CAD-looking Python once. They fail when they must maintain a persistent 3D world model across many revisions:
+- parts float,
+- booleans fail,
+- dimensions drift,
+- edits break the model,
+- screenshots look plausible but topology is invalid,
+- the model cannot reliably repair geometry from tool feedback.
+CADForge turns this into a trainable environment:
+```text
+prompt
+-> model proposes CadQuery code or edit action
+-> backend executes CadQuery
+-> environment exports STL/mesh/screenshots
+-> reward functions score topology, similarity, editability, and tool efficiency
+-> model revises for up to 300 actions
+-> GRPO trains a small model to need fewer revision steps
+```
+The objective is not just "generate a chair." The objective is:
+> Train a tiny Qwen model to become a better CadQuery CAD agent that produces valid, editable, reference-aligned geometry with fewer tool calls.
+## Theme alignment
+### Theme 2: Long-horizon planning
+The environment supports up to 300 actions per episode:
+- generate initial CadQuery,
+- render,
+- inspect topology,
+- inspect screenshots,
+- compare to GLB reference,
+- edit dimensions,
+- add missing parts,
+- repair disconnected components,
+- rerender,
+- commit final design.
+The reward is delayed and multi-part. A model must plan the full assembly, not just write a single pretty script.
+### Theme 3.1: Professional world modeling
+This is a realistic engineering workflow:
+- Python CadQuery execution,
+- STL export,
+- mesh validation,
+- reference GLB normalization,
+- point-cloud and silhouette comparison,
+- persistent CAD state,
+- artifact logs,
+- render snapshots,
+- code diffs,
+- editability checks.
+The agent must model a partially observable 3D world using tool feedback. It cannot solve the task by text pattern matching alone.
+### Theme 4: Self-improvement
+The environment can generate adaptive tasks:
+- make the chair taller,
+- make armrests thicker,
+- widen the five-star base,
+- repair floating parts,
+- reduce revision count,
+- preserve editability under parameter changes.
+Curriculum generation can escalate from simple blocks to full chairs. The model improves by repeatedly encountering its own CAD failure modes.
+### Theme 1: Multi-agent interactions, optional extension
+If time permits, split the workflow into specialist roles:
+- Designer agent proposes CadQuery.
+- Critic agent interprets screenshots and reward reports.
+- Repair agent edits only broken geometry.
+- Verifier agent decides whether to commit.
+This is not required for the MVP, but it is a clean demo extension.
+## What to build first
+Build the CadQuery-only Markus environment before training.
+Do not start with full RL. First make the environment stable, because GRPO only matters if the reward is trustworthy.
+Priority order:
+1. Reference preprocessing pipeline.
+2. CadQuery candidate execution pipeline.
+3. Reward functions with visible breakdowns.
+4. GPT-5.4/GPT-5.5 multi-step benchmark traces.
+5. Small Qwen GRPO run.
+6. Before/after report and demo.
+## Reference pipeline
+Input:
+```text
+3d-models/ikea_markus_office_chair.glb
+```
+Steps:
+1. Load the GLB with `trimesh`.
+2. Convert scene nodes into one mesh.
+3. Normalize orientation so:
+   - Z is up,
+   - seat/back height is vertical,
+   - chair front faces negative Y or a fixed canonical direction.
+4. Normalize origin:
+   - center X/Y at zero,
+   - floor/base touches Z = 0,
+   - scale height to a canonical chair height, for example 1000 mm.
+5. Save normalized artifacts:
+   - `reference_normalized.glb`,
+   - `reference_normalized.stl`,
+   - `reference_point_cloud.npy`,
+   - `reference_voxels.npz`,
+   - `reference_silhouettes/*.png`,
+   - `reference_metrics.json`.
+Reference metrics:
+```json
+{
+  "bbox_mm": {"x": 620, "y": 650, "z": 1000},
+  "seat_height_ratio": 0.45,
+  "back_height_ratio": 0.55,
+  "base_radius_ratio": 0.32,
+  "views": ["front", "back", "left", "right", "top", "isometric"],
+  "semantic_hints": ["seat", "tall_backrest", "armrests", "central_column", "five_star_base"]
+}
+```
+## Candidate pipeline
+Input:
+```text
+task prompt + current CadQuery code + last verifier report + screenshot summaries
+```
+Candidate execution:
+1. Write CadQuery code.
+2. Run in a sandboxed Python subprocess.
+3. Export STL.
+4. Load STL with `trimesh`.
+5. Normalize candidate to the same coordinate frame as the reference.
+6. Render six fixed views plus an isometric.
+7. Compute reward components.
+8. Save artifacts.
+Artifacts per attempt:
+```text
+runs/<episode_id>/<step_id>/
+  candidate.py
+  candidate.stl
+  candidate_normalized.stl
+  renders/isometric.png
+  renders/front.png
+  renders/back.png
+  renders/left.png
+  renders/right.png
+  renders/top.png
+  reward.json
+  verifier_report.md
+```
+## Agent action space
+For the demo UI and GPT benchmark, allow free-form CadQuery edits.
+For GRPO training, use a stricter action format so a small Qwen model can learn without forcing hand-authored chair-part tools:
+```json
+{
+  "thought": "The backrest is too short and disconnected from the seat.",
+  "tool": "apply_patch",
+  "patch": "*** Begin Patch\n*** Update File: candidate.py\n@@\n-back_height = 520\n+back_height = 680\n*** End Patch"
+}
+```
+Initial allowed tools:
+```text
+write_initial_cadquery
+apply_patch
+replace_file
+render_candidate
+inspect_reward
+inspect_screenshots
+commit_design
+```
+The important rule is that the model edits CadQuery like a developer edits code in a REPL. It can add functions, refactor parameters, split subassemblies, create helpers, and compose objects however it wants. The environment should not expose narrow tools such as `add_part seat` or `add_backrest` as the main action space, because that bakes our solution into the policy. Semantic part names are reward probes, not required action names.
+Optional later tools can stay code-native:
+```text
+run_static_code_check
+show_code_diff
+revert_last_patch
+ask_verifier_for_top_failure
+export_artifacts
+parameter_edit
+```
+The long-horizon version should count every tool call as an action and cap the episode at 300 actions.
+## Reward design
+The reward must be multi-signal. A single "looks like chair" score is too easy to hack.
+Use two reward speeds:
+- `fast`: dense RL feedback after ordinary edit/tool steps. It scores build success, topology, bounding-box similarity, code/semantic structure, and editability without writing screenshots or running point-cloud/silhouette comparison.
+- `full`: checkpoint/final scoring. It saves render artifacts and computes silhouette IoU plus Chamfer-style point-cloud similarity against both the ideal CadQuery reference and the GLB reference.
+The training loop should use `fast` for most rollout steps and `full` on `commit_design`, every N revisions, and benchmark/report runs.
+Final score:
+```text
+R_total =
+    0.20 * R_build
+  + 0.20 * R_topology
+  + 0.15 * R_semantic_parts
+  + 0.15 * R_reference_similarity
+  + 0.10 * R_silhouette
+  + 0.10 * R_editability
+  + 0.05 * R_efficiency
+  + 0.05 * R_process
+  - penalties
+```
+Use gates before adding soft similarity rewards:
+```text
+if code_does_not_run: R_total = -1.0 and terminate
+if no_mesh_or_empty_mesh: R_total = -1.0 and terminate
+if final_design_has_many_components: cap R_total at 0.20
+if final_design_is_not_chair_like: cap R_total at 0.35
+```
+### R_build
+Checks whether CadQuery code is executable and exports geometry.
+Signals:
+- imports only allowed modules,
+- defines a final `result` or discoverable CadQuery object,
+- CadQuery build succeeds,
+- STL export succeeds,
+- mesh loads in `trimesh`,
+- bounding box is finite and nonzero.
+Suggested scoring:
+```text
+cadquery_import_ok:      +0.15
+script_exec_ok:          +0.25
+solid_found:             +0.20
+stl_export_ok:           +0.20
+mesh_load_ok:            +0.20
+```
+### R_topology
+Topology dominates because disconnected pretty geometry is not useful CAD.
+Signals:
+- connected component count,
+- watertight mesh,
+- manifold edges,
+- boundary edges,
+- non-manifold edges,
+- degenerate faces,
+- reasonable face count,
+- no huge accidental slabs.
+Suggested scoring:
+```text
+single_connected_component: +0.35
+watertight:                  +0.25
+no_non_manifold_edges:        +0.15
+low_boundary_edges:           +0.10
+no_degenerate_faces:          +0.10
+reasonable_complexity:        +0.05
+```
+Penalties:
+```text
+extra_connected_component:     -0.15 each
+boundary_edge_ratio_high:      -0.10 to -0.30
+non_manifold_edges_present:    -0.20
+degenerate_faces_present:      -0.10
+```
+### R_semantic_parts
+The Markus chair is not just any tall object. It needs recognizable functional parts.
+Required part hints:
+- seat,
+- tall backrest,
+- upper/headrest-like section,
+- left armrest,
+- right armrest,
+- central support column,
+- five-star base or at least 5 radial spokes,
+- caster proxies or feet.
+Detection can start from named variables and bounding boxes, then become geometric:
+```text
+seat exists:                         +0.10
+backrest taller than seat:            +0.15
+backrest touches seat/rear supports:  +0.15
+two armrests exist:                   +0.15
+armrests connect to seat/back:        +0.10
+central column exists:                +0.10
+base has 5 radial spokes:             +0.15
+base contacts column:                 +0.10
+```
+This reward should inspect both code and geometry. Code names are helpful but not sufficient.
+### R_reference_similarity
+Use normalized candidate and normalized GLB reference.
+Signals:
+- bounding-box ratio similarity,
+- point-cloud Chamfer distance,
+- voxel IoU,
+- rough mass distribution by height,
+- principal-axis alignment.
+Suggested scoring:
+```text
+bbox_ratio_score:        0.25
+chamfer_score:           0.30
+voxel_iou_score:         0.25
+height_distribution:     0.10
+principal_axes_score:    0.10
+```
+Important: this reward should not overpower topology. A broken mesh that happens to occupy similar pixels should not win.
+### R_silhouette
+Render candidate and reference with the same camera settings:
+- front,
+- back,
+- left,
+- right,
+- top,
+- isometric.
+Compute binary mask IoU or distance transform similarity.
+Suggested scoring:
+```text
+front_iou:      0.20
+side_iou:       0.25
+back_iou:       0.15
+top_iou:        0.15
+isometric_iou:  0.25
+```
+This is the judge-friendly reward because it maps to visible screenshots.
+### R_editability
+This is the product differentiator. The environment should mutate the code and check whether it still builds.
+Edit tests:
+- increase backrest height by 10 percent,
+- widen seat by 10 percent,
+- thicken armrests,
+- increase base radius,
+- change column height,
+- change global scale.
+Signals:
+```text
+named_parameters_present:       +0.20
+all_major_dimensions_parameterized: +0.25
+edit_backrest_height_rebuilds:  +0.15
+edit_seat_width_rebuilds:       +0.15
+edit_base_radius_rebuilds:      +0.15
+no_hardcoded_uneditable_blob:    +0.10
+```
+This blocks the model from generating a one-off decorative mesh-like CadQuery script.
+### R_efficiency
+The goal is fewer revision steps.
+Signals:
+- number of tool calls,
+- number of failed renders,
+- number of compile failures,
+- token count,
+- code size.
+Suggested scoring:
+```text
+R_efficiency = max(0, 1 - tool_calls / max_tool_calls)
+```
+Penalties:
+```text
+compile_failure:      -0.05 each
+render_failure:       -0.05 each
+unproductive_edit:    -0.03 each
+excessive_code_size:  -0.05
+```
+This is where the "trained model needs fewer revisions" claim becomes measurable.
+### R_process
+Reward the agent for using the workflow correctly:
+- renders before committing,
+- reads reward report before editing,
+- repairs the biggest failure first,
+- does not repeat the same failed edit,
+- commits only after passing minimum topology gates.
+Example:
+```text
+rendered_before_commit:            +0.20
+used_verifier_feedback_in_edit:     +0.25
+fixed_previous_top_failure:         +0.25
+no_repeated_failed_patch:           +0.15
+commit_after_threshold:             +0.15
+```
+## Anti-reward-hacking checks
+Block or penalize:
+- reading reference reward files directly,
+- hardcoding saved mesh artifacts,
+- importing network or filesystem tools outside the run directory,
+- writing files outside the episode directory,
+- returning a prebuilt STL instead of CadQuery,
+- creating one giant slab that fills the silhouette,
+- naming variables `seat` and `backrest` without matching geometry,
+- disabling or bypassing verifier code,
+- excessive triangle count to game silhouette overlap.
+CadQuery execution should run in a subprocess with:
+- timeout,
+- allowed imports,
+- isolated working directory,
+- max file size,
+- max mesh triangles,
+- no network,
+- no access to hidden reference internals.
+## GPT-5.4/GPT-5.5 benchmark
+Purpose:
+Show that even strong frontier models improve through tool use, and collect teacher traces for the small model.
+Benchmark setup:
+- Tasks: 5 to 10 Markus-chair variants.
+- Models: GPT-5.4 and GPT-5.5 if available in the local/API stack.
+- Budget: 1, 3, 5, 10, and 20 tool-call attempts.
+- Each attempt saves code, STL, screenshots, reward breakdown, and critique.
+Tasks:
+1. Baseline Markus-like chair.
+2. Taller backrest.
+3. Thicker armrests.
+4. Wider five-star base.
+5. Repair a provided broken chair with floating parts.
+6. Make the chair editable under global scale changes.
+7. Improve silhouette match against the GLB.
+Report file:
+```text
+experiment-2-cadforge/reports/gpt-cadquery-benchmark.md
+```
+Report structure:
+```markdown
+# GPT CadQuery Tool-Use Benchmark
+## Summary Table
+| Task | Model | Attempts | Best Reward | Build | Topology | Similarity | Editability | Notes |
+## Task 1: Baseline Markus Chair
+### Attempt 1
+- Code: ...
+- Reward: ...
+- Failure: ...
+- Screenshots: ...
+### Attempt N
+- Improvement: ...
+## Cross-task Findings
+- What improved with more tool calls
+- What did not improve
+- Repeated failure modes
+- Best teacher traces for SFT or GRPO warm start
+```
+This is the "evidence" part judges will care about.
+## GRPO training plan
+Use GRPO/RLVR, not classic human-preference RLHF, for the core result.
+Why:
+- rewards are verifiable,
+- no reward model needed,
+- multiple sampled candidates per prompt can be ranked by the verifier,
+- the hackathon guide explicitly favors verifier-first GRPO-style tasks.
+Model target:
+```text
+Qwen small instruct model, ideally 0.5B to 1.5B for overnight feasibility.
+```
+Training stages:
+### Stage 0: Formatting warm start
+Small SFT dataset from:
+- hand-written valid CadQuery templates,
+- GPT teacher traces,
+- environment tool-call transcripts.
+Goal:
+Teach the small model to emit the correct action JSON and basic CadQuery structure.
+### Stage 1: Easy GRPO
+Tasks:
+- create one box,
+- create seat plus backrest,
+- create connected chair silhouette from boxes/cylinders,
+- pass build/topology rewards.
+Reward focus:
+- valid code,
+- connected mesh,
+- named parameters.
+### Stage 2: Markus semantic GRPO
+Tasks:
+- full Markus-like chair,
+- add armrests,
+- add five-star base,
+- repair disconnected base/back/arms.
+Reward focus:
+- semantic part score,
+- topology,
+- silhouette.
+### Stage 3: Revision efficiency GRPO
+Tasks:
+- start from flawed candidates,
+- repair within 5 to 20 tool calls,
+- minimize failed renders and repeated edits.
+Reward focus:
+- fewer tool calls,
+- fixed prior failure,
+- final reward delta.
+## Overnight RunPod plan
+Only start raw compute after the environment can run 100 local episodes without crashing.
+Minimum preflight:
+```text
+python scripts/run_cadquery_env_smoke.py --episodes 20
+python scripts/run_reward_regression.py
+python scripts/run_gpt_benchmark.py --tasks 2 --attempts 2
+```
+When to rent RunPod:
+1. Reward code is stable.
+2. Artifacts save correctly.
+3. No reward file leakage.
+4. Qwen can produce valid action JSON at least sometimes.
+5. Local mini-GRPO or dry-run completes.
+Suggested overnight jobs:
+```text
+Job A: GPT teacher benchmark
+- 5 to 10 tasks
+- 5 to 20 attempts per task
+- save markdown report and screenshots
+Job B: Qwen Stage 0 SFT
+- formatting/action traces
+- 1 to 2 hours
+Job C: Qwen GRPO Stage 1/2
+- easy to medium curriculum
+- 6 to 10 hours
+- checkpoint every 30 to 60 minutes
+```
+Metrics to monitor:
+- total reward,
+- build success rate,
+- connected component pass rate,
+- semantic part score,
+- silhouette score,
+- editability pass rate,
+- average tool calls to best design,
+- compile failure rate,
+- examples every N steps.
+Stop the run if:
+- compile failure rate stays above 80 percent after warmup,
+- reward rises while topology gets worse,
+- outputs start exploiting file paths or constants,
+- model stops using valid action JSON.
+## Demo plan
+The winning demo should be visual and measurable.
+Screen 1:
+- user prompt,
+- reference GLB thumbnails,
+- baseline small Qwen attempt,
+- broken render and reward report.
+Screen 2:
+- multi-step tool trace,
+- failed topology warning,
+- edit patch,
+- rerender.
+Screen 3:
+- post-GRPO Qwen attempt,
+- fewer revisions,
+- better topology,
+- better silhouette,
+- reward improvement.
+Screen 4:
+- GPT-5.5 benchmark report as teacher/frontier baseline,
+- trained tiny model improvement curve,
+- environment API/OpenEnv story.
+Core claim:
+> We built a professional CAD RL environment, not just a CAD generator. The same verifier can train small models, benchmark frontier agents, and generate adaptive curricula.
+## Concrete next build tasks
+### Today: environment and rewards
+1. Convert the existing CadQuery renderer into a repeatable backend environment call.
+2. Add a `runs/<episode_id>/` artifact writer.
+3. Add reward JSON output for:
+   - build,
+   - topology,
+   - bbox,
+   - semantic parts,
+   - screenshots.
+4. Add GLB reference preprocessing.
+5. Add a one-command smoke test.
+### Next: benchmark report
+1. Build `scripts/experiment-2/run-gpt-cadquery-benchmark.js` or Python equivalent.
+2. Run GPT model for 5 tasks.
+3. Save every attempt as Markdown plus images.
+4. Summarize improvement with more tool calls.
+### Then: OpenEnv wrapper
+1. Define observation model:
+   - task prompt,
+   - current code,
+   - last reward,
+   - render paths,
+   - verifier warnings.
+2. Define action model:
+   - tool name,
+   - patch or code,
+   - commit flag.
+3. Implement `reset()`.
+4. Implement `step(action)`.
+5. Add timeout and sandbox limits.
+6. Validate with OpenEnv CLI.
+### Then: GRPO
+1. Create small curriculum dataset.
+2. Run formatting SFT if needed.
+3. Run GRPO with 4 to 8 samples per prompt.
+4. Save checkpoints and eval artifacts.
+5. Compare baseline Qwen vs trained Qwen.
+## Submission story
+Use this structure in the final hackathon README:
+1. Problem: LLMs produce plausible but unreliable CAD.
+2. Environment: CadQuery tool-use world with persistent geometry state.
+3. Rewards: topology-first, reference similarity, editability, process efficiency.
+4. Themes: long horizon, professional world modeling, self-improvement, optional multi-agent.
+5. Training: GRPO on small Qwen with verifiable rewards.
+6. Evidence: GPT benchmark traces plus Qwen before/after curves.
+7. Product: CADForge can become a sellable CAD-agent evaluation and training platform.
+## What not to do
+- Do not train before reward functions are stable.
+- Do not optimize only screenshots.
+- Do not let similarity beat topology.
+- Do not make the first benchmark generic CAD generation.
+- Do not promise full FEA for the first demo.
+- Do not make the small model write arbitrary long Python without a structured action wrapper.
+## Final scope recommendation
+The hackathon-winning scope is:
+> CadQuery Markus Chair RLVE: a verifiable long-horizon CAD environment where frontier models and small open models iteratively generate, inspect, repair, and improve parametric CAD against a real GLB reference, with GRPO training showing that a tiny Qwen model learns to produce valid chair CAD in fewer revisions.
+This is narrow enough to finish and strong enough to sell.

docs/brainstorm/14-cadquery-sft-grpo-rlve-training-plan.md ADDED Viewed

	@@ -0,0 +1,295 @@

+# Brainstorm 14: CadQuery SFT + GRPO/RLVE Training Plan
+Date: 2026-04-25
+## Goal
+Train a small Qwen3.5 model to act as a CadQuery CAD agent. The model should learn to generate and revise editable chair CAD with fewer failed attempts and fewer revision steps.
+The target task is not free-form mesh generation. It is code-CAD tool use:
+```text
+prompt
+-> candidate.py
+-> CadQuery build/export
+-> mesh + rendered views
+-> reward report
+-> code edit/revision
+-> repeat
+-> commit
+```
+## Why SFT First
+GRPO needs nonzero reward. A tiny 0.8B or 2B model may fail before it even reaches CadQuery execution unless we teach the basic format.
+SFT is only a warm start. It should teach:
+- return complete Python files,
+- use `import cadquery as cq`,
+- assign a final object to `fixture`,
+- avoid unsafe imports and fragile CadQuery APIs,
+- organize CAD as functions and named dimensions,
+- revise code using verifier feedback.
+SFT should not try to memorize the whole ideal chair. Keep it small and behavioral.
+## SFT Data
+Create examples from:
+1. The ideal Markus CadQuery code.
+2. GPT-5.4/GPT-5.5 benchmark traces.
+3. Handwritten repair examples:
+   - missing armrests,
+   - floating base,
+   - too-short backrest,
+   - failed `loft()` replaced with boxes/cylinders,
+   - disconnected caster assembly,
+   - no final `fixture`.
+4. Environment transcripts:
+   - prompt,
+   - previous code,
+   - reward JSON,
+   - next corrected code.
+Recommended first dataset size:
+```text
+50 to 200 examples
+```
+That is enough for format and tool behavior.
+## GRPO Setup
+Use normal GRPO with vLLM in serve mode on RunPod, as suggested by the judge.
+Architecture:
+```text
+GRPO trainer
+-> requests N samples from vLLM server
+-> parses candidate code
+-> runs CADForge evaluator
+-> computes reward
+-> updates LoRA adapter
+```
+For Qwen3.5, avoid vLLM async complications initially. Run vLLM as a plain server and call it synchronously from the rollout function.
+Local Ollama is for baseline/debug only. It is not the training backend.
+## Reward Modes
+Use two reward modes:
+### Fast Mode
+Used for most rollout candidates.
+Scores:
+- build success,
+- topology sanity,
+- contact/gap score,
+- semantic/code structure,
+- bbox similarity,
+- editability.
+Does not write colored screenshots or run expensive point-cloud/silhouette scoring.
+### Full Mode
+Used for:
+- final commit,
+- every N training steps,
+- benchmark reports,
+- judge artifacts.
+Scores:
+- everything from fast mode,
+- colored view renders,
+- silhouette IoU,
+- Chamfer-style point-cloud similarity,
+- candidate vs ideal CadQuery,
+- candidate vs GLB.
+## Reward Functions
+### Build Reward
+High only if:
+- code executes,
+- CadQuery object exists,
+- STL export succeeds,
+- mesh loads.
+Hard failure:
+```text
+reward = -1.0
+```
+if code cannot run or no mesh exports.
+### Topology Reward
+Checks mesh health:
+- component count,
+- watertightness,
+- boundary edges,
+- non-manifold edges,
+- degenerate faces,
+- face count sanity.
+For chairs, many components can be okay because a chair is an assembly. For monolithic hooks/brackets, this should become stricter later.
+### Contact/Gap Reward
+This handles the important chair case:
+- small assembly gaps are tolerated,
+- large separated parts are penalized,
+- lots of floating components are bad.
+This prevents a model from making a plausible-looking chair where the base, back, or armrests float far away.
+### Semantic Reward
+Checks whether the candidate behaves like a Markus-style chair:
+- code mentions and organizes chair concepts,
+- proportions are chair-like,
+- there is a tall upper body/back region,
+- there is lower base spread,
+- code is split into reusable functions.
+This is hackable if used alone, so it is never the only reward.
+### Reference Similarity
+Compares the candidate to:
+1. ideal CadQuery reference,
+2. real Markus GLB.
+Signals:
+- bbox proportions,
+- point-cloud distance,
+- silhouette similarity.
+The ideal CadQuery reference is the gold code target. The GLB is the real-world visual target.
+### Editability Reward
+Rewards:
+- functions,
+- named dimensions,
+- returns from helper builders,
+- final object assignment.
+Later this should become stronger by actually mutating parameters and rebuilding.
+### Efficiency Reward
+For multi-step episodes:
+- fewer failed CadQuery builds,
+- fewer tool calls,
+- fewer repeated edits,
+- higher reward with fewer revisions.
+This is where the final product claim comes from:
+> After GRPO, the small model reaches a good CadQuery chair in fewer revisions.
+## Reward Hacking Risks
+Known risks:
+- naming a variable `backrest` without real geometry,
+- making a giant bounding-box slab,
+- making visually close but non-editable geometry,
+- creating many disconnected decorative parts,
+- overfitting to the ideal code,
+- using APIs that work once but break under edits.
+Mitigations:
+- final full reward uses silhouettes and point clouds,
+- contact/gap reward punishes large separation,
+- editability reward punishes one-off blobs,
+- holdout tasks change dimensions and requirements,
+- inspect rendered reports often.
+## How We Know It Is Improving
+Do not trust only average training reward.
+Track:
+- build success rate,
+- best full reward on fixed holdout prompts,
+- average revisions to reach reward greater than 0.75,
+- contact/gap score,
+- silhouette score,
+- editability score,
+- failure categories,
+- rendered Markdown reports before and after training.
+Main demo metric:
+```text
+Baseline Qwen: needs many attempts, often fails build/contact.
+Trained Qwen: builds earlier, fewer gaps, better full reward, fewer revisions.
+```
+## Local Baseline With Ollama
+Use Ollama to see what 0.8B and 2B can do before training:
+```bash
+scripts/experiment-2/run-gpt-cadquery-benchmark.js \
+  --run \
+  --provider ollama \
+  --model qwen3.5:0.8b \
+  --tasks 1 \
+  --attempts 1 \
+  --timeout-ms 180000
+```
+The benchmark caps generation with `num_predict`, disables streaming, and has a timeout so a small model cannot hang forever.
+## RunPod Training Plan
+Only start RunPod after:
+- local Ollama baselines run,
+- reward reports look sensible,
+- SFT examples exist,
+- the evaluator survives 20 to 100 episodes.
+RunPod jobs:
+1. SFT warm start on 50 to 200 examples.
+2. GRPO Stage 1 on easy build/topology tasks.
+3. GRPO Stage 2 on Markus-chair reward.
+4. Full benchmark report against baseline Qwen and GPT-5.4.
+Use vLLM serve mode for rollouts, not async mode, for the first stable GRPO run.
+## Immediate Next Steps
+1. Run Ollama 0.8B baseline.
+2. Run Ollama 2B baseline when download finishes.
+3. Save both reports with rendered images.
+4. Create first SFT JSONL from ideal code and GPT traces.
+5. Build a minimal GRPO script that calls the evaluator.
+6. Move to RunPod.

docs/brainstorm/15-cadquery-agentic-traces-sft-grpo-plan.md ADDED Viewed

	@@ -0,0 +1,246 @@

+# CadQuery Agentic Traces, SFT, and GRPO Plan
+## Current truth
+The first benchmark was multi-attempt generation, not a full agent loop. The new trace runner is the real loop:
+1. Evaluate current CadQuery code.
+2. Save reward JSON, rendered views, STL, and verifier report.
+3. Send the model the prompt, previous code, reward JSON, and optionally rendered images.
+4. Ask for a complete revised CadQuery file.
+5. Evaluate the revised code.
+6. Save a step transcript for SFT, preference learning, and RL rollouts.
+The current GPT-5.4 vision trace improved the Markus repair seed from `0.613` to `0.800` in one edit. The second edit reached `0.794`, so it is a useful negative/preference example: more edits are only good when reward increases.
+Qwen 3.5 2B can produce CadQuery-shaped code after `think: false`, but it currently fails builds. That is expected before SFT. Its failed trace is useful because the verifier now exposes concrete Python errors, for example undefined dimensions.
+## Data we are collecting
+### SFT rows
+Path: `experiment-2-cadforge/data/sft/cadquery_agentic_sft.jsonl`
+Each row teaches:
+- system prompt with CadQuery tool rules
+- user observation: task, previous code, reward JSON
+- assistant action: corrected complete CadQuery code
+- metadata: reward before, reward after, reward delta, artifact path
+Use only positive or mildly positive rows for the first SFT pass. Filter with:
+- `reward_after > reward_before`
+- `reward_after >= 0.70`
+- `build == 1`
+### Preference rows
+Path: `experiment-2-cadforge/data/preferences/cadquery_agentic_preferences.jsonl`
+Each row teaches:
+- prompt: same observation as SFT
+- chosen: higher reward code
+- rejected: lower reward code
+- chosen/rejected rewards
+Use this for DPO/RLHF-style ranking if there is time. For the hackathon, this is also strong evidence that the environment can produce preference data automatically.
+### RL rollout rows
+Path: `experiment-2-cadforge/data/rl/cadquery_rollouts.jsonl`
+Each row teaches:
+- state: prompt + code + reward JSON
+- action: next CadQuery file
+- reward: absolute reward after action
+- reward_delta: improvement from previous step
+- done: whether this was the last rollout step
+This is the GRPO/RLVE substrate. For early GRPO, reward the action with:
+```text
+step_reward = 0.60 * reward_after
+            + 0.35 * max(reward_after - reward_before, -0.25)
+            + 0.05 * build_bonus
+```
+For group-relative training, sample several candidate revisions for the same observation and rank them by `step_reward`.
+## Reward design
+### Build reward
+This is the first gate. If code does not execute or does not export `fixture`, reward is `-1`.
+Why it matters: tiny models hallucinate APIs or undefined variables. Penalizing build failures prevents reward hacking through long invalid code.
+### Topology reward
+Checks mesh health:
+- face count sane
+- boundary/non-manifold/degenerate ratios low
+- component count acceptable for an assembly
+For chairs, many parts are allowed. For future single-piece objects, add a stricter single-body mode.
+### Contact/gap reward
+Checks whether major components are plausibly connected. This catches:
+- floating backrest
+- disconnected caster assembly
+- floating base
+- armrests that hover too far away
+For chairs, small gaps are not fatal because real chairs have visual separations. Large gaps should hurt.
+### Semantic parts reward
+Uses code and geometry hints to see whether the model contains chair-like intent:
+- seat
+- backrest
+- headrest
+- armrest
+- gas cylinder or central column
+- star base
+- caster proxies
+- mechanism/lumbar hints
+This should not force exact function names. It should reward discoverable part structure and meaningful dimensions.
+### Reference similarity reward
+Compares candidate geometry to both:
+- IKEA Markus GLB reference
+- ideal Markus CadQuery reference
+This gives a grounded target while still allowing the generated CadQuery to differ from the exact ideal code.
+### Silhouette reward
+Compares rendered masks across:
+- front
+- back
+- left
+- right
+- top
+- isometric
+This catches shape-level errors faster than pure point-cloud comparison and makes the markdown reports human-readable.
+### Editability reward
+Rewards code that a future agent can keep editing:
+- named dimensions
+- helper functions
+- final `fixture`
+- clear construction blocks
+- avoids brittle operations like fragile loft/sweep chains
+This is important because the goal is long-horizon CAD editing, not one-shot mesh generation.
+## What counts as improvement
+Do not reward more steps by itself. Reward useful steps.
+Good step:
+- code builds
+- reward increases
+- one major issue is fixed
+- model remains editable
+Bad step:
+- code stops building
+- reward decreases sharply
+- adds meaningless geometry to game semantic keywords
+- bloats code without improving geometry
+Long horizon comes from decomposing into many useful edits, not from forcing a fixed number of edits.
+## GPT teacher data plan
+Use GPT-5.4/GPT-5.5 as teacher agents to generate traces.
+Recommended overnight settings:
+```bash
+scripts/experiment-2/run-cadquery-agentic-trace.js \
+  --provider openai \
+  --model gpt-5.4 \
+  --steps 4 \
+  --vision
+```
+Repeat with different task prompts and seeded failure modes:
+- missing armrests
+- floating base
+- too-short backrest
+- failed `loft()` replaced with boxes/cylinders
+- disconnected caster assembly
+- no final `fixture`
+- wrong cylinder height/radius
+- overfit blocky chair with no semantics
+For each failure mode, GPT sees the reward report and optionally images, repairs the code, and creates training examples automatically.
+## Qwen student plan
+Start with SFT, not GRPO.
+1. Collect 100-300 high-quality GPT repair steps.
+2. Filter for positive deltas and successful builds.
+3. SFT Qwen 3.5 2B on observation-to-code repair.
+4. Run Qwen in the same environment.
+5. Keep Qwen traces as before/after evidence.
+6. Then run GRPO using reward deltas.
+Qwen 0.8B is useful as a dramatic baseline. Qwen 2B is the better hackathon target.
+## Generalization plan
+The environment can generalize if every object has:
+- task prompt
+- reference GLB
+- optional ideal CadQuery code
+- object-specific semantic hints
+- reward profile
+Object families to add after Markus:
+- table
+- simple stool
+- shelf bracket
+- screw/bolt
+- hinge
+- drawer handle
+- caster wheel
+For each object, preprocess:
+1. Normalize GLB scale/origin/orientation.
+2. Extract bounding box, silhouettes, point samples, topology.
+3. Evaluate ideal CadQuery if available.
+4. Run teacher traces from seeded failures.
+5. Add object-specific semantic hints.
+## Tomorrow demo story
+Show three things:
+1. GPT teacher improves a broken CAD file through multiple tool calls.
+2. The environment records every observation, action, reward, render, and code revision.
+3. Qwen starts weak, then after SFT/GRPO it builds more often and reaches higher reward in fewer edits.
+The sellable product is not just "CAD generation." It is a repeatable professional-tool RL environment for teaching small models to use CAD tools over long horizons with persistent state and verifiable rewards.

docs/brainstorm/16-tonight-execution-plan.md ADDED Viewed

	@@ -0,0 +1,140 @@

+# CADForge Tonight Execution Plan
+## Time Budget
+You have 3-5 focused hours before overnight training. Use them like this:
+| Block | Time | Goal |
+|---|---:|---|
+| Reward sanity + task setup | 30-45 min | Confirm task-specific rewards and prompts work. Done for Markus + six-leg table. |
+| Asset generation | 60-120 min wall time | Generate 20-24 white-background reference images and optionally FAL GLBs in parallel. |
+| Teacher traces | 90-180 min wall time | Run GPT-5.4/GPT-5.5 agentic repair traces, 2-4 steps each. |
+| SFT packaging | 20-40 min | Filter positive deltas, make train/val JSONL, quick dataset card. |
+| RunPod setup | 30-60 min | Start Qwen 2B/9B SFT. If SFT finishes, start GRPO. |
+| README + demo assets | 60-120 min | Results table, trace screenshots, short video, slides. |
+Overnight from 12am-6am:
+1. SFT Qwen 2B or 9B on positive GPT repair traces.
+2. Evaluate the tuned model on held-out tasks.
+3. If build rate is non-zero, run GRPO with grouped repair candidates.
+4. Save reward/loss curves and before/after traces.
+## Are Qwen 0.8B/2B Too Dumb?
+They are not useless, but they are too cold for raw CAD repair. Qwen 2B currently writes CadQuery-shaped code but fails on undefined variables and API mistakes. That means:
+- Do SFT first.
+- Use Qwen 0.8B as a weak baseline.
+- Use Qwen 2B as the realistic small-model demo.
+- Downloading Qwen 9B is a good idea if you can afford the inference/training memory, because it should produce more buildable rollouts for GRPO.
+## What Is Graded Per Step?
+Each agentic step is graded independently:
+- previous code
+- reward JSON
+- rendered views if available
+- next code
+- next reward
+- reward delta
+The training data records:
+- SFT: observation -> improved code
+- Preference/RLHF: chosen higher-reward code vs rejected lower-reward code
+- RL/GRPO: observation, action, reward, reward_delta
+Do not reward more steps by itself. Reward useful improvement:
+```text
+step_reward = 0.60 * reward_after
+            + 0.35 * clamp(reward_after - reward_before, -0.25, 0.25)
+            + 0.05 * build_success
+```
+## Current Reward Functions
+### Build
+Hard gate. Invalid code or missing STL gets `-1`. This catches hallucinated CadQuery APIs, undefined variables, no final `fixture`, and execution crashes.
+### Topology
+Checks mesh health: faces, components, watertightness, boundaries, non-manifold edges, and degenerate faces. Assemblies can have multiple components; future monolithic tasks should use stricter settings.
+### Contact/Gaps
+Penalizes large disconnected components. Small chair/table gaps are tolerated; floating bases, detached backrests, disconnected caster assemblies, and floating load bosses lose reward.
+### Task Semantics
+Now task-specific. A stator is rewarded for `stator`, `radial_tooth`, `center_bore`; a table is rewarded for `tabletop`, `leg`, `crossbar`, `stretcher`. This fixes the previous Markus-only bias.
+### Reference Similarity
+If a task GLB exists, the evaluator compares bbox, point-cloud similarity, and silhouettes. If no GLB exists yet, this component is neutral and the report says so explicitly.
+### Silhouette
+Full mode renders front/back/left/right/top/isometric masks. These are used for scoring when a GLB reference exists and are always saved for human inspection.
+### Editability
+Rewards named dimensions, helper functions, clean final `fixture`, and reusable code structure. This matters because the project is about long-horizon editable CAD, not just one-shot meshes.
+## Commands
+Generate reference images and FAL GLBs for the first 8 tasks:
+```bash
+scripts/experiment-2/generate-cad-assets.js --limit 8 --concurrency 3
+```
+Use FAL text-to-3D only, fastest path:
+```bash
+scripts/experiment-2/generate-cad-assets.js --skip-images --limit 8 --concurrency 4
+```
+Use image-to-3D after GPT images:
+```bash
+scripts/experiment-2/generate-cad-assets.js --limit 8 --concurrency 2 --image-to-3d
+```
+Run teacher traces on easy tasks:
+```bash
+scripts/experiment-2/run-teacher-trace-batch.js --provider openai --model gpt-5.4 --levels easy --limit 8 --steps 3
+```
+Run richer vision traces after images/renders exist:
+```bash
+scripts/experiment-2/run-teacher-trace-batch.js --provider openai --model gpt-5.4 --levels easy,medium --limit 12 --steps 4 --vision
+```
+Filter positive SFT rows:
+```bash
+scripts/experiment-2/filter-positive-sft.js --min-after 0.70 --min-delta 0.001
+```
+Run Qwen baseline:
+```bash
+scripts/experiment-2/run-cadquery-agentic-trace.js --provider ollama --model qwen3.5:2b --task-spec table_six_leg_500n --task-id table_six_leg_500n --steps 1
+```
+## Submission Story
+The story should mirror the best example project:
+1. Cold start: Qwen fails to emit valid CadQuery.
+2. Environment: CAD code executes in real CadQuery, renders, and scores every step.
+3. Teacher: GPT-5.4/GPT-5.5 improves broken CAD over multiple tool calls.
+4. Data: every observation/action/reward becomes SFT, preference, and RL rollout data.
+5. Student: Qwen learns to build valid editable CAD in fewer revisions.
+6. Self-improvement: new object prompts + generated GLBs expand the curriculum automatically.

docs/brainstorm/17-cadquery-reward-functions-deep-dive.md ADDED Viewed

	@@ -0,0 +1,272 @@

+# CadQuery Reward Functions Deep Dive
+This document explains the reward code in `experiment-2-cadforge/python_tools/cadquery_env.py`.
+## Total Reward
+The evaluator produces component scores in `[0, 1]`, except build failures, which return total reward `-1`.
+Fast mode is for dense RL feedback:
+```text
+total = 0.22 * build
+      + 0.17 * topology
+      + 0.12 * contact
+      + 0.25 * semantic_parts
+      + 0.10 * reference_similarity
+      + 0.10 * editability
+      + 0.04 * efficiency
+```
+Full mode is for reports, teacher traces, and benchmark artifacts:
+```text
+total = 0.18 * build
+      + 0.17 * topology
+      + 0.10 * contact
+      + 0.15 * semantic_parts
+      + 0.15 * reference_similarity
+      + 0.10 * silhouette
+      + 0.10 * editability
+      + 0.05 * efficiency
+```
+Fast mode deliberately weights task semantics higher and reference lower because, without renders/point clouds, bbox-only reference scoring is easier to game.
+## Build Reward
+Build is a hard gate.
+If the CadQuery code does not execute, does not create/export `fixture`, or fails STL export:
+```text
+total = -1
+build = 0
+all other components = 0
+```
+The evaluator now includes a concise Python/CadQuery error in `notes`, for example:
+```text
+Build error: NameError: name 'headrest_height_from_ground' is not defined
+```
+This is crucial for Qwen because most early failures are invalid code, not bad geometry.
+## Task Semantics Reward
+Function: `semantic_reward(code, mesh, task_spec)`
+This has three parts:
+```text
+semantic = 0.35 * code_score
+         + 0.45 * geometry_score
+         + 0.20 * assembly_score
+```
+### Code Score
+Each task has `semantic_hints` in `cad_tasks.json`.
+Example table hints:
+```json
+["tabletop", "six_leg", "leg", "crossbar", "stretcher", "support", "load_500n"]
+```
+The reward checks whether each hint appears in the code, allowing underscore-insensitive matches:
+```python
+hint in lowered_code
+hint_without_underscores in lowered_code_without_underscores
+```
+This rewards explicit, editable part intent. It does not require a fixed function name like `add_seat()`. The model can invent its own structure, but it gets credit when the code makes the intended parts legible.
+### Geometry Score
+If the task has `bbox_mm`, the evaluator compares the normalized shape ratios:
+```text
+target = [x, y, z] / z
+actual = candidate_bbox / candidate_height
+geometry_score = 1 - mean(relative_ratio_error)
+```
+This prevents a model from getting high semantic score by merely writing keywords in comments while building a totally wrong envelope.
+For the original Markus chair path without a task spec, it uses chair-specific geometry signals:
+- chair-like width/height and depth/height ratios
+- lower base spread
+- meaningful upper-height geometry
+### Assembly Score
+The reward counts helper functions:
+```python
+functions = number of `def name(...):`
+assembly_score = min(1, functions / 6)
+```
+This encourages decomposed CAD: helper functions for legs, arms, bosses, teeth, ribs, etc. It does not force exact part names.
+## Editability Reward
+Function: `editability_reward(code)`
+This rewards code that another agent can revise over many steps.
+```text
+editability = 0.35 * function_score
+            + 0.20 * named_dimension_score
+            + 0.25 * reusable_return_score
+            + 0.20 * final_object_score
+```
+### Function Score
+```python
+functions = count_regex(r"^\s*def\s+\w+\s*\(")
+function_score = min(1, functions / 6)
+```
+Why: long-horizon CAD editing works better when the model can edit `make_leg()`, `make_backrest()`, `make_stator_tooth()`, or `make_mounting_hole_pattern()` instead of rewriting one giant union chain.
+### Named Dimension Score
+```python
+named_values = count_regex(r"^\s*[a-zA-Z_][a-zA-Z0-9_]*\s*=\s*[-+]?\d")
+named_dimension_score = min(1, named_values / 8)
+```
+This rewards parameters like:
+```python
+seat_width = 520
+leg_height = 680
+bolt_radius = 4
+```
+Why: dimensions make future edits local and stable. A repair agent can increase `back_height` or move `caster_radius` without hunting through anonymous numbers.
+### Reusable Return Score
+```python
+reusable_returns = count_regex(r"^\s*return\s+")
+reusable_return_score = min(1, reusable_returns / max(1, functions))
+```
+This rewards helper functions that return shapes instead of mutating unclear globals.
+Good:
+```python
+def make_leg(x, y):
+    return cq.Workplane("XY").box(35, 35, leg_height).translate((x, y, leg_height / 2))
+```
+Less editable:
+```python
+leg = cq.Workplane("XY").box(35, 35, 680)
+```
+### Final Object Score
+```python
+has_final_object = "fixture" in code or "result" in code or "chair" in code or "show_object" in code
+```
+This gives `0.20` when the model clearly exposes an exportable final object. In practice, `fixture` is the important convention because the runner exports `fixture`.
+## Topology Reward
+Function: `topology_reward(topology_metrics(mesh))`
+It checks:
+- component count
+- watertightness
+- boundary edge ratio
+- non-manifold edge ratio
+- degenerate face ratio
+- sane face count
+Markus and many CAD assemblies are allowed to have multiple components, so this does not require a single monolithic body. For future tasks that explicitly require one connected watertight object, we should add a stricter task-level option.
+## Contact/Gaps Reward
+Function: `contact_metrics(mesh)`
+The mesh is split into components. Tiny fragments are ignored. For each meaningful component, the evaluator finds the nearest component bounding-box gap and normalizes it by object height.
+The score decays with:
+- higher mean gap
+- higher max gap
+- count of large separated components
+This catches:
+- floating backrests
+- disconnected caster assemblies
+- floating table legs
+- load bosses not attached to arms
+- detached wall plates or brackets
+Small assembly gaps are tolerated because real CAD assemblies may have separate touching solids or visual separation.
+## Reference Similarity
+If a GLB reference exists, the evaluator compares the candidate against:
+- generated/reference GLB
+- optional ideal CadQuery reference
+The per-reference score is:
+```text
+reference_one = 0.25 * bbox
+              + 0.35 * chamfer
+              + 0.40 * silhouette
+```
+If both ideal CadQuery and GLB exist:
+```text
+reference_similarity = 0.60 * ideal_score + 0.40 * glb_score
+```
+If no GLB exists yet for a generated task, reference and silhouette are neutral `0.50`, and the report explicitly says the task-specific GLB is missing.
+## Silhouette Reward
+Full mode renders masks from:
+- front
+- back
+- left
+- right
+- top
+- isometric
+It computes mask IoU against reference silhouettes. This is cheap enough for reports and teacher traces and catches overall shape mistakes that bbox alone cannot catch.
+## Known Limitations
+1. Code semantic hints can be gamed by comments. Geometry and reference scores reduce this, but we should later ignore comments or parse identifiers only.
+2. Editability currently checks simple regexes. It rewards structure, not true AST-level quality.
+3. No finite-element simulation is running yet. Load/safety-factor phrases are currently semantic intent, not verified stress analysis.
+4. Generic tasks without GLBs use neutral reference scores until generated references are preprocessed.
+5. Single-body/watertight requirements need task-specific stricter topology settings.
+## Near-Term Improvements
+- Parse Python AST for identifiers and assignments instead of raw regex.
+- Add per-task reward profiles, for example `single_body_required`, `min_holes`, `radial_symmetry_required`, `leg_count`.
+- Add image/GLB generated references for all 24 tasks.
+- Add cheap analytic checks for hole count, radial teeth count, and support/leg count.
+- Add optional FEA proxy rewards for load-bearing prompts.

docs/brainstorm/18-how-sft-and-grpo-data-works.md ADDED Viewed

	@@ -0,0 +1,192 @@

+# How CADForge SFT and GRPO Data Works
+## What The Agent Sees During A Trace
+Each step is an observe -> edit -> score loop.
+The observation includes:
+1. The task prompt.
+2. The previous CadQuery code.
+3. The reward JSON from the verifier.
+4. Verifier notes such as build errors, disconnected components, weak semantic hints, or missing reference similarity.
+5. Optional rendered views for teacher traces when vision is enabled.
+The action is always a complete replacement CadQuery file. The model is not asked to emit prose. It is asked to emit executable Python where the final object is assigned to `fixture`.
+## Example SFT Row Shape
+A training row looks like this:
+```json
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are CADForge, a careful CadQuery CAD repair agent... Return only a complete executable Python CadQuery file."
+    },
+    {
+      "role": "user",
+      "content": "Task: Build a simple four-legged chair...\n\nCurrent reward JSON: ...\n\nCurrent CadQuery code: ...\n\nRevise the code to improve the reward."
+    },
+    {
+      "role": "assistant",
+      "content": "import cadquery as cq\n\nseat_width = 420\n...\nfixture = chair.clean()"
+    }
+  ],
+  "reward_before": 0.441,
+  "reward_after": 0.802,
+  "reward_delta": 0.361,
+  "artifacts_dir": ".../step-1"
+}
+```
+During SFT, Qwen learns the mapping:
+```text
+(previous code + verifier feedback + task) -> better complete CadQuery code
+```
+It does not learn from hidden GPT thinking. It learns from the repair action.
+## Why We Do Not Want Long Thinking In The Output
+Qwen 3.5 can overthink in Ollama and spend tokens on internal reasoning before answering. For CADForge this is bad because:
+- the environment needs code, not a long explanation;
+- excess thinking slows rollouts;
+- verbose traces can pollute SFT if included as assistant output;
+- invalid prose around code can break execution.
+For Ollama inference use:
+```json
+{
+  "think": false,
+  "stream": false,
+  "options": {
+    "temperature": 0.2,
+    "num_predict": 3000,
+    "num_ctx": 8192
+  }
+}
+```
+And use a strict system prompt:
+```text
+Return only complete executable Python CadQuery code. No markdown. No explanation.
+```
+## Which Rows Go Into SFT
+We generated several files:
+- `cadquery_agentic_sft.jsonl`: all raw teacher steps, including regressions and failures.
+- `cadquery_agentic_sft_positive.jsonl`: strict positives, `reward_after >= 0.70` and `reward_delta > 0`.
+- `cadquery_agentic_sft_positive_060.jsonl`: recommended first SFT set, `reward_after >= 0.60` and `reward_delta > 0`.
+- `cadquery_agentic_sft_delta_positive.jsonl`: every improving step, even if final reward is still modest.
+- `cadquery_agentic_sft_train.jsonl` and `cadquery_agentic_sft_val.jsonl`: train/val split from the recommended set.
+For the first overnight run, train on:
+```text
+experiment-2-cadforge/data/sft/cadquery_agentic_sft_train.jsonl
+```
+Use the validation set:
+```text
+experiment-2-cadforge/data/sft/cadquery_agentic_sft_val.jsonl
+```
+## What Preference/RLHF Data Means
+Preference rows are chosen/rejected pairs for the same prompt.
+If a teacher repair improves reward, the repair is chosen and the previous code is rejected. If it regresses, the previous code is chosen and the repair is rejected.
+This supports DPO/RLHF-style training:
+```text
+same observation -> prefer higher-reward code over lower-reward code
+```
+## What RL/GRPO Data Means
+RL rollout rows contain:
+- observation
+- action code
+- reward after action
+- reward delta
+- done flag
+- artifact paths
+For GRPO, we should use live environment scoring, not only static rows. The static rollout rows are useful for debugging and offline analysis; the live verifier is what makes the RLVE environment real.
+A practical step reward is:
+```text
+step_reward = 0.60 * reward_after
+            + 0.35 * clamp(reward_after - reward_before, -0.25, 0.25)
+            + 0.05 * build_success
+```
+Why this shape:
+- absolute reward teaches final quality;
+- reward delta teaches improvement;
+- build success prevents invalid code from getting accidental credit.
+## Is More Data Better?
+More data helps if it is diverse. Blind duplication does not help much.
+Our diversity knobs are:
+- 24 object prompts;
+- easy/medium/hard tasks;
+- generated images and GLBs;
+- five seed modes: weak, missing features, disconnected, bad dimensions, build error;
+- teacher prompt variants: editability, silhouette/contact, build robustness;
+- multiple repair steps per trace;
+- reward filtering that keeps only improving SFT examples.
+This gives the small model patterns like:
+- repair undefined variables;
+- replace fragile geometry with reliable Workplane operations;
+- add missing semantic parts;
+- reconnect floating components;
+- improve proportions against a GLB reference;
+- expose dimensions and helper functions for later edits.
+## What To Expect After SFT
+Before SFT, Qwen 2B/9B may:
+- overthink;
+- output prose;
+- hallucinate CadQuery APIs;
+- forget `fixture`;
+- create disconnected blocky assemblies.
+After SFT, success should first show up as:
+- higher build rate;
+- more `fixture = ...` completions;
+- fewer fake APIs;
+- more named dimensions/helper functions;
+- better first-repair reward.
+Do not expect perfect CAD from SFT alone. SFT makes the model trainable for GRPO. GRPO should then optimize reward and reduce revision count.
+## Recommended Training Order
+1. Baseline Qwen 2B and 9B on 5 held-out prompts.
+2. SFT Qwen 2B on recommended positive rows.
+3. SFT Qwen 9B on the same data.
+4. Evaluate build rate, reward, and average steps-to-threshold.
+5. Run GRPO with vLLM serve mode and live verifier rewards.
+6. Compare before/after traces in the demo UI and markdown reports.

docs/brainstorm/19-qwen35-2b-9b-cadforge-sft-grpo-runpod-plan.md ADDED Viewed

	@@ -0,0 +1,224 @@

+# Qwen3.5 2B/9B CADForge SFT + GRPO Plan
+## Recommendation
+Use `Qwen/Qwen3.5-2B` for the fast hackathon loop and `Qwen/Qwen3.5-9B` for the serious "we can win this" run.
+Default plan:
+1. Train `Qwen/Qwen3.5-2B` with LoRA SFT to prove the environment/data loop works.
+2. Evaluate base vs SFT on held-out CADForge tasks.
+3. Train `Qwen/Qwen3.5-9B` with LoRA SFT on the same data.
+4. Run GRPO from the stronger SFT adapter, using CADForge as the reward environment.
+5. Report both: 2B as the tiny-model story, 9B as the strongest small open model result.
+If H200 budget is available, do not waste time optimizing around 24 GB constraints. Use Unsloth BF16 LoRA for both SFT runs, keep QLoRA only as a last-resort fallback, and test FP8 mainly for GRPO/rollout throughput once the BF16 SFT baseline is working.
+## LoRA vs QLoRA vs PEFT
+`PEFT` is the umbrella library/pattern: parameter-efficient fine-tuning. It includes LoRA, QLoRA-style quantized LoRA, prefix tuning, adapters, etc.
+`LoRA` means the base model stays frozen and we train small low-rank matrices inserted into attention/MLP layers. It is the best default when we have enough VRAM.
+`QLoRA` means the base model is loaded in 4-bit quantized form while training LoRA adapters. It saves VRAM, but can be slower, more finicky, and slightly less clean when memory is not the bottleneck.
+For CADForge:
+- Use `LoRA BF16` on H200/H100/A100/L40S, preferably with Unsloth for Qwen3.5.
+- Use `FP8 LoRA/GRPO` as an optimization after the BF16 baseline, especially for RL where rollout inference dominates runtime.
+- Use `QLoRA` only if nothing else fits. Unsloth's Qwen3.5 guide specifically says 4-bit QLoRA is not recommended for Qwen3.5 because quantization differences are higher than normal.
+- Do not full fine-tune first. Full fine-tune is overkill for 1.7M-2.0M SFT tokens and increases risk of overfitting/catastrophic drift.
+- Consider full fine-tune later only if we have much more data, a stable eval suite, and the LoRA result is clearly adapter-limited.
+Suggested adapter settings:
+| Model | Method | Rank | Alpha | Target Modules | Notes |
+|---|---:|---:|---:|---|---|
+| Qwen3.5-2B | Unsloth BF16 LoRA | 16-32 | 32-64 | attention + MLP projections | Fast proof run |
+| Qwen3.5-9B | Unsloth BF16 LoRA | 32-64 | 64-128 | attention + MLP projections | Main result |
+| GRPO optimization | Unsloth FP8 LoRA/GRPO | 32-64 | 64-128 | attention + MLP projections | Test after BF16 SFT |
+| Emergency low-VRAM fallback | QLoRA | 16-32 | 32-64 | attention + MLP projections | Avoid for Qwen3.5 unless forced |
+## BF16 vs FP8 vs Ollama Q8
+These are easy to mix up:
+- `BF16` is the stable training dtype for our SFT baseline. On H200 it is the right first choice.
+- `FP8` is a GPU precision/quantized-weight training and inference path. Unsloth supports FP8 RL/GRPO with `load_in_fp8=True`; their docs report faster RL inference and much lower VRAM. This is worth testing for CADForge GRPO after the normal BF16 SFT result exists.
+- `Q8_0` / `q8` in Ollama or GGUF is an inference quantization format, not the same thing as FP8 training. A model served as Q8 in Ollama does not mean we should train in FP8.
+- Export path should be: train BF16 LoRA -> evaluate -> optionally merge -> export GGUF `q8_0` or `q4_k_m` for llama.cpp/Ollama-style demos.
+Practical choice:
+1. Train SFT in BF16 LoRA with Unsloth.
+2. Evaluate base vs SFT.
+3. Run GRPO first in the simplest working Unsloth setup.
+4. If GRPO rollout generation is the bottleneck, enable FP8 with `load_in_fp8=True` and compare reward/build-rate against BF16.
+Do not switch the whole plan to FP8 before we have a BF16 control run. FP8 may be faster, especially in RL, but the hackathon needs a clean ablation: base -> SFT BF16 -> GRPO BF16/FP8.
+## Hardware
+Since H200 spend is okay:
+- `1x H200 141GB`: best single-GPU choice for SFT + GRPO. Use this if available.
+- `1x H100 80GB`: good fallback for GRPO.
+- `1x L40S 48GB`: fine for SFT and eval, less ideal for GRPO throughput.
+- `1x RTX 4090 24GB`: only for cheap SFT experiments with QLoRA.
+For `Qwen/Qwen3.5-2B`:
+- SFT LoRA: H200 is more than enough; this should be quick.
+- GRPO: possible on H200, but 2B may be too weak unless SFT build-rate improves clearly.
+For `Qwen/Qwen3.5-9B`:
+- SFT LoRA: H200 Unsloth BF16 LoRA, sequence length 8192-16384 depending on trace length. Unsloth lists Qwen3.5-9B BF16 LoRA around 22GB VRAM before our batch/context choices.
+- GRPO: H200 preferred because we can keep policy, reference, optimizer state, and rollout engine comfortable. Test FP8 GRPO if rollout throughput or memory becomes the bottleneck.
+- If using vLLM colocated, start conservative: 4 completions per prompt, then push to 8 if memory/throughput is stable.
+## Current Data
+Local current snapshot:
+- Raw SFT: 1240 rows, about 3.52M tokens.
+- Relaxed positive SFT: 704 rows, about 1.95M tokens.
+- Train split: 633 rows, about 1.76M tokens.
+- Val split: 71 rows, about 0.19M tokens.
+- Preferences: 1239 rows, about 4.16M tokens.
+- RL rollouts: 1239 rows, about 3.33M tokens.
+- Prompt-to-CadQuery direct set: 25 high-quality task-to-code rows.
+Use this order:
+1. SFT on `cadquery_agentic_sft_train.jsonl`.
+2. Mix in `cadquery_prompt_to_cadquery_train.jsonl` at a small weight or oversample it 2-4x so the model can also do first-shot CAD, not only repair.
+3. Evaluate on `cadquery_agentic_sft_val.jsonl` plus held-out task prompts.
+4. Use preferences later for DPO/ORPO only if GRPO is too slow.
+5. Use `cadquery_rollouts.jsonl` to seed GRPO prompts and replay evals.
+## Thinking Traces
+CAD is logical and step-by-step, so thinking can help at inference time. But we should not train on hidden chain-of-thought.
+Important distinction:
+- OpenAI hidden reasoning is not in our data. We only have the final visible model output.
+- The SFT JSONL maps observation/reward/current-code context to improved CadQuery code.
+- The trace JSON files contain `raw_model_output`, but for OpenAI traces this is visible output we asked for, usually complete code, not protected internal reasoning.
+- A scan of the current local data found zero `<think>` / `</think>` blocks.
+For training:
+- Do not fabricate long chain-of-thought.
+- Do not train the model to emit `<think>` blocks before code.
+- Train it to produce clean final CadQuery code only.
+- If we want structured reasoning, make it explicit and safe: short public planning comments inside the code or a separate non-submitted planning field for experiments, not hidden CoT imitation.
+For inference/eval:
+- For 2B, start non-thinking mode for SFT evaluation because the target output is only code and extra thinking tokens can leak into invalid Python.
+- Also run a small A/B: thinking enabled vs disabled, then strip `<think>...</think>` before passing code to CADForge. If thinking improves build/reward, mention it as an inference-time scaffold, not SFT data.
+- For 9B, thinking mode is worth testing more seriously. CAD repair benefits from multi-step reasoning, but the final action must still be clean code.
+- Best demo agent: "think privately or in scratch, then submit only code to CADForge." The environment should never score thinking text; it scores executable code.
+Qwen notes from official cards:
+- `Qwen/Qwen3.5-2B` supports thinking and non-thinking; the model card says 2B operates in non-thinking mode by default and can be served with vLLM/SGLang.
+- `Qwen/Qwen3.5-9B` is Apache-2.0, compatible with Transformers/vLLM/SGLang/KTransformers, and Qwen recommends vLLM/SGLang for throughput.
+- Older `Qwen3-8B` docs explicitly describe `enable_thinking=True/False` and `<think>...</think>` parsing; the same care applies when testing Qwen thinking-mode models.
+## SFT Plan
+2B:
+```text
+model: Qwen/Qwen3.5-2B
+method: Unsloth BF16 LoRA
+rank: 16 or 32
+seq_len: 8192 first, 16384 if examples need it
+epochs: 2-4
+lr: 1e-4 to 2e-4
+batching: maximize tokens/sec; use grad accumulation rather than tiny context
+eval: every 25-50 steps
+```
+9B:
+```text
+model: Qwen/Qwen3.5-9B
+method: Unsloth BF16 LoRA on H200
+rank: 32 or 64
+seq_len: 8192-16384
+epochs: 2-3
+lr: 5e-5 to 1.5e-4
+eval: held-out build rate and reward, not just loss
+```
+What to watch:
+- Validation loss is secondary.
+- Build rate is the first key metric.
+- Mean reward and reward >= 0.70 / >= 0.85 rates are the real metrics.
+- Inspect failures for Python syntax, CadQuery API hallucination, missing `fixture`, disconnected parts, and semantic misses.
+## GRPO Plan
+Run GRPO only after SFT has non-trivial build rate.
+For 2B:
+- Use as a proof of learning if SFT already reaches reasonable build rate.
+- 4 completions per prompt, short rollouts, fast reward mode.
+- Stop quickly if all completions fail to build.
+For 9B:
+- Main GRPO candidate.
+- Start from SFT adapter.
+- 4-8 completions per prompt.
+- CADForge fast reward during training.
+- Start BF16 for the clean baseline; then test Unsloth FP8 GRPO with `load_in_fp8=True` if rollout speed or VRAM is limiting.
+- Periodically run full reward for report artifacts.
+- Keep anti-hacking constraints: final object must be `fixture`, blocked tokens rejected, build hard-gated, semantic hints required.
+Reward objective:
+```text
+step_reward = 0.60 * reward_after
+            + 0.35 * clamp(reward_after - reward_before, -0.25, 0.25)
+            + 0.05 * build_success
+```
+## Inference After Training
+Quick eval:
+- Load base model + LoRA adapter with Transformers/PEFT.
+- Generate code from held-out observations.
+- Strip markdown fences and any accidental `<think>` block.
+- Submit code to the OpenEnv Space `/step`.
+Serving:
+- vLLM with LoRA adapter if Qwen3.5 adapter support is stable.
+- Otherwise merge LoRA into a HF checkpoint and serve the merged model.
+- For Ollama/llama.cpp demos, merge and convert to GGUF after training. Use `q8_0` for quality-first local demos, `q4_k_m` for portable demos. This is inference quantization, separate from BF16/FP8 training.
+Demo loop:
+```bash
+OPENENV_BASE_URL=https://sanjuhs-cadforge-cadquery-openenv.hf.space \
+  python experiment-2-cadforge/inference.py
+```
+## Final Report Story
+Mirror `docs/best-example-project.md`:
+1. Cold Qwen cannot reliably produce buildable CadQuery.
+2. CADForge executes real CadQuery, exports STL, renders, and scores every step.
+3. GPT teacher traces generate repair trajectories.
+4. SFT teaches the small model the code-CAD grammar and repair style.
+5. GRPO teaches verifier-directed improvement against objective geometry rewards.
+6. 2B proves the tiny-model story; 9B gives the strongest open small-model result.

docs/brainstorm/20-cadforge-qwen-training-runbook.md ADDED Viewed

	@@ -0,0 +1,380 @@

+# CADForge Qwen Training Runbook
+## Goal
+Train a small Qwen model to produce editable, buildable CadQuery and then improve it through reward feedback.
+The hackathon story is:
+1. Base Qwen often writes invalid or incomplete CadQuery.
+2. SFT teaches two behaviors: create CAD from a prompt and repair CAD from verifier feedback.
+3. GRPO/RLVE then rewards buildable, connected, semantically correct, editable CAD.
+4. The environment stores artifacts, reward JSON, code, and renders, so improvement is visible and auditable.
+## Hardware
+Use the H200 for the real run.
+Recommended setup:
+- GPU: 1x H200 141 GB
+- CUDA image: PyTorch 2.8 or latest RunPod PyTorch CUDA image
+- Python: 3.10 or 3.11
+- Package runner: `uv`
+- Training dtype: BF16
+- Method: Unsloth LoRA SFT first, then TRL GRPO
+Do not start with QLoRA on the H200. BF16 LoRA is cleaner and the GPU has enough memory.
+## Data Mix
+The first SFT run should mix:
+- all cold-start rows from `cadquery_prompt_to_cadquery_train.jsonl`
+- all repair rows from `cadquery_agentic_sft_train.jsonl`
+- cold-start rows repeated 4x
+Upsampling means repeating the cold-start rows. It does not mean skipping them.
+Why repeat them? We only have 20 cold-start train rows but 633 repair train rows. If each appears once, the model mostly learns "repair existing CAD." The demo also needs "write the first complete CAD file from a prompt," so we repeat cold-start rows to keep that behavior visible during training.
+Expected mixed train size:
+```text
+20 cold-start rows * 4 = 80 cold-start examples
+633 repair rows * 1 = 633 repair examples
+total = 713 train examples
+```
+## Local Prep Commands
+From the repo root:
+```bash
+uv run training/prepare_sft_mix.py --cold-start-upsample 4
+uv run training/smoke_cadforge_reward.py --reward-mode fast
+```
+The first command creates:
+- `training/output/cadforge_sft_mix_train.jsonl`
+- `training/output/cadforge_sft_mix_val.jsonl`
+The second command verifies that the CadQuery reward backend can build and score one known-good row.
+## RunPod Setup
+After the RunPod starts:
+```bash
+apt-get update
+apt-get install -y git git-lfs build-essential curl
+curl -LsSf https://astral.sh/uv/install.sh | sh
+source $HOME/.local/bin/env
+git lfs install
+git clone https://github.com/sanjuhs/open-env-meta-final-hackathon.git
+cd open-env-meta-final-hackathon
+```
+Set secrets:
+```bash
+export HF_TOKEN=...
+export TRACKIO_SPACE_ID=sanjuhs/cadforge-trackio
+```
+Move caches to the workspace volume before installing training packages:
+```bash
+export UV_CACHE_DIR=/workspace/.uv-cache
+export HF_HOME=/workspace/.cache/huggingface
+export TORCH_HOME=/workspace/.cache/torch
+export TRITON_CACHE_DIR=/workspace/.cache/triton
+export VLLM_CACHE_ROOT=/workspace/.cache/vllm
+export UV_LINK_MODE=copy
+export HF_HUB_ENABLE_HF_TRANSFER=1
+```
+Install the app/runtime dependencies:
+```bash
+uv sync --project experiment-2-cadforge
+uv run training/prepare_sft_mix.py --cold-start-upsample 4
+uv run training/smoke_cadforge_reward.py --reward-mode fast
+```
+If the generated local data files are not present in git, the training scripts can download the uploaded dataset from Hugging Face.
+## SFT Smoke Test
+Run this first. It should take only a few minutes on the H200.
+```bash
+uv run training/train_sft_unsloth.py \
+  --model unsloth/Qwen3.5-2B \
+  --output-dir outputs/qwen35-2b-cadforge-sft-smoke \
+  --max-steps 10 \
+  --limit-train-rows 32 \
+  --limit-val-rows 8 \
+  --max-seq-length 4096 \
+  --per-device-train-batch-size 1 \
+  --gradient-accumulation-steps 4 \
+  --lora-r 16 \
+  --lora-alpha 32 \
+  --run-name qwen35-2b-sft-smoke
+```
+Success criteria:
+- model loads in BF16
+- LoRA attaches
+- loss logs for 10 steps
+- one checkpoint/output folder is written
+- no chat-template/data-format crash
+Qwen3.5 note: the model type is `qwen3_5`, so Transformers must be `>=5.2.0`. If a smoke run says Transformers does not recognize `qwen3_5`, update the training environment and rerun; this is a dependency issue, not a data issue.
+Qwen3.5 is a unified vision-language model. For text-only SFT, call the processor with `text=...`; do not pass the chat string positionally. A positional text string can be interpreted as an image input and trigger an image decode error. That error means the processor call is wrong, not that image fine-tuning itself is broken.
+Trackio note: smoke tests default to TensorBoard only. Add `--enable-trackio` after `HF_TOKEN` is configured on the pod.
+## SFT Real 2B Run
+```bash
+uv run training/train_sft_unsloth.py \
+  --model unsloth/Qwen3.5-2B \
+  --output-dir outputs/qwen35-2b-cadforge-sft \
+  --hub-model-id sanjuhs/qwen35-2b-cadforge-sft-lora \
+  --push-to-hub \
+  --enable-trackio \
+  --max-steps 0 \
+  --num-train-epochs 3 \
+  --max-seq-length 8192 \
+  --per-device-train-batch-size 1 \
+  --gradient-accumulation-steps 8 \
+  --learning-rate 2e-4 \
+  --lora-r 16 \
+  --lora-alpha 32 \
+  --eval-steps 25 \
+  --save-steps 50 \
+  --run-name qwen35-2b-sft-full
+```
+Watch:
+- train loss
+- eval loss
+- generated sample build rate after training
+- whether outputs contain only Python code, not markdown or thinking tags
+The first live 2B run launched on 2026-04-25 used the same settings, without `--push-to-hub` and `--enable-trackio` because no HF token was configured in the pod environment yet:
+```text
+output: /workspace/open-env-meta-final/outputs/qwen35-2b-cadforge-sft-full-20260425
+log:    /workspace/open-env-meta-final/training/logs/sft-2b-full-20260425.log
+```
+Generate the current/final curve images:
+```bash
+uv run training/make_training_report.py \
+  --log training/logs/sft-2b-full-20260425.log \
+  --output-dir training/reports/qwen35-2b-sft-final
+```
+Evaluate the finished adapter against CADForge:
+```bash
+uv run training/evaluate_cadforge_model.py \
+  --base-model unsloth/Qwen3.5-2B \
+  --adapter outputs/qwen35-2b-cadforge-sft-full-20260425 \
+  --eval-jsonl training/output/cadforge_sft_mix_val.jsonl \
+  --output-dir training/eval/qwen35-2b-cadforge-sft-full-20260425 \
+  --limit 24 \
+  --max-new-tokens 2048 \
+  --reward-mode fast \
+  --episode-prefix qwen35-2b-sft-eval
+```
+This writes:
+- `training/eval/qwen35-2b-cadforge-sft-full-20260425/eval_report.md`
+- `training/eval/qwen35-2b-cadforge-sft-full-20260425/eval_results.jsonl`
+- one generated CadQuery file per eval row
+The eval script strips accidental `<think>...</think>` blocks before scoring, but the current SFT data does not contain thinking traces.
+## SFT Real 9B Run
+Start this after the 2B path works:
+```bash
+uv run training/train_sft_unsloth.py \
+  --model Qwen/Qwen3.5-9B \
+  --output-dir outputs/qwen35-9b-cadforge-sft \
+  --hub-model-id sanjuhs/qwen35-9b-cadforge-sft-lora \
+  --push-to-hub \
+  --enable-trackio \
+  --max-steps 0 \
+  --num-train-epochs 2 \
+  --max-seq-length 8192 \
+  --per-device-train-batch-size 1 \
+  --gradient-accumulation-steps 8 \
+  --learning-rate 1e-4 \
+  --lora-r 32 \
+  --lora-alpha 64 \
+  --eval-steps 25 \
+  --save-steps 50 \
+  --run-name qwen35-9b-sft-full
+```
+## GRPO Smoke Test
+First use the cheap reward backend. This verifies GRPO wiring without spending time on CadQuery execution for every completion.
+```bash
+uv run training/train_grpo_cadforge.py \
+  --model unsloth/Qwen3.5-2B \
+  --output-dir outputs/qwen35-2b-cadforge-grpo-smoke \
+  --reward-backend cheap \
+  --limit-prompts 8 \
+  --max-steps 5 \
+  --num-generations 4 \
+  --max-prompt-length 4096 \
+  --max-completion-length 1024 \
+  --run-name qwen35-2b-grpo-cheap-smoke
+```
+Success criteria:
+- GRPOTrainer starts
+- four completions per prompt are generated
+- reward function returns scalar scores
+- loss/reward metrics log
+## CADForge GRPO Smoke
+Then use the real CADForge reward in fast mode:
+```bash
+uv run training/train_grpo_cadforge.py \
+  --model unsloth/Qwen3.5-2B \
+  --output-dir outputs/qwen35-2b-cadforge-grpo-cadforge-smoke \
+  --reward-backend cadforge \
+  --cadforge-reward-mode fast \
+  --limit-prompts 4 \
+  --max-steps 2 \
+  --num-generations 4 \
+  --max-prompt-length 4096 \
+  --max-completion-length 1536 \
+  --run-name qwen35-2b-grpo-cadforge-smoke
+```
+This is slower because every completion is executed as CadQuery, exported to mesh, and scored.
+## GRPO From SFT Adapter
+After the 2B SFT run finishes and the eval pass is complete, launch GRPO from the trained adapter:
+```bash
+training/launch_grpo_after_sft.sh
+```
+The launcher defaults to:
+```text
+base model:   unsloth/Qwen3.5-2B
+adapter:      outputs/qwen35-2b-cadforge-sft-full-20260425
+output:       outputs/qwen35-2b-cadforge-grpo-from-sft-20260425
+log:          training/logs/grpo-2b-from-sft-20260425.log
+prompts:      64
+steps:        80
+generations:  4
+batch:        4
+grad accum:   4
+completion:   1536 tokens
+reward:       CADForge fast reward
+```
+Override any path without editing the file:
+```bash
+SFT_ADAPTER=outputs/qwen35-2b-cadforge-sft-full-20260425 \
+OUT_DIR=outputs/qwen35-2b-cadforge-grpo-from-sft-20260425 \
+training/launch_grpo_after_sft.sh
+```
+This direct-adapter GRPO path is the safest first production run. vLLM server mode remains useful after exporting/merging a model path that vLLM can serve cleanly.
+## vLLM Server Mode
+The judge recommended normal GRPO with vLLM serve mode, not async mode. The script exposes that path:
+```bash
+python -m vllm.entrypoints.openai.api_server \
+  --model unsloth/Qwen3.5-2B \
+  --host 127.0.0.1 \
+  --port 8000
+```
+Then:
+```bash
+uv run training/train_grpo_cadforge.py \
+  --model Qwen/Qwen3.5-2B \
+  --use-vllm-server \
+  --vllm-server-host 127.0.0.1 \
+  --vllm-server-port 8000 \
+  --reward-backend cadforge \
+  --cadforge-reward-mode fast \
+  --limit-prompts 16 \
+  --max-steps 20 \
+  --num-generations 4 \
+  --run-name qwen35-2b-grpo-vllm-server-smoke
+```
+If the local TRL version changes vLLM config names, disable `--use-vllm-server` for the first proof run and use standard colocated generation.
+## GRPO Real Run
+Only do the real GRPO run if SFT has a non-trivial build rate.
+```bash
+uv run training/train_grpo_cadforge.py \
+  --model outputs/qwen35-2b-cadforge-sft \
+  --output-dir outputs/qwen35-2b-cadforge-grpo \
+  --hub-model-id sanjuhs/qwen35-2b-cadforge-grpo \
+  --push-to-hub \
+  --enable-trackio \
+  --reward-backend cadforge \
+  --cadforge-reward-mode fast \
+  --limit-prompts 256 \
+  --max-steps 100 \
+  --num-generations 4 \
+  --max-prompt-length 4096 \
+  --max-completion-length 2048 \
+  --learning-rate 5e-6 \
+  --run-name qwen35-2b-grpo-cadforge-full
+```
+Full reward with renders should be used for periodic eval/reporting, not every GRPO step. Fast reward is the training signal; full reward is the judge-facing artifact generator.
+## What To Report
+For the initial report after smoke tests, capture:
+- exact base model repo
+- GPU name and VRAM
+- SFT smoke loss logs
+- GRPO smoke reward logs
+- one CADForge reward JSON from `smoke_cadforge_reward.py`
+- whether artifacts were written
+- blocker list, if any
+The hackathon result should compare:
+- base Qwen prompt-only build rate
+- SFT Qwen prompt-only build rate
+- SFT Qwen repair reward delta
+- GRPO Qwen reward delta
+- GPT-5.4 teacher trace improvement as the upper-bound teacher demonstration

docs/cadforge-openenv-project-report.md CHANGED Viewed

@@ -117,7 +117,7 @@ The strict 9B run completed on an H200 and produced exactly that separation:
 The raw logs and report artifacts are backed up here:
 - Training evidence dataset: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
-- Local backup: `training/backups/cadforge-training-evidence-20260426`
 - Final adaptive repair report: `training/reports/qwen35-9b-grpo-20260426-adaptive-repair-final-8192/`
 ![Training evidence build-rate summary](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/resolve/main/docs/detailed-blog/rendered-assets/training-evidence-build-rate-summary.png)

 The raw logs and report artifacts are backed up here:
 - Training evidence dataset: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
+- Compressed archive on that dataset: `archives/cadforge-training-evidence-20260426.tar.gz`
 - Final adaptive repair report: `training/reports/qwen35-9b-grpo-20260426-adaptive-repair-final-8192/`
 ![Training evidence build-rate summary](https://huggingface.co/spaces/sanjuhs/cadforge-cadquery-openenv/resolve/main/docs/detailed-blog/rendered-assets/training-evidence-build-rate-summary.png)

docs/cadforge-submission-checklist.md ADDED Viewed

	@@ -0,0 +1,71 @@

+# CADForge Submission Checklist
+## Non-Negotiables
+| Requirement | Status | File / Link |
+|---|---|---|
+| OpenEnv environment | done | `experiment-2-cadforge/openenv.yaml` |
+| Hugging Face Space | ready to push | `sanjuhs/cadforge-cadquery-openenv` |
+| Training notebook | done | `training/cadforge_openenv_training_colab.ipynb` |
+| Unsloth / TRL scripts | done | `training/train_sft_unsloth.py`, `training/train_grpo_cadforge.py` |
+| Evidence of training | done | `training/reports/qwen35-9b-grpo-strict-build-20260426-strict-build/` |
+| Raw training logs | done | [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence) |
+| Final adaptive repair evidence | done | `training/reports/qwen35-9b-grpo-20260426-adaptive-repair-final-8192/` |
+| Final inference comparison | done | `inference/results/stator-qwen-vs-frontier/report.md` |
+| Separate HF blog markdown | done | `experiment-2-cadforge/CADFORGE_BLOG.md` |
+| README links to all materials | done | `README.md`, `experiment-2-cadforge/README.md` |
+## What To Push To The HF Space
+Push from the Space root:
+```bash
+cd experiment-2-cadforge
+set -a; source ../.env; set +a
+../.venv/bin/openenv validate .
+../.venv/bin/openenv push . --repo-id sanjuhs/cadforge-cadquery-openenv --interface
+```
+The HF Space should include:
+- `README.md`
+- `CADFORGE_BLOG.md`
+- `openenv.yaml`
+- server/client environment code
+Do not upload large videos to the Space. Link to YouTube if a video is made.
+## Judge Story
+CADForge teaches an LLM to write buildable, editable CADQuery by interacting with a real CAD compiler and verifier.
+The strongest 30-second story:
+1. Base tiny models can write plausible CAD text, but much of it does not compile.
+2. SFT teaches the model the shape of editable CadQuery programs.
+3. First GRPO exposed a reward bug: dense reward was too forgiving.
+4. The environment fought back with strict build gating.
+5. Strict GRPO produced `96/320` buildable completions and a best CADForge score of `0.9352`.
+6. Adaptive repair started from the strict-GRPO adapter and fixed the clipping failure: `53/180` buildable repairs with `0` clipped completions.
+7. Final stator inference shows the product shape: base Qwen failed build, RL-tuned Qwen built editable CAD, and GPT-5.4 remained a strong frontier baseline.
+## Training Log Narrative
+Use this evidence arc if judges ask whether training really happened:
+| Run | Log evidence | Interpretation |
+|---|---|---|
+| 2B SFT | `training/reports/qwen35-2b-sft-final/` | tiny model learns CadQuery grammar and trace format |
+| 2B dense GRPO | `training/logs/grpo-2b-completions.jsonl` | reward moved, but `0/160` builds exposed forgiving reward |
+| 9B SFT | `training/reports/qwen35-9b-sft-final/` | stronger syntax/style learning |
+| 9B dense GRPO | `training/logs/grpo-9b-completions.jsonl` | bigger model got higher reward but still `0/160` builds |
+| 9B strict GRPO | `training/logs/grpo-9b-strict-build-20260426-strict-build-completions.jsonl` | build-gated reward produced `96/320` buildable completions |
+| Adaptive v1 | `training/logs/grpo-9b-20260426-adaptive-repair-completions.jsonl` | failed run exposed clipping and curriculum bug |
+| Adaptive final 8192 | `training/logs/grpo-9b-20260426-adaptive-repair-final-8192-completions.jsonl` | fixed setup produced `53/180` buildable repairs |
+## Remaining Optional Polish
+- Add a <2 minute YouTube link to both READMEs if you record one.
+- Add the HF Space URL after pushing/confirming the live Space.
+- Add screenshots from the live browser UI if there is time.
+- Run a broader 10-20 task inference comparison if there is extra GPU/API time.

docs/competiton-round1/COMPETITION_REQUIREMENTS.md ADDED Viewed

	@@ -0,0 +1,69 @@

+# OpenEnv Round 1 — Competition Requirements
+**Deadline**: 8 April 2026, 11:59 PM IST
+**Competing as**: Solo — Sanjayprasad H S (sanjuhs123@gmail.com)
+---
+## Mandatory Pass/Fail Gates (all must pass or DQ)
+1. **HF Space deploys** — automated ping to Space URL returns 200 + responds to `reset()`
+2. **OpenEnv spec compliance** — `openenv validate` passes (openenv.yaml, typed models, step/reset/state endpoints)
+3. **Dockerfile builds** — `docker build` succeeds on submitted repo
+4. **Baseline reproduces** — `inference.py` runs without error, produces scores
+5. **3+ tasks with graders** — each grader returns score in 0.0–1.0 range
+## Functional Requirements
+| Requirement | Detail |
+|---|---|
+| Real-world task | Must simulate something humans actually do (not games/toys) |
+| OpenEnv spec | Typed `Action`, `Observation` Pydantic models; `step(action)` → obs, reward, done, info; `reset()` → initial obs; `state()` → current state; `openenv.yaml` with metadata |
+| 3+ tasks with graders | Each task has a concrete objective + programmatic grader (0.0–1.0). Easy → medium → hard progression. Deterministic, reproducible. |
+| Reward function | Signal over full trajectory (not just binary end). Partial progress rewarded. Penalize bad behavior. |
+| Baseline inference script | Named `inference.py` in project root. Uses OpenAI API client. Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars. Produces reproducible baseline score on all 3 tasks. Must emit `[START]`, `[STEP]`, `[END]` structured stdout logs. |
+## Non-Functional Requirements
+- Deploy as containerized HF Space tagged `openenv`
+- Working Dockerfile (`docker build` + `docker run`)
+- README: env description, action/observation spaces, task descriptions, setup instructions, baseline scores
+## Scoring Weights
+| Parameter | Weight |
+|---|---|
+| Real-world utility | 30% |
+| Task & grader quality | 25% |
+| Environment design | 20% |
+| Code quality & spec compliance | 15% |
+| Creativity & novelty | 10% |
+## Infra Constraints
+- Inference script runtime < 20 min
+- Must run on vcpu=2, memory=8gb
+- Use OpenAI Client for all LLM calls
+## Env Vars Required
+```
+API_BASE_URL   — LLM API endpoint
+MODEL_NAME     — model identifier for inference
+HF_TOKEN       — HF / API key
+```
+## Pre-Submission Validation
+```bash
+# Run the validation script before submitting
+openenv validate
+docker build .
+# Then submit HF Spaces URL on platform
+```
+## Evaluation Pipeline
+1. **Phase 1**: Automated validation (pass/fail gate)
+2. **Phase 2**: Agentic evaluation — baseline agent + standard Open LLM agent (Nemotron 3 Super) run against all envs
+3. **Phase 3**: Human review by Meta + HF engineers

docs/competiton-round1/inference-script-example.md ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

docs/competiton-round1/objective.md ADDED Viewed

	@@ -0,0 +1,581 @@

+step 1
+How will you compete?
+Choose solo or team before you can start the assessment
+Step 1 Complete
+Competing as Solo Warrior
+👤
+Sanjayprasad H S
+sanjuhs123@gmail.com
+🔒
+Locked for Round 1. You cannot switch to a team until Round 1 is over.
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp
+OpenEnv Round 1 Bootcamp: Build Your First RL Environment
+Live walkthrough to submit a strong Round 1 entry
+timing
+8:00 PM Onwards
+Wednesday, 1st April
+Host
+Ben Burtenshaw
+Community Education in AI at Hugging Face
+Pulkit Aneja
+Scaler Instructor
+Watch Recording
+PROBLEM STATEMENT
+Round 1 — Problem Statement
+The Task
+Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard  step() / reset() / state()  API.
+Key Requirements at a Glance
+Must simulate a real-world task (not games or toys)
+Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
+Minimum 3 tasks with agent graders (easy → medium → hard, scores/reward 0.0–1.0)
+Meaningful reward function with partial progress signals
+Baseline inference script with reproducible scores
+Deploy to Hugging Face Spaces + working Dockerfile
+README with environment description, action/observation spaces, setup instructions
+Functional Requirements
+Real-world task simulation
+The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
+OpenEnv spec compliance
+Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
+Minimum 3 tasks with agent graders
+Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
+Meaningful reward function
+Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
+Baseline inference script
+Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
+Detailed Requirements
+Non-Functional Requirements
+Deploys to a Hugging Face Space
+Environment must run as a containerized HF Space tagged with openenv.
+Containerized execution
+Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
+Documentation
+README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
+Parameter
+Weight
+Description
+Real-world utility
+30%
+Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
+Task & grader quality
+25%
+Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
+Environment design
+20%
+Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
+Code quality & spec compliance
+15%
+Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
+Creativity & novelty
+10%
+Novel problem domain, interesting mechanics, clever reward design, original approach.
+Scoring Breakdown
+Real-world utility (30%)
+•  0–5: Toy/artificial problem with no practical application
+•  6–15: Valid domain but shallow modeling of the real task
+•  16–25: Good domain modeling, would be useful for agent evaluation
+•  26–30: Excellent — fills a real gap, immediate value for the RL/agent community
+Task & grader quality (25%)
+•  3+ tasks with difficulty range?
+•  Graders produce scores between 0.0–1.0?
+•  Graders deterministic and reproducible?
+•  Hard task genuinely challenges frontier models?
+Environment design (20%)
+•  reset() produces clean state?
+•  Action/observation types well-designed and documented?
+•  Reward function provides useful varying signal (not just sparse)?
+•  Episode boundaries sensible?
+Code quality & spec compliance (15%)
+•  openenv validate passes?
+•  docker build && docker run works?
+•  HF Space deploys and responds?
+•  Baseline script runs and reproduces scores?
+Creativity & novelty (10%)
+•  Domain we haven’t seen in OpenEnv before?
+•  Reward design has interesting properties?
+•  Clever mechanics that make the environment engaging?
+Evaluation Criteria
+Phase 1: Automated Validation
+Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
+Phase 2: Agentic Evaluation
+Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
+Phase 3: Human Review
+Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
+Disqualification Criteria
+Environment does not deploy or respond
+Plagiarized or trivially modified existing environments
+Graders that always return the same score
+No baseline inference script
+How Judging works
+Pre-Submission Checklist  — all must pass or you're disqualified
+HF Space deploys
+Automated ping to the Space URL — must return 200 and respond to reset()
+OpenEnv spec compliance
+Validate openenv.yaml, typed models, step()/reset()/state() endpoints
+Dockerfile builds
+Automated docker build on the submitted repo
+Baseline reproduces
+Run the submitted inference script — must complete without error and produce scores
+3+ tasks with graders
+Enumerate tasks, run each grader, verify scores/reward in 0.0–1.0 range
+Mandatory Additional Instructions
+Before submitting, ensure the following variables are defined in your environment configuration:
+API_BASE_URL   The API endpoint for the LLM.
+MODEL_NAME     The model identifier to use for inference.
+HF_TOKEN       Your Hugging Face / API key.
+The inference script must be named `inference.py` and placed in the root directory of the project
+Participants must use OpenAI Client for all LLM calls using above variables
+Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference.py provided below. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to the Sample Inference Script for the complete format specification and examples.
+Infra Restrictions
+Runtime of inference script should be less than 20min
+Make sure your env and inference can run on a machine with vcpu=2, memory=8gb
+Validator
+Run the pre-submission validation script before submitting
+NEW
+Sample Inference Script
+"""
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())
+NEW
+Pre Validation Script
+#!/usr/bin/env bash
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0
+Submission window opens on 28th March
+Deadline: 8 Apr 11:59 PM
+Submit your Assessment
+→
+Study material
+Preparatory Course
+4 modules · ~3.5 hours
+Each module: read the README first, then open the notebook in Colab. No local setup needed.
+What you'll do
+Connect to 3 real AI environments hosted online — an Echo bot, a Catch game, and Wordle — and interact with each using the exact same code pattern.
+Read Concept
+ Module 1: Why OpenEnv?
+ESSENTIAL FOR ROUND 1
+45 min
+What you'll do
+Write 4 different game-playing strategies for a Catch game, run a competition between them, then switch to a completely different game using the same code.
+Read Concept
+Module 2: Using Existing Environments
+ESSENTIAL FOR ROUND 1
+50 min
+What you'll do
+Clone an existing environment, modify it, run it on your machine, then deploy your version live to Hugging Face Spaces with one command.
+Read Concept
+ Module 3: Deploying Environments
+ESSENTIAL FOR ROUND 1
+45 min
+What you'll do
+Build a complete word-guessing game environment from scratch — define the rules, implement the logic, test it locally, and deploy it live. About 100 lines of real code.
+Read Concept
+Module 4: Building Your Own Environment
+ MOST IMPORTANT FOR ROUND 1
+60 min
+View full course repository
+GUIDE
+Round 1 Guide
+What to Expect
+Prerequisites
+How to Submit
+When Round 1 starts on 1 April:
+Step 1
+Application Form
+Choose 1 of the 4–5 problem statements revealed on the platform.
+Step 2
+ Scaffold
+$
+openenv init my_env
+Copy
+Generate project structure.
+Step 3
+Build
+Define your environment in the generated files.
+Step 4
+Test locally
+$
+uv run server
+Copy
+Step 5
+Deploy
+$
+openenv push --repo-id your-username/my-env
+Copy
+Step 6
+ Submit
+Paste your HF Spaces URL here before the deadline.
+Deadline: 8 April 2026, 11:59 PM IST
+Step 2
+Submit your Assessment
+Complete Step 1 first
+Problem Statement is live. Build and submit.
+Round 1 begins
+Submission window opens on 28th March
+Deadline: 8 Apr 11:59 PM
+Submit your Assessment
+→
+NOTE: Only team leaders can make the final submission.
+FAQs
+Frequently Asked Questions
+Need help? Reach out to us
+help_openenvhackathon@scaler.com

docs/competiton-round1/pre-vaidationscript-example.md ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

docs/detailed-blog/cadforge-detailed-blog.md CHANGED Viewed

@@ -300,7 +300,7 @@ The model artifacts are on Hugging Face:
 The raw evidence bundle is also public:
 - Training logs and reports: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
-- Local backup: `training/backups/cadforge-training-evidence-20260426`
 - Per-completion reward traces: `training/logs/*completions.jsonl`
 - Parsed plots and metrics: `training/reports/*`

 The raw evidence bundle is also public:
 - Training logs and reports: [sanjuhs/cadforge-training-evidence](https://huggingface.co/datasets/sanjuhs/cadforge-training-evidence)
+- Compressed archive on that dataset: `archives/cadforge-training-evidence-20260426.tar.gz`
 - Per-completion reward traces: `training/logs/*completions.jsonl`
 - Parsed plots and metrics: `training/reports/*`

docs/doc-edit-game-v2.md ADDED Viewed

	@@ -0,0 +1,149 @@

+---
+title: DocEdit Game V2 — Document Editing RL Environment
+emoji: 📄
+colorFrom: indigo
+colorTo: red
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# DocEdit Game V2 — Production-Grade Document Editing RL Environment
+Train applicator models to perform precise, fast edits on legal and pharmaceutical documents. Procedurally generated tasks with 6 document types, 12 corruption types, 16+ editing tools, and windowed navigation for documents of any size.
+## The Problem We Solve
+Legal and pharmaceutical professionals spend hours editing massive documents — contracts, affidavits, drug labels, clinical study reports. A frontier LLM can *decide* what edits to make, but executing 200 precise edits on a 2000-page XML document is too slow and expensive for GPT-4o. We train **applicator models** (1-7B params) that execute edits with near-perfect accuracy at 500x lower cost.
+## Game Mechanics
+1. **Reset**: Environment generates a document with procedural corruptions (spelling, case, names, formatting, PDF artifacts, junk chars)
+2. **Observe**: Agent sees a document chunk + edit instruction + similarity score
+3. **Act**: Agent calls one tool per step (replace, format, delete, merge_runs, clean_junk_chars, etc.)
+4. **Reward**: Incremental similarity improvement to the hidden target, with bonuses for completion and penalties for collateral damage
+5. **Win**: Achieve similarity ≥ 0.999
+## Domains
+| Domain | Document Types | Real-World Scenario |
+|--------|---------------|-------------------|
+| **Legal** | Contract, Affidavit, Case Brief | Redlining, name changes, section renumbering |
+| **Pharmaceutical** | Drug Label, Clinical Study Report | Dosage updates, adverse reaction additions, regulatory formatting |
+| **Business** | Business Report | Financial table fixes, executive summary edits |
+## 12 Corruption Types (3 Tiers)
+**Tier 1 — Content**: spelling, case, names, punctuation, content deletion, content insertion
+**Tier 2 — Formatting**: formatting strip, formatting wrong, alignment, spacing
+**Tier 3 — Artifacts**: PDF-to-DOCX fragmented runs, junk characters (zero-width spaces, BOMs)
+## 16+ Tools (Agent Actions)
+```json
+{"tool": "replace", "params": {"target": "recieve", "content": "receive"}}
+{"tool": "format_text", "params": {"target": "Important Notice", "format": "bold"}}
+{"tool": "highlight", "params": {"target": "Section 3.2", "color": "yellow"}}
+{"tool": "merge_runs", "params": {"line_index": 23}}
+{"tool": "clean_junk_chars", "params": {}}
+{"tool": "set_alignment", "params": {"line_index": 5, "alignment": "center"}}
+{"tool": "scroll_to", "params": {"chunk": 47}}
+```
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `document_chunk` | str | Currently visible document chunk (XML) |
+| `chunk_index` / `total_chunks` | int | Navigation position |
+| `document_overview` | str | Heading index for navigation |
+| `edit_instruction` | str | Natural language edit description |
+| `similarity` | float | Overall similarity to target (0-1) |
+| `collateral_damage` | float | Fraction of correct text accidentally damaged |
+| `task_difficulty` | int | 1-6 severity level |
+| `doc_type` / `domain` | str | Document template and domain |
+## 5 Fixed Evaluation Tasks
+| Task | Domain | Difficulty | Corruptions |
+|------|--------|-----------|-------------|
+| `legal_easy` | Legal | 2 (easy) | Spelling, punctuation, content insertion |
+| `legal_medium` | Legal | 3 (medium) | Mixed Tier 1+2 |
+| `legal_hard` | Legal | 5 (expert) | All tiers including PDF artifacts |
+| `pharma_easy` | Pharma | 2 (easy) | Spelling, content deletion |
+| `pharma_hard` | Pharma | 4 (hard) | Mixed Tier 1+2 |
+## Dual-Seed System
+```python
+reset(doc_seed=42, corruption_seed=9042, difficulty=3, domain="legal")
+```
+- `doc_seed` controls document generation (template, content, length)
+- `corruption_seed` controls corruption application (types, positions)
+- 2^32 × 2^32 = ~18 quintillion unique tasks
+## Reward Design
+```
+reward = similarity_after - similarity_before     # incremental
+if exact_match: reward += 1.0 + 0.2 * efficiency  # completion bonus scaled by speed
+if noop: reward -= 0.01                            # wasted step
+if collateral_damage: reward -= 0.02 * damage      # broke something
+```
+## Quick Start
+```bash
+cd doc_edit_game_v2 && uv sync
+uvicorn server.app:app --host 0.0.0.0 --port 8001
+# Or Docker
+docker build -t doc_edit_game_v2-env:latest -f server/Dockerfile .
+docker run -p 8000:8000 doc_edit_game_v2-env:latest
+```
+## Human + Model Web UI
+The server now includes a browser playground for the same document-generation and grading logic:
+- `GET /` serves a human-editing interface
+- `POST /api/game/new` creates a new task from a seed, domain, and difficulty
+- `POST /api/game/{session_id}/submit-human` grades the human-edited document
+- `POST /api/game/{session_id}/model-step` applies model-style tool calls on a parallel workspace
+- `POST /api/game/{session_id}/submit-model` grades the model workspace
+UI flow:
+1. Load a new random seed from the top bar
+2. Read the scenario exposition + instruction
+3. Edit the corrupted source document in the human lane
+4. Optionally apply environment tools in the model lane on the same seed
+5. Submit each lane and compare scores side by side
+The human lane uses direct document submission for easy play-testing.
+The model lane uses the existing tool-based editing logic so it stays compatible with the RL-style environment.
+## Architecture
+```
+doc_edit_game_v2/
+├── game/
+│   ├── templates/          # 6 document generators (legal, pharma, business)
+│   ├── corruptions/        # 12 corruption types in 3 tiers
+│   ├── tools/              # 16+ editing tools
+│   ├── windowing.py        # Chunked navigation for large docs
+│   ├── grader.py           # Multi-level grading (similarity + edit accuracy + collateral)
+│   ├── generator.py        # Task orchestrator with dual-seed system
+│   └── content_pools.py    # Domain-specific vocabulary
+├── models.py               # DocEditAction + DocEditObservation
+├── client.py               # WebSocket client
+├── inference.py             # Baseline LLM inference script
+└── server/
+    ├── doc_edit_game_v2_environment.py
+    ├── app.py
+    └── Dockerfile
+```

docs/docs-guide.md ADDED Viewed

	@@ -0,0 +1 @@

+ So this is a project that we are doing. First, you need to go through the judging criteria, and then you will understand all the themes. After that, you need to go through the hackathon help guide to understand what was going on. Then you go over round one corrections as to what were the things that we had to do correctly. I previously submitted doc-edit-game-v2, so you can read that, and then you can read round one corrections. After that, you read the @best-example-project.md , which is a previous hackathon winner for something very similar. I want you to now come up with a very good idea, or collate all this information, or do whatever task the user is telling.

docs/final-postmortem-round1.md ADDED Viewed

	@@ -0,0 +1,240 @@

+# DocEdit Qwen2.5-3B SFT + GRPO Post-Mortem
+Date:
+- April 17, 2026
+Hardware:
+- `1x H200 SXM`
+Base model:
+- `Qwen/Qwen2.5-3B-Instruct`
+Training recipe:
+- `LoRA SFT`
+- `LoRA GRPO`
+Primary Hub repo:
+- [sanjuhs/docedit-qwen25-3b-checkpoints](https://huggingface.co/sanjuhs/docedit-qwen25-3b-checkpoints)
+---
+## 1. Goal
+The goal of this run was to answer a narrow but important question:
+> Can a small open model be adapted and reinforcement-tuned to repair corrupted structured documents?
+This was not yet the final tool-policy architecture.
+Instead, this run intentionally produced a **rewrite-policy baseline** that we can later compare against:
+- frontier-model tool use
+- tool-trajectory training
+- planner -> applicator architectures
+---
+## 2. What We Ran
+### SFT stage
+We trained a LoRA adapter on paired:
+- corrupted document
+- repaired target document
+This teaches:
+- markup discipline
+- structured output behavior
+- basic repair mapping
+### GRPO stage
+We then continued from the SFT adapter using verifier-based RL.
+Reward ingredients:
+- structural correctness
+- edit accuracy
+- collateral damage penalty
+- output format penalty
+---
+## 3. Final Training Outcome
+### SFT
+- runtime: about `109.38s`
+- final train loss: about `0.06346`
+- final mean token accuracy: about `0.98954`
+### GRPO
+- runtime: about `5562.75s`
+- total steps: `100`
+- final train loss: about `0.03506`
+- final logged step-100 reward mean: about `0.79567`
+GRPO checkpoints written:
+- `checkpoint-25`
+- `checkpoint-50`
+- `checkpoint-75`
+- `checkpoint-100`
+---
+## 4. SFT Loss Curve
+```mermaid
+xychart-beta
+    title "SFT Loss"
+    x-axis ["Step 5", "Step 10", "Step 15", "Final"]
+    y-axis "Loss" 0 --> 0.10
+    line [0.0811, 0.0352, 0.0910, 0.0635]
+```
+## 5. GRPO Reward Curve Snapshot
+```mermaid
+xychart-beta
+    title "GRPO Reward Snapshot"
+    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
+    y-axis "Reward" 0.55 --> 1.30
+    line [0.8422, 0.7638, 0.9102, 0.7957]
+```
+## 6. GRPO Step Time Snapshot
+```mermaid
+xychart-beta
+    title "GRPO Step Time"
+    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
+    y-axis "Seconds" 40 --> 70
+    line [66.42, 58.12, 55.95, 61.71]
+```
+---
+## 7. Quick Directional Eval
+After training, we ran a **very small** local eval on `3` validation cases for:
+- base model
+- SFT adapter
+- final GRPO adapter
+This is not a full benchmark.
+It is only a quick directional comparison to tell us whether the trained adapters are plausibly improving over baseline.
+### 3-case quick eval results
+| Model | Cases | Exact match rate | Mean similarity | Mean composite score | Mean edit accuracy | Mean collateral damage |
+|---|---:|---:|---:|---:|---:|---:|
+| Base `Qwen2.5-3B-Instruct` | 3 | 0.0000 | 0.9358 | 0.7790 | 0.4444 | 0.2000 |
+| `Qwen2.5-3B + SFT LoRA` | 3 | 0.3333 | 0.9964 | 0.9109 | 0.6667 | 0.0159 |
+| `Qwen2.5-3B + GRPO LoRA` | 3 | 0.3333 | 0.9964 | 0.9149 | 0.6667 | 0.0000 |
+### Visual comparison
+```mermaid
+xychart-beta
+    title "Quick Eval Mean Composite Score"
+    x-axis ["Base", "SFT", "GRPO"]
+    y-axis "Composite Score" 0.70 --> 0.95
+    bar [0.7790, 0.9109, 0.9149]
+```
+```mermaid
+xychart-beta
+    title "Quick Eval Mean Collateral Damage"
+    x-axis ["Base", "SFT", "GRPO"]
+    y-axis "Collateral Damage" 0.00 --> 0.25
+    bar [0.2000, 0.0159, 0.0000]
+```
+### What this means
+On this very small check:
+- SFT clearly improved over the base model
+- GRPO slightly improved over SFT on composite score
+- GRPO also reduced collateral damage to zero on this 3-case slice
+This is encouraging, but it is **not enough** to claim robust superiority yet.
+---
+## 8. What Went Well
+1. The H200 setup worked well for this scale.
+2. SFT completed quickly and produced a clean LoRA adapter.
+3. GRPO completed fully and wrote multiple checkpoints.
+4. The final GRPO adapter loads and generates correctly.
+5. The quick directional eval suggests the trained adapters beat the untuned base model.
+---
+## 9. What Did Not Go Perfectly
+1. The current policy is still a **rewrite policy**, not the final tool-call architecture.
+2. We had to patch `run_grpo.py` during the run to match the installed TRL version.
+3. We also had to fix a repo-root import issue in the GRPO entrypoint.
+4. The currently published eval is still small and should be treated as a sanity check, not a full research result.
+---
+## 10. Biggest Strategic Takeaway
+This run successfully answers:
+> Can we fine-tune and RL-tune a small model for DocEdit on one H200?
+Answer:
+- **yes**
+But it does **not** yet settle the bigger architecture question:
+> Is rewrite-policy the right final product design?
+The answer there is still:
+- **probably not**
+The next likely better direction is:
+- frontier model plans edits
+- smaller executor/applicator handles structured edit application
+- or frontier model directly uses a compact patch language
+This run is therefore best understood as:
+- a successful baseline
+- a checkpoint artifact
+- a comparison anchor for future tool-policy work
+---
+## 11. Recommended Next Steps
+1. Run `GPT-5.4` directly with a compact edit language or tool schema.
+2. Compare that against this rewrite-policy baseline.
+3. Decide whether to:
+   - keep frontier-only tool use
+   - or distill those edit traces into a smaller applicator model
+4. Move future training toward:
+   - structured edit plans
+   - tool trajectories
+   - planner -> executor separation
+---
+## 12. Final Judgment
+Was the H200 run worth doing?
+- **Yes.**
+Why?
+- it produced complete SFT and GRPO artifacts
+- it gave us a usable small-model baseline
+- it generated a real comparison point for future design decisions
+Would I immediately continue training more rewrite-policy models after this?
+- **No.**
+I would pause here, keep these artifacts, and move the next cycle toward the cleaner frontier-planner / structured-edit direction.

docs/hackathon_help_guide.md ADDED Viewed

	@@ -0,0 +1,425 @@

+# **Hackathon Self-Serve Guide: Build an RL Environment, Train an LLM, Ship a Demo**
+## **0\) What you are building**
+The core idea is not just to fine-tune a text model, but to build a **specialized LLM system** that can act inside an environment, get feedback, and improve through reinforcement learning. The practical stack discussed here is:
+**Environment → verifier/reward functions → TRL trainer → Unsloth for efficiency → deployment on OpenEnv / Spaces**.
+A strong project usually looks like one of these,
+Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
+## **1\) Start with the right project idea**
+Pick a task that has all three of these properties:
+1. **The model can act step by step**
+2. **You can verify success programmatically**
+3. **The task is hard enough to be interesting, but not so hard that the model never succeeds**
+This last point matters a lot. RL only works if the probability of getting a good answer is greater than zero. If your task is so hard that the model never gets any reward, you will burn compute and learn nothing.
+Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
+A useful rule: **prefer tasks with crisp verification over tasks that only “look good” to a human.** RL gets easier when the reward is objective.
+## **2\) Understand the minimum RL loop before you build**
+At a high level, your loop is:
+1. Give the model a prompt
+2. Let it generate an action, strategy, answer, or code
+3. Execute that output in an environment or verifier
+4. Convert the result into a reward
+5. Update the model so higher-reward behavior becomes more likely
+That is the practical mental model for RL here. The system samples many outputs, scores them, and shifts probability mass away from bad outputs and toward better ones.
+One especially useful framing is that RL is like a more efficient version of repeated in-context improvement. Instead of repeatedly stuffing previous examples into the context, you let backpropagation store what worked into the weights.
+## **3\) Decide whether you need SFT first**
+Use this simple rule:
+* If you have **a lot of good data**, use **SFT**
+* If you **do not have data but can verify outputs**, use **RL**
+* In many practical cases, do **a little SFT first**, then RL
+Why this matters:
+* SFT is generally more sample-efficient
+* RL is useful when you can test outcomes but cannot cheaply author ideal traces
+* RL often needs some warm start, formatting priming, or easy tasks first so that good rollouts happen at all
+For hackathon teams, the best path is usually:
+1. Start from a capable base/instruct model
+2. Add light formatting or task scaffolding if needed
+3. Use RL for improvement, not as magic from scratch
+## **4\) Design the environment before you design the trainer**
+Treat the environment as a first-class artifact. It should define:
+* **reset()**: start a fresh episode
+* **step(action)**: apply an action and return the next result
+* **state() / observation**: what the agent sees
+* **reward**: what counts as progress or success
+OpenEnv standardizes this so the same training code can work across many environments, instead of every team inventing a different API. That is one of the main reasons to use it in a hackathon.
+Think about your environment in this order:
+1. What does the agent observe?
+2. What actions can it take?
+3. What ends an episode?
+4. How do you compute reward?
+5. How do you stop abuse, infinite loops, or cheating?
+**5\) Build the environment using OpenEnv**
+The intended workflow is to bootstrap an environment skeleton and then fill in the behavior. OpenEnv’s CLI creates the scaffolding for you. The environment is implemented as a Python package and exposed via a FastAPI app.
+Your implementation typically defines:
+* action dataclass
+* observation dataclass
+* state representation
+* environment methods like reset and step
+* FastAPI wrapper / client-server interface
+That gives you a clean separation:
+* the **environment** handles world dynamics and scoring,
+* the **trainer** handles optimization,
+* and the **model** just learns to act inside the interface.
+## **6\) Keep the task simple at first**
+Do not begin with your hardest benchmark. Start with the easiest version of your environment that still proves the concept. This is where curriculum learning helps.
+A good progression:
+1. easy tasks with short horizons,
+2. medium tasks with a little more branching,
+3. harder tasks only after the model starts getting non-zero reward.
+The principle is simple: **make success possible early**. If the model never sees successful trajectories, learning stalls.
+## **7\) Design rewards carefully**
+Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the model will optimize the wrong thing very efficiently.
+A strong reward design usually includes multiple components, for example:
+* execution success,
+* correctness,
+* format compliance,
+* timeouts,
+* resource usage,
+* safety constraints,
+* and anti-cheating checks.
+One explicit recommendation was to use **multiple independent reward functions**, not just one. If you only have a single reward signal, it is easier for the model to hack it. Multiple independent checks reduce that risk.
+For example, for a coding environment:
+* reward passing tests,
+* penalize timeouts,
+* reward format compliance,
+* reject use of forbidden globals,
+* and separately verify the function contract.
+## **8\) Protect yourself against reward hacking**
+Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts that maximize your reward without solving the real task. Examples mentioned include:
+* editing timers,
+* caching results,
+* abusing globals,
+* mutating protected state,
+* or exploiting environment bugs.
+What to do:
+1. Use multiple independent reward functions
+2. Lock down execution where possible
+3. Add time limits
+4. Avoid unrestricted global state
+5. Sample outputs frequently and inspect them
+6. Terminate or roll back runs if behavior drifts badly
+A particularly practical recommendation was to use a **locked-down function** or restricted execution approach so the model cannot rely on undeclared globals or hidden cached state.
+Also, do not just let training run forever without checking generations. Periodic human inspection is still necessary.
+## **9\) Use process-aware feedback when you can**
+Naively assigning the same final reward to every token is inefficient. If possible, use richer supervision that distinguishes good intermediate steps from bad ones. That is the idea behind **process supervision**.
+In practice, this can be approximated by:
+* line-by-line checks,
+* step-level verifiers,
+* program trace analysis,
+* or LLM-as-a-judge for intermediate reasoning.
+But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal.
+For a hackathon, outcome-based verification plus a few lightweight process checks is usually the sweet spot.
+## **10\) Pick the right training stack**
+The intended stack here is:
+* **TRL** for RL training algorithms
+* **Unsloth** to make RL training and inference more efficient
+* **OpenEnv** to standardize environment interaction
+This combination works because:
+* OpenEnv gives you a common environment interface
+* TRL gives you RL trainers like GRPO
+* Unsloth reduces memory use and improves efficiency on top of TRL
+One of the practical examples used the same prompt repeated many times, routed through an environment, with TRL driving training and Unsloth helping with performance.
+## **11\) Prefer GRPO / RLVR style training for verifiable tasks**
+The RL setup discussed here leans toward **RL with verifiable rewards**:
+* instead of a learned reward model,
+* use a verifier, test harness, regex check, executor, or environment.
+GRPO was described as a more efficient evolution relative to older PPO-style setups, especially by simplifying away parts like the value model.
+For hackathon purposes, the key practical takeaway is:
+* if the task is verifiable,
+* build the verifier first,
+* then plug that verifier into RL training.
+## **12\) Keep inference fast**
+One important point: in RL for LLMs, **inference can dominate total runtime**. Over time, rollout generation often becomes the bottleneck, not the optimizer step.
+That means your project speed depends heavily on:
+* fast sampling,
+* tight environment loops,
+* low-overhead execution,
+* and efficient model runtime.
+This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy environments early in the hackathon.
+## **13\) Deploy your environment early**
+OpenEnv environments are designed to be deployed as **Hugging Face Spaces**, which provide:
+* a running server,
+* a Git repository,
+* and a container registry.
+That gives you several ways to work:
+* interact with the remote Space directly,
+* install the client code from the repo,
+* pull and run the container locally,
+* or run the FastAPI app locally via Python/Uvicorn.
+Why this is good for a hackathon:
+* one shared source of truth,
+* easier collaboration,
+* easier demos,
+* easier switching between local and remote execution.
+A good habit is to deploy an early version of the environment before training seriously. That catches API and packaging issues early.
+## **14\) Scale only after the environment is stable**
+There was a dedicated tutorial flow around:
+1. environment,
+2. deployment,
+3. scaling,
+4. training with TRL and Wordle.
+Follow the same order.
+Do **not** start with scale. First confirm:
+* reset works,
+* step works,
+* rewards are sensible,
+* timeouts work,
+* logs are visible,
+* and the environment can be run locally and remotely.
+Only then:
+* increase batch sizes,
+* duplicate prompts or tasks,
+* expand task diversity,
+* and benchmark throughput.
+## **15\) Monitor the right things during training**
+Do not watch only one scalar. Monitor:
+* overall reward,
+* individual reward function columns,
+* success indicators,
+* timeout frequency,
+* and generated strategies over time.
+A very concrete suggestion was:
+* watch whether the reward is going up,
+* and separately watch critical columns like “function works.”
+Also inspect actual generations during training. A rising reward is not enough if the model is learning to exploit bugs.
+## **16\) Save models correctly**
+If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was:
+**Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively.** That can badly damage model quality. Instead, use the proper merged-save path, or use the adapters directly.
+For participants, that means:
+* keep your training save path simple,
+* test post-training inference immediately,
+* and do not leave export until the end.
+## **17\) How to structure your team over the hackathon**
+A very effective team split is:
+**Person A: Environment**
+* builds reset/step/state
+* adds timeouts and safety constraints
+* makes local and remote execution work
+**Person B: Verifier / Rewards**
+* writes multiple reward functions
+* adds anti-hacking checks
+* makes failure cases visible
+**Person C: Training**
+* sets up TRL \+ Unsloth
+* runs experiments
+* tracks metrics and generations
+**Person D: Demo / Product**
+* prepares the Space demo
+* creates a simple interface
+* records examples and final benchmarks
+This split matches the way the stack naturally decomposes in practice.
+## **18\) A practical 1-day execution plan**
+### **Phase 1: Pick a narrow task**
+Choose a small, verifiable environment. Avoid huge long-horizon tasks first.
+### **Phase 2: Build the environment**
+Use OpenEnv init, implement reset/step/state, and get a local loop working.
+### **Phase 3: Build rewards**
+Add at least 2–4 independent reward checks, plus timeout and anti-cheat logic.
+### **Phase 4: Deploy**
+Push to a Space or run locally via container/Uvicorn so teammates can use the same environment.
+### **Phase 5: Train small**
+Run a tiny TRL \+ Unsloth experiment first. Look at outputs, not just metrics.
+### **Phase 6: Inspect for hacking**
+Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts.
+### **Phase 7: Add curriculum**
+If the model gets zero reward too often, simplify tasks or add easier start states.
+### **Phase 8: Train bigger**
+Only after the loop is stable should you increase scale, batch size, or environment diversity.
+### **Phase 9: Save and demo**
+Export the trained model correctly, test inference, and show before/after behavior.
+## **19\) What judges or reviewers will likely find compelling**
+The strongest hackathon projects usually show:
+* a clear environment design,
+* objective reward functions,
+* evidence that the model improved,
+* prevention against reward hacking,
+* a reproducible deployment story,
+* and a sharp demo.
+A simple but strong demo format is:
+1. baseline model attempt,
+2. reward/verifier output,
+3. trained model attempt,
+4. measurable improvement,
+5. short explanation of safeguards.
+## **20\) Suggested problem statement theme directions**
+Please Refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing)
+## **21\) Common mistakes to avoid**
+* Picking a task so hard that success probability is zero
+* Using only one reward function
+* Not checking for reward hacking
+* Training before the environment is stable
+* Relying only on average reward and not inspecting outputs
+* Forgetting timeouts and sandbox limits
+* Saving LoRA/QLoRA models incorrectly
+## **22\) Learning Resources**
+**(Recommended) RL Environment Lecture Chapters:**
+[**RL Mega Lecture**](https://openenv-india-apr-2026.lovable.app/)
+**Module 1: Why OpenEnv?** (\~7 min)
+▸ Workshop 8:02–15:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=482s](https://www.youtube.com/watch?v=1jU05MlENOI&t=482s)
+▸ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec \+ Docker
+▸ Alt: Mega Lecture 40:01–46:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=2401s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s)
+**Module 2: Using Existing Envs** (\~7.5 min)
+▸ Workshop 35:33–43:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2133s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s)
+▸ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry), from\_hub
+▸ Alt: Mega Lecture 1:24:11–1:30:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5051s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s)
+**Module 3: Deploying Envs** (\~9 min)
+▸ Mega Lecture 1:30:00–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5400s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s)
+▸ Ben: live openenv init, scaffold, running locally, openenv push, Docker run from Space
+▸ Alt: Workshop 43:05–48:30 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2585s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s)
+**Module 4: Building Your Own** (\~6.5 min)
+▸ Workshop 43:45–50:20 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2625s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s)
+▸ Ben: scaffold files, business logic (reset/step), models, client, publishing
+▸ Alt: Mega Lecture 1:33:30–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5610s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s)
+**Module 5: Training \+ TRL** (\~14 min)
+▸ Mega Lecture 1:53:20–2:07:12 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=6800s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s)
+▸ Lewis: Wordle GRPO walkthrough — rollout function, reward shaping, GRPOTrainer, live training
+▸ Alt: Workshop 22:24–34:12 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=1344s](https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s)

docs/judging_criteria.md ADDED Viewed

	@@ -0,0 +1,166 @@

+Theme #1 - Multi-Agent Interactions
+Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
+Expected Outcome: an environment that can be used to train multi-agent task handling in a LLM
+Example environments: Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
+Theme #2 - (Super) Long-Horizon Planning & Instruction Following
+You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
+Expected Outcome: an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
+Example environments: (Think of OpenClaw workflows with Multi-turn tasks). Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
+Theme #3 - World Modeling
+#3.1 Professional Tasks
+Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
+Expected Outcome: an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
+Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations with feedback, tool-discovery benchmarks.
+#3.2 Personalized Tasks
+Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks
+Expected Outcome: An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
+Example environments: Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, shopping, etc
+Theme #4 - Self-Improvement
+The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
+Expected Outcome: an environment for improving self-play of a LLM over a defined set of tasks
+Example environments: Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
+Theme #5: Wild Card - Impress Us!
+We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
+Guidelines for Problem Statement
+It is NOT mandatory to choose the same problem statement as Round 1. Only choose the same problem statement if it aligns with the above provided Hackathon themes.
+You can start working on your problem statement once you have finalized it. Post-training can be done onsite on 25th & 26th when you receive compute credits for HuggingFace.
+Before the onsite, we suggest you work on building the environment, agent behaviours, reward model and evaluate if your work aligns with the judging criteria given below.
+Judging Criteria
+Minimum requirements:
+Usage of OpenEnv (latest release)
+Show a minimal training script for your environment using Unsloth or HF TRL in Colab
+Write a mini-blog on HuggingFace or mini-video on YouTube talking about your submission, <2 minutes
+Your OpenEnv compliant environment should be hosted on Hugging Face Spaces.
+Judging Overview
+Evaluation: Teams will be scored based on the following criteria:
+Environment Innovation (40%): Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
+Storytelling (30%): Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
+Showing Improvement in Rewards (20%): Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
+Reward and Training Script/Pipeline Setup (10%): Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
+OpenEnv Hackathon - What Judges Look For
+This guide tells you what makes a strong submission for the OpenEnv Hackathon (India 2026).
+Read it before you start building, and again before you submit.
+For the list of themes and example problems, refer to the top sections.
+NOTE: Please remember only one submission per team. If you have multiple ideas, pick the best one and go for it. Please make sure that the URL link of your environment is submitted as judges will pull the environment from the URL to evaluate it. Changes or commits after the submission deadline will not be considered.
+TL;DR
+Build an environment that an LLM could actually be trained on to get measurably better at
+something interesting. Then show that training. Then tell the story.
+A messy but ambitious environment with real training evidence beats a polished but boring one.
+Pick a problem that excites you (that energy comes through in the pitch).
+Judging Criteria
+Criterion: Environment Innovation
+Weight: 40%
+What it means:
+Is the environment novel, creative, or genuinely challenging?
+Does it meaningfully test agent behavior in a way that hasn't been done before?
+Criterion: Storytelling & Presentation
+Weight: 30%
+What it means:
+Can you clearly explain the problem, the environment, and what the agent learned?
+Is the demo engaging and easy to follow for a non-technical audience?
+Criterion: Showing Improvement in Rewards
+Weight: 20%
+What it means:
+Is there observable evidence of training progress? Reward curves, before/after behavior,
+comparison against a baseline -- anything that proves the agent learned something.
+Criterion: Reward & Training Pipeline
+Weight: 10%
+What it means:
+Is the reward logic coherent? Does the pipeline produce meaningful improvement in the trained
+agent's behavior?
+Minimum Submission Requirements
+NOTE: These are non-negotiable. Submissions missing any of these are at a serious disadvantage.
+Use OpenEnv (latest release). Build on top of the framework; don’t reinvent the wheel.
+A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
+Evidence that you actually trained; at minimum, loss and reward plots from a real run.
+A short writeup: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
+Push your environment to a Hugging Face Space so it’s discoverable and runnable.
+A README that motivates the problem, explains how the env works, and shows results.
+README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
+Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
+What Makes a Submission Stand Out
+Pick an ambitious, original problem
+The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
+you need a genuinely fresh angle. Some questions to ask yourself:
+Does this environment exist to teach an LLM something it currently can’t do well?
+Is the domain underexplored in RL/LLM training?
+Could a researcher write a paper about training on this?
+Design a reward signal that actually teaches
+A great environment has a reward function that:
+Provides a rich, informative signal (not just 0/1 at the end)
+Captures something hard to measure in a clever way
+Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
+Is hard to game; an agent that exploits the reward without solving the task should not get high scores
+Show real training, end to end
+The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
+agent learns, and you can show it.” Concretely:
+Your training loop should connect to your environment (not a static dataset)
+Train long enough that the curves mean something
+Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
+Include the plots and numbers in your README and writeup
+Make your plots readable
+Reviewers spend seconds, not minutes, on each plot. Help them out:
+Label both axes (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
+Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via Wandb, please include the link to that specific run of your plots)
+Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
+Tell a story, not an API doc
+Your README, blog, and pitch should answer:
+Problem) what capability gap or interesting domain are you targeting?
+Environment) what does the agent see, do, and get rewarded for?
+Results) what changed after training? Show it.
+Why does it matter) who would care, and why?
+A reviewer should be able to read your README in 3~5 minutes and want to try your
+environment.
+NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
+  from your README as a link.
+Engineer it cleanly (table stakes)
+Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
+Use OpenEnv’s Environment / MCPEnvironment base classes properly
+Respect the client / server separation (clients should never import server internals)
+Follow the standard Gym-style API (reset, step, state)
+Have a valid openenv.yaml manifest
+Don’t use reserved tool names (reset, step, state, close) for MCP tools
+Final Note
+Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
+ambitious. Pick a problem you find genuinely interesting; that almost always produces better
+work than chasing what you think judges want. Good luck.

docs/project-setup.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ Ideally, we should always use Python 3.12, and then we can use uv, which is to install all other packages in Python. For the front end, we can use HTML, CSS, JavaScript, or we can use an XJS application, whatever suits us for a front end interactive environment. Ideally, HTML, CSS, JS should be good, or even Gradio is fine. Whatever works for this OpenENV, you can refer to competition round one if you wanted more, but ideally we would want to do that.
2	+
3	+ For more data and information, you can refer to @docs-guide.md and you can go through all other files and folders in this repository inside the docs folder so that we can get a better idea of what we want and how we're going to build stuff.

docs/round1-corrections.md ADDED Viewed

	@@ -0,0 +1,32 @@

+I went through it. Short version: what passed the OpenEnv hackathon was the **DocEdit Game V2 environment**, not the later trained Qwen model.
+The submitted project was [attempt1/doc_edit_game_v2](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/README.md), deployed as HF Space `sanjuhs/doc_edit_v5`. It passed because it satisfied the OpenEnv gates:
+1. **HF Space existed and ran**: final target was `https://sanjuhs-doc-edit-v5.hf.space`.
+2. **OpenEnv spec**: [openenv.yaml](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/openenv.yaml:1), typed `DocEditAction` / `DocEditObservation`, FastAPI mounted OpenEnv endpoints in [server/app.py](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/server/app.py:29).
+3. **Docker build path**: [server/Dockerfile](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/server/Dockerfile:13) used the OpenEnv base image and ran the server.
+4. **Baseline inference**: [inference.py](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/inference.py:15) used the OpenAI client, read `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, emitted `[START]`, `[STEP]`, `[END]`, and ran 5 fixed tasks.
+5. **3+ graded tasks**: you actually had 5: `legal_easy`, `legal_medium`, `legal_hard`, `pharma_easy`, `pharma_hard` in [doc_edit_game_v2_environment.py](/Users/sanju/Desktop/coding/python/open-env-meta/attempt1/doc_edit_game_v2/server/doc_edit_game_v2_environment.py:24).
+The “finally passed after many iterations” part seems to be these fixes:
+- Switched from trying to pull/run the HF registry image to connecting directly to the live HF Space because the validator could not pull `registry.hf.space` cleanly.
+- Made the Space/Docker path stable.
+- Clamped scores away from exact `0.0` and `1.0` because the validator rejected boundary values.
+- Kept the environment lightweight enough for the 2 vCPU / 8 GB RAM constraint.
+- Framed the task as a real-world document editing environment, which matched the scoring weights: real-world utility, grader quality, environment design.
+Important timeline: the later [training/FINAL_POSTMORTEM.md](/Users/sanju/Desktop/coding/python/open-env-meta/training/FINAL_POSTMORTEM.md:1) was dated **April 17, 2026**, after the hackathon deadline of **April 8, 2026**. That Qwen SFT + GRPO run proved the idea was trainable, but it was not the thing that made the OpenEnv submission pass.
+For direction: I agree with your instinct. A pure “applicator model replaces frontier model” story is weaker now because frontier models have strong tool/function calling. OpenAI’s docs describe tool calling as a first-class way for models to call app-defined functions, and they recommend keeping tool sets small and evaluating accuracy as tools scale: [OpenAI function calling docs](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling-and-structured-outputs).
+So I’d make the next product a **DocEdit Workbench**:
+1. **Frontier planner baseline**: frontier model emits compact edit plans/tool calls.
+2. **Verifier + patch engine**: deterministic tools apply changes and score collateral damage.
+3. **Small model only where it wins**: train Qwen-sized models to do chunk-level edit localization/parameterization, not broad planning.
+4. **React app optional but useful**: not as a generic “look, training curves” page, but as a real evaluation cockpit: source vs target vs model output, tool trace, score, collateral damage, cost, latency, and replay.
+For the small model, Qwen still makes sense as an experiment. Qwen2.5-1.5B-Instruct explicitly emphasizes structured output / JSON improvements, and Qwen3-1.7B emphasizes agent/tool capability: [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B). But the bar should be: can it beat frontier-tool-calling on **cost, latency, privacy, or batch volume** while staying within acceptable accuracy?
+My recommendation: build the next repo around **frontier planner + verifiable editor + optional distilled executor**. That is much stronger than betting the whole project on a tiny model being magically better.