# ForgeEnv πŸ”§ > *A self-improving RL environment that teaches LLMs to fix HuggingFace > training scripts as the ecosystem evolves.* ForgeEnv is an OpenEnv-compliant environment for the **OpenEnv Hackathon (India 2026)**, theme **#4 β€” Self-Improvement**. Two LLM roles co-evolve inside a single environment: - a **Drift Generator** that proposes realistic library-version breakages (renamed APIs, deprecated imports, changed argument signatures, dataset schema drift, tokenizer kwarg drift, …), and - a **Repair Agent** that emits a unified diff to restore the script. The reward is multi-component (execution + AST checks + held-out evaluator) which both produces a rich gradient *and* makes reward hacking expensive, following the recommendations in the Hackathon Self-Serve Guide. ## Why it matters LLM agents that write training code today are silently broken by HF library upgrades β€” a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a dataset column is restructured. Today, humans patch these. ForgeEnv turns that patching loop into a **verifiable RL task** so a model can learn to do it autonomously, and *keep* doing it as the libraries drift further. ## Live links | Artifact | URL | | --------------------------- | -------------------------------------------------------------------- | | Environment Space (Docker) | | | Demo Space (Gradio + ZeroGPU) | | | Trained model (LoRA) | | | Training notebook (Colab) | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) | ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Teacher (deter- β”‚ curriculum β†’ β”‚ ministic) β”‚ {RenameApiCall, DeprecateImport, …} β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ target_category β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ForgeEnvironment (OpenEnv) β”‚ β”‚ reset() β†’ drift_gen obs (script, target_category) β”‚ β”‚ step(BreakageAction) β†’ repair obs (broken_script, trace) β”‚ β”‚ step(RepairAction) β†’ reward, breakdown, held-out scores β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Drift Generator β”‚ β”‚ Repair Agent β”‚ β”‚ β”‚ β”‚ (LLM, GRPO) β”‚ β”‚ (LLM, GRPO + SFT) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Simulator (AST + heuristic exec) + Visible Verifier β”‚ β”‚ β”‚ β”‚ + Held-out Evaluator + Library Drift Engine β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly the Challenger / Solver loop from R-Zero, with role-switched prompts Γ  la SPIRAL and Absolute Zero Reasoner. ## Reward design ``` visible_reward β”œβ”€ execution_success (sandboxed run / heuristic simulator) β”œβ”€ ast_well_formed (parses + no forbidden globals) β”œβ”€ format_compliance (valid unified diff or full-script replacement) β”œβ”€ minimality (smaller diffs preferred β€” anti-rewrite) └─ no_forbidden_globals (locked-down execution check) held_out_evaluator (NOT used for training, used for evals only) β”œβ”€ executed_cleanly β”œβ”€ matches_target_api (semantic correctness) └─ regression_free (other tests still pass) ``` Multiple independent components, plus a **held-out evaluator the trainer never sees**, so the agent can't game its way to the top of the curve. ## Results (50 episodes / agent, oracle as upper-bound proxy for trained) After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op baseline on every metric we track: | Agent | Mean visible reward | Success rate (held-out exec) | | ------------------ | ------------------- | ---------------------------- | | Baseline (no-op) | **0.90** | **50 %** | | Trained (oracle) | **1.51** | **86 %** | Three plots (committed to `artifacts/plots/`): - `baseline_vs_trained.png` β€” reward distribution, baseline vs trained. - `training_reward_curve.png` β€” reward trajectory across episodes. - `success_by_category.png` β€” per-primitive success rates. A 43-entry `repair_library.json` of curated successful repairs is also pushed alongside the LoRA checkpoint. ## Quick start ```bash # 1. install (env-only deps, no torch needed for the env itself) pip install -e .[openenv] pip install -e .[dev] # 2. run the test suite pytest -q # 74 tests β€” full env + roles + reward + training # 3. spin up the environment locally uvicorn forgeenv.env.server:app --port 7860 # 4. generate the demo artifacts (plots + repair_library.json + eval JSON) python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50 # 5. push to HF Spaces export HF_TOKEN=hf_... python scripts/deploy_spaces.py --user akhiilll ``` Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) β€” open it on Colab with a T4 or A100 and re-run end-to-end. ## Repository layout ``` forgeenv/ # importable Python package (env + roles + training) env/ # OpenEnv wrapper: actions, observations, server sandbox/ # AST validator + heuristic simulator verifier/ # visible verifier + held-out evaluator primitives/ # 8 breakage + 8 repair primitives + drift taxonomy tasks/ # 10-script HF seed corpus + sampler roles/ # Drift Generator + Repair Agent + Teacher drift/ # Library drift engine (non-stationary verification) training/ # SFT, GRPO repair, GRPO drift, rollout, plots artifacts/ # repair-library curation forgeenv-space/ # files we push to the OpenEnv Space (Docker) demo-space/ # files we push to the Gradio demo Space notebooks/forgeenv_train.ipynb # Colab training pipeline warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen scripts/ generate_artifacts.py # plots + eval_results.json + repair_library.json deploy_spaces.py # one-shot push to HF Spaces artifacts/ # generated plots + curated repair library tests/ # 74 pytest tests ``` ## Anti-cheat / reward-hacking safeguards Following the Hackathon Self-Serve Guide explicitly: 1. **Multiple independent reward functions** (5 visible + 3 held-out). 2. **Held-out evaluator** the trainer never sees, used only for plots. 3. **Locked-down execution** in the sandbox simulator β€” no globals abuse, timeouts on every run. 4. **AST validator** rejects forbidden constructs (network calls, `os.system`, etc.) before reward is computed. 5. **Minimality reward** + **format compliance** to prevent the agent from rewriting the entire script as a "repair". 6. The **Drift Generator** is itself trained against an R-Zero composite reward (uncertainty βˆ’ repetition) so it can't trivially game the agent. ## References - Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025) - Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025) - Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning…* (2025) - Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β€” Reward engineering & shaping - Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β€” Reward engineering for RL in software tasks - OpenEnv Hackathon Self-Serve Guide (2026) ## License Apache-2.0