| # ForgeEnv π§ | |
| > *A self-improving RL environment that teaches LLMs to fix HuggingFace | |
| > training scripts as the ecosystem evolves.* | |
| ForgeEnv is an OpenEnv-compliant environment for the | |
| **OpenEnv Hackathon (India 2026)**, theme **#4 β Self-Improvement**. | |
| Two LLM roles co-evolve inside a single environment: | |
| - a **Drift Generator** that proposes realistic library-version breakages | |
| (renamed APIs, deprecated imports, changed argument signatures, dataset | |
| schema drift, tokenizer kwarg drift, β¦), and | |
| - a **Repair Agent** that emits a unified diff to restore the script. | |
| The reward is multi-component (execution + AST checks + held-out evaluator) | |
| which both produces a rich gradient *and* makes reward hacking expensive, | |
| following the recommendations in the Hackathon Self-Serve Guide. | |
| ## Why it matters | |
| LLM agents that write training code today are silently broken by HF library | |
| upgrades β a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a | |
| dataset column is restructured. Today, humans patch these. ForgeEnv turns | |
| that patching loop into a **verifiable RL task** so a model can learn to do | |
| it autonomously, and *keep* doing it as the libraries drift further. | |
| ## Live links | |
| | Artifact | URL | | |
| | --------------------------- | -------------------------------------------------------------------- | | |
| | Environment Space (Docker) | <https://huggingface.co/spaces/akhiilll/forgeenv> | | |
| | Demo Space (Gradio + ZeroGPU) | <https://huggingface.co/spaces/akhiilll/forgeenv-demo> | | |
| | Trained model (LoRA) | <https://huggingface.co/akhiilll/forgeenv-repair-agent> | | |
| | Training notebook (Colab) | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) | | |
| ## Architecture | |
| ``` | |
| ββββββββββββββββββββ | |
| β Teacher (deter- β curriculum β | |
| β ministic) β {RenameApiCall, DeprecateImport, β¦} | |
| ββββββββββββββββββββ | |
| β target_category | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ForgeEnvironment (OpenEnv) β | |
| β reset() β drift_gen obs (script, target_category) β | |
| β step(BreakageAction) β repair obs (broken_script, trace) β | |
| β step(RepairAction) β reward, breakdown, held-out scores β | |
| β β | |
| β βββββββββββββββββββββ ββββββββββββββββββββββββ β | |
| β β Drift Generator β β Repair Agent β β | |
| β β (LLM, GRPO) β β (LLM, GRPO + SFT) β β | |
| β βββββββββββββββββββββ ββββββββββββββββββββββββ β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Simulator (AST + heuristic exec) + Visible Verifier β β | |
| β β + Held-out Evaluator + Library Drift Engine β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly | |
| the Challenger / Solver loop from R-Zero, with role-switched prompts Γ la | |
| SPIRAL and Absolute Zero Reasoner. | |
| ## Reward design | |
| ``` | |
| visible_reward | |
| ββ execution_success (sandboxed run / heuristic simulator) | |
| ββ ast_well_formed (parses + no forbidden globals) | |
| ββ format_compliance (valid unified diff or full-script replacement) | |
| ββ minimality (smaller diffs preferred β anti-rewrite) | |
| ββ no_forbidden_globals (locked-down execution check) | |
| held_out_evaluator (NOT used for training, used for evals only) | |
| ββ executed_cleanly | |
| ββ matches_target_api (semantic correctness) | |
| ββ regression_free (other tests still pass) | |
| ``` | |
| Multiple independent components, plus a **held-out evaluator the trainer | |
| never sees**, so the agent can't game its way to the top of the curve. | |
| ## Results (50 episodes / agent, oracle as upper-bound proxy for trained) | |
| After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op | |
| baseline on every metric we track: | |
| | Agent | Mean visible reward | Success rate (held-out exec) | | |
| | ------------------ | ------------------- | ---------------------------- | | |
| | Baseline (no-op) | **0.90** | **50 %** | | |
| | Trained (oracle) | **1.51** | **86 %** | | |
| Three plots (committed to `artifacts/plots/`): | |
| - `baseline_vs_trained.png` β reward distribution, baseline vs trained. | |
| - `training_reward_curve.png` β reward trajectory across episodes. | |
| - `success_by_category.png` β per-primitive success rates. | |
| A 43-entry `repair_library.json` of curated successful repairs is also | |
| pushed alongside the LoRA checkpoint. | |
| ## Quick start | |
| ```bash | |
| # 1. install (env-only deps, no torch needed for the env itself) | |
| pip install -e .[openenv] | |
| pip install -e .[dev] | |
| # 2. run the test suite | |
| pytest -q # 74 tests β full env + roles + reward + training | |
| # 3. spin up the environment locally | |
| uvicorn forgeenv.env.server:app --port 7860 | |
| # 4. generate the demo artifacts (plots + repair_library.json + eval JSON) | |
| python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50 | |
| # 5. push to HF Spaces | |
| export HF_TOKEN=hf_... | |
| python scripts/deploy_spaces.py --user akhiilll | |
| ``` | |
| Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in | |
| [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) β open | |
| it on Colab with a T4 or A100 and re-run end-to-end. | |
| ## Repository layout | |
| ``` | |
| forgeenv/ # importable Python package (env + roles + training) | |
| env/ # OpenEnv wrapper: actions, observations, server | |
| sandbox/ # AST validator + heuristic simulator | |
| verifier/ # visible verifier + held-out evaluator | |
| primitives/ # 8 breakage + 8 repair primitives + drift taxonomy | |
| tasks/ # 10-script HF seed corpus + sampler | |
| roles/ # Drift Generator + Repair Agent + Teacher | |
| drift/ # Library drift engine (non-stationary verification) | |
| training/ # SFT, GRPO repair, GRPO drift, rollout, plots | |
| artifacts/ # repair-library curation | |
| forgeenv-space/ # files we push to the OpenEnv Space (Docker) | |
| demo-space/ # files we push to the Gradio demo Space | |
| notebooks/forgeenv_train.ipynb # Colab training pipeline | |
| warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen | |
| scripts/ | |
| generate_artifacts.py # plots + eval_results.json + repair_library.json | |
| deploy_spaces.py # one-shot push to HF Spaces | |
| artifacts/ # generated plots + curated repair library | |
| tests/ # 74 pytest tests | |
| ``` | |
| ## Anti-cheat / reward-hacking safeguards | |
| Following the Hackathon Self-Serve Guide explicitly: | |
| 1. **Multiple independent reward functions** (5 visible + 3 held-out). | |
| 2. **Held-out evaluator** the trainer never sees, used only for plots. | |
| 3. **Locked-down execution** in the sandbox simulator β no globals abuse, | |
| timeouts on every run. | |
| 4. **AST validator** rejects forbidden constructs (network calls, `os.system`, | |
| etc.) before reward is computed. | |
| 5. **Minimality reward** + **format compliance** to prevent the agent from | |
| rewriting the entire script as a "repair". | |
| 6. The **Drift Generator** is itself trained against an R-Zero composite | |
| reward (uncertainty β repetition) so it can't trivially game the agent. | |
| ## References | |
| - Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025) | |
| - Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025) | |
| - Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoningβ¦* (2025) | |
| - Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β Reward engineering & shaping | |
| - Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β Reward engineering for RL in software tasks | |
| - OpenEnv Hackathon Self-Serve Guide (2026) | |
| ## License | |
| Apache-2.0 | |