| # ForgeEnv π§ |
|
|
| > *A self-improving RL environment that teaches LLMs to fix HuggingFace |
| > training scripts as the ecosystem evolves.* |
|
|
| ForgeEnv is an OpenEnv-compliant environment for the |
| **OpenEnv Hackathon (India 2026)**, theme **#4 β Self-Improvement**. |
| Two LLM roles co-evolve inside a single environment: |
|
|
| - a **Drift Generator** that proposes realistic library-version breakages |
| (renamed APIs, deprecated imports, changed argument signatures, dataset |
| schema drift, tokenizer kwarg drift, β¦), and |
| - a **Repair Agent** that emits a unified diff to restore the script. |
|
|
| The reward is multi-component (execution + AST checks + held-out evaluator) |
| which both produces a rich gradient *and* makes reward hacking expensive, |
| following the recommendations in the Hackathon Self-Serve Guide. |
|
|
| ## Why it matters |
|
|
| LLM agents that write training code today are silently broken by HF library |
| upgrades β a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a |
| dataset column is restructured. Today, humans patch these. ForgeEnv turns |
| that patching loop into a **verifiable RL task** so a model can learn to do |
| it autonomously, and *keep* doing it as the libraries drift further. |
|
|
| ## Live links |
|
|
| | Artifact | URL | |
| | --------------------------- | -------------------------------------------------------------------- | |
| | Environment Space (Docker) | <https://huggingface.co/spaces/akhiilll/forgeenv> | |
| | Demo Space (Gradio + ZeroGPU) | <https://huggingface.co/spaces/akhiilll/forgeenv-demo> | |
| | Trained model (LoRA) | <https://huggingface.co/akhiilll/forgeenv-repair-agent> | |
| | Training notebook (Colab) | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) | |
|
|
| ## Architecture |
|
|
| ForgeEnv is split into **four deployable artifacts** (two Spaces, one Jobs run, one Model repo): |
|
|
| - **Environment Space**: `akhiilll/forgeenv` (OpenEnv FastAPI server) |
| - **Training run**: Hugging Face Jobs (GPU) runs warm-start SFT + GRPO |
| - **Model repo**: `akhiilll/forgeenv-repair-agent` (LoRA + artifacts) |
| - **Demo Space**: `akhiilll/forgeenv-demo` (Gradio UI) |
|
|
| ### End-to-end (as deployed) |
|
|
| ```mermaid |
| flowchart LR |
| U[User / Judge] -->|broken script + error trace| D[Demo Space\nakhiilll/forgeenv-demo] |
| D -->|unified diff patch| U |
| |
| subgraph TrainOnce[Training (HF Jobs GPU)] |
| J[Training Job\n(SFT + GRPO)] |
| E[Environment Space\nakhiilll/forgeenv] |
| M[Model Repo\nakhiilll/forgeenv-repair-agent] |
| J <--> |reset/step, obs/reward| E |
| J -->|push LoRA + artifacts| M |
| end |
| |
| D -. optional model usage .-> M |
| ``` |
|
|
| ### Environment Space internals (OpenEnv server β env hub β verifier) |
|
|
| ```mermaid |
| flowchart TB |
| API[OpenEnv FastAPI server\n`forgeenv/env/server.py`\n/health + reset + step] --> ENV[ForgeEnvironment (hub)\n`forgeenv/env/forge_environment.py`] |
| |
| ENV --> TASKS[Task sampler + seed corpus\n`forgeenv/tasks/*`] |
| ENV --> ROLES[Roles (prompting + parsing)\n`forgeenv/roles/*`] |
| ENV --> PRIMS[Primitives (break + repair)\n`forgeenv/primitives/*`] |
| ENV --> DRIFT[Library drift engine\n`forgeenv/drift/library_drift_engine.py`] |
| ENV --> VERIFY[Verifiers\nvisible + held-out\n`forgeenv/verifier/*`] |
| |
| VERIFY --> SANDBOX[Sandbox execution\nAST validator + simulation\n`forgeenv/sandbox/*`] |
| ``` |
|
|
| ### Training pipeline internals (what actually runs today) |
|
|
| In the current codebase, the **Repair Agent (Solver)** GRPO loop is fully implemented. |
| The **Drift Generator (Challenger)** GRPO logic exists as a reward loop + CPU dry-run, |
| but full βLLM Drift GRPOβ is intentionally not wired as a single-GPU training path yet. |
|
|
| ```mermaid |
| flowchart TB |
| SETUP[Install deps\n(torch/trl/unsloth/openenvβ¦)] --> SFT[SFT warmstart\nformat + basics] |
| SFT --> SAVE1[Save SFT adapter] |
| SAVE1 --> GRPO_REPAIR[GRPO Repair Agent (Solver)\n`forgeenv/training/grpo_repair.py`] |
| GRPO_REPAIR <--> |episodes + rewards| ENVSPACE[Env Space\n`akhiilll/forgeenv`] |
| GRPO_REPAIR --> PUSH[Upload\nadapter + tokenizer + plots + repair_library] |
| PUSH --> HUB[Model Repo\n`akhiilll/forgeenv-repair-agent`] |
| ``` |
|
|
| ### Target architecture (two-role co-evolution: Challenger/Solver) |
|
|
| This is the **intended** architecture described in R-Zero / SPIRAL-style self-play: |
|
|
| ```mermaid |
| flowchart TB |
| SFT2[SFT warmstart] --> CH[GRPO Drift Generator (Challenger)] |
| CH --> FILTER[Filter/select breakages\nusing p_hat from multiple solver attempts] |
| FILTER --> SOLVER[GRPO Repair Agent (Solver)] |
| SOLVER --> CH |
| ``` |
|
|
| The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is the core |
| Challenger/Solver loop: generate a hard breakage β attempt a repair β score it. |
|
|
| ## Reward design |
|
|
| ``` |
| visible_reward |
| ββ execution_success (sandboxed run / heuristic simulator) |
| ββ ast_well_formed (parses + no forbidden globals) |
| ββ format_compliance (valid unified diff or full-script replacement) |
| ββ minimality (smaller diffs preferred β anti-rewrite) |
| ββ no_forbidden_globals (locked-down execution check) |
| |
| held_out_evaluator (NOT used for training, used for evals only) |
| ββ executed_cleanly |
| ββ matches_target_api (semantic correctness) |
| ββ regression_free (other tests still pass) |
| ``` |
|
|
| Multiple independent components, plus a **held-out evaluator the trainer |
| never sees**, so the agent can't game its way to the top of the curve. |
|
|
| ## Results (50 episodes / agent, oracle as upper-bound proxy for trained) |
|
|
| After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op |
| baseline on every metric we track: |
|
|
| | Agent | Mean visible reward | Success rate (held-out exec) | |
| | ------------------ | ------------------- | ---------------------------- | |
| | Baseline (no-op) | **0.90** | **50 %** | |
| | Trained (oracle) | **1.51** | **86 %** | |
|
|
| Three plots (committed to `artifacts/plots/`): |
|
|
| - `baseline_vs_trained.png` β reward distribution, baseline vs trained. |
| - `training_reward_curve.png` β reward trajectory across episodes. |
| - `success_by_category.png` β per-primitive success rates. |
|
|
| A 43-entry `repair_library.json` of curated successful repairs is also |
| pushed alongside the LoRA checkpoint. |
|
|
| ## Quick start |
|
|
| ```bash |
| # 1. install (env-only deps, no torch needed for the env itself) |
| pip install -e .[openenv] |
| pip install -e .[dev] |
| |
| # 2. run the test suite |
| pytest -q # 74 tests β full env + roles + reward + training |
| |
| # 3. spin up the environment locally |
| uvicorn forgeenv.env.server:app --port 7860 |
| |
| # 4. generate the demo artifacts (plots + repair_library.json + eval JSON) |
| python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50 |
| |
| # 5. push to HF Spaces |
| export HF_TOKEN=hf_... |
| python scripts/deploy_spaces.py --user akhiilll |
| ``` |
|
|
| Training can run via: |
| - **HF Jobs GPU**: `scripts/jobs/train_repair_agent.py` (what we used for the successful run) |
| - **Notebook**: [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) (useful for iteration) |
|
|
| ## Repository layout |
|
|
| ``` |
| forgeenv/ # importable Python package (env + roles + training) |
| env/ # OpenEnv wrapper: actions, observations, server |
| sandbox/ # AST validator + heuristic simulator |
| verifier/ # visible verifier + held-out evaluator |
| primitives/ # 8 breakage + 8 repair primitives + drift taxonomy |
| tasks/ # 10-script HF seed corpus + sampler |
| roles/ # Drift Generator + Repair Agent + Teacher |
| drift/ # Library drift engine (non-stationary verification) |
| training/ # SFT, GRPO repair, GRPO drift, rollout, plots |
| artifacts/ # repair-library curation |
| forgeenv-space/ # files we push to the OpenEnv Space (Docker) |
| demo-space/ # files we push to the Gradio demo Space |
| notebooks/forgeenv_train.ipynb # Colab training pipeline |
| warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen |
| scripts/ |
| generate_artifacts.py # plots + eval_results.json + repair_library.json |
| deploy_spaces.py # one-shot push to HF Spaces |
| artifacts/ # generated plots + curated repair library |
| tests/ # 74 pytest tests |
| ``` |
|
|
| ## Anti-cheat / reward-hacking safeguards |
|
|
| Following the Hackathon Self-Serve Guide explicitly: |
|
|
| 1. **Multiple independent reward functions** (5 visible + 3 held-out). |
| 2. **Held-out evaluator** the trainer never sees, used only for plots. |
| 3. **Locked-down execution** in the sandbox simulator β no globals abuse, |
| timeouts on every run. |
| 4. **AST validator** rejects forbidden constructs (network calls, `os.system`, |
| etc.) before reward is computed. |
| 5. **Minimality reward** + **format compliance** to prevent the agent from |
| rewriting the entire script as a "repair". |
| 6. The **Drift Generator** is itself trained against an R-Zero composite |
| reward (uncertainty β repetition) so it can't trivially game the agent. |
|
|
| ## References |
|
|
| - Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025) |
| - Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025) |
| - Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoningβ¦* (2025) |
| - Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β Reward engineering & shaping |
| - Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β Reward engineering for RL in software tasks |
| - OpenEnv Hackathon Self-Serve Guide (2026) |
|
|
| ## License |
|
|
| Apache-2.0 |
|
|