File size: 9,249 Bytes
a15535e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | # ForgeEnv π§
> *A self-improving RL environment that teaches LLMs to fix HuggingFace
> training scripts as the ecosystem evolves.*
ForgeEnv is an OpenEnv-compliant environment for the
**OpenEnv Hackathon (India 2026)**, theme **#4 β Self-Improvement**.
Two LLM roles co-evolve inside a single environment:
- a **Drift Generator** that proposes realistic library-version breakages
(renamed APIs, deprecated imports, changed argument signatures, dataset
schema drift, tokenizer kwarg drift, β¦), and
- a **Repair Agent** that emits a unified diff to restore the script.
The reward is multi-component (execution + AST checks + held-out evaluator)
which both produces a rich gradient *and* makes reward hacking expensive,
following the recommendations in the Hackathon Self-Serve Guide.
## Why it matters
LLM agents that write training code today are silently broken by HF library
upgrades β a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a
dataset column is restructured. Today, humans patch these. ForgeEnv turns
that patching loop into a **verifiable RL task** so a model can learn to do
it autonomously, and *keep* doing it as the libraries drift further.
## Live links
| Artifact | URL |
| --------------------------- | -------------------------------------------------------------------- |
| Environment Space (Docker) | <https://huggingface.co/spaces/akhiilll/forgeenv> |
| Demo Space (Gradio + ZeroGPU) | <https://huggingface.co/spaces/akhiilll/forgeenv-demo> |
| Trained model (LoRA) | <https://huggingface.co/akhiilll/forgeenv-repair-agent> |
| Training notebook (Colab) | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) |
## Architecture
```
ββββββββββββββββββββ
β Teacher (deter- β curriculum β
β ministic) β {RenameApiCall, DeprecateImport, β¦}
ββββββββββββββββββββ
β target_category
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ForgeEnvironment (OpenEnv) β
β reset() β drift_gen obs (script, target_category) β
β step(BreakageAction) β repair obs (broken_script, trace) β
β step(RepairAction) β reward, breakdown, held-out scores β
β β
β βββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Drift Generator β β Repair Agent β β
β β (LLM, GRPO) β β (LLM, GRPO + SFT) β β
β βββββββββββββββββββββ ββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Simulator (AST + heuristic exec) + Visible Verifier β β
β β + Held-out Evaluator + Library Drift Engine β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly
the Challenger / Solver loop from R-Zero, with role-switched prompts Γ la
SPIRAL and Absolute Zero Reasoner.
## Reward design
```
visible_reward
ββ execution_success (sandboxed run / heuristic simulator)
ββ ast_well_formed (parses + no forbidden globals)
ββ format_compliance (valid unified diff or full-script replacement)
ββ minimality (smaller diffs preferred β anti-rewrite)
ββ no_forbidden_globals (locked-down execution check)
held_out_evaluator (NOT used for training, used for evals only)
ββ executed_cleanly
ββ matches_target_api (semantic correctness)
ββ regression_free (other tests still pass)
```
Multiple independent components, plus a **held-out evaluator the trainer
never sees**, so the agent can't game its way to the top of the curve.
## Results (50 episodes / agent, oracle as upper-bound proxy for trained)
After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op
baseline on every metric we track:
| Agent | Mean visible reward | Success rate (held-out exec) |
| ------------------ | ------------------- | ---------------------------- |
| Baseline (no-op) | **0.90** | **50 %** |
| Trained (oracle) | **1.51** | **86 %** |
Three plots (committed to `artifacts/plots/`):
- `baseline_vs_trained.png` β reward distribution, baseline vs trained.
- `training_reward_curve.png` β reward trajectory across episodes.
- `success_by_category.png` β per-primitive success rates.
A 43-entry `repair_library.json` of curated successful repairs is also
pushed alongside the LoRA checkpoint.
## Quick start
```bash
# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]
# 2. run the test suite
pytest -q # 74 tests β full env + roles + reward + training
# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860
# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50
# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll
```
Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in
[`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) β open
it on Colab with a T4 or A100 and re-run end-to-end.
## Repository layout
```
forgeenv/ # importable Python package (env + roles + training)
env/ # OpenEnv wrapper: actions, observations, server
sandbox/ # AST validator + heuristic simulator
verifier/ # visible verifier + held-out evaluator
primitives/ # 8 breakage + 8 repair primitives + drift taxonomy
tasks/ # 10-script HF seed corpus + sampler
roles/ # Drift Generator + Repair Agent + Teacher
drift/ # Library drift engine (non-stationary verification)
training/ # SFT, GRPO repair, GRPO drift, rollout, plots
artifacts/ # repair-library curation
forgeenv-space/ # files we push to the OpenEnv Space (Docker)
demo-space/ # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb # Colab training pipeline
warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
generate_artifacts.py # plots + eval_results.json + repair_library.json
deploy_spaces.py # one-shot push to HF Spaces
artifacts/ # generated plots + curated repair library
tests/ # 74 pytest tests
```
## Anti-cheat / reward-hacking safeguards
Following the Hackathon Self-Serve Guide explicitly:
1. **Multiple independent reward functions** (5 visible + 3 held-out).
2. **Held-out evaluator** the trainer never sees, used only for plots.
3. **Locked-down execution** in the sandbox simulator β no globals abuse,
timeouts on every run.
4. **AST validator** rejects forbidden constructs (network calls, `os.system`,
etc.) before reward is computed.
5. **Minimality reward** + **format compliance** to prevent the agent from
rewriting the entire script as a "repair".
6. The **Drift Generator** is itself trained against an R-Zero composite
reward (uncertainty β repetition) so it can't trivially game the agent.
## References
- Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025)
- Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025)
- Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoningβ¦* (2025)
- Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β Reward engineering & shaping
- Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β Reward engineering for RL in software tasks
- OpenEnv Hackathon Self-Serve Guide (2026)
## License
Apache-2.0
|