File size: 9,249 Bytes
a15535e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# ForgeEnv πŸ”§

> *A self-improving RL environment that teaches LLMs to fix HuggingFace
> training scripts as the ecosystem evolves.*

ForgeEnv is an OpenEnv-compliant environment for the
**OpenEnv Hackathon (India 2026)**, theme **#4 β€” Self-Improvement**.
Two LLM roles co-evolve inside a single environment:

- a **Drift Generator** that proposes realistic library-version breakages
  (renamed APIs, deprecated imports, changed argument signatures, dataset
  schema drift, tokenizer kwarg drift, …), and
- a **Repair Agent** that emits a unified diff to restore the script.

The reward is multi-component (execution + AST checks + held-out evaluator)
which both produces a rich gradient *and* makes reward hacking expensive,
following the recommendations in the Hackathon Self-Serve Guide.

## Why it matters

LLM agents that write training code today are silently broken by HF library
upgrades β€” a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a
dataset column is restructured. Today, humans patch these. ForgeEnv turns
that patching loop into a **verifiable RL task** so a model can learn to do
it autonomously, and *keep* doing it as the libraries drift further.

## Live links

| Artifact                    | URL                                                                  |
| --------------------------- | -------------------------------------------------------------------- |
| Environment Space (Docker)  | <https://huggingface.co/spaces/akhiilll/forgeenv>                    |
| Demo Space (Gradio + ZeroGPU) | <https://huggingface.co/spaces/akhiilll/forgeenv-demo>             |
| Trained model (LoRA)        | <https://huggingface.co/akhiilll/forgeenv-repair-agent>              |
| Training notebook (Colab)   | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb)   |

## Architecture

```
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Teacher (deter- β”‚     curriculum β†’
                 β”‚  ministic)       β”‚     {RenameApiCall, DeprecateImport, …}
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ target_category
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ForgeEnvironment (OpenEnv)                                      β”‚
β”‚   reset()  β†’  drift_gen obs (script, target_category)           β”‚
β”‚   step(BreakageAction)  β†’  repair obs (broken_script, trace)    β”‚
β”‚   step(RepairAction)    β†’  reward, breakdown, held-out scores   β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚   β”‚ Drift Generator   β”‚    β”‚ Repair Agent         β”‚            β”‚
β”‚   β”‚ (LLM, GRPO)       β”‚    β”‚ (LLM, GRPO + SFT)    β”‚            β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚ Simulator (AST + heuristic exec) + Visible Verifier   β”‚    β”‚
β”‚   β”‚ + Held-out Evaluator + Library Drift Engine            β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly
the Challenger / Solver loop from R-Zero, with role-switched prompts Γ  la
SPIRAL and Absolute Zero Reasoner.

## Reward design

```
visible_reward
 β”œβ”€ execution_success        (sandboxed run / heuristic simulator)
 β”œβ”€ ast_well_formed          (parses + no forbidden globals)
 β”œβ”€ format_compliance        (valid unified diff or full-script replacement)
 β”œβ”€ minimality               (smaller diffs preferred β€” anti-rewrite)
 └─ no_forbidden_globals     (locked-down execution check)

held_out_evaluator (NOT used for training, used for evals only)
 β”œβ”€ executed_cleanly
 β”œβ”€ matches_target_api       (semantic correctness)
 └─ regression_free          (other tests still pass)
```

Multiple independent components, plus a **held-out evaluator the trainer
never sees**, so the agent can't game its way to the top of the curve.

## Results (50 episodes / agent, oracle as upper-bound proxy for trained)

After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op
baseline on every metric we track:

| Agent              | Mean visible reward | Success rate (held-out exec) |
| ------------------ | ------------------- | ---------------------------- |
| Baseline (no-op)   | **0.90**            | **50 %**                     |
| Trained (oracle)   | **1.51**            | **86 %**                     |

Three plots (committed to `artifacts/plots/`):

- `baseline_vs_trained.png` β€” reward distribution, baseline vs trained.
- `training_reward_curve.png` β€” reward trajectory across episodes.
- `success_by_category.png` β€” per-primitive success rates.

A 43-entry `repair_library.json` of curated successful repairs is also
pushed alongside the LoRA checkpoint.

## Quick start

```bash
# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]

# 2. run the test suite
pytest -q                 # 74 tests β€” full env + roles + reward + training

# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860

# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50

# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll
```

Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in
[`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) β€” open
it on Colab with a T4 or A100 and re-run end-to-end.

## Repository layout

```
forgeenv/                       # importable Python package (env + roles + training)
  env/                          # OpenEnv wrapper: actions, observations, server
  sandbox/                      # AST validator + heuristic simulator
  verifier/                     # visible verifier + held-out evaluator
  primitives/                   # 8 breakage + 8 repair primitives + drift taxonomy
  tasks/                        # 10-script HF seed corpus + sampler
  roles/                        # Drift Generator + Repair Agent + Teacher
  drift/                        # Library drift engine (non-stationary verification)
  training/                     # SFT, GRPO repair, GRPO drift, rollout, plots
  artifacts/                    # repair-library curation
forgeenv-space/                 # files we push to the OpenEnv Space (Docker)
demo-space/                     # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb  # Colab training pipeline
warmstart/                      # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
  generate_artifacts.py         # plots + eval_results.json + repair_library.json
  deploy_spaces.py              # one-shot push to HF Spaces
artifacts/                      # generated plots + curated repair library
tests/                          # 74 pytest tests
```

## Anti-cheat / reward-hacking safeguards

Following the Hackathon Self-Serve Guide explicitly:

1. **Multiple independent reward functions** (5 visible + 3 held-out).
2. **Held-out evaluator** the trainer never sees, used only for plots.
3. **Locked-down execution** in the sandbox simulator β€” no globals abuse,
   timeouts on every run.
4. **AST validator** rejects forbidden constructs (network calls, `os.system`,
   etc.) before reward is computed.
5. **Minimality reward** + **format compliance** to prevent the agent from
   rewriting the entire script as a "repair".
6. The **Drift Generator** is itself trained against an R-Zero composite
   reward (uncertainty βˆ’ repetition) so it can't trivially game the agent.

## References

- Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025)
- Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025)
- Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning…* (2025)
- Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β€” Reward engineering & shaping
- Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β€” Reward engineering for RL in software tasks
- OpenEnv Hackathon Self-Serve Guide (2026)

## License

Apache-2.0