akhiilll's picture
docs: simplify end-to-end diagram labels
da588ae verified
# ForgeEnv πŸ”§
> *A self-improving RL environment that teaches LLMs to fix HuggingFace
> training scripts as the ecosystem evolves.*
ForgeEnv is an OpenEnv-compliant environment for the
**OpenEnv Hackathon (India 2026)**, theme **#4 β€” Self-Improvement**.
Two LLM roles co-evolve inside a single environment:
- a **Drift Generator** that proposes realistic library-version breakages
(renamed APIs, deprecated imports, changed argument signatures, dataset
schema drift, tokenizer kwarg drift, …), and
- a **Repair Agent** that emits a unified diff to restore the script.
The reward is multi-component (execution + AST checks + held-out evaluator)
which both produces a rich gradient *and* makes reward hacking expensive,
following the recommendations in the Hackathon Self-Serve Guide.
## Why it matters
LLM agents that write training code today are silently broken by HF library
upgrades β€” a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a
dataset column is restructured. Today, humans patch these. ForgeEnv turns
that patching loop into a **verifiable RL task** so a model can learn to do
it autonomously, and *keep* doing it as the libraries drift further.
## Live links
| Artifact | URL |
| --------------------------- | -------------------------------------------------------------------- |
| Environment Space (Docker) | <https://huggingface.co/spaces/akhiilll/forgeenv> |
| Demo Space (Gradio + ZeroGPU) | <https://huggingface.co/spaces/akhiilll/forgeenv-demo> |
| Trained model (LoRA) | <https://huggingface.co/akhiilll/forgeenv-repair-agent> |
| Training notebook (Colab) | [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) |
## Architecture
ForgeEnv is split into **four deployable artifacts** (two Spaces, one Jobs run, one Model repo):
- **Environment Space**: `akhiilll/forgeenv` (OpenEnv FastAPI server)
- **Training run**: Hugging Face Jobs (GPU) runs warm-start SFT + GRPO
- **Model repo**: `akhiilll/forgeenv-repair-agent` (LoRA + artifacts)
- **Demo Space**: `akhiilll/forgeenv-demo` (Gradio UI)
### End-to-end (as deployed)
```mermaid
flowchart LR
U[User / Judge] -->|broken script + error trace| D[Demo Space\nakhiilll/forgeenv-demo]
D -->|unified diff patch| U
subgraph TrainOnce[Training (HF Jobs GPU)]
J[Training Job\n(SFT + GRPO)]
E[Environment Space\nakhiilll/forgeenv]
M[Model Repo\nakhiilll/forgeenv-repair-agent]
J <--> |reset/step, obs/reward| E
J -->|push LoRA + artifacts| M
end
D -. optional model usage .-> M
```
### Environment Space internals (OpenEnv server β†’ env hub β†’ verifier)
```mermaid
flowchart TB
API[OpenEnv FastAPI server\n`forgeenv/env/server.py`\n/health + reset + step] --> ENV[ForgeEnvironment (hub)\n`forgeenv/env/forge_environment.py`]
ENV --> TASKS[Task sampler + seed corpus\n`forgeenv/tasks/*`]
ENV --> ROLES[Roles (prompting + parsing)\n`forgeenv/roles/*`]
ENV --> PRIMS[Primitives (break + repair)\n`forgeenv/primitives/*`]
ENV --> DRIFT[Library drift engine\n`forgeenv/drift/library_drift_engine.py`]
ENV --> VERIFY[Verifiers\nvisible + held-out\n`forgeenv/verifier/*`]
VERIFY --> SANDBOX[Sandbox execution\nAST validator + simulation\n`forgeenv/sandbox/*`]
```
### Training pipeline internals (what actually runs today)
In the current codebase, the **Repair Agent (Solver)** GRPO loop is fully implemented.
The **Drift Generator (Challenger)** GRPO logic exists as a reward loop + CPU dry-run,
but full β€œLLM Drift GRPO” is intentionally not wired as a single-GPU training path yet.
```mermaid
flowchart TB
SETUP[Install deps\n(torch/trl/unsloth/openenv…)] --> SFT[SFT warmstart\nformat + basics]
SFT --> SAVE1[Save SFT adapter]
SAVE1 --> GRPO_REPAIR[GRPO Repair Agent (Solver)\n`forgeenv/training/grpo_repair.py`]
GRPO_REPAIR <--> |episodes + rewards| ENVSPACE[Env Space\n`akhiilll/forgeenv`]
GRPO_REPAIR --> PUSH[Upload\nadapter + tokenizer + plots + repair_library]
PUSH --> HUB[Model Repo\n`akhiilll/forgeenv-repair-agent`]
```
### Target architecture (two-role co-evolution: Challenger/Solver)
This is the **intended** architecture described in R-Zero / SPIRAL-style self-play:
```mermaid
flowchart TB
SFT2[SFT warmstart] --> CH[GRPO Drift Generator (Challenger)]
CH --> FILTER[Filter/select breakages\nusing p_hat from multiple solver attempts]
FILTER --> SOLVER[GRPO Repair Agent (Solver)]
SOLVER --> CH
```
The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is the core
Challenger/Solver loop: generate a hard breakage β†’ attempt a repair β†’ score it.
## Reward design
```
visible_reward
β”œβ”€ execution_success (sandboxed run / heuristic simulator)
β”œβ”€ ast_well_formed (parses + no forbidden globals)
β”œβ”€ format_compliance (valid unified diff or full-script replacement)
β”œβ”€ minimality (smaller diffs preferred β€” anti-rewrite)
└─ no_forbidden_globals (locked-down execution check)
held_out_evaluator (NOT used for training, used for evals only)
β”œβ”€ executed_cleanly
β”œβ”€ matches_target_api (semantic correctness)
└─ regression_free (other tests still pass)
```
Multiple independent components, plus a **held-out evaluator the trainer
never sees**, so the agent can't game its way to the top of the curve.
## Results (50 episodes / agent, oracle as upper-bound proxy for trained)
After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op
baseline on every metric we track:
| Agent | Mean visible reward | Success rate (held-out exec) |
| ------------------ | ------------------- | ---------------------------- |
| Baseline (no-op) | **0.90** | **50 %** |
| Trained (oracle) | **1.51** | **86 %** |
Three plots (committed to `artifacts/plots/`):
- `baseline_vs_trained.png` β€” reward distribution, baseline vs trained.
- `training_reward_curve.png` β€” reward trajectory across episodes.
- `success_by_category.png` β€” per-primitive success rates.
A 43-entry `repair_library.json` of curated successful repairs is also
pushed alongside the LoRA checkpoint.
## Quick start
```bash
# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]
# 2. run the test suite
pytest -q # 74 tests β€” full env + roles + reward + training
# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860
# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50
# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll
```
Training can run via:
- **HF Jobs GPU**: `scripts/jobs/train_repair_agent.py` (what we used for the successful run)
- **Notebook**: [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) (useful for iteration)
## Repository layout
```
forgeenv/ # importable Python package (env + roles + training)
env/ # OpenEnv wrapper: actions, observations, server
sandbox/ # AST validator + heuristic simulator
verifier/ # visible verifier + held-out evaluator
primitives/ # 8 breakage + 8 repair primitives + drift taxonomy
tasks/ # 10-script HF seed corpus + sampler
roles/ # Drift Generator + Repair Agent + Teacher
drift/ # Library drift engine (non-stationary verification)
training/ # SFT, GRPO repair, GRPO drift, rollout, plots
artifacts/ # repair-library curation
forgeenv-space/ # files we push to the OpenEnv Space (Docker)
demo-space/ # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb # Colab training pipeline
warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
generate_artifacts.py # plots + eval_results.json + repair_library.json
deploy_spaces.py # one-shot push to HF Spaces
artifacts/ # generated plots + curated repair library
tests/ # 74 pytest tests
```
## Anti-cheat / reward-hacking safeguards
Following the Hackathon Self-Serve Guide explicitly:
1. **Multiple independent reward functions** (5 visible + 3 held-out).
2. **Held-out evaluator** the trainer never sees, used only for plots.
3. **Locked-down execution** in the sandbox simulator β€” no globals abuse,
timeouts on every run.
4. **AST validator** rejects forbidden constructs (network calls, `os.system`,
etc.) before reward is computed.
5. **Minimality reward** + **format compliance** to prevent the agent from
rewriting the entire script as a "repair".
6. The **Drift Generator** is itself trained against an R-Zero composite
reward (uncertainty βˆ’ repetition) so it can't trivially game the agent.
## References
- Huang et al., *R-Zero: Self-Evolving Reasoning LLM From Zero Data* (2025)
- Zhao et al., *Absolute Zero: Reinforced Self-play Reasoning with Zero Data* (2025)
- Liu et al., *SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning…* (2025)
- Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) β€” Reward engineering & shaping
- Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) β€” Reward engineering for RL in software tasks
- OpenEnv Hackathon Self-Serve Guide (2026)
## License
Apache-2.0