docs: simplify end-to-end diagram labels

da588ae verified 15 days ago

9.74 kB

	# ForgeEnv 🔧

	> *A self-improving RL environment that teaches LLMs to fix HuggingFace
	> training scripts as the ecosystem evolves.*

	ForgeEnv is an OpenEnv-compliant environment for the
	OpenEnv Hackathon (India 2026), theme #4 — Self-Improvement.
	Two LLM roles co-evolve inside a single environment:

	- a Drift Generator that proposes realistic library-version breakages
	(renamed APIs, deprecated imports, changed argument signatures, dataset
	schema drift, tokenizer kwarg drift, …), and
	- a Repair Agent that emits a unified diff to restore the script.

	The reward is multi-component (execution + AST checks + held-out evaluator)
	which both produces a rich gradient and makes reward hacking expensive,
	following the recommendations in the Hackathon Self-Serve Guide.

	## Why it matters

	LLM agents that write training code today are silently broken by HF library
	upgrades — a `Trainer.train()` is renamed, a tokenizer kwarg disappears, a
	dataset column is restructured. Today, humans patch these. ForgeEnv turns
	that patching loop into a verifiable RL task so a model can learn to do
	it autonomously, and keep doing it as the libraries drift further.

	## Live links

	\| Artifact \| URL \|
	\| --------------------------- \| -------------------------------------------------------------------- \|
	\| Environment Space (Docker) \| <https://huggingface.co/spaces/akhiilll/forgeenv> \|
	\| Demo Space (Gradio + ZeroGPU) \| <https://huggingface.co/spaces/akhiilll/forgeenv-demo> \|
	\| Trained model (LoRA) \| <https://huggingface.co/akhiilll/forgeenv-repair-agent> \|
	\| Training notebook (Colab) \| [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) \|

	## Architecture

	ForgeEnv is split into four deployable artifacts (two Spaces, one Jobs run, one Model repo):

	- Environment Space: `akhiilll/forgeenv` (OpenEnv FastAPI server)
	- Training run: Hugging Face Jobs (GPU) runs warm-start SFT + GRPO
	- Model repo: `akhiilll/forgeenv-repair-agent` (LoRA + artifacts)
	- Demo Space: `akhiilll/forgeenv-demo` (Gradio UI)

	### End-to-end (as deployed)

	```mermaid
	flowchart LR
	U[User / Judge] -->\|broken script + error trace\| D[Demo Space\nakhiilll/forgeenv-demo]
	D -->\|unified diff patch\| U

	subgraph TrainOnce[Training (HF Jobs GPU)]
	J[Training Job\n(SFT + GRPO)]
	E[Environment Space\nakhiilll/forgeenv]
	M[Model Repo\nakhiilll/forgeenv-repair-agent]
	J <--> \|reset/step, obs/reward\| E
	J -->\|push LoRA + artifacts\| M
	end

	D -. optional model usage .-> M
	```

	### Environment Space internals (OpenEnv server → env hub → verifier)

	```mermaid
	flowchart TB
	API[OpenEnv FastAPI server\n`forgeenv/env/server.py`\n/health + reset + step] --> ENV[ForgeEnvironment (hub)\n`forgeenv/env/forge_environment.py`]

	ENV --> TASKS[Task sampler + seed corpus\n`forgeenv/tasks/*`]
	ENV --> ROLES[Roles (prompting + parsing)\n`forgeenv/roles/*`]
	ENV --> PRIMS[Primitives (break + repair)\n`forgeenv/primitives/*`]
	ENV --> DRIFT[Library drift engine\n`forgeenv/drift/library_drift_engine.py`]
	ENV --> VERIFY[Verifiers\nvisible + held-out\n`forgeenv/verifier/*`]

	VERIFY --> SANDBOX[Sandbox execution\nAST validator + simulation\n`forgeenv/sandbox/*`]
	```

	### Training pipeline internals (what actually runs today)

	In the current codebase, the Repair Agent (Solver) GRPO loop is fully implemented.
	The Drift Generator (Challenger) GRPO logic exists as a reward loop + CPU dry-run,
	but full “LLM Drift GRPO” is intentionally not wired as a single-GPU training path yet.

	```mermaid
	flowchart TB
	SETUP[Install deps\n(torch/trl/unsloth/openenv…)] --> SFT[SFT warmstart\nformat + basics]
	SFT --> SAVE1[Save SFT adapter]
	SAVE1 --> GRPO_REPAIR[GRPO Repair Agent (Solver)\n`forgeenv/training/grpo_repair.py`]
	GRPO_REPAIR <--> \|episodes + rewards\| ENVSPACE[Env Space\n`akhiilll/forgeenv`]
	GRPO_REPAIR --> PUSH[Upload\nadapter + tokenizer + plots + repair_library]
	PUSH --> HUB[Model Repo\n`akhiilll/forgeenv-repair-agent`]
	```

	### Target architecture (two-role co-evolution: Challenger/Solver)

	This is the intended architecture described in R-Zero / SPIRAL-style self-play:

	```mermaid
	flowchart TB
	SFT2[SFT warmstart] --> CH[GRPO Drift Generator (Challenger)]
	CH --> FILTER[Filter/select breakages\nusing p_hat from multiple solver attempts]
	FILTER --> SOLVER[GRPO Repair Agent (Solver)]
	SOLVER --> CH
	```

	The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is the core
	Challenger/Solver loop: generate a hard breakage → attempt a repair → score it.

	## Reward design

	```
	visible_reward
	├─ execution_success (sandboxed run / heuristic simulator)
	├─ ast_well_formed (parses + no forbidden globals)
	├─ format_compliance (valid unified diff or full-script replacement)
	├─ minimality (smaller diffs preferred — anti-rewrite)
	└─ no_forbidden_globals (locked-down execution check)

	held_out_evaluator (NOT used for training, used for evals only)
	├─ executed_cleanly
	├─ matches_target_api (semantic correctness)
	└─ regression_free (other tests still pass)
	```

	Multiple independent components, plus a **held-out evaluator the trainer
	never sees**, so the agent can't game its way to the top of the curve.

	## Results (50 episodes / agent, oracle as upper-bound proxy for trained)

	After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op
	baseline on every metric we track:

	\| Agent \| Mean visible reward \| Success rate (held-out exec) \|
	\| ------------------ \| ------------------- \| ---------------------------- \|
	\| Baseline (no-op) \| 0.90 \| 50 % \|
	\| Trained (oracle) \| 1.51 \| 86 % \|

	Three plots (committed to `artifacts/plots/`):

	- `baseline_vs_trained.png` — reward distribution, baseline vs trained.
	- `training_reward_curve.png` — reward trajectory across episodes.
	- `success_by_category.png` — per-primitive success rates.

	A 43-entry `repair_library.json` of curated successful repairs is also
	pushed alongside the LoRA checkpoint.

	## Quick start

	```bash
	# 1. install (env-only deps, no torch needed for the env itself)
	pip install -e .[openenv]
	pip install -e .[dev]

	# 2. run the test suite
	pytest -q # 74 tests — full env + roles + reward + training

	# 3. spin up the environment locally
	uvicorn forgeenv.env.server:app --port 7860

	# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
	python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50

	# 5. push to HF Spaces
	export HF_TOKEN=hf_...
	python scripts/deploy_spaces.py --user akhiilll
	```

	Training can run via:
	- HF Jobs GPU: `scripts/jobs/train_repair_agent.py` (what we used for the successful run)
	- Notebook: [`notebooks/forgeenv_train.ipynb`](notebooks/forgeenv_train.ipynb) (useful for iteration)

	## Repository layout

	```
	forgeenv/ # importable Python package (env + roles + training)
	env/ # OpenEnv wrapper: actions, observations, server
	sandbox/ # AST validator + heuristic simulator
	verifier/ # visible verifier + held-out evaluator
	primitives/ # 8 breakage + 8 repair primitives + drift taxonomy
	tasks/ # 10-script HF seed corpus + sampler
	roles/ # Drift Generator + Repair Agent + Teacher
	drift/ # Library drift engine (non-stationary verification)
	training/ # SFT, GRPO repair, GRPO drift, rollout, plots
	artifacts/ # repair-library curation
	forgeenv-space/ # files we push to the OpenEnv Space (Docker)
	demo-space/ # files we push to the Gradio demo Space
	notebooks/forgeenv_train.ipynb # Colab training pipeline
	warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen
	scripts/
	generate_artifacts.py # plots + eval_results.json + repair_library.json
	deploy_spaces.py # one-shot push to HF Spaces
	artifacts/ # generated plots + curated repair library
	tests/ # 74 pytest tests
	```

	## Anti-cheat / reward-hacking safeguards

	Following the Hackathon Self-Serve Guide explicitly:

	1. Multiple independent reward functions (5 visible + 3 held-out).
	2. Held-out evaluator the trainer never sees, used only for plots.
	3. Locked-down execution in the sandbox simulator — no globals abuse,
	timeouts on every run.
	4. AST validator rejects forbidden constructs (network calls, `os.system`,
	etc.) before reward is computed.
	5. Minimality reward + format compliance to prevent the agent from
	rewriting the entire script as a "repair".
	6. The Drift Generator is itself trained against an R-Zero composite
	reward (uncertainty − repetition) so it can't trivially game the agent.

	## References

	- Huang et al., R-Zero: Self-Evolving Reasoning LLM From Zero Data (2025)
	- Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2025)
	- Liu et al., SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning… (2025)
	- Ibrahim et al., [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) — Reward engineering & shaping
	- Masud et al., [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) — Reward engineering for RL in software tasks
	- OpenEnv Hackathon Self-Serve Guide (2026)

	## License

	Apache-2.0