{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ForgeEnv: Training Notebook\n", "\n", "Self-improving RL environment for HuggingFace ecosystem repair under library drift.\n", "Trains a **Repair Agent** (and optionally a co-evolving **Drift Generator**)\n", "on top of `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` using **TRL GRPO + Unsloth**.\n", "\n", "Pipeline:\n", "1. Install dependencies (Unsloth, TRL, ForgeEnv).\n", "2. Generate warm-start pairs.\n", "3. SFT warm-start the Repair Agent (200 steps).\n", "4. GRPO main training (200 episodes).\n", "5. Evaluate baseline vs trained, save plots and adapter to Hugging Face Hub.\n", "\n", "**Hardware**: T4 / L4 / A100 (4-bit QLoRA). Designed for ~1 hr on A100, ~3 hrs on T4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip install -q unsloth==2024.4 trl>=0.10.0 peft>=0.10.0 accelerate>=0.30.0 datasets>=2.18.0\n", "!pip install -q openenv-core>=0.2.0 nltk>=3.8.0 scikit-learn>=1.4.0 matplotlib>=3.8.0 wandb>=0.16.0 huggingface_hub>=0.23.0\n", "!pip install -q -e ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os, json, torch\n", "from pathlib import Path\n", "\n", "HF_USERNAME = 'akhiilll'\n", "HF_TOKEN = os.environ.get('HF_TOKEN', '') # set this in Colab Secrets\n", "MODEL_REPO = f'{HF_USERNAME}/forgeenv-repair-agent'\n", "BASE_MODEL = 'unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit'\n", "\n", "from huggingface_hub import login\n", "if HF_TOKEN:\n", " login(token=HF_TOKEN)\n", "print('Torch:', torch.__version__, 'CUDA:', torch.cuda.is_available())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Generate warm-start pairs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python warmstart/generate_pairs.py --target_count 64 --out_dir warmstart/data\n", "!head -1 warmstart/data/repair_pairs.jsonl | python -m json.tool" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. SFT warm-start (Repair Agent)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from forgeenv.training.sft_warmstart import run_sft\n", "\n", "run_sft(\n", " role='repair_agent',\n", " data_path='warmstart/data/repair_pairs.jsonl',\n", " output_dir='artifacts/checkpoints/repair_agent_sft',\n", " base_model=BASE_MODEL,\n", " max_steps=200,\n", " batch_size=2,\n", " learning_rate=2e-4,\n", " lora_r=16,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. GRPO main training (Repair Agent)\n", "\n", "200 episodes against the live ForgeEnvironment. Logs reward at every 5 steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from forgeenv.training.grpo_repair import run_grpo\n", "\n", "run_grpo(\n", " base_model=BASE_MODEL,\n", " adapter_path='artifacts/checkpoints/repair_agent_sft',\n", " output_dir='artifacts/checkpoints/repair_agent_grpo',\n", " total_episodes=200,\n", " group_size=4,\n", " learning_rate=5e-6,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Baseline vs trained eval (50 episodes each)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from forgeenv.env.forge_environment import ForgeEnvironment\n", "from forgeenv.training.rollout import rollout_one_episode\n", "\n", "def run_eval(generate_fn, n_episodes=50, label=''):\n", " rewards = []\n", " successes = 0\n", " for i in range(n_episodes):\n", " env = ForgeEnvironment(seed=42 + i)\n", " result = rollout_one_episode(env, repair_generate=generate_fn)\n", " rewards.append(result.visible_reward)\n", " successes += int(result.success)\n", " return {\n", " 'label': label,\n", " 'mean_reward': sum(rewards) / len(rewards),\n", " 'success_rate': successes / n_episodes,\n", " 'rewards': rewards,\n", " }\n", "\n", "from forgeenv.training.rollout import _baseline_repair_generate\n", "baseline_result = run_eval(_baseline_repair_generate(), n_episodes=50, label='baseline (no-op)')\n", "print(json.dumps({k: v for k, v in baseline_result.items() if k != 'rewards'}, indent=2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the trained adapter and eval\n", "from unsloth import FastLanguageModel\n", "from peft import PeftModel\n", "\n", "model, tokenizer = FastLanguageModel.from_pretrained(\n", " model_name=BASE_MODEL, max_seq_length=4096, dtype=None, load_in_4bit=True\n", ")\n", "model = PeftModel.from_pretrained(model, 'artifacts/checkpoints/repair_agent_grpo')\n", "model = FastLanguageModel.for_inference(model)\n", "\n", "def trained_generate(system, user):\n", " msgs = [{'role':'system','content':system},{'role':'user','content':user}]\n", " inputs = tokenizer.apply_chat_template(msgs, return_tensors='pt', add_generation_prompt=True).to(model.device)\n", " out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.95)\n", " return tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)\n", "\n", "trained_result = run_eval(trained_generate, n_episodes=50, label='trained (GRPO)')\n", "print(json.dumps({k: v for k, v in trained_result.items() if k != 'rewards'}, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Save plots + push to Hugging Face Hub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from forgeenv.training.plots import (\n", " plot_reward_curve, plot_success_rate_by_category, plot_baseline_vs_trained\n", ")\n", "\n", "Path('artifacts/plots').mkdir(parents=True, exist_ok=True)\n", "plot_baseline_vs_trained(\n", " baseline_rewards=baseline_result['rewards'],\n", " trained_rewards=trained_result['rewards'],\n", " out_path='artifacts/plots/baseline_vs_trained.png',\n", ")\n", "plot_reward_curve(\n", " rewards=trained_result['rewards'],\n", " out_path='artifacts/plots/training_reward_curve.png',\n", ")\n", "print('Plots written.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import HfApi\n", "\n", "api = HfApi(token=HF_TOKEN)\n", "api.create_repo(MODEL_REPO, exist_ok=True, private=False)\n", "model.push_to_hub(MODEL_REPO, token=HF_TOKEN)\n", "tokenizer.push_to_hub(MODEL_REPO, token=HF_TOKEN)\n", "api.upload_folder(folder_path='artifacts/plots', repo_id=MODEL_REPO, path_in_repo='plots')\n", "print(f'Pushed to https://huggingface.co/{MODEL_REPO}')" ] } ], "metadata": { "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.10"} }, "nbformat": 4, "nbformat_minor": 4 }