Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

Don Rishabh Claude Opus 4.7 (1M context) commited on 15 days ago

Commit

a56bede

1 Parent(s): 3724e90

remove untested Colab notebook + link training/ folder in README

- Deleted notebooks/prompt_golf_train_minimal.ipynb — never ran
end-to-end on a real Colab GPU; risk of judges hitting confusing
errors > value of having the file in the repo
- Stripped notebook references from README (Links section, Files
tree) and BLOG_POST.md (TL;DR + Try-it-yourself section)
- Added a top-level "Training pipeline" link in the README Links
section pointing at github.com/.../tree/main/training (folder view)
- Made the existing "Training pipeline (training/)" sub-section
header itself a folder link

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

README.md +2 -3
notebooks/prompt_golf_train_minimal.ipynb +0 -256

README.md CHANGED Viewed

@@ -23,8 +23,8 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
 - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
 - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
 - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
 - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
-- 📓 **Colab training notebook:** [`notebooks/prompt_golf_train_minimal.ipynb`](./notebooks/prompt_golf_train_minimal.ipynb)
 ### Trained adapters & data
@@ -36,7 +36,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
 | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
 | [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
-### Training pipeline (`training/`)
 | File | Role |
 |---|---|
@@ -166,7 +166,6 @@ prompt_golf_env/
     rubrics.py                         # additive reward composition
     tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
   training/                            # see Links → Training pipeline
-  notebooks/                           # Colab smoke training
   ui/ + space-demo/                    # Gradio demos
   BLOG_POST.md                         # writeup
 ```

 - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
 - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
 - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
+- 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
 - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
 ### Trained adapters & data
 | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
 | [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
+### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
 | File | Role |
 |---|---|
     rubrics.py                         # additive reward composition
     tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
   training/                            # see Links → Training pipeline
   ui/ + space-demo/                    # Gradio demos
   BLOG_POST.md                         # writeup
 ```

notebooks/prompt_golf_train_minimal.ipynb DELETED Viewed

@@ -1,256 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Prompt Golf — Minimal Training Demo\n",
-    "\n",
-    "Train a Qwen3-1.7B **agent** (LoRA) to write short prompts that steer a frozen **target** LLM.\n",
-    "Cross-family RL on the OpenEnv Prompt Golf environment using TRL GRPO.\n",
-    "\n",
-    "**Hardware**\n",
-    "- Recommended: L4 or A100 (Colab Pro+) — runs the headline `Qwen agent → Llama-3.2-3B target` config.\n",
-    "- Free T4 (16 GB): downsize the target to `Qwen/Qwen2.5-0.5B-Instruct` so everything fits.\n",
-    "\n",
-    "This notebook runs a 30-step smoke training so you can verify the pipeline end-to-end on Colab in ~10 min.\n",
-    "For the full 500-step training that produced the demo CSVs, use HuggingFace Jobs via `training/hf_job_train.sh`."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 1. Install dependencies\n",
-    "\n",
-    "Mirrors the OpenEnv-official pin set used by HF Jobs (`pytorch/2.4.0-cuda12.4` base + uv upgrade to torch ≥ 2.8 + `trl==0.22.2`)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install -q -U uv\n",
-    "!uv pip install --system -q \\\n",
-    "    \"torch>=2.8.0\" \"torchvision>=0.25.0\" \"triton>=3.4.0\" bitsandbytes \\\n",
-    "    \"transformers==4.56.2\" \\\n",
-    "    \"unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo\" \\\n",
-    "    \"unsloth[base] @ git+https://github.com/unslothai/unsloth\"\n",
-    "!uv pip install --system --upgrade --no-deps -q \\\n",
-    "    \"transformers==4.56.2\" tokenizers \"trl==0.22.2\" unsloth unsloth_zoo\n",
-    "!pip install -q 'openenv-core[core]>=0.2.2' 'peft>=0.13.0' 'datasets>=3.0.0' 'accelerate>=0.34.0'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 2. Clone the env + install the package"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!rm -rf /content/prompt_golf_env\n",
-    "!git clone --depth 1 https://huggingface.co/spaces/rishabh16196/prompt_golf_env /content/prompt_golf_env\n",
-    "%cd /content/prompt_golf_env\n",
-    "!pip install -q --no-deps -e ."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 3. Log in to HuggingFace\n",
-    "\n",
-    "Needed to download Qwen3-1.7B and Llama-3.2-3B-Instruct (the latter is gated — accept the license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct first)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from huggingface_hub import notebook_login\n",
-    "notebook_login()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 4. Verify the env (mock backend, CPU-only — no model load)\n",
-    "\n",
-    "Quick sanity check that the env imports and resets correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "os.environ['PROMPT_GOLF_TARGET_BACKEND'] = 'mock'\n",
-    "os.environ['PROMPT_GOLF_JUDGE_BACKEND'] = 'mock'\n",
-    "\n",
-    "from prompt_golf_env.server.prompt_golf_environment import PromptGolfEnvironment, _ALL_TASKS\n",
-    "from prompt_golf_env.models import GolfAction\n",
-    "\n",
-    "print(f'task bank: {len(_ALL_TASKS)} tasks (20 v1 + 15 v2 + 52 tough)')\n",
-    "\n",
-    "env = PromptGolfEnvironment()\n",
-    "obs = env.reset(task='sentiment_basic', seed=0)\n",
-    "print(f'\\ntask: {obs.task_id}  |  budget: {obs.prompt_budget_tokens} tokens')\n",
-    "print(f'verbose description ({len(obs.task_description)} chars):')\n",
-    "print(f'  {obs.task_description[:140]}...')\n",
-    "\n",
-    "# Try a hand-written prompt\n",
-    "result = env.step(GolfAction(prompt='Classify the sentiment as positive, negative, or neutral. Output the label only.'))\n",
-    "print(f'\\nhand-written prompt:  reward={result.reward:+.3f}  raw={result.raw_task_score:.2f}  '\n",
-    "      f'tokens={result.submitted_prompt_tokens}  leak={result.leakage_penalty:.2f}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 5. Mini training run (30 steps)\n",
-    "\n",
-    "Runs the full agent + target pipeline with a small number of steps to verify the loop works on your hardware. Defaults below are sized for **L4 (24 GB)**.\n",
-    "\n",
-    "**For free T4 (16 GB)**: change `--target-model` to `Qwen/Qwen2.5-0.5B-Instruct`, drop `--num-generations 4` to `2`, and skip the judge (set `PROMPT_GOLF_JUDGE_BACKEND=mock` in the cell below)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Switch off mock backends — we want real model inference now.\n",
-    "del os.environ['PROMPT_GOLF_TARGET_BACKEND']\n",
-    "# Keep judge on mock for the smoke run unless you have an A100; the\n",
-    "# 8B 8-bit judge alone takes ~8 GB on top of agent + target.\n",
-    "os.environ['PROMPT_GOLF_JUDGE_BACKEND'] = 'mock'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": "# `train_grpo.py` trains on ALL tasks in the bank by default. The\n# --held-out-tasks flag carves out a small eval split that the GRPO\n# trainer reports on each step. With max_steps=30 the loop sees a\n# tiny fraction of the bank — purpose here is to verify the pipeline\n# runs on your hardware, not to converge.\n!python -u training/train_grpo.py \\\n    --agent-model Qwen/Qwen3-1.7B \\\n    --target-model meta-llama/Llama-3.2-3B-Instruct \\\n    --max-steps 30 \\\n    --num-generations 4 \\\n    --per-device-batch-size 2 \\\n    --gradient-accumulation-steps 2 \\\n    --seeds-per-task 2 \\\n    --learning-rate 5e-6 \\\n    --beta 0.04 \\\n    --enable-thinking \\\n    --max-completion-length 768 \\\n    --output-dir /content/outputs/grpo_demo"
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 6. Inspect training metrics"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "import matplotlib.pyplot as plt\n",
-    "\n",
-    "metrics_path = '/content/outputs/grpo_demo/train_metrics.jsonl'\n",
-    "rows = [json.loads(l) for l in open(metrics_path)]\n",
-    "print(f'{len(rows)} steps logged')\n",
-    "\n",
-    "fig, axes = plt.subplots(1, 2, figsize=(11, 3.5))\n",
-    "steps = [r['step'] for r in rows]\n",
-    "axes[0].plot(steps, [r['reward'] for r in rows], color='#1f77b4')\n",
-    "axes[0].axhline(0, color='gray', lw=0.5)\n",
-    "axes[0].set_title('reward per step'); axes[0].set_xlabel('step'); axes[0].grid(alpha=0.3)\n",
-    "axes[1].plot(steps, [r.get('avg_tokens', 0) for r in rows], color='#ff7f0e')\n",
-    "axes[1].set_title('avg prompt tokens per step'); axes[1].set_xlabel('step'); axes[1].grid(alpha=0.3)\n",
-    "plt.tight_layout(); plt.show()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 7. Eval the trained adapter on a few tasks\n",
-    "\n",
-    "Loads the LoRA adapter you just trained and prints what it now writes for each task vs the verbose hand-written description."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!python -u training/eval_before_after.py \\\n",
-    "    --agent-model Qwen/Qwen3-1.7B \\\n",
-    "    --adapter /content/outputs/grpo_demo/adapter_final \\\n",
-    "    --target-model meta-llama/Llama-3.2-3B-Instruct \\\n",
-    "    --label trained \\\n",
-    "    --tasks tough_fallacy_classify,sentiment_basic,ner_people,format_uppercase \\\n",
-    "    --output-json /content/outputs/eval_trained.jsonl \\\n",
-    "    --enable-thinking"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "rows = [json.loads(l) for l in open('/content/outputs/eval_trained.jsonl')]\n",
-    "for r in rows:\n",
-    "    print(f\"\\n[{r['task_id']}]  reward={r['reward']:+.3f}  raw={r['raw_task_score']:.2f}  tokens={r['tokens']}\")\n",
-    "    print(f\"  trained agent's prompt: {r['agent_prompt'][:140]!r}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## What's next\n",
-    "\n",
-    "This notebook ran 30 steps on 4 tasks — enough to verify the pipeline. The adapter checkpoints used in the demo CSVs were produced by 500-step runs over all 87 tasks, which take ~3-4h on L40S.\n",
-    "\n",
-    "**To reproduce the full results:**\n",
-    "1. `bash training/hf_job_train.sh` — same-family Qwen→Qwen baseline (single-turn)\n",
-    "2. `ENABLE_THINKING=true PUSH_TO_HUB=rishabh16196/prompt-golf-qwen-to-llama bash training/hf_job_train.sh` — cross-family Qwen→Llama (the hero run)\n",
-    "3. `bash training/hf_job_train_multistep.sh` — multi-turn trajectory-level GRPO (warm-started from #2)\n",
-    "4. `bash training/hf_job_eval.sh both` — base + trained eval on either adapter\n",
-    "5. `python training/build_before_after_csv.py ...` — merge eval JSONLs into the demo CSV\n",
-    "\n",
-    "Existing artifacts:\n",
-    "- Qwen→Qwen demo CSV: https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv\n",
-    "- Capability profiles (per task): https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/tree/main/profiles\n",
-    "- Plots: https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/tree/main/plots"
-   ]
-  }
- ],
- "metadata": {
-  "accelerator": "GPU",
-  "colab": {
-   "provenance": [],
-   "gpuType": "L4"
-  },
-  "kernelspec": {
-   "display_name": "Python 3",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}