Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

App Files Files Community

prompt_golf_env / ui /README.md

Don Rishabh

ui: local Gradio demo app — verbose / untrained / trained side-by-side

1aee0c3 13 days ago

preview code

raw

history blame contribute delete

2.92 kB

Prompt Golf — Local Demo UI

A single-file Gradio app that loads the target model locally and shows verbose vs untrained vs trained prompts side by side. Pick a task, the app fills in the three prompts from the demo CSV; type a test input; hit "Run target" and watch the target generate with all three prompts in one batched forward pass.

Run

# from repo root
pip install gradio transformers torch  # if not already in your env
python ui/demo_app.py

Then open http://localhost:7860.

Configuration (env vars)

Var	Default	What
`DEMO_TARGET_MODEL`	`meta-llama/Llama-3.2-3B-Instruct`	Target model loaded for live generation.
`DEMO_CSV`	`outputs/qwen_to_qwen_demo.csv`	Demo CSV produced by `training/build_before_after_csv.py`. Auto-fetches the published one from HF Hub if not local.
`HF_TOKEN`	—	Required to download Llama-3.2 (gated).

Examples:

# Use the same-family Qwen demo CSV with Qwen3-1.7B target
DEMO_TARGET_MODEL=Qwen/Qwen3-1.7B python ui/demo_app.py

# Once the cross-family Qwen->Llama CSV is built, point at it:
DEMO_CSV=outputs/qwen_to_llama_demo.csv python ui/demo_app.py

What you see

[ task dropdown — sorted by reward gain ]

┌────────── VERBOSE ─────────┐ ┌──── UNTRAINED (base) ────┐ ┌──── TRAINED ────┐
│ human-written, 200-500 tok │ │ raw Qwen3 agent          │ │ Qwen3 + LoRA    │
│ tokens │ accuracy          │ │ tokens │ accuracy        │ │ tokens │ acc.   │
└────────────────────────────┘ └──────────────────────────┘ └─────────────────┘

[ test input — type your own ]
[ Run target with all three prompts ]

┌─── target output: VERBOSE ──┐ ┌── target output: UNTRAINED ──┐ ┌── target: TRAINED ──┐

batched in 1.4s | verbose: 250 tok | untrained: 35 tok | trained: 12 tok

Performance notes

Batched generation: all three prompts go through one model.generate() call with left-padding, so the 3-prompt round trip costs about the same as one inference (≈ 1-2 sec on a 4090 / L40S at bf16, ~3-5 sec on M-series Macs at MPS).
Greedy decoding (temperature=0) for reproducibility.
No vLLM by design — vLLM's throughput wins kick in with concurrent users; for a single-presenter demo, vanilla transformers keeps setup minimal and works on Mac.

Hardware

Setup	Llama-3.2-3B	Qwen3-1.7B
Mac M-series (MPS, 16 GB unified)	tight, works	comfortable
Mac M-series (32 GB+)	comfortable	comfortable
RTX 4090 / L40S (24 GB+)	comfortable	comfortable
T4 (16 GB)	tight; switch to Qwen target instead	comfortable