Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

App Files Files Community

prompt_golf_env / ui /README.md

Don Rishabh

ui: local Gradio demo app — verbose / untrained / trained side-by-side

1aee0c3 13 days ago

preview code

raw

history blame contribute delete

2.92 kB

	# Prompt Golf — Local Demo UI

	A single-file Gradio app that loads the target model locally and shows
	verbose vs untrained vs trained prompts side by side. Pick a task,
	the app fills in the three prompts from the demo CSV; type a test
	input; hit "Run target" and watch the target generate with all three
	prompts in one batched forward pass.

	## Run

	```bash
	# from repo root
	pip install gradio transformers torch # if not already in your env
	python ui/demo_app.py
	```

	Then open `http://localhost:7860`.

	## Configuration (env vars)

	\| Var \| Default \| What \|
	\|---\|---\|---\|
	\| `DEMO_TARGET_MODEL` \| `meta-llama/Llama-3.2-3B-Instruct` \| Target model loaded for live generation. \|
	\| `DEMO_CSV` \| `outputs/qwen_to_qwen_demo.csv` \| Demo CSV produced by `training/build_before_after_csv.py`. Auto-fetches the published one from HF Hub if not local. \|
	\| `HF_TOKEN` \| — \| Required to download Llama-3.2 (gated). \|

	Examples:

	```bash
	# Use the same-family Qwen demo CSV with Qwen3-1.7B target
	DEMO_TARGET_MODEL=Qwen/Qwen3-1.7B python ui/demo_app.py

	# Once the cross-family Qwen->Llama CSV is built, point at it:
	DEMO_CSV=outputs/qwen_to_llama_demo.csv python ui/demo_app.py
	```

	## What you see

	```
	[ task dropdown — sorted by reward gain ]

	┌────────── VERBOSE ─────────┐ ┌──── UNTRAINED (base) ────┐ ┌──── TRAINED ────┐
	│ human-written, 200-500 tok │ │ raw Qwen3 agent │ │ Qwen3 + LoRA │
	│ tokens │ accuracy │ │ tokens │ accuracy │ │ tokens │ acc. │
	└────────────────────────────┘ └──────────────────────────┘ └─────────────────┘

	[ test input — type your own ]
	[ Run target with all three prompts ]

	┌─── target output: VERBOSE ──┐ ┌── target output: UNTRAINED ──┐ ┌── target: TRAINED ──┐

	batched in 1.4s \| verbose: 250 tok \| untrained: 35 tok \| trained: 12 tok
	```

	## Performance notes

	- Batched generation: all three prompts go through one
	`model.generate()` call with left-padding, so the 3-prompt round trip
	costs about the same as one inference (≈ 1-2 sec on a 4090 / L40S
	at bf16, ~3-5 sec on M-series Macs at MPS).
	- Greedy decoding (`temperature=0`) for reproducibility.
	- No vLLM by design — vLLM's throughput wins kick in with concurrent
	users; for a single-presenter demo, vanilla `transformers` keeps
	setup minimal and works on Mac.

	## Hardware

	\| Setup \| Llama-3.2-3B \| Qwen3-1.7B \|
	\|---\|---\|---\|
	\| Mac M-series (MPS, 16 GB unified) \| tight, works \| comfortable \|
	\| Mac M-series (32 GB+) \| comfortable \| comfortable \|
	\| RTX 4090 / L40S (24 GB+) \| comfortable \| comfortable \|
	\| T4 (16 GB) \| tight; switch to Qwen target instead \| comfortable \|