Spaces:
Sleeping
Sleeping
| title: WhipStudio Env | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| base_path: /ui/ | |
| # π§ WhipStudio β ML Debug Arena | |
| An OpenEnv-compatible RL environment where agents debug broken PyTorch training scripts. | |
| Features **6 debugging tasks** with continuous reward scoring (0.0-1.0). | |
| ## π― Overview | |
| WhipStudio presents agents with broken ML training code and challenges them to fix it. | |
| Agents must diagnose bugs, fix all issues, and meet performance thresholds. | |
| ## π Tasks | |
| | Task | Difficulty | Bug Type | | |
| |------|------------|----------| | |
| | task1 | Easy | Wrong optimizer order + bad LR | | |
| | task2 | Medium | Silent NaN from log(0) | | |
| | task3 | Medium | Label inversion | | |
| | task4 | Medium | Wrong loss function | | |
| | task5 | Medium | Frozen backbone | | |
| | task6 | Hard | IO mismatch (4 bugs) | | |
| ## π Quick Start | |
| ### Run Locally | |
| ```bash | |
| # Install dependencies | |
| pip install -r server/requirements.txt | |
| # Start server | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Run Inference | |
| ```bash | |
| # Set required environment variables | |
| export API_BASE_URL="https://api-inference.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-Coder-32B-Instruct" | |
| export HF_TOKEN="your_token" | |
| # Run inference | |
| python inference.py --env-url http://localhost:7860 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t whipstudio . | |
| docker run -p 7860:7860 whipstudio | |
| ``` | |
| ## π‘ API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/reset` | POST | Start new episode with `{"task_id": "task1"}` | | |
| | `/step` | POST | Submit fix with `{"action": {"action_type": "submit_fix", "fixed_code": "..."}}` | | |
| | `/state` | GET | Get current session state | | |
| | `/tasks` | GET | List available tasks | | |
| | `/health` | GET | Health check (returns 200) | | |
| ## π Inference Output Format | |
| The `inference.py` script emits structured logs: | |
| ``` | |
| [START] task_id=task1 | |
| [STEP] task_id=task1 step=1 action=submit_fix(1234chars) reward=0.4500 done=true | |
| [END] task_id=task1 final_score=0.4500 | |
| ``` | |
| ## ποΈ Project Structure | |
| ``` | |
| whipstudio/ | |
| βββ server/ | |
| β βββ app.py # FastAPI application | |
| β βββ environment.py # OpenEnv environment | |
| β βββ tasks/ # Task definitions + graders | |
| βββ inference.py # Hackathon inference script | |
| βββ models.py # Pydantic schemas | |
| βββ openenv.yaml # OpenEnv specification | |
| βββ Dockerfile | |
| βββ README.md | |
| ``` | |
| ## β Hackathon Compliance | |
| - β HF Space deploys and responds to `/health` (200) | |
| - β OpenEnv spec compliance (`openenv.yaml`, typed models, `/reset`, `/step`, `/state`) | |
| - β Dockerfile builds | |
| - β `inference.py` uses OpenAI client with `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` | |
| - β Structured stdout logs: `[START]`, `[STEP]`, `[END]` | |
| - β 6 tasks with graders returning scores in 0.0-1.0 range | |
| - β Runtime < 20 min, runs on vcpu=2, memory=8gb | |
| ## π License | |
| Apache-2.0 | |