File size: 10,358 Bytes
4fbc241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ed43f8
 
 
 
 
 
4fbc241
 
 
 
 
3ed43f8
4fbc241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ed43f8
 
 
4fbc241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: LLMServeEnv
emoji: 🚀
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - llm-serving
---

# LLMServeEnv

OpenEnv-compliant RL environment for learning LLM inference serving policies under latency, memory, and cost constraints.

## Hackathon Submission Rules This Repo Targets

This repository is structured around the Round 1 automated gate. The submission-critical requirements are treated as non-optional:

- full OpenEnv compliance with typed `Action`, `Observation`, and reward-bearing trajectory behavior
- working `reset()`, `step()`, `state()`, `/tasks`, `/grader`, and `/baseline`
- valid `openenv.yaml`
- reproducible baseline inference path using the official OpenAI client and `OPENAI_API_KEY`
- clean Docker build for Hugging Face Docker Spaces
- built-in OpenEnv web interface available at `/web`

If any of those fail, the environment is effectively non-submittable.

## Environment Summary

LLMServeEnv models the control problem faced by LLM serving systems: an agent must choose batching, KV cache allocation, speculative decoding depth, quantization, and routing policies while serving changing request traffic. The environment rewards policies that improve throughput without violating latency SLOs, memory budgets, or cost constraints.

### RL-First Architecture

This environment was deeply designed as a true Reinforcement Learning challenge. A hand-coded heuristic policy (like Orca or vLLM rules) cannot solve it optimally due to non-stationary workloads and interdependent resource trade-offs. The reference PPO agent trained on our environment reliably outperforms state-of-the-art hand-coded heuristics.

The environment is CPU-simulated and deterministic under fixed seeds, which keeps RL experimentation and grader evaluation reproducible.

## Action Space

`ServeAction` is the full serving configuration applied to the next simulation window.

| Field | Type | Range | Meaning |
| --- | --- | --- | --- |
| `batch_cap` | `int` | `1..512` | Maximum requests batched at once |
| `kv_budget_fraction` | `float` | `0.1..1.0` | Relative KV cache budget |
| `speculation_depth` | `int` | `0..8` | Draft-token depth for speculation |
| `quantization_tier` | `enum` | `FP16`, `INT8`, `INT4` | Serving precision tier |
| `prefill_decode_split` | `bool` | `true/false` | Whether prefill/decode are disaggregated |
| `priority_routing` | `bool` | `true/false` | Whether priority traffic routing is enabled |

## Observation Space

`ServeObservation` reports queue state, latency, throughput, memory, and per-step reward metadata.

Key fields:

- `queue_depth`
- `active_requests`
- `kv_cache_occupancy`
- `mean_prompt_length`
- `p50_ttft_ms`
- `p99_ttft_ms`
- `p50_itl_ms`
- `throughput_tps`
- `slo_compliance_rate`
- `gpu_memory_used_gb`
- `estimated_cost_per_1k`
- `request_arrival_rate`
- `spec_acceptance_rate`
- `eviction_events`
- `step_index`
- `task_id`
- `reward`
- `done`
- `metadata`

## Tasks

The environment ships with three validator-facing tasks and deterministic graders.

### `static_workload` (easy)

- stable request rate
- short prompts
- teaches basic batching and KV budget tradeoffs

### `bursty_workload` (medium)

- bursty arrival process
- higher queue volatility
- requires adaptive latency-throughput balance

### `adversarial_multitenant` (hard)

- mixed prompt lengths
- sharp traffic spikes
- priority workload pressure and tighter resource stress

## Grading and Reward Design

- rewards are shaped at every step, not only at episode end
- reward combines throughput, SLO compliance, memory pressure, and cost behavior
- graders return final scores in `[0.0, 1.0]`
- grading is deterministic for the same episode log

`/grader` can grade either:

- the current completed in-memory episode
- an explicitly provided `episode_log`

## Canonical Runtime Surface

The canonical runtime is the root Docker image serving `server.app:app` on port `7860`.

Required endpoints exposed by the app:

- `GET /health`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /metadata`
- `GET /schema`
- `GET /tasks`
- `POST /grader`
- `GET /baseline`
- `GET /web`
- `GET /demo` -> redirects to `/web`

The built-in OpenEnv UI is available at `/web`. That is the recommended interface for judges and team debugging. There is no custom frontend in the submission-critical path.

## Local Development

### Install

```bash
uv sync --frozen
pip install openenv
```

### Run the app

```bash
uvicorn server.app:app --host 0.0.0.0 --port 7860
```

### Runtime modes

Simulator mode remains the default:

```bash
LLMSERVE_MODE=sim uvicorn server.app:app --host 0.0.0.0 --port 7860
```

Real mode executes actual OpenAI requests during each environment `step()`:

```bash
export OPENAI_API_KEY=your_key_here
LLMSERVE_MODE=real \
LLMSERVE_REAL_PROVIDER=openai \
LLMSERVE_REAL_MODEL=gpt-4.1-mini \
uvicorn server.app:app --host 0.0.0.0 --port 7860
```

Useful real-mode tuning env vars:

- `LLMSERVE_REAL_MAX_REQUESTS_PER_STEP`
- `LLMSERVE_REAL_MAX_PROMPT_TOKENS`
- `LLMSERVE_REAL_MAX_COMPLETION_TOKENS`

### OpenEnv validation

```bash
openenv validate
```

### Run tests

```bash
pytest -q
```

## RL Agent Training & Benchmarks

You can run our fully integrated lightweight PyTorch PPO to train directly on the tasks using only a CPU.

```bash
# Train on the hardest adversarial task
python train.py --task adversarial_multitenant --steps 120000 --seed 0

# Evaluate trained weights to view benchmark scores
python evaluate.py --agent ppo --task all --episodes 20
```

### Reference Benchmark

RL consistently outperforms the reference hand-coded heuristic heuristics and generic LLM control policies:

| Agent | Task 1 (Static) | Task 2 (Bursty) | Task 3 (Adversarial) |
|---|---|---|---|
| Random | ~0.05 | ~0.03 | ~0.02 |
| Heuristic (Orca+vLLM+Decima) | ~0.30 | ~0.25 | ~0.20 |
| Trained PPO | **~0.55** | **~0.48** | **~0.38** |

## Canonical Docker Build

Use the root `Dockerfile` as the canonical submission image.

```bash
docker build -t llmserve-env .
docker run --rm -p 7860:7860 llmserve-env
```

For a fully clean rebuild, use:

```bash
docker build --no-cache -t llmserve-env .
```

Then verify:

- API: `http://localhost:7860/health`
- OpenEnv UI: `http://localhost:7860/web`

The root `Dockerfile` builds a CPU-only image and packages the tracked `weights/` directory into the container. That is the Dockerfile used for local verification and Hugging Face submission. `server/Dockerfile` is kept only as a compatibility mirror.

## Baseline Inference

The submission requires an OpenAI-backed baseline path. This repo supports two baseline modes:

- deterministic local baseline for reproducible internal sanity checks
- OpenAI baseline for submission compliance

### Deterministic baseline

Runs entirely against the local simulator with no external model calls.

```bash
python -m server.baseline_inference --mode deterministic
```

### OpenAI baseline

This is the submission-facing baseline path. It uses the official OpenAI client and reads credentials from `OPENAI_API_KEY`.

```bash
export OPENAI_API_KEY=your_key_here
python -m server.baseline_inference --mode openai --runtime in-process --model gpt-4.1-mini
```

That standalone path is the safest submission artifact because it does not assume a separate local server is already running.

To run against a live local or deployed endpoint instead:

```bash
python -m server.baseline_inference \
  --mode openai \
  --runtime http \
  --base-url http://localhost:7860 \
  --model gpt-4.1-mini
```

You can also write the results to disk:

```bash
python -m server.baseline_inference \
  --mode openai \
  --runtime in-process \
  --model gpt-4.1-mini \
  --output artifacts/baseline_openai.json
```

The `/baseline` endpoint exposes the same logic:

- `GET /baseline` -> deterministic suite
- `GET /baseline?use_openai=true` -> OpenAI suite, requires `OPENAI_API_KEY`

The endpoint uses the in-process environment so it does not depend on the server making HTTP calls to itself.

## Python Client Example

```python
from llmserve_env import LLMServeEnv

env = LLMServeEnv.from_url("http://localhost:7860")
observation = env.reset(task_id="static_workload", seed=42)

while not observation.done:
    action = {
        "batch_cap": 32,
        "kv_budget_fraction": 1.0,
        "speculation_depth": 0,
        "quantization_tier": "FP16",
        "prefill_decode_split": False,
        "priority_routing": False,
    }
    observation, reward, done, info = env.step(action)

grader_result = env.grade()
print(grader_result)
```

## Hugging Face Space Deployment

Deploy as a Docker Space and keep the Space tagged with `openenv`.

Recommended deployment path:

1. Push this repository to the Space.
2. Use the root `Dockerfile`.
3. Set the Space port to `7860`.
4. Make sure the repository includes the `weights/` directory; the Docker image copies those model files at build time.
5. Add `OPENAI_API_KEY` as a secret only if you want the OpenAI baseline endpoint to run in the deployed Space.
6. After deployment, verify:
   - `/health`
   - `/tasks`
   - `/web`
   - `/reset`
   - `/baseline`

For the built-in OpenEnv UI, the deployed URL should serve `/web` successfully. `/demo` exists only as a redirect for compatibility.

## Pre-Submission Checklist

Run the local checks:

```bash
pytest -q
openenv validate
docker build -t llmserve-env .
```

Run the consolidated helper:

```bash
python scripts/pre_submission_check.py --skip-docker
```

Run the full helper once Docker is available:

```bash
python scripts/pre_submission_check.py --space-url https://your-space-name.hf.space
```

Run the OpenAI baseline verification:

```bash
export OPENAI_API_KEY=your_key_here
python scripts/pre_submission_check.py \
  --run-openai-baseline \
  --baseline-runtime in-process \
  --model gpt-4.1-mini
```

## What Still Requires Real Credentials or Deployment Access

These checks cannot be completed from a code-only scaffold:

- a real `OPENAI_API_KEY` to execute the submission baseline end to end
- a real Hugging Face Space URL to verify `/web` and validator-facing endpoints after deployment
- Docker daemon access on the machine that will perform the final build check

Everything else in this repo is designed so those last-mile checks are the only external dependencies left.