InosLihka commited on
Commit
69310d6
·
1 Parent(s): c07f15e

Add SPEC.md and hackathon reference docs

Browse files
SPEC.md ADDED
@@ -0,0 +1,554 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Build a Complete OpenEnv Environment: RhythmEnv
2
+
3
+ ## Context
4
+
5
+ You are an expert software engineer tasked with building a **complete, production-grade OpenEnv environment** for a Meta x Hugging Face hackathon.
6
+
7
+ You have access to:
8
+
9
+ * The OpenEnv repository (including examples)
10
+ * Validation tools (`openenv validate`)
11
+ * Docker and Hugging Face Spaces
12
+
13
+ You must:
14
+
15
+ * Use this specification as a **strong foundation**
16
+ * Cross-reference with existing OpenEnv examples
17
+ * Improve design decisions where appropriate
18
+ * Ensure strict compliance with OpenEnv standards
19
+
20
+ Do NOT blindly follow instructions — refine and correct based on best practices observed in the repo.
21
+
22
+ ---
23
+
24
+ # Objective
25
+
26
+ Build an environment called:
27
+
28
+ ## **RhythmEnv**
29
+
30
+ > A deterministic reinforcement learning environment that simulates daily planning and execution under constraints like time, energy, deadlines, and task importance.
31
+
32
+ This environment should allow agents to learn:
33
+
34
+ * prioritization
35
+ * scheduling
36
+ * energy management
37
+ * decision-making under trade-offs
38
+
39
+ ---
40
+
41
+ # Core Requirements (MANDATORY)
42
+
43
+ ---
44
+
45
+ ## 1. OpenEnv Spec Compliance
46
+
47
+ You MUST:
48
+
49
+ * Implement typed Pydantic models:
50
+
51
+ * Observation
52
+ * Action
53
+ * Reward
54
+
55
+ * Implement:
56
+
57
+ * `reset()`
58
+ * `step(action)`
59
+ * `state()`
60
+
61
+ * Include:
62
+
63
+ * `openenv.yaml`
64
+ * environment metadata
65
+
66
+ * Pass:
67
+
68
+ ```bash
69
+ openenv validate
70
+ ```
71
+
72
+ ---
73
+
74
+ ## 2. Real-World Task
75
+
76
+ The environment simulates:
77
+
78
+ > “Given a set of tasks, deadlines, and constraints, plan and execute optimally over a day.”
79
+
80
+ This is NOT a game. It must feel like a real productivity system.
81
+
82
+ ---
83
+
84
+ ## 3. Determinism (CRITICAL)
85
+
86
+ * No randomness anywhere
87
+ * All transitions are pure functions
88
+ * Same input → same output always
89
+
90
+ ---
91
+
92
+ ## 4. Episode Design
93
+
94
+ * 1 episode = 1 day
95
+ * 1 step = 30 minutes
96
+ * Total steps = ~20
97
+
98
+ ---
99
+
100
+ ## 5. Action Space (STRICT)
101
+
102
+ No free-form actions.
103
+
104
+ Define structured actions:
105
+
106
+ * START_TASK(task_id)
107
+ * CONTINUE_TASK()
108
+ * SWITCH_TASK(task_id)
109
+ * TAKE_BREAK(duration)
110
+
111
+ Validate all actions strictly.
112
+
113
+ ---
114
+
115
+ ## 6. Observation Space
116
+
117
+ Must include:
118
+
119
+ * current timestep
120
+ * energy (0–1)
121
+ * stress (0–1)
122
+ * current task (optional)
123
+ * tasks list
124
+ * calendar (meetings)
125
+ * remaining steps
126
+
127
+ ### Task fields:
128
+
129
+ * id
130
+ * effort (0–1)
131
+ * progress (0–1)
132
+ * deadline (timestep)
133
+ * importance (0–1)
134
+
135
+ ---
136
+
137
+ ## 7. Environment Dynamics
138
+
139
+ ---
140
+
141
+ ### Task Progress
142
+
143
+ * Only when working on a task
144
+ * Scales with energy
145
+
146
+ Example baseline:
147
+
148
+ ```
149
+ progress_delta = k * energy
150
+ ```
151
+
152
+ ---
153
+
154
+ ### Energy
155
+
156
+ * decreases during work
157
+ * increases during breaks
158
+ * slight decay during idle/switch
159
+
160
+ Clamp between [0, 1]
161
+
162
+ ---
163
+
164
+ ### Stress
165
+
166
+ * increases when:
167
+
168
+ * deadlines missed
169
+ * too many pending tasks
170
+ * decreases during breaks
171
+
172
+ ---
173
+
174
+ ### Meetings
175
+
176
+ * block task progress
177
+ * slightly reduce energy
178
+
179
+ ---
180
+
181
+ ## 8. Hidden Internal Mode (IMPORTANT)
182
+
183
+ Implement a latent mode (NOT exposed to agent):
184
+
185
+ * deep_work
186
+ * execution
187
+ * balanced
188
+
189
+ Derived deterministically from state.
190
+
191
+ Used to:
192
+
193
+ * slightly influence reward
194
+ * create richer learning signal
195
+
196
+ ---
197
+
198
+ # 🔴 Reward & Grader Design Contract (STRICT)
199
+
200
+ This is the MOST IMPORTANT part of the system.
201
+
202
+ ---
203
+
204
+ ## Reward Design Requirements
205
+
206
+ ---
207
+
208
+ ### 1. Dense & Informative
209
+
210
+ * Every step must produce meaningful reward
211
+ * No flat or zero-reward sequences
212
+
213
+ ---
214
+
215
+ ### 2. Monotonic Progress
216
+
217
+ * More progress → higher reward
218
+ * Regressions → penalties
219
+
220
+ ---
221
+
222
+ ### 3. Multi-Component Reward
223
+
224
+ Reward MUST include:
225
+
226
+ ### Positive:
227
+
228
+ * task progress
229
+ * task completion (scaled by importance)
230
+
231
+ ### Negative:
232
+
233
+ * stress penalty
234
+ * missed deadlines
235
+ * excessive switching
236
+ * inefficiency
237
+ * no-op behavior
238
+
239
+ ---
240
+
241
+ ### 4. Anti-Exploitation
242
+
243
+ Explicitly prevent:
244
+
245
+ * infinite loops
246
+ * repeated switching
247
+ * spamming breaks
248
+ * idle actions
249
+
250
+ Add:
251
+
252
+ * penalties
253
+ * diminishing returns
254
+ * constraints
255
+
256
+ ---
257
+
258
+ ### 5. Bounded Reward
259
+
260
+ * Prevent reward explosion
261
+ * Keep values stable and comparable
262
+
263
+ ---
264
+
265
+ ### 6. Reward Breakdown (MANDATORY)
266
+
267
+ Return structured info:
268
+
269
+ ```
270
+ {
271
+ "progress": ...,
272
+ "completion_bonus": ...,
273
+ "stress_penalty": ...,
274
+ "switch_penalty": ...,
275
+ "inefficiency_penalty": ...
276
+ }
277
+ ```
278
+
279
+ ---
280
+
281
+ ## Grader Design Requirements (CRITICAL)
282
+
283
+ Each task MUST include a deterministic grader.
284
+
285
+ ---
286
+
287
+ ### Requirements:
288
+
289
+ * Score range:
290
+
291
+ ```
292
+ 0.0 ≤ score ≤ 1.0
293
+ ```
294
+
295
+ * Deterministic
296
+ * Reproducible
297
+ * Continuous (not binary)
298
+
299
+ ---
300
+
301
+ ### Must Evaluate:
302
+
303
+ * task completion
304
+ * deadline adherence
305
+ * efficiency
306
+ * energy usage
307
+ * stress management
308
+
309
+ ---
310
+
311
+ ### Efficiency Metric (REQUIRED)
312
+
313
+ Define:
314
+
315
+ ```
316
+ efficiency = optimal_steps / actual_steps
317
+ ```
318
+
319
+ Use in grader.
320
+
321
+ ---
322
+
323
+ ### Normalization
324
+
325
+ Ensure:
326
+
327
+ * random agent → ~0.1–0.3
328
+ * baseline agent → ~0.4–0.6
329
+ * strong agent → ~0.7–1.0
330
+
331
+ ---
332
+
333
+ ## Reward vs Grader Alignment
334
+
335
+ * Reward → guides learning
336
+ * Grader → evaluates outcome
337
+
338
+ They must align but NOT be identical.
339
+
340
+ ---
341
+
342
+ ## Anti-Exploitation Validation (MANDATORY)
343
+
344
+ Explicitly test:
345
+
346
+ * agent spamming TAKE_BREAK
347
+ * agent switching every step
348
+ * agent doing nothing
349
+
350
+ Ensure:
351
+
352
+ * these strategies score poorly
353
+
354
+ ---
355
+
356
+ ## Logging (MANDATORY)
357
+
358
+ Return in `info`:
359
+
360
+ ```
361
+ {
362
+ "reward_breakdown": ...,
363
+ "task_progress": ...,
364
+ "deadline_status": ...
365
+ }
366
+ ```
367
+
368
+ ---
369
+
370
+ # 9. Tasks (3 Required)
371
+
372
+ ---
373
+
374
+ ## Task 1 — Easy (Single Priority)
375
+
376
+ * 3 tasks
377
+ * 1 clearly important
378
+ * no meetings
379
+ * high energy
380
+
381
+ Goal:
382
+
383
+ * complete main task efficiently
384
+
385
+ ---
386
+
387
+ ## Task 2 — Medium (Deadline Pressure)
388
+
389
+ * multiple tasks
390
+ * tight deadlines
391
+ * at least one meeting
392
+
393
+ Goal:
394
+
395
+ * maximize completion before deadlines
396
+
397
+ ---
398
+
399
+ ## Task 3 — Hard (Energy Tradeoff)
400
+
401
+ * low energy
402
+ * one deep task
403
+ * multiple small tasks
404
+
405
+ Goal:
406
+
407
+ * balance:
408
+
409
+ * rest
410
+ * deep work
411
+ * short tasks
412
+
413
+ ---
414
+
415
+ # 10. Baseline Agent (`inference.py`)
416
+
417
+ ---
418
+
419
+ ## Requirements:
420
+
421
+ * Use OpenAI client
422
+ * Read:
423
+
424
+ * API_BASE_URL
425
+ * MODEL_NAME
426
+ * OPENAI_API_KEY
427
+ * Run all 3 tasks
428
+ * Output logs in EXACT format:
429
+
430
+ * `[START]`
431
+ * `[STEP]`
432
+ * `[END]`
433
+
434
+ ---
435
+
436
+ ## Baseline Strategy
437
+
438
+ Simple heuristic:
439
+
440
+ * pick highest importance task
441
+ * continue until done or deadline
442
+ * take break if energy low
443
+
444
+ Baseline must be:
445
+
446
+ * non-trivial
447
+ * beatable
448
+
449
+ ---
450
+
451
+ # 11. Code Structure
452
+
453
+ ```
454
+ rhythm_env/
455
+ ├── env.py
456
+ ├── models.py
457
+ ├── tasks/
458
+ ├── graders/
459
+ ├── utils/
460
+ ├── openenv.yaml
461
+ ├── inference.py
462
+ ├── Dockerfile
463
+ └── README.md
464
+ ```
465
+
466
+ ---
467
+
468
+ # 12. README (MANDATORY)
469
+
470
+ Include:
471
+
472
+ * description
473
+ * motivation
474
+ * action space
475
+ * observation space
476
+ * reward design
477
+ * task descriptions
478
+ * grader explanation
479
+ * setup instructions
480
+ * baseline results
481
+
482
+ ---
483
+
484
+ # 13. Validation Checklist
485
+
486
+ Before finishing:
487
+
488
+ * [ ] openenv validate passes
489
+ * [ ] Docker builds & runs
490
+ * [ ] HF Space responds to reset()
491
+ * [ ] All 3 tasks execute
492
+ * [ ] Graders return valid scores
493
+ * [ ] Baseline script runs < 20 min
494
+ * [ ] Logs follow required format
495
+
496
+ ---
497
+
498
+ # 14. Iteration Requirement (MANDATORY)
499
+
500
+ After initial implementation:
501
+
502
+ 1. Run:
503
+
504
+ * baseline agent
505
+ * random policy
506
+
507
+ 2. Compare scores
508
+
509
+ 3. Adjust:
510
+
511
+ * reward weights
512
+ * penalties
513
+ * grader scaling
514
+
515
+ DO NOT finalize without iteration.
516
+
517
+ ---
518
+
519
+ # 15. Design Principles (FINAL)
520
+
521
+ ---
522
+
523
+ ## DO:
524
+
525
+ * Learn from OpenEnv examples
526
+ * Keep environment deterministic
527
+ * Make trade-offs the core difficulty
528
+ * Keep state interpretable
529
+ * Ensure reward clarity
530
+
531
+ ---
532
+
533
+ ## DO NOT:
534
+
535
+ * introduce randomness
536
+ * hide critical information unnecessarily
537
+ * create sparse rewards
538
+ * build overly complex simulation
539
+
540
+ ---
541
+
542
+ # Final Goal
543
+
544
+ Produce an environment that:
545
+
546
+ * agents can learn from
547
+ * evaluators can trust
548
+ * demonstrates meaningful improvement across models
549
+ * reflects real-world decision-making
550
+
551
+ ---
552
+
553
+ End of Prompt.
554
+
docs/problem_statement.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Round 1 — Problem Statement
2
+
3
+ ## The Task
4
+
5
+ Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
6
+
7
+ ---
8
+
9
+ ## Key Requirements at a Glance
10
+
11
+ - Must simulate a real-world task (not games or toys)
12
+ - Implement full OpenEnv spec: typed models, `step()`/`reset()`/`state()`, `openenv.yaml`
13
+ - Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
14
+ - Meaningful reward function with partial progress signals
15
+ - Baseline inference script with reproducible scores
16
+ - Deploy to Hugging Face Spaces + working Dockerfile
17
+ - README with environment description, action/observation spaces, setup instructions
18
+
19
+ ---
20
+
21
+ ## Functional Requirements
22
+
23
+ ### 1. Real-World Task Simulation
24
+ The environment must simulate a task humans actually do — not games or toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
25
+
26
+ ### 2. OpenEnv Spec Compliance
27
+ Implement the full OpenEnv interface:
28
+ - Typed `Observation`, `Action`, and `Reward` Pydantic models
29
+ - `step(action)` → returns observation, reward, done, info
30
+ - `reset()` → returns initial observation
31
+ - `state()` → returns current state
32
+ - `openenv.yaml` with metadata
33
+ - Tested via `openenv validate`
34
+
35
+ ### 3. Minimum 3 Tasks with Agent Graders
36
+ Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
37
+
38
+ ### 4. Meaningful Reward Function
39
+ - Provides signal over the full trajectory (not just binary end-of-episode)
40
+ - Rewards partial progress toward task completion
41
+ - Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions)
42
+
43
+ ### 5. Baseline Inference Script
44
+ - Uses the OpenAI API client to run a model against the environment
45
+ - Reads API credentials from environment variables (`OPENAI_API_KEY`)
46
+ - Produces a reproducible baseline score on all 3 tasks
47
+
48
+ ---
49
+
50
+ ## Non-Functional Requirements
51
+
52
+ ### 1. Hugging Face Space Deployment
53
+ Environment must run as a containerized HF Space tagged with `openenv`.
54
+
55
+ ### 2. Containerized Execution
56
+ Must include a working Dockerfile. The environment should start cleanly with `docker build` + `docker run`.
57
+
58
+ ### 3. Documentation
59
+ README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
60
+
61
+ ---
62
+
63
+ ## Evaluation Criteria
64
+
65
+ | Parameter | Weight | Description |
66
+ |---|---|---|
67
+ | Real-world utility | 30% | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
68
+ | Task & grader quality | 25% | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
69
+ | Environment design | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries. |
70
+ | Code quality & spec compliance | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works. |
71
+ | Creativity & novelty | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach. |
72
+
73
+ ---
74
+
75
+ ## Scoring Breakdown
76
+
77
+ **Real-world utility (30%)**
78
+ - 0–5: Toy/artificial problem with no practical application
79
+ - 6–15: Valid domain but shallow modeling of the real task
80
+ - 16–25: Good domain modeling, would be useful for agent evaluation
81
+ - 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
82
+
83
+ **Task & grader quality (25%)**
84
+ - 3+ tasks with difficulty range?
85
+ - Graders produce scores between 0.0–1.0?
86
+ - Graders deterministic and reproducible?
87
+ - Hard task genuinely challenges frontier models?
88
+
89
+ **Environment design (20%)**
90
+ - `reset()` produces clean state?
91
+ - Action/observation types well-designed and documented?
92
+ - Reward function provides useful varying signal (not just sparse)?
93
+ - Episode boundaries sensible?
94
+
95
+ **Code quality & spec compliance (15%)**
96
+ - `openenv validate` passes?
97
+ - `docker build && docker run` works?
98
+ - HF Space deploys and responds?
99
+ - Baseline script runs and reproduces scores?
100
+
101
+ **Creativity & novelty (10%)**
102
+ - Domain we haven't seen in OpenEnv before?
103
+ - Reward design has interesting properties?
104
+ - Clever mechanics that make the environment engaging?
105
+
106
+ ---
107
+
108
+ ## How Judging Works
109
+
110
+ **Phase 1: Automated Validation**
111
+ Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
112
+
113
+ **Phase 2: Agentic Evaluation**
114
+ Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
115
+
116
+ **Phase 3: Human Review**
117
+ Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
118
+
119
+ ---
120
+
121
+ ## Disqualification Criteria
122
+
123
+ - Environment does not deploy or respond
124
+ - Plagiarized or trivially modified existing environments
125
+ - Graders that always return the same score
126
+ - No baseline inference script
127
+
128
+ ---
129
+
130
+ ## Pre-Submission Checklist — all must pass or you're disqualified
131
+
132
+ - [ ] HF Space deploys — automated ping to the Space URL must return 200 and respond to `reset()`
133
+ - [ ] OpenEnv spec compliance — validate `openenv.yaml`, typed models, `step()`/`reset()`/`state()` endpoints
134
+ - [ ] Dockerfile builds — automated `docker build` on the submitted repo
135
+ - [ ] Baseline reproduces — run the submitted inference script, must complete without error and produce scores
136
+ - [ ] 3+ tasks with graders — enumerate tasks, run each grader, verify scores in 0.0–1.0 range
137
+
138
+ ---
139
+
140
+ ## Mandatory Additional Instructions
141
+
142
+ Before submitting, ensure the following variables are defined in your environment configuration:
143
+
144
+ | Variable | Description |
145
+ |---|---|
146
+ | `API_BASE_URL` | The API endpoint for the LLM |
147
+ | `MODEL_NAME` | The model identifier to use for inference |
148
+ | `HF_TOKEN` | Your Hugging Face / API key |
149
+
150
+ - The inference script must be named `inference.py` and placed in the root directory of the project
151
+ - Participants must use OpenAI Client for all LLM calls using the above variables
152
+ - Participants must emit structured stdout logs strictly following the `[START]`, `[STEP]`, and `[END]` format (see `scripts/sample_inference.py`)
153
+
154
+ ---
155
+
156
+ ## Infrastructure Restrictions
157
+
158
+ - Runtime of inference script must be less than 20 minutes
159
+ - Environment and inference must run on a machine with vCPU=2, memory=8GB
160
+
161
+ ---
162
+
163
+ ## Setup Prerequisites
164
+
165
+ | Tool | Purpose | Install |
166
+ |---|---|---|
167
+ | Python 3.10+ | Runtime | `python --version` |
168
+ | Git + GitHub | Push submission | `git --version` |
169
+ | Hugging Face CLI | Deploy to HF Spaces | `pip install huggingface_hub` then `huggingface-cli login` |
170
+ | OpenEnv | The framework | `pip install openenv-core` |
171
+ | Docker (recommended) | Isolated container testing | `docker --version` |
172
+ | VS Code (recommended) | Best Python + Docker support | — |
173
+
174
+ ---
175
+
176
+ *See `scripts/validate-submission.sh` to run pre-submission checks and `scripts/sample_inference.py` for the required inference script format.*
docs/scripts/sample_inference.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script Example
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after env.close(), always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw last_action_error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
39
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
40
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
41
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
42
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
43
+ """
44
+
45
+ import asyncio
46
+ import os
47
+ import textwrap
48
+ from typing import List, Optional
49
+
50
+ from openai import OpenAI
51
+
52
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
53
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
54
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
55
+
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
58
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
59
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
60
+ MAX_STEPS = 8
61
+ TEMPERATURE = 0.7
62
+ MAX_TOKENS = 150
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
+
65
+ # Max possible reward: each token contributes 0.1, across all steps
66
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
67
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
68
+
69
+ SYSTEM_PROMPT = textwrap.dedent(
70
+ """
71
+ You are interacting with a simple echo environment.
72
+ Each turn you must send a message. The environment will echo it back.
73
+ Reward is proportional to message length: reward = len(message) * 0.1
74
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
75
+ Reply with exactly one message string — no quotes, no prefixes, just the message text.
76
+ """
77
+ ).strip()
78
+
79
+ def log_start(task: str, env: str, model: str) -> None:
80
+ print(f"[START] task={task} env={env} model={model}", flush=True)
81
+
82
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
83
+ error_val = error if error else "null"
84
+ done_val = str(done).lower()
85
+ print(
86
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
87
+ flush=True,
88
+ )
89
+
90
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
91
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
92
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
93
+
94
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
95
+ history_block = "\n".join(history[-4:]) if history else "None"
96
+ return textwrap.dedent(
97
+ f"""
98
+ Step: {step}
99
+ Last echoed message: {last_echoed!r}
100
+ Last reward: {last_reward:.2f}
101
+ Previous steps:
102
+ {history_block}
103
+ Send your next message.
104
+ """
105
+ ).strip()
106
+
107
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
108
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
109
+ try:
110
+ completion = client.chat.completions.create(
111
+ model=MODEL_NAME,
112
+ messages=[
113
+ {"role": "system", "content": SYSTEM_PROMPT},
114
+ {"role": "user", "content": user_prompt},
115
+ ],
116
+ temperature=TEMPERATURE,
117
+ max_tokens=MAX_TOKENS,
118
+ stream=False,
119
+ )
120
+ text = (completion.choices[0].message.content or "").strip()
121
+ return text if text else "hello"
122
+ except Exception as exc:
123
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
124
+ return "hello"
125
+
126
+ async def main() -> None:
127
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
128
+
129
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
130
+
131
+ history: List[str] = []
132
+ rewards: List[float] = []
133
+ steps_taken = 0
134
+ score = 0.0
135
+ success = False
136
+
137
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
138
+
139
+ try:
140
+ result = await env.reset() # OpenENV.reset()
141
+ last_echoed = result.observation.echoed_message
142
+ last_reward = 0.0
143
+
144
+ for step in range(1, MAX_STEPS + 1):
145
+ if result.done:
146
+ break
147
+
148
+ message = get_model_message(client, step, last_echoed, last_reward, history)
149
+
150
+ result = await env.step(MyEnvV4Action(message=message))
151
+ obs = result.observation
152
+
153
+ reward = result.reward or 0.0
154
+ done = result.done
155
+ error = None
156
+
157
+ rewards.append(reward)
158
+ steps_taken = step
159
+ last_echoed = obs.echoed_message
160
+ last_reward = reward
161
+
162
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
163
+
164
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
165
+
166
+ if done:
167
+ break
168
+
169
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
170
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
171
+ success = score >= SUCCESS_SCORE_THRESHOLD
172
+
173
+ finally:
174
+ try:
175
+ await env.close()
176
+ except Exception as e:
177
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
178
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
179
+
180
+ if __name__ == "__main__":
181
+ asyncio.run(main())
182
+
docs/scripts/validate-submission.sh ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
139
+ DOCKER_CONTEXT="$REPO_DIR/server"
140
+ else
141
+ fail "No Dockerfile found in repo root or server/ directory"
142
+ stop_at "Step 2"
143
+ fi
144
+
145
+ log " Found Dockerfile in $DOCKER_CONTEXT"
146
+
147
+ BUILD_OK=false
148
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
149
+
150
+ if [ "$BUILD_OK" = true ]; then
151
+ pass "Docker build succeeded"
152
+ else
153
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
154
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
155
+ stop_at "Step 2"
156
+ fi
157
+
158
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
159
+
160
+ if ! command -v openenv &>/dev/null; then
161
+ fail "openenv command not found"
162
+ hint "Install it: pip install openenv-core"
163
+ stop_at "Step 3"
164
+ fi
165
+
166
+ VALIDATE_OK=false
167
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
168
+
169
+ if [ "$VALIDATE_OK" = true ]; then
170
+ pass "openenv validate passed"
171
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
172
+ else
173
+ fail "openenv validate failed"
174
+ printf "%s\n" "$VALIDATE_OUTPUT"
175
+ stop_at "Step 3"
176
+ fi
177
+
178
+ printf "\n"
179
+ printf "${BOLD}========================================${NC}\n"
180
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
181
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "\n"
184
+
185
+ exit 0
186
+