InosLihka Claude Sonnet 4.6 commited on
Commit
9bfe470
Β·
1 Parent(s): 69310d6

Reorganize docs: segregate Round 1 and Round 2

Browse files

- Move Round 1 files (problem statement, SPEC, inference copy, scripts) into docs/round1/
- Add Round 2 docs: problem statement, confirmation, design notes
- Root inference.py unchanged (hackathon requirement)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs/Hackathon Themes.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Theme #1 - Multi-Agent Interactions
2
+ Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
3
+ Expected Outcome: an environment that can be used to train multi-agent task handling in a LLM
4
+ Example environments: Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
5
+ Sub-themes with bonus prizes.
6
+ Fleet AI. Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents operating in complex, multi-agent settings.
7
+ Halluminate. Multi-Actor Environments: Build a realistic environment where an agent interacts with and manages multiple actors (agents) to discover and achieve the task
8
+ Theme #2 - (Super) Long-Horizon Planning & Instruction Following
9
+ You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
10
+ Expected Outcome: an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
11
+ Example environments: Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
12
+ Sub-themes with bonus prizes.
13
+ Scale AI. Environments for long horizon workflows for non-code use cases within a business setting: focusing on either Sales, Project management, or HR & IT.
14
+ Mercor. Make an environment with capped/uncapped rewards where frontier model rewards scale with token output.
15
+ Theme #3 - World Modeling
16
+ #3.1 Professional Tasks
17
+ Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
18
+ Expected Outcome: an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
19
+ Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β†’ code β†’ experiments), economic simulations with feedback, tool-discovery benchmarks.
20
+ Sub-themes with bonus prizes.
21
+ Scaler AI Labs. Multi-App RL Environment for Enterprise Workflows: Create RL environments to demonstrate complex workflows, business rule nuances etc in a large enterprise
22
+
23
+ #3.2 Personalized Tasks
24
+ Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks
25
+
26
+
27
+ Expected Outcome: An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
28
+
29
+ Example environments: Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, shopping, etc
30
+
31
+ Sub-themes with bonus prizes.
32
+ Patronus AI. Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where the underlying data schemas, API contracts, and t&cs/policies/rules change.
33
+
34
+ Theme #4 - Self-Improvement
35
+ The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
36
+ Expected Outcome: an environment for improving self-play of a LLM over a defined set of tasks
37
+ Example environments: Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
38
+ Sub-themes with bonus prizes.
39
+ Snorkel AI. Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements / preferences.
40
+
41
+ Theme #5: Wild Card - Impress Us!
42
+ We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
43
+
44
+ Guidelines for Problem Statement
45
+ It is NOT mandatory to choose the same problem statement as Round 1. Only choose the same problem statement if it aligns with the above provided Hackathon themes.
46
+ You can start working on your problem statement once you have finalized it. Post-training can be done onsite on 25th & 26th when you receive compute credits for HuggingFace.
47
+ Before the onsite, we suggest you work on building the environment, agent behaviours, reward model and evaluate if your work aligns with the judging criteria given below.
48
+
49
+
50
+ Judging Criteria
51
+ Minimum requirements:
52
+ Usage of OpenEnv (latest release)
53
+ Show a minimal training script for your environment using Unsloth or HF TRL in Colab
54
+ Write a mini-blog on HuggingFace or mini-video on YouTube talking about your submission, <2 minutes
55
+ Your OpenEnv compliant environment should be hosted on Hugging Face Spaces.
56
+
57
+ First Round Judging Overview
58
+ Pitch Format: Each team has 3 minutes to pitch, followed by 2 minutes for Q&A (5 minutes total).
59
+ Evaluation: Teams will be scored based on the following criteria:
60
+ Environment Innovation (40%): Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
61
+ Storytelling (30%): Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
62
+ Showing Improvement in Rewards (20%): Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
63
+ Reward and Training Script/Pipeline Setup (10%): Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
64
+
65
+ OpenEnv Hackathon - What Judges Look For
66
+
67
+ This guide tells you what makes a strong submission for the OpenEnv Hackathon (India 2026).
68
+ Read it before you start building, and again before you submit.
69
+
70
+ For the list of themes and example problems, refer to the top sections.
71
+
72
+ NOTE: Please remember only one submission per team. If you have multiple ideas, pick the best one and go for it. Please make sure that the URL link of your environment is submitted as judges will pull the environment from the URL to evaluate it. Changes or commits after the submission deadline will not be considered.
73
+
74
+ TL;DR
75
+
76
+ Build an environment that an LLM could actually be trained on to get measurably better at
77
+ something interesting. Then show that training. Then tell the story.
78
+
79
+ A messy but ambitious environment with real training evidence beats a polished but boring one.
80
+ Pick a problem that excites you (that energy comes through in the pitch).
81
+
82
+ Judging Criteria
83
+
84
+ Criterion: Environment Innovation
85
+ Weight: 40%
86
+ What it means:
87
+ Is the environment novel, creative, or genuinely challenging?
88
+ Does it meaningfully test agent behavior in a way that hasn't been done before?
89
+
90
+
91
+ Criterion: Storytelling & Presentation
92
+ Weight: 30%
93
+ What it means:
94
+ Can you clearly explain the problem, the environment, and what the agent learned?
95
+ Is the demo engaging and easy to follow for a non-technical audience?
96
+
97
+
98
+ Criterion: Showing Improvement in Rewards
99
+ Weight: 20%
100
+ What it means:
101
+ Is there observable evidence of training progress? Reward curves, before/after behavior,
102
+ comparison against a baseline -- anything that proves the agent learned something.
103
+
104
+
105
+ Criterion: Reward & Training Pipeline
106
+ Weight: 10%
107
+ What it means:
108
+ Is the reward logic coherent? Does the pipeline produce meaningful improvement in the trained
109
+ agent's behavior?
110
+
111
+
112
+ Minimum Submission Requirements
113
+
114
+ NOTE: These are non-negotiable. Submissions missing any of these are at a serious disadvantage.
115
+ Use OpenEnv (latest release). Build on top of the framework; don’t reinvent the wheel.
116
+ A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
117
+ Evidence that you actually trained; at minimum, loss and reward plots from a real run.
118
+ A short writeup: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
119
+ Push your environment to a Hugging Face Space so it’s discoverable and runnable.
120
+ A README that motivates the problem, explains how the env works, and shows results.
121
+ README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
122
+ Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
123
+
124
+ What Makes a Submission Stand Out
125
+
126
+ Pick an ambitious, original problem
127
+ The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
128
+ you need a genuinely fresh angle. Some questions to ask yourself:
129
+ Does this environment exist to teach an LLM something it currently can’t do well?
130
+ Is the domain underexplored in RL/LLM training?
131
+ Could a researcher write a paper about training on this?
132
+
133
+ Design a reward signal that actually teaches
134
+ A great environment has a reward function that:
135
+ Provides a rich, informative signal (not just 0/1 at the end)
136
+ Captures something hard to measure in a clever way
137
+ Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
138
+ Is hard to game; an agent that exploits the reward without solving the task should not get high scores
139
+
140
+ Show real training, end to end
141
+ The bar isn’t β€œtraining script exists.” The bar is β€œtraining script runs against the environment, the
142
+ agent learns, and you can show it.” Concretely:
143
+ Your training loop should connect to your environment (not a static dataset)
144
+ Train long enough that the curves mean something
145
+ Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
146
+ Include the plots and numbers in your README and writeup
147
+
148
+ Make your plots readable
149
+ Reviewers spend seconds, not minutes, on each plot. Help them out:
150
+ Label both axes (e.g. β€œtraining step” / β€œepisode” on x, β€œreward” / β€œloss” on y) and include units where they apply
151
+ Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via WANBD, please include the link to that specific run of your plots)
152
+ Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
153
+
154
+ Tell a story, not an API doc
155
+ Your README, blog, and pitch should answer:
156
+ Problem) what capability gap or interesting domain are you targeting?
157
+ Environment) what does the agent see, do, and get rewarded for?
158
+ Results) what changed after training? Show it.
159
+ Why does it matter) who would care, and why?
160
+
161
+ A reviewer should be able to read your README in 3~5 minutes and want to try your
162
+ environment.
163
+
164
+ NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
165
+ from your README.
166
+
167
+ Engineer it cleanly (table stakes)
168
+ Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
169
+ Use OpenEnv’s Environment / MCPEnvironment base classes properly
170
+ Respect the client / server separation (clients should never import server internals)
171
+ Follow the standard Gym-style API (reset, step, state)
172
+ Have a valid openenv.yaml manifest
173
+ Don’t use reserved tool names (reset, step, state, close) for MCP tools
174
+
175
+ Final Note
176
+
177
+ Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
178
+ ambitious. Pick a problem you find genuinely interesting; that almost always produces better
179
+ work than chasing what you think judges want. Good luck.
180
+
SPEC.md β†’ docs/round1/SPEC.md RENAMED
File without changes
docs/round1/inference.py ADDED
@@ -0,0 +1,304 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ RhythmEnv Inference Script
9
+ ===================================
10
+ MANDATORY
11
+ - Before submitting, ensure the following variables are defined in your environment configuration:
12
+ API_BASE_URL The API endpoint for the LLM.
13
+ MODEL_NAME The model identifier to use for inference.
14
+ HF_TOKEN Your Hugging Face / API key.
15
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
16
+
17
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
18
+ (and should reflect your active inference setup):
19
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
20
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
21
+
22
+ - The inference script must be named `inference.py` and placed in the root directory of the project
23
+ - Participants must use OpenAI Client for all LLM calls using above variables
24
+
25
+ STDOUT FORMAT
26
+ - The script must emit exactly three line types to stdout, in this order:
27
+
28
+ [START] task=<task_name> env=<benchmark> model=<model_name>
29
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
30
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
31
+
32
+ Rules:
33
+ - One [START] line at episode begin.
34
+ - One [STEP] line per step, immediately after env.step() returns.
35
+ - One [END] line after env.close(), always emitted (even on exception).
36
+ - reward and rewards are formatted to 2 decimal places.
37
+ - done and success are lowercase booleans: true or false.
38
+ - error is the raw last_action_error string, or null if none.
39
+ - All fields on a single line with no newlines within a line.
40
+ - Each tasks should return score in [0, 1]
41
+ """
42
+
43
+ import asyncio
44
+ import os
45
+ import sys
46
+ import textwrap
47
+ from typing import List, Optional
48
+
49
+ from openai import OpenAI
50
+
51
+ # Add current directory to path for local imports
52
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
53
+
54
+ from client import RhythmEnv
55
+ from models import ActionType, RhythmAction
56
+
57
+ # ---------------------------------------------------------------------------
58
+ # Configuration
59
+ # ---------------------------------------------------------------------------
60
+
61
+ IMAGE_NAME = os.getenv("IMAGE_NAME")
62
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
63
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
64
+ MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
65
+ BASE_URL = os.getenv("RHYTHM_ENV_URL", "https://InosLihka-rhythm-env.hf.space")
66
+ BENCHMARK = "rhythm_env"
67
+ TASKS = ["easy", "medium", "hard"]
68
+ MAX_STEPS = 20
69
+ SCORE_THRESHOLD = 0.1
70
+
71
+ SYSTEM_PROMPT = textwrap.dedent("""\
72
+ You are a daily planning agent. You manage tasks across a workday.
73
+ Each step is a 30-minute slot. You have energy (0-1) and stress (0-1).
74
+
75
+ Available actions (respond with EXACTLY one line in this format):
76
+ START_TASK <task_id>
77
+ CONTINUE_TASK
78
+ SWITCH_TASK <task_id>
79
+ TAKE_BREAK
80
+
81
+ Rules:
82
+ - START_TASK/SWITCH_TASK require a task_id (integer).
83
+ - CONTINUE_TASK continues your current task.
84
+ - TAKE_BREAK recovers energy and reduces stress.
85
+ - Take breaks when energy < 0.3.
86
+ - Prioritize tasks by deadline urgency, then importance.
87
+ - Avoid unnecessary switching (costs energy and reward).
88
+
89
+ Respond with ONLY the action line, nothing else.""")
90
+
91
+
92
+ # ---------------------------------------------------------------------------
93
+ # Logging helpers
94
+ # ---------------------------------------------------------------------------
95
+
96
+ def log_start(task: str, env: str, model: str) -> None:
97
+ print(f"[START] task={task} env={env} model={model}", flush=True)
98
+
99
+
100
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
101
+ error_val = error if error else "null"
102
+ done_val = str(done).lower()
103
+ print(
104
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
105
+ flush=True,
106
+ )
107
+
108
+
109
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
110
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
111
+ print(
112
+ f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
113
+ flush=True,
114
+ )
115
+
116
+
117
+ # ---------------------------------------------------------------------------
118
+ # Heuristic action selection (enhanced by LLM)
119
+ # ---------------------------------------------------------------------------
120
+
121
+ def choose_action_heuristic(obs) -> RhythmAction:
122
+ """Greedy heuristic: prioritize by deadline then importance."""
123
+ energy = obs.energy
124
+ current_task_id = obs.current_task_id
125
+ tasks = obs.tasks
126
+ timestep = obs.timestep
127
+ meetings = obs.meetings
128
+
129
+ # During meeting slots, just take a break
130
+ if timestep in meetings:
131
+ return RhythmAction(action_type=ActionType.TAKE_BREAK)
132
+
133
+ # Take break if energy is low
134
+ if energy < 0.3:
135
+ return RhythmAction(action_type=ActionType.TAKE_BREAK)
136
+
137
+ # Get uncompleted tasks
138
+ uncompleted = [t for t in tasks if t.progress < t.effort]
139
+ if not uncompleted:
140
+ return RhythmAction(action_type=ActionType.TAKE_BREAK)
141
+
142
+ # Sort by deadline (ascending), then importance (descending)
143
+ uncompleted.sort(key=lambda t: (t.deadline, -t.importance))
144
+
145
+ # Check for urgent tasks (deadline within 3 steps)
146
+ urgent = [t for t in uncompleted if t.deadline - timestep <= 3]
147
+ best = urgent[0] if urgent else uncompleted[0]
148
+
149
+ if current_task_id is not None and current_task_id == best.id:
150
+ return RhythmAction(action_type=ActionType.CONTINUE_TASK)
151
+ elif current_task_id is not None:
152
+ return RhythmAction(action_type=ActionType.SWITCH_TASK, task_id=best.id)
153
+ else:
154
+ return RhythmAction(action_type=ActionType.START_TASK, task_id=best.id)
155
+
156
+
157
+ def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
158
+ """Use LLM to pick an action, fall back to heuristic on failure."""
159
+ tasks_desc = "\n".join(
160
+ f" Task {t.id}: {t.name} β€” {t.description}\n"
161
+ f" (effort={t.effort:.2f}, progress={t.progress:.2f}, "
162
+ f"deadline=step {t.deadline}, importance={t.importance})"
163
+ for t in obs.tasks
164
+ )
165
+ user_prompt = textwrap.dedent(f"""\
166
+ Step: {obs.timestep}/{MAX_STEPS}
167
+ Energy: {obs.energy:.2f}
168
+ Stress: {obs.stress:.2f}
169
+ Current task: {obs.current_task_id}
170
+ Meetings at steps: {obs.meetings}
171
+ Remaining steps: {obs.remaining_steps}
172
+
173
+ Tasks:
174
+ {tasks_desc}
175
+
176
+ Choose your action:""")
177
+
178
+ try:
179
+ completion = llm_client.chat.completions.create(
180
+ model=MODEL_NAME,
181
+ messages=[
182
+ {"role": "system", "content": SYSTEM_PROMPT},
183
+ {"role": "user", "content": user_prompt},
184
+ ],
185
+ temperature=0.3,
186
+ max_tokens=30,
187
+ stream=False,
188
+ )
189
+ text = (completion.choices[0].message.content or "").strip()
190
+ return parse_llm_action(text, obs)
191
+ except Exception:
192
+ return choose_action_heuristic(obs)
193
+
194
+
195
+ def parse_llm_action(text: str, obs) -> RhythmAction:
196
+ """Parse LLM response text into a RhythmAction."""
197
+ text = text.strip().upper()
198
+
199
+ if text.startswith("TAKE_BREAK"):
200
+ return RhythmAction(action_type=ActionType.TAKE_BREAK)
201
+
202
+ if text.startswith("CONTINUE_TASK"):
203
+ if obs.current_task_id is not None:
204
+ return RhythmAction(action_type=ActionType.CONTINUE_TASK)
205
+ return choose_action_heuristic(obs)
206
+
207
+ for prefix, action_type in [
208
+ ("START_TASK", ActionType.START_TASK),
209
+ ("SWITCH_TASK", ActionType.SWITCH_TASK),
210
+ ]:
211
+ if text.startswith(prefix):
212
+ rest = text[len(prefix):].strip()
213
+ try:
214
+ task_id = int(rest)
215
+ if 0 <= task_id < len(obs.tasks):
216
+ return RhythmAction(action_type=action_type, task_id=task_id)
217
+ except ValueError:
218
+ pass
219
+
220
+ # Fallback
221
+ return choose_action_heuristic(obs)
222
+
223
+
224
+ # ---------------------------------------------------------------------------
225
+ # Main loop
226
+ # ---------------------------------------------------------------------------
227
+
228
+ async def run_task(task_name: str, llm_client: OpenAI) -> float:
229
+ """Run a single task and return the score."""
230
+ if IMAGE_NAME:
231
+ env = await RhythmEnv.from_docker_image(IMAGE_NAME)
232
+ else:
233
+ env = RhythmEnv(base_url=BASE_URL)
234
+
235
+ rewards: List[float] = []
236
+ steps_taken = 0
237
+ score = 0.0
238
+ success = False
239
+
240
+ log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
241
+
242
+ try:
243
+ async with env:
244
+ result = await env.reset(task=task_name)
245
+
246
+ for step in range(1, MAX_STEPS + 1):
247
+ if result.done:
248
+ break
249
+
250
+ # Use LLM if available, otherwise heuristic
251
+ if llm_client is not None:
252
+ action = choose_action_llm(result.observation, llm_client)
253
+ else:
254
+ action = choose_action_heuristic(result.observation)
255
+
256
+ action_str = action.action_type.value
257
+ if action.task_id is not None:
258
+ action_str += f"({action.task_id})"
259
+
260
+ result = await env.step(action)
261
+
262
+ reward = result.reward or 0.0
263
+ done = result.done
264
+ rewards.append(reward)
265
+ steps_taken = step
266
+
267
+ log_step(step=step, action=action_str, reward=reward, done=done, error=None)
268
+
269
+ if done:
270
+ break
271
+
272
+ # Get final score from grader
273
+ score = result.observation.reward_breakdown.get("final_score", 0.0)
274
+ score = max(0.0, min(1.0, score))
275
+ success = score >= SCORE_THRESHOLD
276
+
277
+ except Exception as e:
278
+ print(f"[DEBUG] Error running task {task_name}: {e}", flush=True)
279
+ finally:
280
+ try:
281
+ await env.close()
282
+ except Exception as e:
283
+ print(f"[DEBUG] env.close() error: {e}", flush=True)
284
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
285
+
286
+ return score
287
+
288
+
289
+ async def main() -> None:
290
+ llm_client = None
291
+ if API_KEY:
292
+ llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
293
+
294
+ scores = []
295
+ for task_name in TASKS:
296
+ s = await run_task(task_name, llm_client)
297
+ scores.append(s)
298
+
299
+ avg = sum(scores) / len(scores) if scores else 0.0
300
+ print(f"\n[SUMMARY] avg_score={avg:.3f} scores={','.join(f'{s:.3f}' for s in scores)}", flush=True)
301
+
302
+
303
+ if __name__ == "__main__":
304
+ asyncio.run(main())
docs/{problem_statement.md β†’ round1/problem_statement.md} RENAMED
File without changes
docs/{scripts β†’ round1/scripts}/sample_inference.py RENAMED
File without changes
docs/{scripts β†’ round1/scripts}/validate-submission.sh RENAMED
File without changes
docs/round2_confirmation.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Round 2 β€” Entry Confirmation
2
+
3
+ **Event:** Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology β€” Grand Finale
4
+ **Date:** 25–26 April 2026
5
+ **Venue:** Scaler School of Technology, Electronic City, Bangalore
6
+ **Category:** Solo
7
+ **Name:** Akhil Soni
8
+ **Email:** akhilsoni0102@gmail.com
9
+
10
+ > This document serves as your official team ticket to the finale. Present it at entry.
11
+
12
+ ---
13
+
14
+ ## Before You Arrive
15
+
16
+ - [ ] Join the private Discord (MANDATORY) β€” all major updates announced here first
17
+ - [ ] Check the travel guide β€” venue details, directions, nearby stay options
18
+
19
+ ---
20
+
21
+ ## Entry Requirements
22
+
23
+ ⚠️ You must present this document at entry. Entry denied if details don't match registration.
24
+
25
+ Bring:
26
+ - Valid government-issued ID
27
+ - College/company ID used during registration
28
+
29
+ ---
30
+
31
+ ## Round 2 Themes (Summary)
32
+
33
+ Your task is to choose one or more themes and design your own problem statement around it.
34
+
35
+ - **Multi-Agent Interactions** β€” cooperation, competition, negotiation
36
+ - **Long-Horizon Planning & Instruction Following** β€” multi-step reasoning, sparse rewards
37
+ - **World Modeling** β€” professional tasks (tools/APIs) or personalized tasks
38
+ - **Self-Improvement** β€” self-play, adaptive curricula, recursive skill amplification
39
+ - **Wild Card** β€” anything original and meaningful
40
+
41
+ As part of your submission, clearly define:
42
+
43
+ | What | Description |
44
+ |---|---|
45
+ | Problem statement | What capability gap are you solving? |
46
+ | Environment | Where the agent operates |
47
+ | Agent capabilities | What the agent can see and do |
48
+ | Tasks | What the agent must accomplish |
49
+ | Reward model / evaluation logic | How success is measured |
50
+ | Post-training / self-improvement strategy | How the agent improves |
51
+
52
+ > Full themes and judging criteria: see [round2_problem_statement.md](round2_problem_statement.md)
53
+
54
+ ---
55
+
56
+ ## Key Notes
57
+
58
+ - You can begin working on your problem statement immediately
59
+ - Post-training happens **on-site** with provided compute credits
60
+ - Use time before April 25 to build the environment, agent behaviours, and reward model
docs/round2_design_notes.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Round 2 β€” Design Notes
2
+
3
+ ## The Core Idea
4
+
5
+ Design the environment so that **reward is controlled by hidden variables the agent cannot see in state.**
6
+ The agent must discover them through trial and error across many episodes.
7
+
8
+ This is what creates genuine learning signal β€” the agent starts confused, gets inconsistent rewards,
9
+ and gradually figures out the hidden rules that explain its performance.
10
+
11
+ ---
12
+
13
+ ## The Principle (Generic)
14
+
15
+ ```
16
+ What agent sees in state: observable facts (energy, tasks, deadlines, timestep)
17
+ What agent does NOT see: hidden variables that secretly control reward
18
+
19
+ The agent's job: figure out the hidden variables through experience
20
+ ```
21
+
22
+ Each hidden variable should satisfy three rules:
23
+
24
+ 1. **Discoverable through reward signal alone** β€” the agent can figure it out without being told
25
+ 2. **Changes strategy significantly once discovered** β€” it's not a minor tweak, it rewires planning
26
+ 3. **Not guessable from common sense alone** β€” a pretrained LLM should not figure it out on the first try
27
+
28
+ ---
29
+
30
+ ## Reference Example β€” EchoEnv
31
+
32
+ EchoEnv is the simplest possible OpenEnv environment. The agent sends a string, the env echoes it back.
33
+
34
+ ```
35
+ Hidden variable: string length
36
+ Agent observes: it sent "hi" β†’ reward 0.2
37
+ it sent "hello" β†’ reward 0.5
38
+ it sent "hello world" β†’ reward 0.8
39
+ Agent discovers: longer string = higher reward
40
+ Strategy change: always send the longest possible string
41
+ ```
42
+
43
+ The agent never sees a "length bonus" field in the observation. It just notices the correlation
44
+ between string length and reward over many episodes, and learns to exploit it.
45
+
46
+ ---
47
+
48
+ ## Applying This to RhythmEnv
49
+
50
+ RhythmEnv simulates a workday. One episode = one day. 20 steps = 20 half-hour slots.
51
+
52
+ The agent sees: energy, stress, tasks, deadlines, current_task_id, timestep.
53
+ The agent does NOT see: hidden variables that secretly influence how reward is calculated.
54
+
55
+ Below are candidate hidden variables. Each one the agent must discover through experience.
56
+
57
+ ---
58
+
59
+ ### Hidden Variable 1 β€” Task Sequencing Dependency
60
+
61
+ Some tasks give a secret bonus when done in a specific order.
62
+
63
+ ```
64
+ Deep work β†’ then email β†’ bonus multiplier applied to email reward
65
+ Email β†’ then deep work β†’ no bonus
66
+ ```
67
+
68
+ **What agent sees:** just the reward it earned on the email task.
69
+ **What agent discovers:** doing deep work first makes email more rewarding.
70
+ **Real world basis:** mental priming β€” focused work puts you in a state where communication is sharper.
71
+ **Strategy change:** agent learns to always front-load deep work, then switch to communication tasks.
72
+
73
+ ---
74
+
75
+ ### Hidden Variable 2 β€” Energy Threshold Cliff
76
+
77
+ Progress rate does not degrade smoothly with energy. It drops sharply at a hidden threshold.
78
+
79
+ ```
80
+ energy > 0.5 β†’ normal progress rate
81
+ energy < 0.5 β†’ progress drops 60% suddenly
82
+ ```
83
+
84
+ **What agent sees:** energy level (e.g. 0.48) β€” looks like any other low energy value.
85
+ **What agent discovers:** something bad happens around 0.5, rewards drop sharply below it.
86
+ **Real world basis:** cognitive performance has cliff effects, not smooth degradation.
87
+ **Strategy change:** agent learns to take a break before hitting 0.5, not after.
88
+
89
+ ---
90
+
91
+ ### Hidden Variable 3 β€” Task Interference
92
+
93
+ Certain task combinations secretly multiply stress while both are active.
94
+
95
+ ```
96
+ working on Task A while Task B is overdue β†’ stress multiplier 1.5x
97
+ agent does not see this multiplier in state
98
+ ```
99
+
100
+ **What agent sees:** stress rising faster than expected in certain situations.
101
+ **What agent discovers:** having overdue tasks in the background makes everything worse.
102
+ **Real world basis:** unfinished obligations create background cognitive load.
103
+ **Strategy change:** agent learns to clear small overdue tasks early, even at the cost of efficiency.
104
+
105
+ ---
106
+
107
+ ### Hidden Variable 4 β€” Recovery Curve Shape
108
+
109
+ Consecutive breaks compound in recovery β€” the second break recovers more than the first.
110
+
111
+ ```
112
+ 1 break alone β†’ +0.12 energy
113
+ 2 breaks back to back β†’ +0.12 + 0.18 energy (hidden compounding)
114
+ ```
115
+
116
+ **What agent sees:** it took a break, energy went up by 0.12. Looks linear.
117
+ **What agent discovers:** two breaks in a row sometimes outperforms one break + one work step.
118
+ **Real world basis:** rest has diminishing costs but increasing returns at low energy states.
119
+ **Strategy change:** agent learns that when very depleted, double breaks are worth it.
120
+
121
+ ---
122
+
123
+ ### Hidden Variable 5 β€” Time-of-Day Sensitivity
124
+
125
+ Certain task types score better at certain timesteps via a hidden importance multiplier.
126
+
127
+ ```
128
+ creative/deep tasks at timestep 0–5 β†’ importance multiplier 1.3x
129
+ creative/deep tasks at timestep 15+ β†’ importance multiplier 0.7x
130
+ ```
131
+
132
+ **What agent sees:** same task, same effort, but different reward depending on when it was done.
133
+ **What agent discovers:** doing creative tasks early in the day is disproportionately rewarded.
134
+ **Real world basis:** cognitive peak hours β€” most people do their best focused work in the morning.
135
+ **Strategy change:** agent learns to protect early timesteps for high-effort tasks regardless of deadline pressure.
136
+
137
+ ---
138
+
139
+ ## Open Questions
140
+
141
+ These are not decided yet. To be answered before building:
142
+
143
+ - Which of the 5 variables above do we actually implement?
144
+ - Do we implement all 5 or pick 2-3 strong ones?
145
+ - How do we make sure the hidden variables are discoverable but not too easy?
146
+ - Does the person profile (energy curve, task types) change across episodes or stay fixed?
147
+ - What does the training story look like β€” how many episodes before the agent figures it out?
docs/round2_problem_statement.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Round 2 β€” Grand Finale Problem Statement
2
+
3
+ **Date:** 25–26 April 2026
4
+ **Venue:** Scaler School of Technology, Electronic City, Bangalore
5
+ **Category:** Solo β€” Akhil Soni
6
+
7
+ ---
8
+
9
+ ## The Task
10
+
11
+ Choose one (or more) of the themes below and design your own problem statement around it.
12
+ Build an environment, train an agent on it, and show measurable improvement.
13
+
14
+ > *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*
15
+
16
+ It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.
17
+
18
+ ---
19
+
20
+ ## Themes
21
+
22
+ ### Theme 1 β€” Multi-Agent Interactions
23
+
24
+ Environments involving cooperation, competition, negotiation, and coalition formation.
25
+ Enables agents to model beliefs and incentives of others in partially observable settings.
26
+ Drives theory-of-mind reasoning and emergent strategic behavior.
27
+
28
+ **Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.
29
+
30
+ **Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
31
+
32
+ **Bonus prizes:**
33
+ - **Fleet AI** β€” Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
34
+ - **Halluminate** β€” Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.
35
+
36
+ ---
37
+
38
+ ### Theme 2 β€” (Super) Long-Horizon Planning & Instruction Following
39
+
40
+ Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
41
+ Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
42
+ Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.
43
+
44
+ **Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.
45
+
46
+ **Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.
47
+
48
+ **Bonus prizes:**
49
+ - **Scale AI** β€” Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
50
+ - **Mercor** β€” Environment with capped/uncapped rewards where frontier model rewards scale with token output.
51
+
52
+ ---
53
+
54
+ ### Theme 3 β€” World Modeling
55
+
56
+ #### 3.1 Professional Tasks
57
+
58
+ Environments requiring real interaction with tools, APIs, or dynamic systems.
59
+ The model must do real work instead of exploiting shortcuts.
60
+ Strengthens causal reasoning and persistent world models.
61
+
62
+ **Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.
63
+
64
+ **Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β†’ code β†’ experiments), economic simulations, tool-discovery benchmarks.
65
+
66
+ **Bonus prizes:**
67
+ - **Scaler AI Labs** β€” Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.
68
+
69
+ #### 3.2 Personalized Tasks
70
+
71
+ Environments for real personalized task handling β€” personal messages, dinner conflicts, tough emails, any personal assistant task.
72
+
73
+ **Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.
74
+
75
+ **Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.
76
+
77
+ **Bonus prizes:**
78
+ - **Patronus AI** β€” Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.
79
+
80
+ ---
81
+
82
+ ### Theme 4 β€” Self-Improvement
83
+
84
+ Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
85
+ Goal: agents learn to drive their own capability growth (recursive skill amplification).
86
+
87
+ **Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.
88
+
89
+ **Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
90
+
91
+ **Bonus prizes:**
92
+ - **Snorkel AI** β€” Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.
93
+
94
+ ---
95
+
96
+ ### Theme 5 β€” Wild Card
97
+
98
+ No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.
99
+
100
+ ---
101
+
102
+ ## Minimum Requirements (Non-Negotiable)
103
+
104
+ Missing any of these puts your submission at a serious disadvantage.
105
+
106
+ | Requirement | Details |
107
+ |---|---|
108
+ | OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
109
+ | Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
110
+ | Training evidence | Loss and reward plots from a real run |
111
+ | Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
112
+ | HF Space deployment | Environment hosted, discoverable, and runnable |
113
+ | README | Motivates the problem, explains the env, shows results, links all materials |
114
+
115
+ ---
116
+
117
+ ## Judging Criteria
118
+
119
+ | Criterion | Weight | What It Means |
120
+ |---|---|---|
121
+ | Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
122
+ | Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
123
+ | Showing Improvement in Rewards | 20% | Observable training progress β€” reward curves, before/after behavior, baseline comparison |
124
+ | Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |
125
+
126
+ ---
127
+
128
+ ## Pitch Format
129
+
130
+ - **3 minutes** to pitch
131
+ - **2 minutes** Q&A
132
+ - 5 minutes total per team
133
+
134
+ Your pitch should answer:
135
+ 1. **Problem** β€” what capability gap or interesting domain are you targeting?
136
+ 2. **Environment** β€” what does the agent see, do, and get rewarded for?
137
+ 3. **Results** β€” what changed after training? Show it.
138
+ 4. **Why it matters** β€” who would care, and why?
139
+
140
+ ---
141
+
142
+ ## What Makes a Submission Stand Out
143
+
144
+ **Pick an ambitious problem.**
145
+ Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?
146
+
147
+ **Design a reward signal that actually teaches.**
148
+ - Rich signal throughout the episode (not just 0/1 at the end)
149
+ - Hard to game β€” an agent that exploits the reward without solving the task should not score high
150
+ - Use OpenEnv's Rubric system thoughtfully
151
+
152
+ **Show real training end to end.**
153
+ - Training loop connects to your environment (not a static dataset)
154
+ - Train long enough that curves mean something
155
+ - Compare trained agent vs random/untrained baseline β€” quantitative and qualitative
156
+ - Include plots and numbers in your README
157
+
158
+ **Make plots readable.**
159
+ - Label both axes with units
160
+ - Save as `.png` / `.jpg` and commit to repo
161
+ - Embed key plots in README with a one-line caption
162
+ - Put baseline vs trained on the same axes
163
+
164
+ ---
165
+
166
+ ## Engineering Checklist
167
+
168
+ - [ ] Use `Environment` / `MCPEnvironment` base classes properly
169
+ - [ ] Respect client/server separation (clients never import server internals)
170
+ - [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
171
+ - [ ] Valid `openenv.yaml` manifest
172
+ - [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
173
+ - [ ] README links to blog, video, or slides
174
+ - [ ] No large video files in HF repo (use URL references)
175
+
176
+ ---
177
+
178
+ ## Before You Arrive in Bangalore
179
+
180
+ Post-training happens on-site with provided compute credits.
181
+ Use the time before April 25 to:
182
+
183
+ - [ ] Finalize your problem statement
184
+ - [ ] Build and deploy your environment to HF Space
185
+ - [ ] Write your training script (ready to run, not necessarily fully executed)
186
+ - [ ] Prepare your 3-minute pitch story
187
+
188
+ ---
189
+
190
+ ## Infrastructure Constraints (same as Round 1)
191
+
192
+ - Inference script runtime: under 20 minutes
193
+ - Hardware: vCPU=2, memory=8GB