File size: 8,200 Bytes
9bfe470
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# Round 2 β€” Grand Finale Problem Statement

**Date:** 25–26 April 2026
**Venue:** Scaler School of Technology, Electronic City, Bangalore
**Category:** Solo β€” Akhil Soni

---

## The Task

Choose one (or more) of the themes below and design your own problem statement around it.
Build an environment, train an agent on it, and show measurable improvement.

> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*

It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.

---

## Themes

### Theme 1 β€” Multi-Agent Interactions

Environments involving cooperation, competition, negotiation, and coalition formation.
Enables agents to model beliefs and incentives of others in partially observable settings.
Drives theory-of-mind reasoning and emergent strategic behavior.

**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.

**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.

**Bonus prizes:**
- **Fleet AI** β€” Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
- **Halluminate** β€” Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.

---

### Theme 2 β€” (Super) Long-Horizon Planning & Instruction Following

Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.

**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.

**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.

**Bonus prizes:**
- **Scale AI** β€” Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
- **Mercor** β€” Environment with capped/uncapped rewards where frontier model rewards scale with token output.

---

### Theme 3 β€” World Modeling

#### 3.1 Professional Tasks

Environments requiring real interaction with tools, APIs, or dynamic systems.
The model must do real work instead of exploiting shortcuts.
Strengthens causal reasoning and persistent world models.

**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.

**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β†’ code β†’ experiments), economic simulations, tool-discovery benchmarks.

**Bonus prizes:**
- **Scaler AI Labs** β€” Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.

#### 3.2 Personalized Tasks

Environments for real personalized task handling β€” personal messages, dinner conflicts, tough emails, any personal assistant task.

**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.

**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.

**Bonus prizes:**
- **Patronus AI** β€” Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.

---

### Theme 4 β€” Self-Improvement

Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
Goal: agents learn to drive their own capability growth (recursive skill amplification).

**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.

**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.

**Bonus prizes:**
- **Snorkel AI** β€” Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.

---

### Theme 5 β€” Wild Card

No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.

---

## Minimum Requirements (Non-Negotiable)

Missing any of these puts your submission at a serious disadvantage.

| Requirement | Details |
|---|---|
| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
| Training evidence | Loss and reward plots from a real run |
| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
| HF Space deployment | Environment hosted, discoverable, and runnable |
| README | Motivates the problem, explains the env, shows results, links all materials |

---

## Judging Criteria

| Criterion | Weight | What It Means |
|---|---|---|
| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
| Showing Improvement in Rewards | 20% | Observable training progress β€” reward curves, before/after behavior, baseline comparison |
| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |

---

## Pitch Format

- **3 minutes** to pitch
- **2 minutes** Q&A
- 5 minutes total per team

Your pitch should answer:
1. **Problem** β€” what capability gap or interesting domain are you targeting?
2. **Environment** β€” what does the agent see, do, and get rewarded for?
3. **Results** β€” what changed after training? Show it.
4. **Why it matters** β€” who would care, and why?

---

## What Makes a Submission Stand Out

**Pick an ambitious problem.**
Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?

**Design a reward signal that actually teaches.**
- Rich signal throughout the episode (not just 0/1 at the end)
- Hard to game β€” an agent that exploits the reward without solving the task should not score high
- Use OpenEnv's Rubric system thoughtfully

**Show real training end to end.**
- Training loop connects to your environment (not a static dataset)
- Train long enough that curves mean something
- Compare trained agent vs random/untrained baseline β€” quantitative and qualitative
- Include plots and numbers in your README

**Make plots readable.**
- Label both axes with units
- Save as `.png` / `.jpg` and commit to repo
- Embed key plots in README with a one-line caption
- Put baseline vs trained on the same axes

---

## Engineering Checklist

- [ ] Use `Environment` / `MCPEnvironment` base classes properly
- [ ] Respect client/server separation (clients never import server internals)
- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
- [ ] Valid `openenv.yaml` manifest
- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
- [ ] README links to blog, video, or slides
- [ ] No large video files in HF repo (use URL references)

---

## Before You Arrive in Bangalore

Post-training happens on-site with provided compute credits.
Use the time before April 25 to:

- [ ] Finalize your problem statement
- [ ] Build and deploy your environment to HF Space
- [ ] Write your training script (ready to run, not necessarily fully executed)
- [ ] Prepare your 3-minute pitch story

---

## Infrastructure Constraints (same as Round 1)

- Inference script runtime: under 20 minutes
- Hardware: vCPU=2, memory=8GB