Spaces:
Sleeping
Sleeping
File size: 8,200 Bytes
9bfe470 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | # Round 2 β Grand Finale Problem Statement
**Date:** 25β26 April 2026
**Venue:** Scaler School of Technology, Electronic City, Bangalore
**Category:** Solo β Akhil Soni
---
## The Task
Choose one (or more) of the themes below and design your own problem statement around it.
Build an environment, train an agent on it, and show measurable improvement.
> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*
It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.
---
## Themes
### Theme 1 β Multi-Agent Interactions
Environments involving cooperation, competition, negotiation, and coalition formation.
Enables agents to model beliefs and incentives of others in partially observable settings.
Drives theory-of-mind reasoning and emergent strategic behavior.
**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.
**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
**Bonus prizes:**
- **Fleet AI** β Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
- **Halluminate** β Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.
---
### Theme 2 β (Super) Long-Horizon Planning & Instruction Following
Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.
**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.
**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.
**Bonus prizes:**
- **Scale AI** β Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
- **Mercor** β Environment with capped/uncapped rewards where frontier model rewards scale with token output.
---
### Theme 3 β World Modeling
#### 3.1 Professional Tasks
Environments requiring real interaction with tools, APIs, or dynamic systems.
The model must do real work instead of exploiting shortcuts.
Strengthens causal reasoning and persistent world models.
**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.
**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β code β experiments), economic simulations, tool-discovery benchmarks.
**Bonus prizes:**
- **Scaler AI Labs** β Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.
#### 3.2 Personalized Tasks
Environments for real personalized task handling β personal messages, dinner conflicts, tough emails, any personal assistant task.
**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.
**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.
**Bonus prizes:**
- **Patronus AI** β Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.
---
### Theme 4 β Self-Improvement
Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
Goal: agents learn to drive their own capability growth (recursive skill amplification).
**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.
**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
**Bonus prizes:**
- **Snorkel AI** β Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.
---
### Theme 5 β Wild Card
No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.
---
## Minimum Requirements (Non-Negotiable)
Missing any of these puts your submission at a serious disadvantage.
| Requirement | Details |
|---|---|
| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
| Training evidence | Loss and reward plots from a real run |
| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
| HF Space deployment | Environment hosted, discoverable, and runnable |
| README | Motivates the problem, explains the env, shows results, links all materials |
---
## Judging Criteria
| Criterion | Weight | What It Means |
|---|---|---|
| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
| Showing Improvement in Rewards | 20% | Observable training progress β reward curves, before/after behavior, baseline comparison |
| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |
---
## Pitch Format
- **3 minutes** to pitch
- **2 minutes** Q&A
- 5 minutes total per team
Your pitch should answer:
1. **Problem** β what capability gap or interesting domain are you targeting?
2. **Environment** β what does the agent see, do, and get rewarded for?
3. **Results** β what changed after training? Show it.
4. **Why it matters** β who would care, and why?
---
## What Makes a Submission Stand Out
**Pick an ambitious problem.**
Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?
**Design a reward signal that actually teaches.**
- Rich signal throughout the episode (not just 0/1 at the end)
- Hard to game β an agent that exploits the reward without solving the task should not score high
- Use OpenEnv's Rubric system thoughtfully
**Show real training end to end.**
- Training loop connects to your environment (not a static dataset)
- Train long enough that curves mean something
- Compare trained agent vs random/untrained baseline β quantitative and qualitative
- Include plots and numbers in your README
**Make plots readable.**
- Label both axes with units
- Save as `.png` / `.jpg` and commit to repo
- Embed key plots in README with a one-line caption
- Put baseline vs trained on the same axes
---
## Engineering Checklist
- [ ] Use `Environment` / `MCPEnvironment` base classes properly
- [ ] Respect client/server separation (clients never import server internals)
- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
- [ ] Valid `openenv.yaml` manifest
- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
- [ ] README links to blog, video, or slides
- [ ] No large video files in HF repo (use URL references)
---
## Before You Arrive in Bangalore
Post-training happens on-site with provided compute credits.
Use the time before April 25 to:
- [ ] Finalize your problem statement
- [ ] Build and deploy your environment to HF Space
- [ ] Write your training script (ready to run, not necessarily fully executed)
- [ ] Prepare your 3-minute pitch story
---
## Infrastructure Constraints (same as Round 1)
- Inference script runtime: under 20 minutes
- Hardware: vCPU=2, memory=8GB
|