Spaces:
Sleeping
Sleeping
| # Round 2 β Grand Finale Problem Statement | |
| **Date:** 25β26 April 2026 | |
| **Venue:** Scaler School of Technology, Electronic City, Bangalore | |
| **Category:** Solo β Akhil Soni | |
| --- | |
| ## The Task | |
| Choose one (or more) of the themes below and design your own problem statement around it. | |
| Build an environment, train an agent on it, and show measurable improvement. | |
| > *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."* | |
| It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below. | |
| --- | |
| ## Themes | |
| ### Theme 1 β Multi-Agent Interactions | |
| Environments involving cooperation, competition, negotiation, and coalition formation. | |
| Enables agents to model beliefs and incentives of others in partially observable settings. | |
| Drives theory-of-mind reasoning and emergent strategic behavior. | |
| **Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM. | |
| **Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games. | |
| **Bonus prizes:** | |
| - **Fleet AI** β Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings. | |
| - **Halluminate** β Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task. | |
| --- | |
| ### Theme 2 β (Super) Long-Horizon Planning & Instruction Following | |
| Environments requiring deep, multi-step reasoning with sparse or delayed rewards. | |
| Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. | |
| Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations. | |
| **Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits. | |
| **Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following. | |
| **Bonus prizes:** | |
| - **Scale AI** β Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT. | |
| - **Mercor** β Environment with capped/uncapped rewards where frontier model rewards scale with token output. | |
| --- | |
| ### Theme 3 β World Modeling | |
| #### 3.1 Professional Tasks | |
| Environments requiring real interaction with tools, APIs, or dynamic systems. | |
| The model must do real work instead of exploiting shortcuts. | |
| Strengthens causal reasoning and persistent world models. | |
| **Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it. | |
| **Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β code β experiments), economic simulations, tool-discovery benchmarks. | |
| **Bonus prizes:** | |
| - **Scaler AI Labs** β Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise. | |
| #### 3.2 Personalized Tasks | |
| Environments for real personalized task handling β personal messages, dinner conflicts, tough emails, any personal assistant task. | |
| **Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations. | |
| **Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping. | |
| **Bonus prizes:** | |
| - **Patronus AI** β Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time. | |
| --- | |
| ### Theme 4 β Self-Improvement | |
| Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. | |
| Goal: agents learn to drive their own capability growth (recursive skill amplification). | |
| **Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks. | |
| **Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula. | |
| **Bonus prizes:** | |
| - **Snorkel AI** β Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences. | |
| --- | |
| ### Theme 5 β Wild Card | |
| No constraint. Any original environment that meaningfully adds value to LLM training on a certain task. | |
| --- | |
| ## Minimum Requirements (Non-Negotiable) | |
| Missing any of these puts your submission at a serious disadvantage. | |
| | Requirement | Details | | |
| |---|---| | |
| | OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel | | |
| | Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook | | |
| | Training evidence | Loss and reward plots from a real run | | |
| | Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck | | |
| | HF Space deployment | Environment hosted, discoverable, and runnable | | |
| | README | Motivates the problem, explains the env, shows results, links all materials | | |
| --- | |
| ## Judging Criteria | |
| | Criterion | Weight | What It Means | | |
| |---|---|---| | |
| | Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? | | |
| | Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? | | |
| | Showing Improvement in Rewards | 20% | Observable training progress β reward curves, before/after behavior, baseline comparison | | |
| | Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? | | |
| --- | |
| ## Pitch Format | |
| - **3 minutes** to pitch | |
| - **2 minutes** Q&A | |
| - 5 minutes total per team | |
| Your pitch should answer: | |
| 1. **Problem** β what capability gap or interesting domain are you targeting? | |
| 2. **Environment** β what does the agent see, do, and get rewarded for? | |
| 3. **Results** β what changed after training? Show it. | |
| 4. **Why it matters** β who would care, and why? | |
| --- | |
| ## What Makes a Submission Stand Out | |
| **Pick an ambitious problem.** | |
| Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this? | |
| **Design a reward signal that actually teaches.** | |
| - Rich signal throughout the episode (not just 0/1 at the end) | |
| - Hard to game β an agent that exploits the reward without solving the task should not score high | |
| - Use OpenEnv's Rubric system thoughtfully | |
| **Show real training end to end.** | |
| - Training loop connects to your environment (not a static dataset) | |
| - Train long enough that curves mean something | |
| - Compare trained agent vs random/untrained baseline β quantitative and qualitative | |
| - Include plots and numbers in your README | |
| **Make plots readable.** | |
| - Label both axes with units | |
| - Save as `.png` / `.jpg` and commit to repo | |
| - Embed key plots in README with a one-line caption | |
| - Put baseline vs trained on the same axes | |
| --- | |
| ## Engineering Checklist | |
| - [ ] Use `Environment` / `MCPEnvironment` base classes properly | |
| - [ ] Respect client/server separation (clients never import server internals) | |
| - [ ] Follow standard Gym-style API (`reset`, `step`, `state`) | |
| - [ ] Valid `openenv.yaml` manifest | |
| - [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools | |
| - [ ] README links to blog, video, or slides | |
| - [ ] No large video files in HF repo (use URL references) | |
| --- | |
| ## Before You Arrive in Bangalore | |
| Post-training happens on-site with provided compute credits. | |
| Use the time before April 25 to: | |
| - [ ] Finalize your problem statement | |
| - [ ] Build and deploy your environment to HF Space | |
| - [ ] Write your training script (ready to run, not necessarily fully executed) | |
| - [ ] Prepare your 3-minute pitch story | |
| --- | |
| ## Infrastructure Constraints (same as Round 1) | |
| - Inference script runtime: under 20 minutes | |
| - Hardware: vCPU=2, memory=8GB | |