# Round 2 — Grand Finale Problem Statement **Date:** 25–26 April 2026 **Venue:** Scaler School of Technology, Electronic City, Bangalore **Category:** Solo — Akhil Soni --- ## The Task Choose one (or more) of the themes below and design your own problem statement around it. Build an environment, train an agent on it, and show measurable improvement. > *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."* It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below. --- ## Themes ### Theme 1 — Multi-Agent Interactions Environments involving cooperation, competition, negotiation, and coalition formation. Enables agents to model beliefs and incentives of others in partially observable settings. Drives theory-of-mind reasoning and emergent strategic behavior. **Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM. **Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games. **Bonus prizes:** - **Fleet AI** — Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings. - **Halluminate** — Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task. --- ### Theme 2 — (Super) Long-Horizon Planning & Instruction Following Environments requiring deep, multi-step reasoning with sparse or delayed rewards. Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations. **Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits. **Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following. **Bonus prizes:** - **Scale AI** — Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT. - **Mercor** — Environment with capped/uncapped rewards where frontier model rewards scale with token output. --- ### Theme 3 — World Modeling #### 3.1 Professional Tasks Environments requiring real interaction with tools, APIs, or dynamic systems. The model must do real work instead of exploiting shortcuts. Strengthens causal reasoning and persistent world models. **Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it. **Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations, tool-discovery benchmarks. **Bonus prizes:** - **Scaler AI Labs** — Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise. #### 3.2 Personalized Tasks Environments for real personalized task handling — personal messages, dinner conflicts, tough emails, any personal assistant task. **Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations. **Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping. **Bonus prizes:** - **Patronus AI** — Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time. --- ### Theme 4 — Self-Improvement Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Goal: agents learn to drive their own capability growth (recursive skill amplification). **Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks. **Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula. **Bonus prizes:** - **Snorkel AI** — Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences. --- ### Theme 5 — Wild Card No constraint. Any original environment that meaningfully adds value to LLM training on a certain task. --- ## Minimum Requirements (Non-Negotiable) Missing any of these puts your submission at a serious disadvantage. | Requirement | Details | |---|---| | OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel | | Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook | | Training evidence | Loss and reward plots from a real run | | Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck | | HF Space deployment | Environment hosted, discoverable, and runnable | | README | Motivates the problem, explains the env, shows results, links all materials | --- ## Judging Criteria | Criterion | Weight | What It Means | |---|---|---| | Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? | | Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? | | Showing Improvement in Rewards | 20% | Observable training progress — reward curves, before/after behavior, baseline comparison | | Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? | --- ## Pitch Format - **3 minutes** to pitch - **2 minutes** Q&A - 5 minutes total per team Your pitch should answer: 1. **Problem** — what capability gap or interesting domain are you targeting? 2. **Environment** — what does the agent see, do, and get rewarded for? 3. **Results** — what changed after training? Show it. 4. **Why it matters** — who would care, and why? --- ## What Makes a Submission Stand Out **Pick an ambitious problem.** Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this? **Design a reward signal that actually teaches.** - Rich signal throughout the episode (not just 0/1 at the end) - Hard to game — an agent that exploits the reward without solving the task should not score high - Use OpenEnv's Rubric system thoughtfully **Show real training end to end.** - Training loop connects to your environment (not a static dataset) - Train long enough that curves mean something - Compare trained agent vs random/untrained baseline — quantitative and qualitative - Include plots and numbers in your README **Make plots readable.** - Label both axes with units - Save as `.png` / `.jpg` and commit to repo - Embed key plots in README with a one-line caption - Put baseline vs trained on the same axes --- ## Engineering Checklist - [ ] Use `Environment` / `MCPEnvironment` base classes properly - [ ] Respect client/server separation (clients never import server internals) - [ ] Follow standard Gym-style API (`reset`, `step`, `state`) - [ ] Valid `openenv.yaml` manifest - [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools - [ ] README links to blog, video, or slides - [ ] No large video files in HF repo (use URL references) --- ## Before You Arrive in Bangalore Post-training happens on-site with provided compute credits. Use the time before April 25 to: - [ ] Finalize your problem statement - [ ] Build and deploy your environment to HF Space - [ ] Write your training script (ready to run, not necessarily fully executed) - [ ] Prepare your 3-minute pitch story --- ## Infrastructure Constraints (same as Round 1) - Inference script runtime: under 20 minutes - Hardware: vCPU=2, memory=8GB