# Round 2 — Grand Finale Problem Statement

**Date:** 25–26 April 2026
**Venue:** Scaler School of Technology, Electronic City, Bangalore
**Category:** Solo — Akhil Soni

---

## The Task

Choose one (or more) of the themes below and design your own problem statement around it.
Build an environment, train an agent on it, and show measurable improvement.

> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*

It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.

---

## Themes

### Theme 1 — Multi-Agent Interactions

Environments involving cooperation, competition, negotiation, and coalition formation.
Enables agents to model beliefs and incentives of others in partially observable settings.
Drives theory-of-mind reasoning and emergent strategic behavior.

**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.

**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.

**Bonus prizes:**
- **Fleet AI** — Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
- **Halluminate** — Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.

---

### Theme 2 — (Super) Long-Horizon Planning & Instruction Following

Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.

**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.

**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.

**Bonus prizes:**
- **Scale AI** — Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
- **Mercor** — Environment with capped/uncapped rewards where frontier model rewards scale with token output.

---

### Theme 3 — World Modeling

#### 3.1 Professional Tasks

Environments requiring real interaction with tools, APIs, or dynamic systems.
The model must do real work instead of exploiting shortcuts.
Strengthens causal reasoning and persistent world models.

**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.

**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations, tool-discovery benchmarks.

**Bonus prizes:**
- **Scaler AI Labs** — Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.

#### 3.2 Personalized Tasks

Environments for real personalized task handling — personal messages, dinner conflicts, tough emails, any personal assistant task.

**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.

**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.

**Bonus prizes:**
- **Patronus AI** — Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.

---

### Theme 4 — Self-Improvement

Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
Goal: agents learn to drive their own capability growth (recursive skill amplification).

**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.

**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.

**Bonus prizes:**
- **Snorkel AI** — Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.

---

### Theme 5 — Wild Card

No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.

---

## Minimum Requirements (Non-Negotiable)

Missing any of these puts your submission at a serious disadvantage.

| Requirement | Details |
|---|---|
| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
| Training evidence | Loss and reward plots from a real run |
| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
| HF Space deployment | Environment hosted, discoverable, and runnable |
| README | Motivates the problem, explains the env, shows results, links all materials |

---

## Judging Criteria

| Criterion | Weight | What It Means |
|---|---|---|
| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
| Showing Improvement in Rewards | 20% | Observable training progress — reward curves, before/after behavior, baseline comparison |
| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |

---

## Pitch Format

- **3 minutes** to pitch
- **2 minutes** Q&A
- 5 minutes total per team

Your pitch should answer:
1. **Problem** — what capability gap or interesting domain are you targeting?
2. **Environment** — what does the agent see, do, and get rewarded for?
3. **Results** — what changed after training? Show it.
4. **Why it matters** — who would care, and why?

---

## What Makes a Submission Stand Out

**Pick an ambitious problem.**
Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?

**Design a reward signal that actually teaches.**
- Rich signal throughout the episode (not just 0/1 at the end)
- Hard to game — an agent that exploits the reward without solving the task should not score high
- Use OpenEnv's Rubric system thoughtfully

**Show real training end to end.**
- Training loop connects to your environment (not a static dataset)
- Train long enough that curves mean something
- Compare trained agent vs random/untrained baseline — quantitative and qualitative
- Include plots and numbers in your README

**Make plots readable.**
- Label both axes with units
- Save as `.png` / `.jpg` and commit to repo
- Embed key plots in README with a one-line caption
- Put baseline vs trained on the same axes

---

## Engineering Checklist

- [ ] Use `Environment` / `MCPEnvironment` base classes properly
- [ ] Respect client/server separation (clients never import server internals)
- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
- [ ] Valid `openenv.yaml` manifest
- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
- [ ] README links to blog, video, or slides
- [ ] No large video files in HF repo (use URL references)

---

## Before You Arrive in Bangalore

Post-training happens on-site with provided compute credits.
Use the time before April 25 to:

- [ ] Finalize your problem statement
- [ ] Build and deploy your environment to HF Space
- [ ] Write your training script (ready to run, not necessarily fully executed)
- [ ] Prepare your 3-minute pitch story

---

## Infrastructure Constraints (same as Round 1)

- Inference script runtime: under 20 minutes
- Hardware: vCPU=2, memory=8GB