rhythm_env / docs /round2 /problem_statement.md
InosLihka's picture
Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline
cc6473a
|
raw
history blame
8.2 kB
# Round 2 β€” Grand Finale Problem Statement
**Date:** 25–26 April 2026
**Venue:** Scaler School of Technology, Electronic City, Bangalore
**Category:** Solo β€” Akhil Soni
---
## The Task
Choose one (or more) of the themes below and design your own problem statement around it.
Build an environment, train an agent on it, and show measurable improvement.
> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*
It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.
---
## Themes
### Theme 1 β€” Multi-Agent Interactions
Environments involving cooperation, competition, negotiation, and coalition formation.
Enables agents to model beliefs and incentives of others in partially observable settings.
Drives theory-of-mind reasoning and emergent strategic behavior.
**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.
**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
**Bonus prizes:**
- **Fleet AI** β€” Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
- **Halluminate** β€” Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.
---
### Theme 2 β€” (Super) Long-Horizon Planning & Instruction Following
Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.
**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.
**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.
**Bonus prizes:**
- **Scale AI** β€” Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
- **Mercor** β€” Environment with capped/uncapped rewards where frontier model rewards scale with token output.
---
### Theme 3 β€” World Modeling
#### 3.1 Professional Tasks
Environments requiring real interaction with tools, APIs, or dynamic systems.
The model must do real work instead of exploiting shortcuts.
Strengthens causal reasoning and persistent world models.
**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.
**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β†’ code β†’ experiments), economic simulations, tool-discovery benchmarks.
**Bonus prizes:**
- **Scaler AI Labs** β€” Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.
#### 3.2 Personalized Tasks
Environments for real personalized task handling β€” personal messages, dinner conflicts, tough emails, any personal assistant task.
**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.
**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.
**Bonus prizes:**
- **Patronus AI** β€” Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.
---
### Theme 4 β€” Self-Improvement
Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
Goal: agents learn to drive their own capability growth (recursive skill amplification).
**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.
**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
**Bonus prizes:**
- **Snorkel AI** β€” Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.
---
### Theme 5 β€” Wild Card
No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.
---
## Minimum Requirements (Non-Negotiable)
Missing any of these puts your submission at a serious disadvantage.
| Requirement | Details |
|---|---|
| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
| Training evidence | Loss and reward plots from a real run |
| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
| HF Space deployment | Environment hosted, discoverable, and runnable |
| README | Motivates the problem, explains the env, shows results, links all materials |
---
## Judging Criteria
| Criterion | Weight | What It Means |
|---|---|---|
| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
| Showing Improvement in Rewards | 20% | Observable training progress β€” reward curves, before/after behavior, baseline comparison |
| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |
---
## Pitch Format
- **3 minutes** to pitch
- **2 minutes** Q&A
- 5 minutes total per team
Your pitch should answer:
1. **Problem** β€” what capability gap or interesting domain are you targeting?
2. **Environment** β€” what does the agent see, do, and get rewarded for?
3. **Results** β€” what changed after training? Show it.
4. **Why it matters** β€” who would care, and why?
---
## What Makes a Submission Stand Out
**Pick an ambitious problem.**
Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?
**Design a reward signal that actually teaches.**
- Rich signal throughout the episode (not just 0/1 at the end)
- Hard to game β€” an agent that exploits the reward without solving the task should not score high
- Use OpenEnv's Rubric system thoughtfully
**Show real training end to end.**
- Training loop connects to your environment (not a static dataset)
- Train long enough that curves mean something
- Compare trained agent vs random/untrained baseline β€” quantitative and qualitative
- Include plots and numbers in your README
**Make plots readable.**
- Label both axes with units
- Save as `.png` / `.jpg` and commit to repo
- Embed key plots in README with a one-line caption
- Put baseline vs trained on the same axes
---
## Engineering Checklist
- [ ] Use `Environment` / `MCPEnvironment` base classes properly
- [ ] Respect client/server separation (clients never import server internals)
- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
- [ ] Valid `openenv.yaml` manifest
- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
- [ ] README links to blog, video, or slides
- [ ] No large video files in HF repo (use URL references)
---
## Before You Arrive in Bangalore
Post-training happens on-site with provided compute credits.
Use the time before April 25 to:
- [ ] Finalize your problem statement
- [ ] Build and deploy your environment to HF Space
- [ ] Write your training script (ready to run, not necessarily fully executed)
- [ ] Prepare your 3-minute pitch story
---
## Infrastructure Constraints (same as Round 1)
- Inference script runtime: under 20 minutes
- Hardware: vCPU=2, memory=8GB