Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /round2 /problem_statement.md

InosLihka

Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline

cc6473a about 1 month ago

preview code

raw

history blame

8.2 kB

	# Round 2 — Grand Finale Problem Statement

	Date: 25–26 April 2026
	Venue: Scaler School of Technology, Electronic City, Bangalore
	Category: Solo — Akhil Soni

	---

	## The Task

	Choose one (or more) of the themes below and design your own problem statement around it.
	Build an environment, train an agent on it, and show measurable improvement.

	> "Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."

	It is NOT mandatory to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.

	---

	## Themes

	### Theme 1 — Multi-Agent Interactions

	Environments involving cooperation, competition, negotiation, and coalition formation.
	Enables agents to model beliefs and incentives of others in partially observable settings.
	Drives theory-of-mind reasoning and emergent strategic behavior.

	Expected outcome: An environment that can be used to train multi-agent task handling in an LLM.

	Example environments: Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.

	Bonus prizes:
	- Fleet AI — Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
	- Halluminate — Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.

	---

	### Theme 2 — (Super) Long-Horizon Planning & Instruction Following

	Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
	Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
	Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.

	Expected outcome: An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.

	Example environments: Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.

	Bonus prizes:
	- Scale AI — Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
	- Mercor — Environment with capped/uncapped rewards where frontier model rewards scale with token output.

	---

	### Theme 3 — World Modeling

	#### 3.1 Professional Tasks

	Environments requiring real interaction with tools, APIs, or dynamic systems.
	The model must do real work instead of exploiting shortcuts.
	Strengthens causal reasoning and persistent world models.

	Expected outcome: An environment capturing nuances of a partially observable world and improving LLM interaction with it.

	Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations, tool-discovery benchmarks.

	Bonus prizes:
	- Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.

	#### 3.2 Personalized Tasks

	Environments for real personalized task handling — personal messages, dinner conflicts, tough emails, any personal assistant task.

	Expected outcome: An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.

	Example environments: Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.

	Bonus prizes:
	- Patronus AI — Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.

	---

	### Theme 4 — Self-Improvement

	Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
	Goal: agents learn to drive their own capability growth (recursive skill amplification).

	Expected outcome: An environment for improving self-play of an LLM over a defined set of tasks.

	Example environments: Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.

	Bonus prizes:
	- Snorkel AI — Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.

	---

	### Theme 5 — Wild Card

	No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.

	---

	## Minimum Requirements (Non-Negotiable)

	Missing any of these puts your submission at a serious disadvantage.

	\| Requirement \| Details \|
	\|---\|---\|
	\| OpenEnv (latest release) \| Build on top of the framework, don't reinvent the wheel \|
	\| Training script \| Using Unsloth or HF TRL, ideally as a runnable Colab notebook \|
	\| Training evidence \| Loss and reward plots from a real run \|
	\| Writeup \| Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck \|
	\| HF Space deployment \| Environment hosted, discoverable, and runnable \|
	\| README \| Motivates the problem, explains the env, shows results, links all materials \|

	---

	## Judging Criteria

	\| Criterion \| Weight \| What It Means \|
	\|---\|---\|---\|
	\| Environment Innovation \| 40% \| Novel, creative, genuinely challenging? Tests agent behavior in a new way? \|
	\| Storytelling & Presentation \| 30% \| Clear explanation of problem, env, and what the agent learned? Engaging demo? \|
	\| Showing Improvement in Rewards \| 20% \| Observable training progress — reward curves, before/after behavior, baseline comparison \|
	\| Reward & Training Pipeline \| 10% \| Coherent reward logic? Does training produce meaningful improvement in agent behavior? \|

	---

	## Pitch Format

	- 3 minutes to pitch
	- 2 minutes Q&A
	- 5 minutes total per team

	Your pitch should answer:
	1. Problem — what capability gap or interesting domain are you targeting?
	2. Environment — what does the agent see, do, and get rewarded for?
	3. Results — what changed after training? Show it.
	4. Why it matters — who would care, and why?

	---

	## What Makes a Submission Stand Out

	Pick an ambitious problem.
	Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?

	Design a reward signal that actually teaches.
	- Rich signal throughout the episode (not just 0/1 at the end)
	- Hard to game — an agent that exploits the reward without solving the task should not score high
	- Use OpenEnv's Rubric system thoughtfully

	Show real training end to end.
	- Training loop connects to your environment (not a static dataset)
	- Train long enough that curves mean something
	- Compare trained agent vs random/untrained baseline — quantitative and qualitative
	- Include plots and numbers in your README

	Make plots readable.
	- Label both axes with units
	- Save as `.png` / `.jpg` and commit to repo
	- Embed key plots in README with a one-line caption
	- Put baseline vs trained on the same axes

	---

	## Engineering Checklist

	- [ ] Use `Environment` / `MCPEnvironment` base classes properly
	- [ ] Respect client/server separation (clients never import server internals)
	- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
	- [ ] Valid `openenv.yaml` manifest
	- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
	- [ ] README links to blog, video, or slides
	- [ ] No large video files in HF repo (use URL references)

	---

	## Before You Arrive in Bangalore

	Post-training happens on-site with provided compute credits.
	Use the time before April 25 to:

	- [ ] Finalize your problem statement
	- [ ] Build and deploy your environment to HF Space
	- [ ] Write your training script (ready to run, not necessarily fully executed)
	- [ ] Prepare your 3-minute pitch story

	---

	## Infrastructure Constraints (same as Round 1)

	- Inference script runtime: under 20 minutes
	- Hardware: vCPU=2, memory=8GB