Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Sonnet 4.6 commited on 19 days ago

Commit

9bfe470

1 Parent(s): 69310d6

Reorganize docs: segregate Round 1 and Round 2

- Move Round 1 files (problem statement, SPEC, inference copy, scripts) into docs/round1/
- Add Round 2 docs: problem statement, confirmation, design notes
- Root inference.py unchanged (hackathon requirement)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (9) hide show

docs/Hackathon Themes.md +180 -0
SPEC.md → docs/round1/SPEC.md +0 -0
docs/round1/inference.py +304 -0
docs/{problem_statement.md → round1/problem_statement.md} +0 -0
docs/{scripts → round1/scripts}/sample_inference.py +0 -0
docs/{scripts → round1/scripts}/validate-submission.sh +0 -0
docs/round2_confirmation.md +60 -0
docs/round2_design_notes.md +147 -0
docs/round2_problem_statement.md +193 -0

docs/Hackathon Themes.md ADDED Viewed

	@@ -0,0 +1,180 @@

+Theme #1 - Multi-Agent Interactions
+Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
+Expected Outcome: an environment that can be used to train multi-agent task handling in a LLM
+Example environments: Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
+Sub-themes with bonus prizes.
+Fleet AI. Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents operating in complex, multi-agent settings.
+Halluminate. Multi-Actor Environments: Build a realistic environment where an agent interacts with and manages multiple actors (agents) to discover and achieve the task
+Theme #2 - (Super) Long-Horizon Planning & Instruction Following
+You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
+Expected Outcome: an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
+Example environments: Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
+Sub-themes with bonus prizes.
+Scale AI. Environments for long horizon workflows for non-code use cases within a business setting: focusing on either Sales, Project management, or HR & IT.
+Mercor. Make an environment with capped/uncapped rewards where frontier model rewards scale with token output.
+Theme #3 - World Modeling
+#3.1 Professional Tasks
+Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
+Expected Outcome: an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
+Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations with feedback, tool-discovery benchmarks.
+Sub-themes with bonus prizes.
+Scaler AI Labs. Multi-App RL Environment for Enterprise Workflows: Create RL environments to demonstrate complex workflows, business rule nuances etc in a large enterprise
+#3.2 Personalized Tasks
+Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks
+Expected Outcome: An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
+Example environments: Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, shopping, etc
+Sub-themes with bonus prizes.
+Patronus AI. Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where the underlying data schemas, API contracts, and t&cs/policies/rules change.
+Theme #4 - Self-Improvement
+The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
+Expected Outcome: an environment for improving self-play of a LLM over a defined set of tasks
+Example environments: Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
+Sub-themes with bonus prizes.
+Snorkel AI. Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements / preferences.
+Theme #5: Wild Card - Impress Us!
+We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
+Guidelines for Problem Statement
+It is NOT mandatory to choose the same problem statement as Round 1. Only choose the same problem statement if it aligns with the above provided Hackathon themes.
+You can start working on your problem statement once you have finalized it. Post-training can be done onsite on 25th & 26th when you receive compute credits for HuggingFace.
+Before the onsite, we suggest you work on building the environment, agent behaviours, reward model and evaluate if your work aligns with the judging criteria given below.
+Judging Criteria
+Minimum requirements:
+Usage of OpenEnv (latest release)
+Show a minimal training script for your environment using Unsloth or HF TRL in Colab
+Write a mini-blog on HuggingFace or mini-video on YouTube talking about your submission, <2 minutes
+Your OpenEnv compliant environment should be hosted on Hugging Face Spaces.
+First Round Judging Overview
+Pitch Format: Each team has 3 minutes to pitch, followed by 2 minutes for Q&A (5 minutes total).
+Evaluation: Teams will be scored based on the following criteria:
+Environment Innovation (40%): Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
+Storytelling (30%): Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
+Showing Improvement in Rewards (20%): Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
+Reward and Training Script/Pipeline Setup (10%): Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
+OpenEnv Hackathon - What Judges Look For
+This guide tells you what makes a strong submission for the OpenEnv Hackathon (India 2026).
+Read it before you start building, and again before you submit.
+For the list of themes and example problems, refer to the top sections.
+NOTE: Please remember only one submission per team. If you have multiple ideas, pick the best one and go for it. Please make sure that the URL link of your environment is submitted as judges will pull the environment from the URL to evaluate it. Changes or commits after the submission deadline will not be considered.
+TL;DR
+Build an environment that an LLM could actually be trained on to get measurably better at
+something interesting. Then show that training. Then tell the story.
+A messy but ambitious environment with real training evidence beats a polished but boring one.
+Pick a problem that excites you (that energy comes through in the pitch).
+Judging Criteria
+Criterion: Environment Innovation
+Weight: 40%
+What it means:
+Is the environment novel, creative, or genuinely challenging?
+Does it meaningfully test agent behavior in a way that hasn't been done before?
+Criterion: Storytelling & Presentation
+Weight: 30%
+What it means:
+Can you clearly explain the problem, the environment, and what the agent learned?
+Is the demo engaging and easy to follow for a non-technical audience?
+Criterion: Showing Improvement in Rewards
+Weight: 20%
+What it means:
+Is there observable evidence of training progress? Reward curves, before/after behavior,
+comparison against a baseline -- anything that proves the agent learned something.
+Criterion: Reward & Training Pipeline
+Weight: 10%
+What it means:
+Is the reward logic coherent? Does the pipeline produce meaningful improvement in the trained
+agent's behavior?
+Minimum Submission Requirements
+NOTE: These are non-negotiable. Submissions missing any of these are at a serious disadvantage.
+Use OpenEnv (latest release). Build on top of the framework; don’t reinvent the wheel.
+A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
+Evidence that you actually trained; at minimum, loss and reward plots from a real run.
+A short writeup: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
+Push your environment to a Hugging Face Space so it’s discoverable and runnable.
+A README that motivates the problem, explains how the env works, and shows results.
+README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
+Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
+What Makes a Submission Stand Out
+Pick an ambitious, original problem
+The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
+you need a genuinely fresh angle. Some questions to ask yourself:
+Does this environment exist to teach an LLM something it currently can’t do well?
+Is the domain underexplored in RL/LLM training?
+Could a researcher write a paper about training on this?
+Design a reward signal that actually teaches
+A great environment has a reward function that:
+Provides a rich, informative signal (not just 0/1 at the end)
+Captures something hard to measure in a clever way
+Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
+Is hard to game; an agent that exploits the reward without solving the task should not get high scores
+Show real training, end to end
+The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
+agent learns, and you can show it.” Concretely:
+Your training loop should connect to your environment (not a static dataset)
+Train long enough that the curves mean something
+Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
+Include the plots and numbers in your README and writeup
+Make your plots readable
+Reviewers spend seconds, not minutes, on each plot. Help them out:
+Label both axes (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
+Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via WANBD, please include the link to that specific run of your plots)
+Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
+Tell a story, not an API doc
+Your README, blog, and pitch should answer:
+Problem) what capability gap or interesting domain are you targeting?
+Environment) what does the agent see, do, and get rewarded for?
+Results) what changed after training? Show it.
+Why does it matter) who would care, and why?
+A reviewer should be able to read your README in 3~5 minutes and want to try your
+environment.
+NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
+  from your README.
+Engineer it cleanly (table stakes)
+Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
+Use OpenEnv’s Environment / MCPEnvironment base classes properly
+Respect the client / server separation (clients should never import server internals)
+Follow the standard Gym-style API (reset, step, state)
+Have a valid openenv.yaml manifest
+Don’t use reserved tool names (reset, step, state, close) for MCP tools
+Final Note
+Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
+ambitious. Pick a problem you find genuinely interesting; that almost always produces better
+work than chasing what you think judges want. Good luck.

SPEC.md → docs/round1/SPEC.md RENAMED Viewed

File without changes

docs/round1/inference.py ADDED Viewed

	@@ -0,0 +1,304 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+RhythmEnv Inference Script
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+"""
+import asyncio
+import os
+import sys
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+# Add current directory to path for local imports
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from client import RhythmEnv
+from models import ActionType, RhythmAction
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+IMAGE_NAME = os.getenv("IMAGE_NAME")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BASE_URL = os.getenv("RHYTHM_ENV_URL", "https://InosLihka-rhythm-env.hf.space")
+BENCHMARK = "rhythm_env"
+TASKS = ["easy", "medium", "hard"]
+MAX_STEPS = 20
+SCORE_THRESHOLD = 0.1
+SYSTEM_PROMPT = textwrap.dedent("""\
+You are a daily planning agent. You manage tasks across a workday.
+Each step is a 30-minute slot. You have energy (0-1) and stress (0-1).
+Available actions (respond with EXACTLY one line in this format):
+  START_TASK <task_id>
+  CONTINUE_TASK
+  SWITCH_TASK <task_id>
+  TAKE_BREAK
+Rules:
+- START_TASK/SWITCH_TASK require a task_id (integer).
+- CONTINUE_TASK continues your current task.
+- TAKE_BREAK recovers energy and reduces stress.
+- Take breaks when energy < 0.3.
+- Prioritize tasks by deadline urgency, then importance.
+- Avoid unnecessary switching (costs energy and reward).
+Respond with ONLY the action line, nothing else.""")
+# ---------------------------------------------------------------------------
+# Logging helpers
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# Heuristic action selection (enhanced by LLM)
+# ---------------------------------------------------------------------------
+def choose_action_heuristic(obs) -> RhythmAction:
+    """Greedy heuristic: prioritize by deadline then importance."""
+    energy = obs.energy
+    current_task_id = obs.current_task_id
+    tasks = obs.tasks
+    timestep = obs.timestep
+    meetings = obs.meetings
+    # During meeting slots, just take a break
+    if timestep in meetings:
+        return RhythmAction(action_type=ActionType.TAKE_BREAK)
+    # Take break if energy is low
+    if energy < 0.3:
+        return RhythmAction(action_type=ActionType.TAKE_BREAK)
+    # Get uncompleted tasks
+    uncompleted = [t for t in tasks if t.progress < t.effort]
+    if not uncompleted:
+        return RhythmAction(action_type=ActionType.TAKE_BREAK)
+    # Sort by deadline (ascending), then importance (descending)
+    uncompleted.sort(key=lambda t: (t.deadline, -t.importance))
+    # Check for urgent tasks (deadline within 3 steps)
+    urgent = [t for t in uncompleted if t.deadline - timestep <= 3]
+    best = urgent[0] if urgent else uncompleted[0]
+    if current_task_id is not None and current_task_id == best.id:
+        return RhythmAction(action_type=ActionType.CONTINUE_TASK)
+    elif current_task_id is not None:
+        return RhythmAction(action_type=ActionType.SWITCH_TASK, task_id=best.id)
+    else:
+        return RhythmAction(action_type=ActionType.START_TASK, task_id=best.id)
+def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
+    """Use LLM to pick an action, fall back to heuristic on failure."""
+    tasks_desc = "\n".join(
+        f"  Task {t.id}: {t.name} — {t.description}\n"
+        f"    (effort={t.effort:.2f}, progress={t.progress:.2f}, "
+        f"deadline=step {t.deadline}, importance={t.importance})"
+        for t in obs.tasks
+    )
+    user_prompt = textwrap.dedent(f"""\
+Step: {obs.timestep}/{MAX_STEPS}
+Energy: {obs.energy:.2f}
+Stress: {obs.stress:.2f}
+Current task: {obs.current_task_id}
+Meetings at steps: {obs.meetings}
+Remaining steps: {obs.remaining_steps}
+Tasks:
+{tasks_desc}
+Choose your action:""")
+    try:
+        completion = llm_client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=0.3,
+            max_tokens=30,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return parse_llm_action(text, obs)
+    except Exception:
+        return choose_action_heuristic(obs)
+def parse_llm_action(text: str, obs) -> RhythmAction:
+    """Parse LLM response text into a RhythmAction."""
+    text = text.strip().upper()
+    if text.startswith("TAKE_BREAK"):
+        return RhythmAction(action_type=ActionType.TAKE_BREAK)
+    if text.startswith("CONTINUE_TASK"):
+        if obs.current_task_id is not None:
+            return RhythmAction(action_type=ActionType.CONTINUE_TASK)
+        return choose_action_heuristic(obs)
+    for prefix, action_type in [
+        ("START_TASK", ActionType.START_TASK),
+        ("SWITCH_TASK", ActionType.SWITCH_TASK),
+    ]:
+        if text.startswith(prefix):
+            rest = text[len(prefix):].strip()
+            try:
+                task_id = int(rest)
+                if 0 <= task_id < len(obs.tasks):
+                    return RhythmAction(action_type=action_type, task_id=task_id)
+            except ValueError:
+                pass
+    # Fallback
+    return choose_action_heuristic(obs)
+# ---------------------------------------------------------------------------
+# Main loop
+# ---------------------------------------------------------------------------
+async def run_task(task_name: str, llm_client: OpenAI) -> float:
+    """Run a single task and return the score."""
+    if IMAGE_NAME:
+        env = await RhythmEnv.from_docker_image(IMAGE_NAME)
+    else:
+        env = RhythmEnv(base_url=BASE_URL)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        async with env:
+            result = await env.reset(task=task_name)
+            for step in range(1, MAX_STEPS + 1):
+                if result.done:
+                    break
+                # Use LLM if available, otherwise heuristic
+                if llm_client is not None:
+                    action = choose_action_llm(result.observation, llm_client)
+                else:
+                    action = choose_action_heuristic(result.observation)
+                action_str = action.action_type.value
+                if action.task_id is not None:
+                    action_str += f"({action.task_id})"
+                result = await env.step(action)
+                reward = result.reward or 0.0
+                done = result.done
+                rewards.append(reward)
+                steps_taken = step
+                log_step(step=step, action=action_str, reward=reward, done=done, error=None)
+                if done:
+                    break
+            # Get final score from grader
+            score = result.observation.reward_breakdown.get("final_score", 0.0)
+            score = max(0.0, min(1.0, score))
+            success = score >= SCORE_THRESHOLD
+    except Exception as e:
+        print(f"[DEBUG] Error running task {task_name}: {e}", flush=True)
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error: {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+async def main() -> None:
+    llm_client = None
+    if API_KEY:
+        llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    scores = []
+    for task_name in TASKS:
+        s = await run_task(task_name, llm_client)
+        scores.append(s)
+    avg = sum(scores) / len(scores) if scores else 0.0
+    print(f"\n[SUMMARY] avg_score={avg:.3f} scores={','.join(f'{s:.3f}' for s in scores)}", flush=True)
+if __name__ == "__main__":
+    asyncio.run(main())

docs/{problem_statement.md → round1/problem_statement.md} RENAMED Viewed

File without changes

docs/{scripts → round1/scripts}/sample_inference.py RENAMED Viewed

File without changes

docs/{scripts → round1/scripts}/validate-submission.sh RENAMED Viewed

File without changes

docs/round2_confirmation.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Round 2 — Entry Confirmation
+**Event:** Meta PyTorch OpenEnv Hackathon × Scaler School of Technology — Grand Finale
+**Date:** 25–26 April 2026
+**Venue:** Scaler School of Technology, Electronic City, Bangalore
+**Category:** Solo
+**Name:** Akhil Soni
+**Email:** akhilsoni0102@gmail.com
+> This document serves as your official team ticket to the finale. Present it at entry.
+---
+## Before You Arrive
+- [ ] Join the private Discord (MANDATORY) — all major updates announced here first
+- [ ] Check the travel guide — venue details, directions, nearby stay options
+---
+## Entry Requirements
+⚠️ You must present this document at entry. Entry denied if details don't match registration.
+Bring:
+- Valid government-issued ID
+- College/company ID used during registration
+---
+## Round 2 Themes (Summary)
+Your task is to choose one or more themes and design your own problem statement around it.
+- **Multi-Agent Interactions** — cooperation, competition, negotiation
+- **Long-Horizon Planning & Instruction Following** — multi-step reasoning, sparse rewards
+- **World Modeling** — professional tasks (tools/APIs) or personalized tasks
+- **Self-Improvement** — self-play, adaptive curricula, recursive skill amplification
+- **Wild Card** — anything original and meaningful
+As part of your submission, clearly define:
+| What | Description |
+|---|---|
+| Problem statement | What capability gap are you solving? |
+| Environment | Where the agent operates |
+| Agent capabilities | What the agent can see and do |
+| Tasks | What the agent must accomplish |
+| Reward model / evaluation logic | How success is measured |
+| Post-training / self-improvement strategy | How the agent improves |
+> Full themes and judging criteria: see [round2_problem_statement.md](round2_problem_statement.md)
+---
+## Key Notes
+- You can begin working on your problem statement immediately
+- Post-training happens **on-site** with provided compute credits
+- Use time before April 25 to build the environment, agent behaviours, and reward model

docs/round2_design_notes.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Round 2 — Design Notes
+## The Core Idea
+Design the environment so that **reward is controlled by hidden variables the agent cannot see in state.**
+The agent must discover them through trial and error across many episodes.
+This is what creates genuine learning signal — the agent starts confused, gets inconsistent rewards,
+and gradually figures out the hidden rules that explain its performance.
+---
+## The Principle (Generic)
+```
+What agent sees in state:    observable facts (energy, tasks, deadlines, timestep)
+What agent does NOT see:     hidden variables that secretly control reward
+The agent's job:             figure out the hidden variables through experience
+```
+Each hidden variable should satisfy three rules:
+1. **Discoverable through reward signal alone** — the agent can figure it out without being told
+2. **Changes strategy significantly once discovered** — it's not a minor tweak, it rewires planning
+3. **Not guessable from common sense alone** — a pretrained LLM should not figure it out on the first try
+---
+## Reference Example — EchoEnv
+EchoEnv is the simplest possible OpenEnv environment. The agent sends a string, the env echoes it back.
+```
+Hidden variable:   string length
+Agent observes:    it sent "hi" → reward 0.2
+                   it sent "hello" → reward 0.5
+                   it sent "hello world" → reward 0.8
+Agent discovers:   longer string = higher reward
+Strategy change:   always send the longest possible string
+```
+The agent never sees a "length bonus" field in the observation. It just notices the correlation
+between string length and reward over many episodes, and learns to exploit it.
+---
+## Applying This to RhythmEnv
+RhythmEnv simulates a workday. One episode = one day. 20 steps = 20 half-hour slots.
+The agent sees: energy, stress, tasks, deadlines, current_task_id, timestep.
+The agent does NOT see: hidden variables that secretly influence how reward is calculated.
+Below are candidate hidden variables. Each one the agent must discover through experience.
+---
+### Hidden Variable 1 — Task Sequencing Dependency
+Some tasks give a secret bonus when done in a specific order.
+```
+Deep work → then email → bonus multiplier applied to email reward
+Email → then deep work → no bonus
+```
+**What agent sees:** just the reward it earned on the email task.
+**What agent discovers:** doing deep work first makes email more rewarding.
+**Real world basis:** mental priming — focused work puts you in a state where communication is sharper.
+**Strategy change:** agent learns to always front-load deep work, then switch to communication tasks.
+---
+### Hidden Variable 2 — Energy Threshold Cliff
+Progress rate does not degrade smoothly with energy. It drops sharply at a hidden threshold.
+```
+energy > 0.5  →  normal progress rate
+energy < 0.5  →  progress drops 60% suddenly
+```
+**What agent sees:** energy level (e.g. 0.48) — looks like any other low energy value.
+**What agent discovers:** something bad happens around 0.5, rewards drop sharply below it.
+**Real world basis:** cognitive performance has cliff effects, not smooth degradation.
+**Strategy change:** agent learns to take a break before hitting 0.5, not after.
+---
+### Hidden Variable 3 — Task Interference
+Certain task combinations secretly multiply stress while both are active.
+```
+working on Task A while Task B is overdue → stress multiplier 1.5x
+agent does not see this multiplier in state
+```
+**What agent sees:** stress rising faster than expected in certain situations.
+**What agent discovers:** having overdue tasks in the background makes everything worse.
+**Real world basis:** unfinished obligations create background cognitive load.
+**Strategy change:** agent learns to clear small overdue tasks early, even at the cost of efficiency.
+---
+### Hidden Variable 4 — Recovery Curve Shape
+Consecutive breaks compound in recovery — the second break recovers more than the first.
+```
+1 break alone      →  +0.12 energy
+2 breaks back to back  →  +0.12 + 0.18 energy  (hidden compounding)
+```
+**What agent sees:** it took a break, energy went up by 0.12. Looks linear.
+**What agent discovers:** two breaks in a row sometimes outperforms one break + one work step.
+**Real world basis:** rest has diminishing costs but increasing returns at low energy states.
+**Strategy change:** agent learns that when very depleted, double breaks are worth it.
+---
+### Hidden Variable 5 — Time-of-Day Sensitivity
+Certain task types score better at certain timesteps via a hidden importance multiplier.
+```
+creative/deep tasks at timestep 0–5   →  importance multiplier 1.3x
+creative/deep tasks at timestep 15+   →  importance multiplier 0.7x
+```
+**What agent sees:** same task, same effort, but different reward depending on when it was done.
+**What agent discovers:** doing creative tasks early in the day is disproportionately rewarded.
+**Real world basis:** cognitive peak hours — most people do their best focused work in the morning.
+**Strategy change:** agent learns to protect early timesteps for high-effort tasks regardless of deadline pressure.
+---
+## Open Questions
+These are not decided yet. To be answered before building:
+- Which of the 5 variables above do we actually implement?
+- Do we implement all 5 or pick 2-3 strong ones?
+- How do we make sure the hidden variables are discoverable but not too easy?
+- Does the person profile (energy curve, task types) change across episodes or stay fixed?
+- What does the training story look like — how many episodes before the agent figures it out?

docs/round2_problem_statement.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# Round 2 — Grand Finale Problem Statement
+**Date:** 25–26 April 2026
+**Venue:** Scaler School of Technology, Electronic City, Bangalore
+**Category:** Solo — Akhil Soni
+---
+## The Task
+Choose one (or more) of the themes below and design your own problem statement around it.
+Build an environment, train an agent on it, and show measurable improvement.
+> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*
+It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.
+---
+## Themes
+### Theme 1 — Multi-Agent Interactions
+Environments involving cooperation, competition, negotiation, and coalition formation.
+Enables agents to model beliefs and incentives of others in partially observable settings.
+Drives theory-of-mind reasoning and emergent strategic behavior.
+**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.
+**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
+**Bonus prizes:**
+- **Fleet AI** — Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
+- **Halluminate** — Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.
+---
+### Theme 2 — (Super) Long-Horizon Planning & Instruction Following
+Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
+Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
+Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.
+**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.
+**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.
+**Bonus prizes:**
+- **Scale AI** — Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
+- **Mercor** — Environment with capped/uncapped rewards where frontier model rewards scale with token output.
+---
+### Theme 3 — World Modeling
+#### 3.1 Professional Tasks
+Environments requiring real interaction with tools, APIs, or dynamic systems.
+The model must do real work instead of exploiting shortcuts.
+Strengthens causal reasoning and persistent world models.
+**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.
+**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations, tool-discovery benchmarks.
+**Bonus prizes:**
+- **Scaler AI Labs** — Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.
+#### 3.2 Personalized Tasks
+Environments for real personalized task handling — personal messages, dinner conflicts, tough emails, any personal assistant task.
+**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.
+**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.
+**Bonus prizes:**
+- **Patronus AI** — Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.
+---
+### Theme 4 — Self-Improvement
+Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
+Goal: agents learn to drive their own capability growth (recursive skill amplification).
+**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.
+**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
+**Bonus prizes:**
+- **Snorkel AI** — Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.
+---
+### Theme 5 — Wild Card
+No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.
+---
+## Minimum Requirements (Non-Negotiable)
+Missing any of these puts your submission at a serious disadvantage.
+| Requirement | Details |
+|---|---|
+| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
+| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
+| Training evidence | Loss and reward plots from a real run |
+| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
+| HF Space deployment | Environment hosted, discoverable, and runnable |
+| README | Motivates the problem, explains the env, shows results, links all materials |
+---
+## Judging Criteria
+| Criterion | Weight | What It Means |
+|---|---|---|
+| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
+| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
+| Showing Improvement in Rewards | 20% | Observable training progress — reward curves, before/after behavior, baseline comparison |
+| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |
+---
+## Pitch Format
+- **3 minutes** to pitch
+- **2 minutes** Q&A
+- 5 minutes total per team
+Your pitch should answer:
+1. **Problem** — what capability gap or interesting domain are you targeting?
+2. **Environment** — what does the agent see, do, and get rewarded for?
+3. **Results** — what changed after training? Show it.
+4. **Why it matters** — who would care, and why?
+---
+## What Makes a Submission Stand Out
+**Pick an ambitious problem.**
+Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?
+**Design a reward signal that actually teaches.**
+- Rich signal throughout the episode (not just 0/1 at the end)
+- Hard to game — an agent that exploits the reward without solving the task should not score high
+- Use OpenEnv's Rubric system thoughtfully
+**Show real training end to end.**
+- Training loop connects to your environment (not a static dataset)
+- Train long enough that curves mean something
+- Compare trained agent vs random/untrained baseline — quantitative and qualitative
+- Include plots and numbers in your README
+**Make plots readable.**
+- Label both axes with units
+- Save as `.png` / `.jpg` and commit to repo
+- Embed key plots in README with a one-line caption
+- Put baseline vs trained on the same axes
+---
+## Engineering Checklist
+- [ ] Use `Environment` / `MCPEnvironment` base classes properly
+- [ ] Respect client/server separation (clients never import server internals)
+- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
+- [ ] Valid `openenv.yaml` manifest
+- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
+- [ ] README links to blog, video, or slides
+- [ ] No large video files in HF repo (use URL references)
+---
+## Before You Arrive in Bangalore
+Post-training happens on-site with provided compute credits.
+Use the time before April 25 to:
+- [ ] Finalize your problem statement
+- [ ] Build and deploy your environment to HF Space
+- [ ] Write your training script (ready to run, not necessarily fully executed)
+- [ ] Prepare your 3-minute pitch story
+---
+## Infrastructure Constraints (same as Round 1)
+- Inference script runtime: under 20 minutes
+- Hardware: vCPU=2, memory=8GB