Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Sonnet 4.6 commited on 13 days ago

Commit

1a25a1a

1 Parent(s): 5fbafee

docs: reorganize — 25 files → 4 focused docs

Deleted all brainstorming (Plan_v2), Round 1 specs, stale Round 2 docs,
and research references that no longer serve the submission.

Final structure:
docs/blog_post.md — public narrative (product vision + training story)
docs/entity_definitions.md — technical reference (state/actions/profiles/reward)
docs/environment_design.md — design rationale (why these choices, training arc)
docs/training.md — training guide (GRPO config, pipeline, results)
docs/references/judging_criteria.md — external hackathon criteria

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (29) hide show

docs/{round2/entity_definitions.md → entity_definitions.md} +0 -0
docs/environment_design.md +133 -0
docs/references/FAQs on Discord.md +0 -77
docs/references/[External] Meta OpenEnv Hackathon Participant Help Guide.md +0 -425
docs/references/[External] OpenEnv Hackathon FAQs.md +0 -556
docs/references/hackathon_checklist.md +0 -153
docs/{round2/[External] Apr ‘26 OpenEnv Hackathon Themes & Judging Criteria.md → references/judging_criteria.md} +0 -0
docs/references/reward_engineering_overview.md +0 -82
docs/references/reward_engineering_software_tasks.md +0 -77
docs/references/unsloth_grpo_training_template.md +0 -269
docs/round1/SPEC.md +0 -554
docs/round1/inference.py +0 -304
docs/round1/problem_statement.md +0 -176
docs/round1/scripts/sample_inference.py +0 -182
docs/round1/scripts/validate-submission.sh +0 -186
docs/round2/Plan_v2/CoreMEters.md +0 -50
docs/round2/Plan_v2/GeminiDiscussion.md +0 -61
docs/round2/Plan_v2/HumanModeling.md +0 -93
docs/round2/Plan_v2/LifeMAth.md +0 -89
docs/round2/Plan_v2/RandomnessFactor.md +0 -132
docs/round2/Plan_v2/RewardIsolation.md +0 -44
docs/round2/Plan_v2/Todo.md +0 -14
docs/round2/confirmation.md +0 -60
docs/round2/design_notes.md +0 -147
docs/round2/environment_design.md +0 -209
docs/round2/hackathon_themes.md +0 -180
docs/round2/pitch_framing.md +0 -57
docs/round2/problem_statement.md +0 -193
docs/training.md +125 -0

docs/{round2/entity_definitions.md → entity_definitions.md} RENAMED Viewed

File without changes

docs/environment_design.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Environment Design — RhythmEnv Life Simulator
+## What It Is
+A Life Simulator — a holistic resource management RL environment where an agent learns a specific person's hidden patterns through experience, not configuration.
+The core problem: personal AI assistants give generic advice because they don't know who you are. RhythmEnv is the training ground for an agent that must discover hidden personality dynamics through reward signals alone — the same way a good personal assistant adapts over their first weeks on the job.
+---
+## Why Abstract Activities, Not Tasks
+An earlier design used a workday task scheduler (energy/stress meters, task queues with deadlines). We moved away from it because with real tasks, an agent can score well just by being a good scheduler — it never needs to infer anything about the person. Abstract life activities (DEEP_WORK, MEDITATE, SOCIALIZE...) force the inference problem: the only way to do well is to figure out *who you're helping*, because the same action has wildly different value depending on the hidden profile.
+| | Workday Scheduler | Life Simulator |
+|---|---|---|
+| Episode | 1 day, 20 steps | 1 week, 28 steps |
+| State | Energy, stress, task queue | 5 life meters |
+| Actions | 4 (task management) | 10 (life activities) |
+| Learning signal | How to sequence tasks | Which actions serve *this specific person* |
+The Life Simulator creates a **non-promptable discovery problem**: the agent cannot know the person's profile from the prompt — it must be inferred from reward patterns. This is structurally different from a task that better prompting solves.
+---
+## The Three Discovery Layers
+### Layer 1 — Reward Weights (What Matters to This Person)
+Same action, same starting state → different rewards per profile:
+```
+DEEP_WORK, step 1, all meters at 0.7:
+  workaholic_stoic:    +1.57   (progress weight 70% — work = meaning)
+  introvert_morning:   +0.32   (serenity weight 60% — mild net gain)
+  extrovert_night_owl: −0.39   (connection weight 75% — work gives 0 connection)
+```
+An agent that doesn't adapt plateaus at ~0.60. One that discovers the profile targets pushes above 0.80.
+### Layer 2 — Action Modifiers (How Actions Affect This Person)
+The base effect matrix is modified invisibly per profile:
+| Profile | Hidden modifier | Observable signal |
+|---|---|---|
+| introvert_morning | Social drain ×3.0 | SOCIALIZE drains vitality 3× faster than expected |
+| introvert_morning | Morning deep work ×2.0 | Same action gives 2× progress at slot 0 |
+| extrovert_night_owl | Morning penalty ×0.4 | DEEP_WORK in morning gives 40% expected progress |
+| extrovert_night_owl | Evening/night bonus ×1.8 | Same action gives 1.8× progress at slots 2–3 |
+| extrovert_night_owl | Social connection ×2.0 | SOCIALIZE gives 2× connection gain |
+| workaholic_stoic | Work recovers vitality +0.06 | DEEP_WORK raises vitality instead of draining |
+| workaholic_stoic | Idle drains serenity −0.10 | ME_TIME/BINGE_WATCH lower serenity |
+### Layer 3 — Stress Spiral
+When serenity drops below the profile's tolerance threshold, all negative effects amplify ×1.3. Wrong actions → serenity drops → worse outcomes → harder recovery. The agent must learn to protect serenity proactively.
+---
+## Observable vs Hidden
+| Observable (agent sees every step) | Hidden (must infer from reward patterns) |
+|---|---|
+| All 5 meter values (0.0–1.0) | Which profile is active |
+| Day of week (0–6) | Profile reward weights |
+| Time slot (0–3) | Per-action modifiers |
+| Active random event name | Stress tolerance threshold |
+| Remaining steps | Connection decay rate |
+---
+## Training Story
+```
+Random baseline       → final_score ≈ 0.60–0.70
+  No pattern. Misses timing windows. Doesn't protect serenity floor.
+Heuristic baseline    → final_score ≈ 0.75–0.82
+  Follows observable rules only. Cannot differentiate profiles.
+  Treats everyone the same.
+GRPO-trained agent    → target: final_score > 0.82 on 2+ profiles
+  Discovers timing bonuses per profile.
+  Adapts action mix to the person's hidden reward structure.
+  Introvert's week looks different from workaholic's week.
+```
+---
+## Anti-Reward-Hacking Measures
+| Safeguard | Mechanism |
+|---|---|
+| Three-layer reward | format + legality + real env replay — all three must pass |
+| Repetition dampening | Same action 3× in a row → 25%/50%/75% effect reduction |
+| Critical floor penalty | Any meter < 0.10 → −0.30 per step |
+| Random events | 8%/step probability prevents overfitting to deterministic trajectories |
+| Seed-based replay | `env_reward` reconstructs exact episode state — reward cannot be fabricated |
+---
+## Hackathon Theme Alignment
+**Primary: Theme 3.2 — World Modeling: Personalized Tasks**
+The environment models real personal assistant behaviour. The hidden profile represents real individual differences — what a person values and how activities physically affect them. Discovery through reward is how a good PA adapts over their first weeks on the job.
+**Secondary: Theme 2 — Long-Horizon Planning**
+28 steps with delayed, compounding consequences. Neglecting connection decays slowly but recovery gets harder each step. Serenity spiral is triggered by accumulated bad decisions, not a single action.
+---
+## Implementation Reference
+| Component | File |
+|---|---|
+| Environment | `server/rhythm_environment.py` |
+| Data models | `models.py` |
+| Dataset generator | `training/dataset.py` |
+| Reward functions | `training/reward_functions.py` |
+| Baseline evaluation | `training/inference_eval.py` |
+| Training notebook | `training/RhythmEnv_GRPO_Training.ipynb` |
+| Gradio UI | `ui/app.py` |
+| FastAPI server | `server/app.py` |
+```python
+env = RhythmEnvironment()
+obs = env.reset(seed=42, profile="introvert_morning")  # profile optional
+obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
+# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...
+```

docs/references/FAQs on Discord.md DELETED Viewed

@@ -1,77 +0,0 @@
-**A message from the team | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology**
-We want to start by saying something that we mean genuinely: thank you.
-Over the past few weeks, you showed up. In numbers we did not fully anticipate. With submissions, energy, questions, and expectations that reflect just how much this means to you. And that means everything to us.
-This hackathon is the first of its kind at this scale in India. We are not just saying that as a line. We mean it operationally. The infrastructure, the evaluation process, the coordination across Meta, PyTorch, Hugging Face, and the team at Scaler School of Technology is being stress-tested in real time. Some things might not have gone as planned, we own that & are working around the clock to fix them. And we are committed to being transparent with you about every single one of them.
-We’re creating this document to answer common questions that we’re seeing on the Discord channel. We will keep updating this document.
-**Who are the mentors and judges for the finale?**
-We are proud to share the full list of mentors and judges who will be part of the in-person finale:
-* Sanyam Bhutani, Partner Engineer at Meta
-* Yash Khare, Partner Engineer at Meta
-* Nilesh Pandey, Partner Engineer at Meta
-* Adithya S Kolavi, ML Engineer at Hugging Face
-* Adarsh Shirawalmath, ML Engineer at Hugging Face
-* Arkadip Maitra, ML Engineer at Red Hat
-* Aashay Sachdeva, Founding Team at Sarvam
-* Deepa Dhevannan, Gen AI Solutions Architect
-* Soumik Rakshit, ML Engineer at Zomato
-* Ayush Satyam**,** Systems ML Engineer, Red Hat
-* Parshant Sharma, Machine Learning Engineer at Red Hat
-These are practitioners actively working at the forefront of AI. The team has worked hard to bring them together so they can be around to help you make your environments even better.
-**Why are results online, and why does the finale still happen on campus? Why are results not being declared on April 26?**
-We want to be fully transparent here, and we want to address both of these together because they come from the same place.
-The final evaluation is being handled directly by engineers from Meta, PyTorch, and Hugging Face. With 800+ submissions, we made the decision to move to a hybrid evaluation model. Initial screening uses automated tooling, but every top team's submission will receive a dedicated, granular review by domain experts, with \~20-30 minutes of evaluation per team. This is not something we are willing to rush. Every submission deserves to be looked at fairly, and that takes time.
-We had originally envisioned the entire evaluation happening offline. But given the volume of submissions, completing a fully offline evaluation before bringing everyone to campus would have meant asking you to wait significantly longer, and that felt like the wrong trade-off.
-What we refused to compromise on is the experience of you all coming together in person, dedicatedly building and improving your environments with the mentors mentioned above being around to help you. There is something that cannot be replicated about builders in the same room, working through ideas together, pushing each other. Beyond that, this is a rare opportunity for the Meta/ Pytorch and Hugging Face teams to interact directly with engineers building in India, and to get a genuine sense of the depth of engineering talent this country has. That kind of exposure goes both ways, and it is something we were not willing to cut.
-**Has promised mentorship and expert access materialised?**
-Yes, and more is coming. A live session has already been conducted with Ben Burtenshaw (Community Education at Hugging Face) and Pulkit Sharma (Senior Instructor at Scaler). \[[Link to session recording here\]](https://www.youtube.com/live/kkCNMz0Ptd8?si=KDIaWXSEX6up4lU4) along with additional modules shared over dashboard and emails. Extensive additional sessions and mentor touchpoints are planned on campus.
-**Why was the problem theme document edited in real time?**
-This one is on us, and we want to be straightforward about it. The document that was shared with participants contained leftover content from a previous hackathon that should have been removed before it went out. It was an editorial error, not an intentional change, and we corrected it as soon as we caught it.
-More broadly, this hackathon is being run by open-source teams across multiple organizations coordinating in real time. Mistakes like this can happen, and when they do, we would rather fix them quickly and tell you exactly what happened than let confusion sit.
-We also want to address the evaluation adjustments that some of you noticed. We made deliberate changes to the judging process to ensure every submission gets the time and attention it deserves. Rushing evaluations on the day of the event would have been unfair to everyone. This was a considered call, not a last-minute scramble.
-We ask that conversations about the hackathon stay within the designated Discord channels so we can track every concern and respond properly. And we ask that everyone continue to engage with each other, and with us, with respect. This community has been extraordinary, and that is worth protecting.
-To be clear, anyone who crosses boundaries & breaks community rules will be banned & therefore automatically disqualified from the finale.
-**Are the prizes and the number of winners still the same?**
-Goes without saying. There are no changes to the prize structure or the number of winners. 15 teams will be awarded, with a total cash prize pool of $30,000, as published on the site:
-| Position Secured | Prize |
-| :---- | :---- |
-| 1st | $7,500 |
-| 2nd | $5,000 |
-| 3rd | $3,500 |
-| 4th to 8th | $2,000 each |
-| 9th to 15th  | $650 each |
-Additionally, top teams will receive an interview opportunity with the Meta and Hugging Face AI teams. This has not changed and will not change.
-**Will Scaler School of Technology students be favoured in the final evaluation?**
-Absolutely not. The final evaluation is entirely in the hands of the Meta, PyTorch, and Hugging Face teams and will follow the judging criteria outlined.
-We are working through every other question that has come in and will post structured answers here as we go. If something is unclear, if something feels wrong, keep asking. We would rather hear it from you directly than have you sit with uncertainty.
-This hackathon is something India has not seen before. We are building the playbook in real time, at scale, and that is both the most exciting and the most humbling part of doing this. We are grateful that every single one of you showed up for it.
-More updates soon.

docs/references/[External] Meta OpenEnv Hackathon Participant Help Guide.md DELETED Viewed

@@ -1,425 +0,0 @@
-# **Hackathon Self-Serve Guide: Build an RL Environment, Train an LLM, Ship a Demo**
-## **0\) What you are building**
-The core idea is not just to fine-tune a text model, but to build a **specialized LLM system** that can act inside an environment, get feedback, and improve through reinforcement learning. The practical stack discussed here is:
-**Environment → verifier/reward functions → TRL trainer → Unsloth for efficiency → deployment on OpenEnv / Spaces**.
-A strong project usually looks like one of these,
-Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
-## **1\) Start with the right project idea**
-Pick a task that has all three of these properties:
-1. **The model can act step by step**
-2. **You can verify success programmatically**
-3. **The task is hard enough to be interesting, but not so hard that the model never succeeds**
-This last point matters a lot. RL only works if the probability of getting a good answer is greater than zero. If your task is so hard that the model never gets any reward, you will burn compute and learn nothing.
-Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
-A useful rule: **prefer tasks with crisp verification over tasks that only “look good” to a human.** RL gets easier when the reward is objective.
-## **2\) Understand the minimum RL loop before you build**
-At a high level, your loop is:
-1. Give the model a prompt
-2. Let it generate an action, strategy, answer, or code
-3. Execute that output in an environment or verifier
-4. Convert the result into a reward
-5. Update the model so higher-reward behavior becomes more likely
-That is the practical mental model for RL here. The system samples many outputs, scores them, and shifts probability mass away from bad outputs and toward better ones.
-One especially useful framing is that RL is like a more efficient version of repeated in-context improvement. Instead of repeatedly stuffing previous examples into the context, you let backpropagation store what worked into the weights.
-## **3\) Decide whether you need SFT first**
-Use this simple rule:
-* If you have **a lot of good data**, use **SFT**
-* If you **do not have data but can verify outputs**, use **RL**
-* In many practical cases, do **a little SFT first**, then RL
-Why this matters:
-* SFT is generally more sample-efficient
-* RL is useful when you can test outcomes but cannot cheaply author ideal traces
-* RL often needs some warm start, formatting priming, or easy tasks first so that good rollouts happen at all
-For hackathon teams, the best path is usually:
-1. Start from a capable base/instruct model
-2. Add light formatting or task scaffolding if needed
-3. Use RL for improvement, not as magic from scratch
-## **4\) Design the environment before you design the trainer**
-Treat the environment as a first-class artifact. It should define:
-* **reset()**: start a fresh episode
-* **step(action)**: apply an action and return the next result
-* **state() / observation**: what the agent sees
-* **reward**: what counts as progress or success
-OpenEnv standardizes this so the same training code can work across many environments, instead of every team inventing a different API. That is one of the main reasons to use it in a hackathon.
-Think about your environment in this order:
-1. What does the agent observe?
-2. What actions can it take?
-3. What ends an episode?
-4. How do you compute reward?
-5. How do you stop abuse, infinite loops, or cheating?
-**5\) Build the environment using OpenEnv**
-The intended workflow is to bootstrap an environment skeleton and then fill in the behavior. OpenEnv’s CLI creates the scaffolding for you. The environment is implemented as a Python package and exposed via a FastAPI app.
-Your implementation typically defines:
-* action dataclass
-* observation dataclass
-* state representation
-* environment methods like reset and step
-* FastAPI wrapper / client-server interface
-That gives you a clean separation:
-* the **environment** handles world dynamics and scoring,
-* the **trainer** handles optimization,
-* and the **model** just learns to act inside the interface.
-## **6\) Keep the task simple at first**
-Do not begin with your hardest benchmark. Start with the easiest version of your environment that still proves the concept. This is where curriculum learning helps.
-A good progression:
-1. easy tasks with short horizons,
-2. medium tasks with a little more branching,
-3. harder tasks only after the model starts getting non-zero reward.
-The principle is simple: **make success possible early**. If the model never sees successful trajectories, learning stalls.
-## **7\) Design rewards carefully**
-Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the model will optimize the wrong thing very efficiently.
-A strong reward design usually includes multiple components, for example:
-* execution success,
-* correctness,
-* format compliance,
-* timeouts,
-* resource usage,
-* safety constraints,
-* and anti-cheating checks.
-One explicit recommendation was to use **multiple independent reward functions**, not just one. If you only have a single reward signal, it is easier for the model to hack it. Multiple independent checks reduce that risk.
-For example, for a coding environment:
-* reward passing tests,
-* penalize timeouts,
-* reward format compliance,
-* reject use of forbidden globals,
-* and separately verify the function contract.
-## **8\) Protect yourself against reward hacking**
-Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts that maximize your reward without solving the real task. Examples mentioned include:
-* editing timers,
-* caching results,
-* abusing globals,
-* mutating protected state,
-* or exploiting environment bugs.
-What to do:
-1. Use multiple independent reward functions
-2. Lock down execution where possible
-3. Add time limits
-4. Avoid unrestricted global state
-5. Sample outputs frequently and inspect them
-6. Terminate or roll back runs if behavior drifts badly
-A particularly practical recommendation was to use a **locked-down function** or restricted execution approach so the model cannot rely on undeclared globals or hidden cached state.
-Also, do not just let training run forever without checking generations. Periodic human inspection is still necessary.
-## **9\) Use process-aware feedback when you can**
-Naively assigning the same final reward to every token is inefficient. If possible, use richer supervision that distinguishes good intermediate steps from bad ones. That is the idea behind **process supervision**.
-In practice, this can be approximated by:
-* line-by-line checks,
-* step-level verifiers,
-* program trace analysis,
-* or LLM-as-a-judge for intermediate reasoning.
-But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal.
-For a hackathon, outcome-based verification plus a few lightweight process checks is usually the sweet spot.
-## **10\) Pick the right training stack**
-The intended stack here is:
-* **TRL** for RL training algorithms
-* **Unsloth** to make RL training and inference more efficient
-* **OpenEnv** to standardize environment interaction
-This combination works because:
-* OpenEnv gives you a common environment interface
-* TRL gives you RL trainers like GRPO
-* Unsloth reduces memory use and improves efficiency on top of TRL
-One of the practical examples used the same prompt repeated many times, routed through an environment, with TRL driving training and Unsloth helping with performance.
-## **11\) Prefer GRPO / RLVR style training for verifiable tasks**
-The RL setup discussed here leans toward **RL with verifiable rewards**:
-* instead of a learned reward model,
-* use a verifier, test harness, regex check, executor, or environment.
-GRPO was described as a more efficient evolution relative to older PPO-style setups, especially by simplifying away parts like the value model.
-For hackathon purposes, the key practical takeaway is:
-* if the task is verifiable,
-* build the verifier first,
-* then plug that verifier into RL training.
-## **12\) Keep inference fast**
-One important point: in RL for LLMs, **inference can dominate total runtime**. Over time, rollout generation often becomes the bottleneck, not the optimizer step.
-That means your project speed depends heavily on:
-* fast sampling,
-* tight environment loops,
-* low-overhead execution,
-* and efficient model runtime.
-This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy environments early in the hackathon.
-## **13\) Deploy your environment early**
-OpenEnv environments are designed to be deployed as **Hugging Face Spaces**, which provide:
-* a running server,
-* a Git repository,
-* and a container registry.
-That gives you several ways to work:
-* interact with the remote Space directly,
-* install the client code from the repo,
-* pull and run the container locally,
-* or run the FastAPI app locally via Python/Uvicorn.
-Why this is good for a hackathon:
-* one shared source of truth,
-* easier collaboration,
-* easier demos,
-* easier switching between local and remote execution.
-A good habit is to deploy an early version of the environment before training seriously. That catches API and packaging issues early.
-## **14\) Scale only after the environment is stable**
-There was a dedicated tutorial flow around:
-1. environment,
-2. deployment,
-3. scaling,
-4. training with TRL and Wordle.
-Follow the same order.
-Do **not** start with scale. First confirm:
-* reset works,
-* step works,
-* rewards are sensible,
-* timeouts work,
-* logs are visible,
-* and the environment can be run locally and remotely.
-Only then:
-* increase batch sizes,
-* duplicate prompts or tasks,
-* expand task diversity,
-* and benchmark throughput.
-## **15\) Monitor the right things during training**
-Do not watch only one scalar. Monitor:
-* overall reward,
-* individual reward function columns,
-* success indicators,
-* timeout frequency,
-* and generated strategies over time.
-A very concrete suggestion was:
-* watch whether the reward is going up,
-* and separately watch critical columns like “function works.”
-Also inspect actual generations during training. A rising reward is not enough if the model is learning to exploit bugs.
-## **16\) Save models correctly**
-If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was:
-**Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively.** That can badly damage model quality. Instead, use the proper merged-save path, or use the adapters directly.
-For participants, that means:
-* keep your training save path simple,
-* test post-training inference immediately,
-* and do not leave export until the end.
-## **17\) How to structure your team over the hackathon**
-A very effective team split is:
-**Person A: Environment**
-* builds reset/step/state
-* adds timeouts and safety constraints
-* makes local and remote execution work
-**Person B: Verifier / Rewards**
-* writes multiple reward functions
-* adds anti-hacking checks
-* makes failure cases visible
-**Person C: Training**
-* sets up TRL \+ Unsloth
-* runs experiments
-* tracks metrics and generations
-**Person D: Demo / Product**
-* prepares the Space demo
-* creates a simple interface
-* records examples and final benchmarks
-This split matches the way the stack naturally decomposes in practice.
-## **18\) A practical 1-day execution plan**
-### **Phase 1: Pick a narrow task**
-Choose a small, verifiable environment. Avoid huge long-horizon tasks first.
-### **Phase 2: Build the environment**
-Use OpenEnv init, implement reset/step/state, and get a local loop working.
-### **Phase 3: Build rewards**
-Add at least 2–4 independent reward checks, plus timeout and anti-cheat logic.
-### **Phase 4: Deploy**
-Push to a Space or run locally via container/Uvicorn so teammates can use the same environment.
-### **Phase 5: Train small**
-Run a tiny TRL \+ Unsloth experiment first. Look at outputs, not just metrics.
-### **Phase 6: Inspect for hacking**
-Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts.
-### **Phase 7: Add curriculum**
-If the model gets zero reward too often, simplify tasks or add easier start states.
-### **Phase 8: Train bigger**
-Only after the loop is stable should you increase scale, batch size, or environment diversity.
-### **Phase 9: Save and demo**
-Export the trained model correctly, test inference, and show before/after behavior.
-## **19\) What judges or reviewers will likely find compelling**
-The strongest hackathon projects usually show:
-* a clear environment design,
-* objective reward functions,
-* evidence that the model improved,
-* prevention against reward hacking,
-* a reproducible deployment story,
-* and a sharp demo.
-A simple but strong demo format is:
-1. baseline model attempt,
-2. reward/verifier output,
-3. trained model attempt,
-4. measurable improvement,
-5. short explanation of safeguards.
-## **20\) Suggested problem statement theme directions**
-Please Refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing)
-## **21\) Common mistakes to avoid**
-* Picking a task so hard that success probability is zero
-* Using only one reward function
-* Not checking for reward hacking
-* Training before the environment is stable
-* Relying only on average reward and not inspecting outputs
-* Forgetting timeouts and sandbox limits
-* Saving LoRA/QLoRA models incorrectly
-## **22\) Learning Resources**
-**(Recommended) RL Environment Lecture Chapters:**
-[**https://openenv-india-apr-2026.lovable.app/**](https://openenv-india-apr-2026.lovable.app/)
-**Module 1: Why OpenEnv?** (\~7 min)
-▸ Workshop 8:02–15:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=482s](https://www.youtube.com/watch?v=1jU05MlENOI&t=482s)
-▸ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec \+ Docker
-▸ Alt: Mega Lecture 40:01–46:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=2401s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s)
-**Module 2: Using Existing Envs** (\~7.5 min)
-▸ Workshop 35:33–43:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2133s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s)
-▸ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry), from\_hub
-▸ Alt: Mega Lecture 1:24:11–1:30:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5051s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s)
-**Module 3: Deploying Envs** (\~9 min)
-▸ Mega Lecture 1:30:00–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5400s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s)
-▸ Ben: live openenv init, scaffold, running locally, openenv push, Docker run from Space
-▸ Alt: Workshop 43:05–48:30 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2585s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s)
-**Module 4: Building Your Own** (\~6.5 min)
-▸ Workshop 43:45–50:20 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2625s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s)
-▸ Ben: scaffold files, business logic (reset/step), models, client, publishing
-▸ Alt: Mega Lecture 1:33:30–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5610s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s)
-**Module 5: Training \+ TRL** (\~14 min)
-▸ Mega Lecture 1:53:20–2:07:12 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=6800s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s)
-▸ Lewis: Wordle GRPO walkthrough — rollout function, reward shaping, GRPOTrainer, live training
-▸ Alt: Workshop 22:24–34:12 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=1344s](https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s)

docs/references/[External] OpenEnv Hackathon FAQs.md DELETED Viewed

@@ -1,556 +0,0 @@
-## **1\) What is reinforcement learning in the context of LLMs?**
-Reinforcement learning for LLMs is a loop where the model generates an answer, code snippet, plan, or action sequence; that output is evaluated by a verifier or environment; and the resulting reward is used to update the model so higher-reward behaviors become more likely over time. In practice, this is often used after pretraining and supervised fine-tuning to sharpen behaviors like reasoning, code generation, or tool use. The session framed this intuition as turning repeated trial-and-error into weight updates instead of stuffing more and more examples into the prompt.
-A good mental model is: supervised fine-tuning tells the model “copy this good target,” while RL tells it “try many possibilities and move probability mass toward the ones that score better.” PPO is one classic algorithm for this style of training, and GRPO is a later variant used heavily in modern LLM work because it can be more memory-efficient for certain setups. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
-For deeper reading:
-* TRL docs for RL trainers and workflows. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
-* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
-* DeepSeekMath for GRPO. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
-## **2\) Why do rewards matter so much?**
-Rewards are the only signal telling the model what “better” means. If your reward is well aligned with the real task, RL can push the model toward genuinely useful behavior. If your reward is incomplete or easy to game, the model will optimize the wrong thing very effectively. The session emphasized that RL gives you what you asked for, not necessarily what you meant.
-For example, if you reward generated code only for passing a shallow regex or a weak unit test, the model may learn to exploit those checks instead of solving the underlying problem. This is why reward design is not a detail; it is the task specification. DeepMind’s discussion of “specification gaming” makes the same point in broader RL terms: weakly specified rewards create loopholes that search will discover. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-Useful reading:
-* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-* Lilian Weng on reward hacking. ([Lil'Log](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/?utm_source=chatgpt.com))
-## **3\) What is rewards engineering?**
-Rewards engineering is the work of designing, combining, validating, and monitoring reward signals so that optimization pressure produces the behavior you actually want. In LLM RL, that usually means deciding:
-* what gets rewarded,
-* how much it gets rewarded,
-* when it gets rewarded,
-* what gets penalized,
-* and how you audit whether the reward is being gamed.
-A practical reward function often has several components. For a code task, you might combine syntax validity, execution success, unit test pass rate, latency, memory use, formatting compliance, and safety checks. The session highlighted verifier-based reward design such as formatting checks, execution checks, regex checks, and environment-based evaluation instead of a learned reward model alone.
-A useful principle is to reward outcomes first, then add process constraints only where needed. Over-shaping the reward can make training brittle or bias the model into narrow strategies, while under-shaping makes hacking easier. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-## **4\) What is RLVR, and how is it different from using a reward model?**
-RLVR usually means reinforcement learning with verifiable rewards. Instead of asking a learned reward model to score outputs, you use a verifier, tester, or environment that can check correctness more directly. The session gave examples like formatting checks, execution checks, regex-based checks, and environment rollouts.
-This is powerful when correctness is externally testable. Code can be compiled and unit-tested. Math can often be checked against a final answer or symbolic verifier. Games can expose reward from the environment. Browser tasks can be checked by page state or task completion. In such cases, verifier-driven rewards are often more trustworthy than a purely learned scalar reward model.
-TRL documents this broader environment-based training pattern, and OpenEnv is meant to standardize how such environments are defined and used. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-## **5\) Why do RL environments matter for LLMs?**
-Static prompt-response datasets are useful, but they are limited. Real deployments require models to interact with systems: codebases, browsers, files, APIs, games, tools, and simulators. RL environments let the model act, observe consequences, and keep going across multiple steps, which is much closer to real agent behavior. The session described environments as the bridge from isolated prompt solving to real-world interaction.
-They also enable dynamic difficulty and richer feedback. Instead of training forever on a fixed set of prompts, the environment can generate or surface tasks that are more appropriate for the current model, which makes curriculum learning and continual challenge easier. This matches the broader “RL with environments” direction discussed in recent OpenEnv and TRL material. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-For examples:
-* BrowserGym for web-task environments. ([GitHub](https://github.com/servicenow/browsergym?utm_source=chatgpt.com))
-* OpenEnv course and TRL integration docs. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
-## **6\) What is OpenEnv, and why would a hackathon team use it?**
-OpenEnv is an open-source framework for defining and interacting with RL environments for LLM and agent training. The session described it as a standardized interface around concepts like reset, step, state, observations, actions, and rewards, with deployment built around Hugging Face Spaces and containerized execution.
-A hackathon team would use OpenEnv because it reduces environment plumbing. Instead of inventing a new interface for each task, you can standardize how the model talks to the environment and then connect that to a trainer like TRL. That means you spend more time on task design and rewards, and less on adapter glue. The session also highlighted `openenv init` for bootstrapping an environment skeleton quickly.
-Good starting points:
-* OpenEnv repo. ([GitHub](https://github.com/meta-pytorch/OpenEnv?utm_source=chatgpt.com))
-* OpenEnv course. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
-* TRL’s OpenEnv integration guide. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-## **7\) How does OpenEnv work at a high level?**
-At a high level, an OpenEnv environment exposes a small set of standard operations:
-* reset the environment,
-* step the environment with an action,
-* return observations, rewards, and state.
-The session described OpenEnv environments as FastAPI applications that can be run locally, deployed on Hugging Face Spaces, or pulled as containers. That gives teams several options: they can use the remote environment directly, install client code from the repo, or run the environment locally through the container image.
-This design is useful because it treats environments as portable, versioned software artifacts rather than ad hoc scripts. Hugging Face’s own TRL docs describe OpenEnv similarly, including support for backend-server execution and standardized APIs. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-## **8\) Where do TRL and Unsloth fit in this stack?**
-TRL is the training library. It provides trainers and workflows for SFT, DPO, PPO, GRPO, reward modeling, and related post-training methods for transformer models. In a typical hackathon setup, TRL handles rollout collection, reward integration, optimization, logging, and trainer configuration. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
-Unsloth fits in as the acceleration and memory-efficiency layer for training and RL fine-tuning. The session described Unsloth as making RL training more efficient and inference faster, which matters because rollout generation often dominates runtime in RL loops. It also noted a practical QLoRA warning: don’t naively upcast a 4-bit model to 16-bit and then merge adapters, because that can damage model quality; use the proper merge path instead.
-Relevant docs:
-* TRL docs and GRPO cookbook. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
-* Unsloth repository/readme. ([GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1&utm_source=chatgpt.com))
-## **9\) What is the difference between PPO and GRPO?**
-PPO is a classic policy optimization algorithm that stabilizes updates by constraining how much the policy changes between iterations. It is one of the most influential RL algorithms in modern deep learning. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
-GRPO is a later group-relative variant used in LLM training that compares sampled outputs within a group to estimate relative advantage, and it is often discussed as a more memory-efficient alternative to full PPO-style setups in some LLM post-training pipelines. The session summarized GRPO as a more efficient version of PPO and specifically noted removing the value model from the setup.
-For deeper details:
-* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
-* DeepSeekMath / GRPO references via TRL paper index and cookbook. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
-## **10\) Why is RL often described as inefficient, yet still useful?**
-RL is often inefficient because the feedback is sparse and delayed. A long rollout may end in one scalar reward, and that weak signal has to train many decisions. The session used a simple example: if a code answer fails at one line but you assign the same negative reward to every token, you’re throwing away a lot of structure.
-It is still useful because it can optimize behaviors where exact supervised targets are unavailable, too expensive, or too limiting. If you can verify success but cannot easily author perfect demonstrations for every scenario, RL can still improve the model by repeated interaction. This is why RL is especially attractive for code execution, tool use, games, browser tasks, and agent workflows.
-A practical takeaway: use RL where verifiers exist and where exploration is worth the extra compute.
-## **11\) What is process supervision, and why is it important?**
-Process supervision means giving feedback on intermediate reasoning or intermediate steps, not only on the final outcome. The session contrasted this with assigning the same reward to every token in the answer, which can be very wasteful. Under process supervision, you try to identify which parts of a trace were good, irrelevant, or harmful.
-This matters because not all failures are equal. Maybe the model chose the right algorithmic approach but made one implementation mistake. Final-outcome-only rewards blur that distinction. Step-aware rewards can improve sample efficiency and make debugging easier, though they also raise new risks if the step labels are noisy or exploitable.
-The session also noted that process supervision is often approximated with humans or LLM-as-a-judge. That can help, but it creates another optimization target that itself may be gamed.
-## **12\) What is reward hacking?**
-Reward hacking is when the model finds a way to maximize reward without genuinely doing the intended task. In other words, the optimization succeeds, but the task specification failed. The session gave intuitive examples such as editing variables, bypassing intended checks, or exploiting quirks in the environment rather than solving the real problem.
-This is the same phenomenon often called specification gaming. DeepMind describes it as agents exploiting flaws or ambiguities in the reward function, and Lilian Weng’s overview covers how common and fundamental this problem is in RL systems. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-A useful mindset is: reward hacking is not proof the model is “evil”; it is proof that optimization pressure found a loophole.
-## **13\) How can a hackathon team reduce reward hacking in practice?**
-Use strong verifiers. Prefer executable checks over stylistic heuristics. For code, run tests, time the solution, validate output shapes and edge cases, and isolate execution. For tool use, verify actual state transitions, not just verbal claims. The session repeatedly emphasized verifiers and environments over vague reward signals.
-Monitor training actively. The session recommended sampling outputs periodically, looking for suspicious patterns, and terminating or rolling back runs when drift appears. It also suggested filtering bad responses and adding guardrails when patterns of exploitation are observed.
-Use layered rewards. Combine success criteria with anti-cheat constraints. For example:
-* pass tests,
-* do not edit protected files,
-* do not bypass timers,
-* stay within time and memory budget,
-* preserve task-required formatting,
-* and log intermediate actions for audit.
-This general strategy aligns with broader RL safety guidance on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-## **14\) What is curriculum learning, and why does it help RL?**
-Curriculum learning means controlling the order or difficulty of training tasks so the model learns from easier tasks first and gradually moves to harder ones. The session directly recommended this for RL: if tasks are too hard at the start, the model may never produce a successful rollout, which means the reward signal is effectively zero and learning stalls.
-This is especially important in LLM RL because many tasks are long-horizon and brittle. An easier initial distribution can bootstrap behavior, after which harder tasks become reachable. In the RL literature more broadly, curriculum learning is a standard way to improve exploration and sample efficiency in difficult environments. ([arXiv](https://arxiv.org/pdf/2504.06618?utm_source=chatgpt.com))
-Practical idea for hackathons:
-* start with short horizons,
-* fewer tools,
-* simpler state spaces,
-* stronger hints,
-* easier test cases,
-* then gradually remove scaffolding.
-## **15\) How do I know whether a task is suitable for RL?**
-A task is a good candidate for RL if:
-* you can verify success or partial progress,
-* exploration is meaningful,
-* multi-step interaction matters,
-* and you do not already have abundant high-quality demonstrations.
-The session highlighted a key rule of thumb: the probability of a good answer must be greater than zero. If the task is so hard that the model never stumbles into any rewarding behavior, RL will waste compute. That means task selection, warm starts, formatting scaffolds, or light SFT can be essential.
-Good hackathon candidates include:
-* code generation with executable tests,
-* browser navigation with page-state checks,
-* games with clear win conditions,
-* API/tool workflows with verifiable side effects.
-## **16\) Should we jump straight into RL, or do some SFT first?**
-Usually, do some SFT or at least a warm start first. The session’s guidance was that pretraining carries most of the capability burden, SFT helps shape the behavior, and RL refines it. It explicitly argued against relying on RL alone from scratch for most practical settings.
-That matches modern post-training stacks: pretrain heavily, align or instruct-tune, then apply preference optimization and/or RL where it adds value. TRL’s supported workflows reflect exactly this broader stack. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
-A hackathon-friendly recipe is:
-1. Start from a solid instruct model.
-2. Add a tiny amount of task-format SFT if needed.
-3. Build a strong verifier.
-4. Use GRPO/PPO-style RL only after the model can at least occasionally succeed.
-## **17\) What should we actually monitor during RL training?**
-Monitor more than the headline reward. The session specifically called out tracking reward trends, component rewards, and whether important success columns are improving over time. It also recommended checking generated strategies and periodically sampling outputs during training rather than letting runs continue blindly.
-Useful metrics include:
-* average reward,
-* verifier pass rate,
-* timeout rate,
-* format adherence,
-* rollout length,
-* diversity of successful solutions,
-* frequency of suspicious shortcuts,
-* and cost per useful trajectory.
-If the average reward rises but the actual task quality drops or becomes brittle, that is often a reward-design problem rather than a model-capability problem.
-## **18\) What is a strong hackathon strategy for building an RL environment fast?**
-Pick a task with a crisp verifier. Build the smallest environment that exposes reset, step, observations, and reward. Use OpenEnv to standardize the interface and TRL to handle training. Use Unsloth if you need to fit training into tighter hardware budgets. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-A practical sequence:
-1. Define the task and what “success” means.
-2. Write the verifier before writing the policy loop.
-3. Create a few toy tasks the model can solve.
-4. Add curriculum or easier variants first.
-5. Run small-scale debugging before long training.
-6. Sample outputs constantly for reward hacking.
-7. Only then scale rollouts and environment diversity.
-## **19\) What are good starter resources for participants?**
-For TRL:
-* Main docs. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
-* PPO trainer docs. ([Hugging Face](https://huggingface.co/docs/trl/ppo_trainer?utm_source=chatgpt.com))
-* GRPO cookbook. ([Hugging Face](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl?utm_source=chatgpt.com))
-* Paper index for GRPO/DeepSeekMath references. ([Hugging Face](https://huggingface.co/docs/trl/paper_index?utm_source=chatgpt.com))
-For OpenEnv:
-* OpenEnv GitHub repo. ([GitHub](https://github.com/meta-pytorch/OpenEnv?utm_source=chatgpt.com))
-* OpenEnv course. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
-* TRL’s OpenEnv integration docs. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
-For environments and benchmarks:
-* BrowserGym. ([GitHub](https://github.com/servicenow/browsergym?utm_source=chatgpt.com))
-For reward design and failure modes:
-* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
-* Lilian Weng on reward hacking. ([Lil'Log](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/?utm_source=chatgpt.com))
-For RL algorithms:
-* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
-* DeepSeekMath / GRPO paper. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
-For Unsloth:
-* Unsloth repo/readme. ([GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1&utm_source=chatgpt.com))
-## **20\) What is the one-sentence summary participants should remember?**
-If you can build a task where success is verifiable, difficulty is controllable, and loopholes are monitored, RL can turn an LLM from “good at answering” into “better at acting.”
-###
-### **21\) What is RLVR?**
-RLVR stands for reinforcement learning with verifiable rewards. Instead of relying only on a learned reward model or human preference model, the training loop uses programmatic checks to determine whether an output is correct. Typical examples include exact-answer checks for math, unit tests for code, schema validation for structured output, or environment-based task completion checks. This makes RLVR especially attractive for domains where correctness can be verified automatically and consistently. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
-### **22\) What is RLVE?**
-RLVE is reinforcement learning with verifiable environments. The key idea is to train on environments that can procedurally generate tasks, expose adjustable difficulty, and provide algorithmically verifiable rewards. Recent work on adaptive verifiable environments argues that static prompt datasets often become either too easy or too hard during training, causing learning to stall, while adaptive environments keep the model near its capability frontier. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **23\) How is RLVE different from RLVR?**
-RLVR usually refers to verifiable rewards on a fixed or semi-fixed set of prompts or problems. RLVE goes a step further by making the task source itself dynamic: the environment can generate new problems, vary difficulty, and keep serving appropriately challenging tasks as the model improves. In practice, RLVE is often better for preventing saturation on static datasets and for building curriculum naturally into training. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **24\) Why are RL environments useful for LLM post-training?**
-They let the model interact, not just answer. In a real environment, the model can act, observe consequences, act again, and get reward from actual task outcomes. That makes environments a better fit for tool use, browsers, APIs, coding agents, games, and long-horizon tasks than plain prompt-response datasets. Hugging Face’s OpenEnv and TRL material reflects this shift toward environment-based agent training. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
-### **25\) Where do TRL, GRPO, and Unsloth fit in?**
-TRL is the training framework that provides RL trainers and infrastructure for post-training transformer models, including GRPO. GRPO is the RL optimization method popularized in DeepSeekMath and now widely used in open LLM RL pipelines because it can be more memory-efficient than PPO-style setups in this context. Unsloth is typically used as the efficiency layer to make fine-tuning and RL training faster and more affordable on limited hardware. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
-### **26\) Why do rewards matter so much?**
-Because the reward is the task definition as far as optimization is concerned. If your reward captures the real objective, RL can improve useful behavior. If your reward is incomplete, noisy, or hackable, the model will optimize the proxy instead of the real task. DeepMind’s write-up on specification gaming makes this point very clearly: the agent’s ingenuity is helpful only when the specification is correct. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
-### **27\) What is reward engineering?**
-Reward engineering is the design of the reward function, the verifier, the shaping terms, the penalties, and the monitoring strategy. In LLM RL, this includes deciding what counts as success, how partial progress is rewarded, what shortcuts are forbidden, and how to detect reward hacking. OpenEnv’s reward-design guide explicitly warns about reward hacking, sparse rewards, and conflicting signals as common pitfalls. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-### **28\) What is reward hacking?**
-Reward hacking happens when a model finds a way to maximize the reward without actually doing the intended task. DeepMind describes this as specification gaming: the system satisfies the literal reward but not the real goal. Classic causes include poorly designed shaping rewards, missing constraints in the success condition, and simulator or verifier loopholes. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
-### **29\) Why is sparse reward a common problem?**
-If successful trajectories are too rare, the model may never get enough positive signal to improve. OpenEnv’s docs explicitly call sparse rewards a common pitfall because the agent may never find positive signal. RLVE work similarly notes that overly difficult tasks can yield consistently poor rewards and stall gradient-based learning. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-### **30\) Why can dense rewards also be dangerous?**
-Dense rewards can speed up learning, but they can also create local optima and incentive misalignment. OpenEnv recommends starting simple and shaping carefully, because intermediate rewards can steer the model toward proxy behaviors. DeepMind gives the broader warning that poorly designed shaping can change the optimal policy itself rather than just helping the model reach the intended outcome faster. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
----
-## **Common Pitfalls in Building RL Environments**
-### **31\) What is the most common mistake when designing an RL environment?**
-Making the environment easy to verify but not faithful to the real task. A verifier that checks only the final string, a regex, or a narrow success pattern may be convenient, but it often misses equivalent correct answers or allows degenerate shortcuts. Recent verifier analysis on mathematical RL found that rule-based verifiers often reject correct but differently formatted answers, while model-based verifiers can be exploited to produce false positives during RL. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **32\) What goes wrong with weak verifiers?**
-Two opposite failure modes are common. Rule-based verifiers can be too brittle and produce false negatives when the answer is correct but phrased differently. Model-based verifiers can be too permissive and produce false positives that the policy learns to exploit. The verifier study on mathematical reasoning reports both problems and shows that stronger policies make verifier weaknesses more obvious. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **33\) Why is “just use an LLM as judge” often risky?**
-Because the judge becomes part of the optimization target. If the policy can find surface patterns that fool the judge, training can inflate reward without improving real task quality. That is exactly why model-based verifiers, despite better static accuracy, can be vulnerable during RL training. Use them carefully, stress-test them, and combine them with hard checks whenever possible. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **34\) What is a common environment-design pitfall for tool-using agents?**
-Not modeling realistic failure modes. Real APIs fail because of permissions, invalid formats, missing fields, timezones, or bad parameters. Hugging Face’s OpenEnv blog highlights examples like missing OAuth scopes and bad RFC3339 datetime formatting. If the environment hides these realities, the resulting policy will be overfit to a toy setup and brittle in deployment. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
-### **35\) Why is static task difficulty a problem?**
-Because the learning signal collapses at both extremes. Tasks that are too easy stop teaching the model anything useful. Tasks that are too hard yield near-zero reward and also stop teaching. RLVE was proposed largely to solve this problem by dynamically adjusting task difficulty as the policy improves. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **36\) What is a common pitfall in environment diversity?**
-Training on too few task types. Recent RLVE results argue that scaling the number of environments improves generalizable reasoning capability, and Reasoning Gym was built around procedurally generated tasks across many domains for exactly this reason. A narrow environment set often produces narrow competence and fragile transfer. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **37\) Why do many RL environments fail to transfer to real-world performance?**
-Because they optimize the wrong abstraction level. If the environment is too toy-like, omits realistic constraints, or over-simplifies tool feedback, the model may become good at the benchmark but not at the actual workflow. This is a practical version of specification gaming: the benchmark is solved, the real job is not. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
----
-## **Common Pitfalls in Reward Engineering**
-### **38\) What is the biggest reward-engineering mistake?**
-Using a proxy metric as if it were the goal. Goodhart-style failures are everywhere in RL: token count, response format, test count, or intermediate progress can all become targets the model exploits. DeepMind’s examples of shaping mistakes and reward misspecification are the canonical warning here. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
-### **39\) Should I start with a complicated reward function?**
-Usually no. OpenEnv explicitly recommends starting simple, often with sparse success/failure reward, before layering in shaping terms. This makes debugging easier and reduces the chance that the model learns the wrong intermediate incentives before it learns the actual task. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-### **40\) What happens when reward components conflict?**
-Learning becomes unstable or confused. OpenEnv lists conflicting signals as a common pitfall: if one term rewards brevity, another rewards verbosity, a third rewards format, and a fourth rewards exploration, the policy may oscillate or learn brittle shortcuts instead of coherent behavior. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-### **41\) Why is binary reward often appealing?**
-Because it is easy to reason about and harder to game superficially. Label Studio’s RLVR overview notes that verifiable rewards are often binary and directly tied to correctness criteria, which makes evaluation simple and scalable. Binary reward is not always sufficient, but it is often a good starting point for precision-critical tasks like code and math. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
-### **42\) Why is binary reward sometimes not enough?**
-Because it can be too sparse, especially for long-horizon tasks. If success only happens at the very end, the model may not learn at all. That is where carefully designed shaping, step-level evaluation, or adaptive curriculum can help — but only if you can add them without creating easy-to-game shortcuts. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-### **43\) How do I know whether my reward is being hacked?**
-Watch for rising reward without corresponding task-quality gains. Typical signs are strange formatting habits, repetitive surface patterns, degenerate short solutions, suspiciously high judge scores, or solutions that pass weak checks but fail stronger ones. The verifier case study is a strong reminder that static verification accuracy is not enough; you must observe what happens under optimization pressure. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **44\) What is a safe pattern for reward engineering?**
-Use layered verification. Start with hard outcome checks. Add anti-cheat constraints. Then add minimal shaping only where the sparse reward is too weak. Keep a holdout evaluator separate from the training reward when possible. This matches both OpenEnv’s “start simple, shape carefully” guidance and DeepMind’s warning about shaping altering the true objective. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
----
-## **Common Pitfalls in RL Post-Training Pipelines with RLVR / RLVE / GRPO**
-### **45\) What is a common mistake in GRPO training runs?**
-Using RL before the base model is ready. GRPO is powerful, but it is a post-training method, not a substitute for capability. TRL’s own GRPO examples start from instruct models and task datasets rather than from weak base checkpoints. If the model almost never produces a correct rollout, the reward signal is too sparse for productive RL. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
-### **46\) Why does RL post-training plateau?**
-Because the model saturates the available prompt distribution or the reward signal no longer differentiates useful improvements. RLVE explicitly frames static data saturation as a problem and shows that adaptive environments can keep learning going after conventional RLVR pipelines flatten out. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **47\) Why can “more RL” make a model worse?**
-Because optimization pressure amplifies whatever the reward favors, including undesirable shortcuts. If the verifier is noisy, if the environment is unrealistic, or if the reward overvalues superficial structure, more training can push the model deeper into those artifacts rather than improving real competence. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **48\) What is a common pitfall in RLVR datasets?**
-Finite, static datasets get stale. Once the model has mastered or overfit their distribution, additional RL yields little signal. RLVE work argues that procedurally generated environments with adjustable difficulty are one way around this limitation. Reasoning Gym makes a similar case for unlimited data generation with controllable complexity. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **49\) Why do identical-looking GRPO runs produce different outcomes?**
-Because RL is highly sensitive to rollout quality, verifier behavior, reward scaling, task mix, generation parameters, and environment bugs. Even if the trainer code is the same, small differences in reward computation or environment behavior can change optimization dynamics substantially. The verifier study is a good reminder that the reward pipeline itself is part of the model. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **50\) What is a common pitfall when mixing many environments?**
-Using an unbalanced mixture. If some environments are much easier, much denser in reward, or much shorter in trajectory length, they can dominate training and starve harder but more important environments. RLVE’s adaptive-difficulty framing exists partly to keep the training distribution informative instead of letting it collapse into easy tasks. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **51\) Why are long-horizon tasks especially hard in RL post-training?**
-Because reward arrives late and useful trajectories are rare. Long tasks need either decomposition, better intermediate signals, stronger initialization, or curriculum. Otherwise, the rollout cost is high and the success rate stays near zero. This is one reason why adaptive environments and procedural curricula are getting attention. ([arXiv](https://arxiv.org/html/2511.07317v1))
-### **52\) What monitoring mistake do teams make most often?**
-They monitor the training reward but not actual behavior. Reward alone is not enough because the reward channel can be flawed. You need sampled rollout audits, stronger offline evaluation, and held-out environments or benchmarks. The verifier case study shows why this matters: reward can rise while real quality does not. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **53\) What is the safest way to structure an RL post-training pipeline?**
-A good pattern is:
-start from a strong instruct or SFT checkpoint, use a task with a strong verifier, begin with simple reward, validate the environment thoroughly, run small-scale debug experiments, audit rollouts manually, then scale training and only later add curriculum or more shaping. This is consistent with TRL’s practical GRPO examples, OpenEnv’s reward guidance, and the lessons from verifier-failure studies. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
----
-## **Practical “What should we do in a hackathon?” FAQs**
-### **54\) What kind of project is most likely to succeed in a hackathon?**
-Pick a task with:
-a clear success condition,
-a verifier you trust,
-short to medium trajectory length,
-few external dependencies,
-and adjustable difficulty.
-Good examples are code repair with tests, structured extraction with schema validation, grid or puzzle games, tool-using workflows with exact state checks, and browser tasks with explicit completion criteria. These are the sweet spot for RLVR and lightweight RLVE prototypes. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
-### **55\) What should we avoid building?**
-Avoid tasks that are subjective, hard to verify, require massive infrastructure, or depend heavily on an LLM judge without hard backstops. Also avoid environments whose failure cases you do not understand. If you cannot explain how the reward could be hacked, you are not ready to optimize it yet. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **56\) What is the best debugging order?**
-First debug the environment manually.
-Then debug the verifier.
-Then run scripted baseline policies.
-Then run a frozen model.
-Then run a tiny RL experiment.
-Only then scale.
-This order isolates bugs early and prevents you from blaming the optimizer for what is really an environment or reward bug. It follows directly from the fact that verifier reliability is foundational in RLVR. ([arXiv](https://arxiv.org/html/2505.22203v1))
-### **57\) What is one rule the team should remember?**
-Do not optimize a reward you have not tried to break yourself first. The easiest way to avoid reward hacking is to adversarially test your environment and reward design before the model does. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
----
-## **58\) Strong references for deeper learning**
-For GRPO and TRL:
-* TRL GRPO Trainer docs. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
-* Hugging Face GRPO cookbook. ([Hugging Face](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl))
-For RL environments and reward design:
-* OpenEnv reward design guide. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
-* OpenEnv tool-using environment examples. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
-For pitfalls and failure modes:
-* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
-* Pitfalls of rule-based and model-based verifiers. ([arXiv](https://arxiv.org/html/2505.22203v1))
-For scalable environment-based training:
-* RLVE paper on adaptive verifiable environments. ([arXiv](https://arxiv.org/html/2511.07317v1))
-* Reasoning Gym. ([OpenReview](https://openreview.net/forum?id=GqYSunGmp7&referrer=%5Bthe+profile+of+Oliver+Stanley%5D%28%2Fprofile%3Fid%3D~Oliver_Stanley1%29))
-Here are solid Unsloth RL post-training recipes worth checking out, with a bias toward official or close-to-official examples.
-### **59\) Core Unsloth GRPO recipes**
-**Qwen2.5 (3B) GRPO notebook**
-A straightforward starter recipe for GRPO with Unsloth. It covers data prep, training, inference, and saving, so it is a good baseline if you want the least opinionated end-to-end example. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Qwen2.5_%283B%29-GRPO.ipynb?utm_source=chatgpt.com))
-**Llama 3.1 (8B) GRPO notebook**
-Same general pattern, but on a larger model family. Useful if you want a more realistic “reasoning/capability uplift” recipe without jumping straight to very large models. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Llama3.1_%288B%29-GRPO.ipynb?utm_source=chatgpt.com))
-**Gemma 3 (1B) GRPO notebook**
-A smaller-scale recipe that is easier to run and debug. Good for iterating on reward functions and rollout settings before spending more compute on larger checkpoints. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Gemma3_%281B%29-GRPO.ipynb?utm_source=chatgpt.com))
-### **59.1) Advanced Unsloth GRPO recipes**
-**Advanced Qwen3 (4B) GRPO notebook**
-This is one of the more interesting recipes because it adds more than the bare trainer loop. Unsloth’s June 2025 discussion explicitly calls out: proximity scoring for more nuanced rewards, OpenR1 dataset support, advanced templates, and “prefinetuning to skip GRPO format learning.” That makes it a better recipe when you care about reward shaping and format bootstrapping, not just getting GRPO to run. ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
-**HF LLM Course: Practical Exercise — GRPO with Unsloth**
-Not an Unsloth-maintained notebook repo entry, but it is a structured learning recipe that uses Unsloth specifically to fine-tune a model with GRPO for reasoning. It is a good companion when you want a didactic walkthrough instead of just notebook cells. ([Hugging Face](https://huggingface.co/learn/llm-course/chapter12/6?utm_source=chatgpt.com))
-### **59.2) Environment / agent-style RL recipes**
-**GPT-OSS 20B \+ 2048 game RL notebook**
-This is closer to “RL with an environment” than plain static-prompt RLVR. The notebook goal is explicitly to make GPT-OSS play 2048 with reinforcement learning / GRPO, which makes it a useful recipe if you want to move beyond math/code answer verification into interactive environment training. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/gpt_oss_%2820B%29_Reinforcement_Learning_2048_Game_BF16.ipynb?utm_source=chatgpt.com))
-### **59.3) Broader recipe collection**
-**Unsloth notebooks repository**
-The main repo currently advertises “250+ Fine-tuning & RL Notebooks,” including GRPO and reinforcement learning notebooks. If you want the widest set of recipes in one place, this is the best starting point. ([GitHub](https://github.com/unslothai/notebooks?utm_source=chatgpt.com))
-### **59.4) Useful adjacent recipes and examples**
-**Scheduler GRPO example using Unsloth**
-A community example that trains a scheduling model with GRPO using Unsloth and QLoRA. It is useful because it shows a non-math, non-code structured-output task where rewards are tied to output format and schedule correctness. ([Hugging Face](https://huggingface.co/blog/anakin87/qwen-scheduler-grpo?utm_source=chatgpt.com))
-**SFT → GRPO pipeline example**
-There is a community “show and tell” example for a full SFT-then-GRPO pipeline. I would treat it as inspiration rather than an official recipe, but it is valuable if your intended workflow is “teach format first, then do RL.” ([GitHub](https://github.com/unslothai/unsloth/discussions/3407?utm_source=chatgpt.com))
-### **59.5) What these recipes collectively cover**
-Across these examples, the main recipe patterns are:
-* plain GRPO on reasoning-style tasks,
-* GRPO with better reward shaping like proximity scoring,
-* pre-SFT or preformatting before RL,
-* QLoRA-based memory-efficient RL fine-tuning,
-* and environment-style RL with game interaction. ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
-### **59.6) Two gaps to keep in mind**
-One gap is **multi-turn GRPO with stepwise rewards**. There is a feature request asking for reward on each step plus a final reward, which suggests this is not yet a mature first-class recipe in Unsloth. ([GitHub](https://github.com/unslothai/unsloth/issues/3615?utm_source=chatgpt.com))
-Another gap is **notebook stability across versions/hardware**. Several issue threads mention breakage or edge cases in GRPO notebooks, including fast inference assumptions, VRAM growth, and vision-GRPO issues. That does not make the recipes unusable, but it does mean you should pin versions and test on a small run first. ([GitHub](https://github.com/unslothai/unsloth/issues/2730?utm_source=chatgpt.com))
-### **59.7) Best recipes by use case**
-If you want the simplest starting point:
-* Qwen2.5 (3B) GRPO
-* Gemma 3 (1B) GRPO ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Qwen2.5_%283B%29-GRPO.ipynb?utm_source=chatgpt.com))
-If you care about reward engineering:
-* Advanced Qwen3 (4B) GRPO ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
-If you care about environment-style RL:
-* GPT-OSS 20B 2048 notebook ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/gpt_oss_%2820B%29_Reinforcement_Learning_2048_Game_BF16.ipynb?utm_source=chatgpt.com))
-If you want the most guided learning path:
-* HF practical exercise with Unsloth \+ GRPO ([Hugging Face](https://huggingface.co/learn/llm-course/chapter12/6?utm_source=chatgpt.com))
-If helpful, I can turn this into a curated table with columns for model, task type, reward type, hardware footprint, and what each recipe teaches.
-## Additional Resources:
-* OpenEnv Core (An interface library for RL post training with environments)
-  * [https://github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
-* OpenEnv-PyTorch Docs
-  * [https://meta-pytorch.org/OpenEnv/](https://meta-pytorch.org/OpenEnv/)
-* HuggingFace OpenEnv Environments Hub
-  * [https://huggingface.co/openenv](https://huggingface.co/openenv)
-  * [https://huggingface.co/openenv/spaces](https://huggingface.co/openenv/spaces)
-* Tutorials to build, run and train RL environments and training pipelines
-  * [https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial](https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial)
-  * RL Training Examples: [https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples](https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples)
-  * RL Environment Examples: [https://github.com/meta-pytorch/OpenEnv/tree/main/envs](https://github.com/meta-pytorch/OpenEnv/tree/main/envs)
-* Few additional YT Videos on building RL Environments:
-  * [https://www.youtube.com/watch?v=0airz7BhBiA](https://www.youtube.com/watch?v=0airz7BhBiA)
-  * [https://www.youtube.com/watch?v=ap4q4sAK4OY](https://www.youtube.com/watch?v=ap4q4sAK4OY)
-  * [https://www.youtube.com/watch?v=Jew4lhAiqnw](https://www.youtube.com/watch?v=Jew4lhAiqnw)
-  * [https://openenv-india-apr-2026.lovable.app/](https://openenv-india-apr-2026.lovable.app/) **(Recommended: Chaptered Lectures)**

docs/references/hackathon_checklist.md DELETED Viewed

@@ -1,153 +0,0 @@
-# Hackathon Checklist — April 25–26, Bangalore
-**Solo participant: Akhil Soni**
----
-## What Judges Want to See
-(From Help Guide + Discord FAQ — judges are Meta/HuggingFace practitioners)
-1. **Working environment** — reset/step runs cleanly, rewards are sensible
-2. **Multiple independent reward functions** — not a single score
-3. **Evidence the model improved** — reward curve going up, before/after comparison
-4. **Anti-hacking measures** — agent can't exploit the environment
-5. **Reproducible deployment** — HF Space that anyone can hit
-6. **Sharp demo** — baseline attempt → reward output → trained attempt → measurable improvement
----
-## Before the Venue (April 24 — today)
-### Environment
-- [x] Round 2 hidden variables designed (HV1 Circadian, HV2 Energy Cliff, HV3 Meltdown)
-- [x] Gradio UI running locally (http://localhost:7862)
-- [ ] **Implement HV1, HV2, HV3 in `server/rhythm_environment.py`**
-- [ ] Add `PersonProfile` enum and `task_type` to `models.py`
-- [ ] Verify grader score still works correctly after HV changes
-- [ ] Add anti-hacking guard: cap consecutive breaks, penalize action spam
-### Reward Functions (multi-layer — for GRPO)
-- [ ] `reward_format_valid` — did the LLM output a parseable action?
-- [ ] `reward_action_legal` — is the chosen action valid given current state?
-- [ ] `reward_env_step` — actual `obs.reward` from `env.step(action)`
-### Training Setup
-- [ ] Write `training/dataset.py` — generate episode observation prompts
-- [ ] Write `training/train.py` — GRPO trainer config (use template in `unsloth_grpo_training_template.md`)
-- [ ] Write `training/inference_eval.py` — baseline run + trained run comparison
----
-## Day 1 at Venue (April 25 — morning priority)
-### 1. Deploy to HF Space FIRST (before training)
-Judges expect a running Space. Do this before anything else.
-```bash
-# Push environment as HF Space
-openenv push  # or manual push to HuggingFace
-```
-- [ ] Environment runs on HF Space
-- [ ] `reset()` and `step()` work remotely
-- [ ] Space URL noted and shared with mentors
-### 2. Verify the RL Loop End-to-End (locally first)
-```
-prompt → LLM → action → env.step() → reward → GRPO update
-```
-- [ ] Full loop runs without crashing
-- [ ] Reward goes to console/log
-- [ ] At least one successful episode (non-zero reward)
-### 3. Run Baseline (before training)
-- [ ] Run 10–20 episodes with untrained model
-- [ ] Log average grader score
-- [ ] Save baseline reward curve
-- [ ] Screenshot or record Gradio UI showing baseline behavior
----
-## Day 1 at Venue (afternoon)
-### 4. Training — Start Small
-- [ ] Train on `easy` scenario first (100–200 steps)
-- [ ] Confirm reward is going up (not flat or crashing)
-- [ ] Check generated actions — look for reward hacking patterns
-- [ ] If reward is flat: simplify prompt, check reward functions individually
-### 5. Anti-Hacking Checks
-The model may learn to spam TAKE_BREAK (low stress = less penalty).
-Guards already partially in code — verify these work:
-- [ ] `consecutive_breaks > MAX_FREE_BREAKS` → penalty applies
-- [ ] `IDLE_PENALTY` fires when no task is active
-- [ ] Model can't "know" hidden variable thresholds (they're not in obs)
-- [ ] Test with a greedy exploit agent manually
----
-## Day 2 at Venue (April 26)
-### 6. Full Training Run
-- [ ] Train on `easy` → `medium` → `hard` (curriculum)
-- [ ] 500–1000 total GRPO steps
-- [ ] Monitor: `reward/mean`, `reward/std`, KL divergence, per-reward-function scores
-- [ ] Save checkpoint every 100 steps
-### 7. Save Model Correctly
-**Warning:** Do NOT upcast 4-bit model to 16-bit and merge LoRA naively — damages quality.
-```python
-# Correct save
-model.save_pretrained_merged("outputs/rhythmenv_trained", tokenizer, save_method="merged_16bit")
-# Or keep adapters separate
-model.save_pretrained("outputs/adapters")
-tokenizer.save_pretrained("outputs/adapters")
-```
-- [ ] Model saved correctly
-- [ ] Post-training inference tested immediately after save
-### 8. Build the Demo
-Format: **baseline → trained → measurable improvement**
-```
-1. Show baseline: untrained model playing easy scenario → grader score ~0.2
-2. Show reward curve: 500 steps, reward trending up
-3. Show trained: model playing same scenario → grader score ~0.6+
-4. Explain hidden variables: why the model had to discover them
-5. Show person profile inference: does the model behave differently for MORNING_PERSON vs NIGHT_OWL?
-```
-- [ ] Gradio UI shows before/after comparison
-- [ ] Reward curve screenshot/chart ready
-- [ ] 3-minute pitch rehearsed (see `docs/round2/pitch_framing.md`)
----
-## Submission Checklist
-- [ ] HF Space deployed and running
-- [ ] `inference.py` updated for trained model (correct `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`)
-- [ ] README updated with Round 2 description
-- [ ] Reward curves saved as images
-- [ ] Model pushed to HF Hub
----
-## Quick Reference — Key Numbers
-| Thing | Value |
-|---|---|
-| Max steps per episode | 20 |
-| Scenarios | easy / medium / hard |
-| Grader weights | 40% completion, 20% deadline, 15% efficiency, 10% energy, 15% stress |
-| GRPO starting lr | 2e-4 |
-| GRPO num_generations | 4 (more than 2048 notebook — hidden vars need exploration) |
-| GRPO max_steps | 1000 |
-| Prize pool | $30,000 (top 15 teams) |
-| Evaluation | ~20-30 min per top team by Meta/HF engineers |
----
-## Contacts at Venue
-- Sanyam Bhutani — Partner Engineer, Meta
-- Ben Burtenshaw — Community Education, HuggingFace
-- Adithya S Kolavi — ML Engineer, HuggingFace

docs/{round2/[External] Apr ‘26 OpenEnv Hackathon Themes & Judging Criteria.md → references/judging_criteria.md} RENAMED Viewed

File without changes

docs/references/reward_engineering_overview.md DELETED Viewed

@@ -1,82 +0,0 @@
----
-source: https://arxiv.org/abs/2408.10215
-title: "Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications"
-authors: Sinan Ibrahim, Mostafa Mostafa, Ali Jnadi, Hadi Salloum, Pavel Osinenko
-published: IEEE Access, Vol. 12, 2024
----
-# Reward Engineering & Shaping — Overview Paper
-## What It Covers
-A survey of 55 papers on reward design challenges in RL. Core problems addressed:
-- Sparse / delayed rewards (most common bottleneck)
-- Reward hacking — agent exploits loopholes instead of solving the task
-- Multi-objective complexity — real tasks have competing objectives
-- Convergence inefficiency without proper guidance
----
-## Key Technique: Potential-Based Reward Shaping (PBRS)
-The safest reward shaping approach — mathematically guarantees the optimal policy doesn't change:
-```
-R'(s,a,s') = R(s,a,s') + γΦ(s') - Φ(s)
-```
-- `Φ(s)` is a potential function encoding "how good is this state"
-- The agent learns faster without learning a different policy
-- **For RhythmEnv:** Φ(s) could be `progress_toward_deadlines + energy_level`
----
-## Key Techniques Relevant to RhythmEnv
-### Handling Sparse Rewards
-- **EXPLORS:** Self-supervised exploration bonuses — fully automated, no manual design
-- **RUNE:** Uses ensemble variance as an exploration bonus (reward uncertainty)
-- **Intrinsic motivation (LIRPG):** Agent learns curiosity-driven rewards alongside task rewards
-### Preventing Reward Hacking
-- Test with adversarial agents before finalizing reward functions
-- **Difference rewards:** `R'(s,a) = R(s,a) + γ[D(s',r) - D(s,r)]` — incentivizes true contribution, useful if extending to multi-agent
-- Monitor agent trajectories for unintended patterns (e.g., spamming breaks, never switching tasks)
-### Multi-Objective Reward Design
-- Use **vector rewards** — separate dimensions for progress, stress, energy, deadlines
-- Aggregate with explicit weights (our current design already does this)
-- Ensure reward components don't cancel each other out silently
-### Dynamic Potential Functions (DPBRS)
-- Time-varying `Φ(s,t)` — potential changes as the episode progresses
-- Relevant for RhythmEnv: deadline proximity should increase the potential for completing near-deadline tasks as time runs out
----
-## Common Pitfalls (Checklist)
-- [ ] Don't rely on sparse rewards alone — add intermediate shaping
-- [ ] Watch for reward hacking — test with a greedy agent that tries to exploit
-- [ ] Complex reward functions are hard to debug — start simple, add components one at a time
-- [ ] Evaluation metrics must be independent of reward design (our `_grade_episode` grader serves this role)
-- [ ] Domain knowledge is essential but expensive to encode — validate with domain experts
----
-## For Our Hidden Variables
-The paper directly supports the hidden variable approach:
-- Hidden variables that secretly modulate reward = reward uncertainty from the agent's perspective
-- Agent must learn to explore across time-of-day and energy levels to discover the true reward structure
-- This is essentially the agent discovering the "potential function" through experience
----
-## Takeaways for RhythmEnv Training
-1. Use PBRS: define Φ(s) = weighted combination of progress + energy + inverse-stress
-2. Add exploration bonus early in training (agent needs to try morning vs afternoon work)
-3. Monitor for reward hacking (e.g., taking maximum breaks to avoid stress penalty)
-4. Track reward components separately in logs — not just total reward
-5. Reduce reward horizon early in training to accelerate validation

docs/references/reward_engineering_software_tasks.md DELETED Viewed

@@ -1,77 +0,0 @@
----
-source: https://arxiv.org/abs/2601.19100
-title: "Reward Engineering for Reinforcement Learning in Software Tasks"
-authors: Md Rayhanul Masud, Azmine Toushik Wasi, Salman Rahman, Md Rizwan Parvez
-published: arXiv, January 2026 (first systematic review of this area)
----
-# Reward Engineering for RL in Software Tasks
-## What It Covers
-First systematic review of reward design for code-centric RL tasks (generation, repair, summarization, testing). Surveys 80+ papers from 2024–2025. Core problem: software tasks lack direct reward signals — everything is proxy-based.
-Relevant to RhythmEnv because **our environment is also proxy-based**: the agent never directly observes the hidden circadian/energy/stress factors — it infers them from reward signals, just like a code agent infers "correctness" from test pass rates.
----
-## Proxy Reward Pattern (directly maps to RhythmEnv)
-| Software Task Proxy | RhythmEnv Equivalent |
-|---|---|
-| Compilation success (binary) | Task completed before deadline (binary) |
-| Test pass rate (% passing) | Importance-weighted completion fraction |
-| Code quality metrics | Energy + stress management score |
-| No regression (didn't break other tests) | No missed deadlines on other tasks |
-| Runtime efficiency | Steps worked / optimal steps (efficiency score) |
-The grader's final score = our "test suite". Per-step rewards = our "fast proxy" signals.
----
-## Key Design Principles
-### 1. Composite Rewards Win
-No single metric is sufficient. Combine:
-- **Fast proxies** (cheap, run every step): progress delta, stress penalty
-- **Slow validators** (expensive, run at episode end): grader score (completion, deadline, efficiency)
-Our design already does this: per-step reward + `_grade_episode` at `done=True`.
-### 2. Sparse Reward Handling
-Software tasks naturally sparse (pass/fail). Solutions:
-- **Partial credit:** Reward near-correct attempts (our `progress_reward` per step does this)
-- **Shaping:** Guide exploration toward productive states
-- **Curriculum:** Start easy, add complexity — our `easy → medium → hard` scenarios
-### 3. Reward Horizon
-Shorter reward horizons accelerate learning. For RhythmEnv:
-- Keep `MAX_STEPS=20` for training (short episodes = faster reward signal)
-- Don't extend to multi-day episodes until single-day policy is stable
-### 4. Avoid Single-Metric Optimization
-Agents trained on test pass rate alone produce brittle code. For us:
-- Don't train only on final score — intermediate per-step rewards matter
-- The hidden variables (HV1/HV2/HV3) ensure the agent can't cheat a single metric
----
-## Practical Checklist for Our Training Setup
-- [ ] Per-step reward provides dense feedback (already implemented)
-- [ ] Final grader score is independent of per-step reward design (already implemented)
-- [ ] Multiple reward components logged separately (need to ensure in training loop)
-- [ ] Curriculum: train on `easy` first, then `medium`, then `hard`
-- [ ] Monitor for policy collapse — agent converging to a single strategy (e.g., always take breaks)
-- [ ] Reward shaping doesn't conflict with grader score direction
----
-## Takeaways for Hidden Variables
-The paper's core insight: reward proxy ≠ true objective. This is exactly what hidden variables enforce:
-- HV1 (Circadian): Same action at different times gives different rewards — forces temporal exploration
-- HV2 (Energy Cliff): Progress collapses silently — forces the agent to maintain energy, can't predict when
-- HV3 (Stress Meltdown): All rewards degrade silently — forces stress management even when it's not penalized directly
-The agent must discover the "true test suite" (hidden variable thresholds) through the proxy (per-step rewards).

docs/references/unsloth_grpo_training_template.md DELETED Viewed

@@ -1,269 +0,0 @@
----
-source: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game.ipynb
-model: unsloth/gpt-oss-20b (4-bit quantized)
-algorithm: GRPO (Group Relative Policy Optimization)
-environment: 2048 game via OpenEnv (Meta-PyTorch)
----
-# Unsloth GRPO Training Template — OpenEnv 2048
-Reference notebook for training an LLM agent on an OpenEnv environment using GRPO.
-Adapt this pattern for RhythmEnv.
----
-## Installation
-```bash
-pip install --upgrade uv
-uv pip install torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes
-uv pip install transformers==4.56.2 trackio trl==0.22.2
-pip install fastapi uvicorn requests
-# Install your environment
-git clone https://github.com/meta-pytorch/OpenEnv.git
-# or: pip install openenv-rhythm-env
-```
----
-## 1. Model Loading
-```python
-from unsloth import FastLanguageModel
-max_seq_length = 768
-lora_rank = 4
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name="unsloth/gpt-oss-20b",  # swap for our model
-    load_in_4bit=True,                  # 4-bit quantization for VRAM
-    max_seq_length=max_seq_length,
-    offload_embedding=True,             # saves VRAM
-)
-model = FastLanguageModel.get_peft_model(
-    model,
-    r=lora_rank,
-    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
-                    "gate_proj", "up_proj", "down_proj"],
-    lora_alpha=lora_rank * 2,           # standard: 2x rank
-    use_gradient_checkpointing="unsloth",
-    random_state=3407,
-)
-```
-**For RhythmEnv:** Swap `model_name` for whatever model we use on-site. Keep 4-bit + LoRA — essential for fitting in hackathon compute budget.
----
-## 2. Environment Connection Pattern
-```python
-import sys, requests
-sys.path.insert(0, './src')
-# Launch env server (FastAPI + uvicorn)
-port, openenv_process = launch_openenv(port=9000, process=None)
-# Reset
-result = openenv_process.reset()
-state = result.observation  # contains board state, legal_actions, done
-# Step
-result = openenv_process.step(action)
-```
-**For RhythmEnv adaptation:**
-```python
-from server.rhythm_environment import RhythmEnvironment
-from models import RhythmAction, ActionType
-env = RhythmEnvironment()
-obs = env.reset(task="easy")
-# obs.energy, obs.stress, obs.tasks, obs.timestep, obs.done
-action = RhythmAction(action_type=ActionType.CONTINUE_TASK)
-obs = env.step(action)
-```
----
-## 3. GRPO Trainer Config
-```python
-from trl import GRPOConfig, GRPOTrainer
-max_prompt_length = 182
-max_completion_length = 768 - max_prompt_length
-training_args = GRPOConfig(
-    temperature=1.0,
-    learning_rate=2e-4,
-    weight_decay=0.001,
-    warmup_ratio=0.1,
-    lr_scheduler_type="linear",
-    optim="adamw_8bit",
-    logging_steps=1,
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=1,
-    num_generations=2,         # generate 2 candidates per prompt, compare
-    max_prompt_length=max_prompt_length,
-    max_completion_length=max_completion_length,
-    max_steps=600,             # ~600 training iterations
-    save_steps=100,
-    report_to="trackio",       # or "wandb"
-    output_dir="outputs",
-)
-```
-**Key GRPO parameters to tune:**
-- `num_generations`: higher = more diverse exploration but slower (2 is minimum)
-- `max_steps`: 600 is baseline; increase if reward curves haven't converged
-- `temperature`: 1.0 for exploration; lower (0.7) after policy stabilizes
----
-## 4. Reward Functions (Three-Layer Stack Pattern)
-The notebook stacks three reward functions. Adapt this for RhythmEnv:
-```python
-# Layer 1: Format validity (always check first)
-def format_valid(completions, **kwargs):
-    scores = []
-    for completion in completions:
-        response = completion[0]["content"]
-        action = extract_action(response)   # parse action from LLM output
-        scores.append(1.0 if action is not None else -2.0)
-    return scores
-# Layer 2: Action legality
-def action_legal(completions, prompts, **kwargs):
-    scores = []
-    for completion, prompt in zip(completions, prompts):
-        obs = get_obs_from_prompt(prompt)   # reconstruct state
-        action = extract_action(completion[0]["content"])
-        legal = action in obs.legal_actions if action is not None else False
-        scores.append(1.0 if legal else -1.0)
-    return scores
-# Layer 3: Environment reward (run env.step, return actual reward)
-def env_reward(completions, prompts, **kwargs):
-    scores = []
-    for completion, prompt in zip(completions, prompts):
-        action = extract_action(completion[0]["content"])
-        obs = run_env_step(action, prompt)  # step the environment
-        scores.append(obs.reward if obs else -3.0)
-    return scores
-# Pass all three to trainer
-trainer = GRPOTrainer(
-    model=model,
-    processing_class=tokenizer,
-    reward_funcs=[format_valid, action_legal, env_reward],
-    args=training_args,
-    train_dataset=dataset,
-)
-trainer.train()
-```
----
-## 5. Dataset Structure
-GRPO needs a dataset of prompts (the model generates completions and gets rewards):
-```python
-from datasets import Dataset
-# For RhythmEnv: each sample is one episode observation prompt
-prompt_template = """
-You are managing a person's workday. Current state:
-- Step: {timestep}/20
-- Energy: {energy:.2f}
-- Stress: {stress:.2f}
-- Current task: {current_task}
-- Tasks: {tasks_summary}
-Choose the best action: START_TASK(id), CONTINUE_TASK, SWITCH_TASK(id), or TAKE_BREAK.
-Reply with just the action.
-"""
-dataset = Dataset.from_list([
-    {"prompt": [{"role": "user", "content": prompt_template.format(**sample)}]}
-    for sample in generate_episode_samples(n=1000)
-])
-```
----
-## 6. Inference After Training
-```python
-text = tokenizer.apply_chat_template(
-    [{"role": "user", "content": prompt}],
-    tokenize=False,
-    add_generation_prompt=True,
-    reasoning_effort="low",   # fast inference during eval
-)
-output = model.generate(
-    **tokenizer(text, return_tensors="pt").to("cuda"),
-    temperature=0.7,           # lower temp at inference time
-    max_new_tokens=64,         # actions are short
-)
-response = tokenizer.decode(output[0], skip_special_tokens=True)
-action = extract_action(response)
-```
----
-## 7. Monitoring
-The notebook uses TrackIO (`report_to="trackio"`). Use W&B or TrackIO:
-```python
-import wandb
-wandb.init(project="rhythmenv-round2")
-# GRPOConfig(report_to="wandb")
-```
-Key metrics to watch:
-- `reward/mean` — should trend upward
-- `reward/std` — high early (exploration), narrows as policy stabilizes
-- `kl` — KL divergence from reference policy; too high = unstable training
-- Per-reward-function scores — track format_valid, action_legal, env_reward separately
----
-## Differences: 2048 Game vs RhythmEnv
-| 2048 Game | RhythmEnv |
-|---|---|
-| Discrete board state (16 ints) | Continuous state (energy, stress, progress) |
-| 4 legal actions always | Variable legal actions (depends on current_task) |
-| Win condition: reach 2048 | Win condition: high grader score (0.0–1.0) |
-| Dense reward via win/lose | Dense reward via progress + penalty components |
-| No hidden variables | 3 hidden variables (Circadian, Energy Cliff, Meltdown) |
-| Strategy = Python function | Strategy = natural language action choice |
-The hidden variables in RhythmEnv mean the agent must run **many episodes** to infer the true reward structure — more training steps needed than 2048.
----
-## Recommended Starting Config for RhythmEnv
-```python
-GRPOConfig(
-    learning_rate=2e-4,
-    num_generations=4,      # more diversity needed (hidden var exploration)
-    max_steps=1000,         # more steps than 2048 (hidden var discovery)
-    temperature=1.0,        # keep high for exploration
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=4,  # effective batch = 4
-    warmup_ratio=0.1,
-    report_to="wandb",
-)
-```

docs/round1/SPEC.md DELETED Viewed

@@ -1,554 +0,0 @@
-# Build a Complete OpenEnv Environment: RhythmEnv
-## Context
-You are an expert software engineer tasked with building a **complete, production-grade OpenEnv environment** for a Meta x Hugging Face hackathon.
-You have access to:
-* The OpenEnv repository (including examples)
-* Validation tools (`openenv validate`)
-* Docker and Hugging Face Spaces
-You must:
-* Use this specification as a **strong foundation**
-* Cross-reference with existing OpenEnv examples
-* Improve design decisions where appropriate
-* Ensure strict compliance with OpenEnv standards
-Do NOT blindly follow instructions — refine and correct based on best practices observed in the repo.
----
-# Objective
-Build an environment called:
-## **RhythmEnv**
-> A deterministic reinforcement learning environment that simulates daily planning and execution under constraints like time, energy, deadlines, and task importance.
-This environment should allow agents to learn:
-* prioritization
-* scheduling
-* energy management
-* decision-making under trade-offs
----
-# Core Requirements (MANDATORY)
----
-## 1. OpenEnv Spec Compliance
-You MUST:
-* Implement typed Pydantic models:
-  * Observation
-  * Action
-  * Reward
-* Implement:
-  * `reset()`
-  * `step(action)`
-  * `state()`
-* Include:
-  * `openenv.yaml`
-  * environment metadata
-* Pass:
-```bash
-openenv validate
-```
----
-## 2. Real-World Task
-The environment simulates:
-> “Given a set of tasks, deadlines, and constraints, plan and execute optimally over a day.”
-This is NOT a game. It must feel like a real productivity system.
----
-## 3. Determinism (CRITICAL)
-* No randomness anywhere
-* All transitions are pure functions
-* Same input → same output always
----
-## 4. Episode Design
-* 1 episode = 1 day
-* 1 step = 30 minutes
-* Total steps = ~20
----
-## 5. Action Space (STRICT)
-No free-form actions.
-Define structured actions:
-* START_TASK(task_id)
-* CONTINUE_TASK()
-* SWITCH_TASK(task_id)
-* TAKE_BREAK(duration)
-Validate all actions strictly.
----
-## 6. Observation Space
-Must include:
-* current timestep
-* energy (0–1)
-* stress (0–1)
-* current task (optional)
-* tasks list
-* calendar (meetings)
-* remaining steps
-### Task fields:
-* id
-* effort (0–1)
-* progress (0–1)
-* deadline (timestep)
-* importance (0–1)
----
-## 7. Environment Dynamics
----
-### Task Progress
-* Only when working on a task
-* Scales with energy
-Example baseline:
-```
-progress_delta = k * energy
-```
----
-### Energy
-* decreases during work
-* increases during breaks
-* slight decay during idle/switch
-Clamp between [0, 1]
----
-### Stress
-* increases when:
-  * deadlines missed
-  * too many pending tasks
-* decreases during breaks
----
-### Meetings
-* block task progress
-* slightly reduce energy
----
-## 8. Hidden Internal Mode (IMPORTANT)
-Implement a latent mode (NOT exposed to agent):
-* deep_work
-* execution
-* balanced
-Derived deterministically from state.
-Used to:
-* slightly influence reward
-* create richer learning signal
----
-# 🔴 Reward & Grader Design Contract (STRICT)
-This is the MOST IMPORTANT part of the system.
----
-## Reward Design Requirements
----
-### 1. Dense & Informative
-* Every step must produce meaningful reward
-* No flat or zero-reward sequences
----
-### 2. Monotonic Progress
-* More progress → higher reward
-* Regressions → penalties
----
-### 3. Multi-Component Reward
-Reward MUST include:
-### Positive:
-* task progress
-* task completion (scaled by importance)
-### Negative:
-* stress penalty
-* missed deadlines
-* excessive switching
-* inefficiency
-* no-op behavior
----
-### 4. Anti-Exploitation
-Explicitly prevent:
-* infinite loops
-* repeated switching
-* spamming breaks
-* idle actions
-Add:
-* penalties
-* diminishing returns
-* constraints
----
-### 5. Bounded Reward
-* Prevent reward explosion
-* Keep values stable and comparable
----
-### 6. Reward Breakdown (MANDATORY)
-Return structured info:
-```
-{
-  "progress": ...,
-  "completion_bonus": ...,
-  "stress_penalty": ...,
-  "switch_penalty": ...,
-  "inefficiency_penalty": ...
-}
-```
----
-## Grader Design Requirements (CRITICAL)
-Each task MUST include a deterministic grader.
----
-### Requirements:
-* Score range:
-```
-0.0 ≤ score ≤ 1.0
-```
-* Deterministic
-* Reproducible
-* Continuous (not binary)
----
-### Must Evaluate:
-* task completion
-* deadline adherence
-* efficiency
-* energy usage
-* stress management
----
-### Efficiency Metric (REQUIRED)
-Define:
-```
-efficiency = optimal_steps / actual_steps
-```
-Use in grader.
----
-### Normalization
-Ensure:
-* random agent → ~0.1–0.3
-* baseline agent → ~0.4–0.6
-* strong agent → ~0.7–1.0
----
-## Reward vs Grader Alignment
-* Reward → guides learning
-* Grader → evaluates outcome
-They must align but NOT be identical.
----
-## Anti-Exploitation Validation (MANDATORY)
-Explicitly test:
-* agent spamming TAKE_BREAK
-* agent switching every step
-* agent doing nothing
-Ensure:
-* these strategies score poorly
----
-## Logging (MANDATORY)
-Return in `info`:
-```
-{
-  "reward_breakdown": ...,
-  "task_progress": ...,
-  "deadline_status": ...
-}
-```
----
-# 9. Tasks (3 Required)
----
-## Task 1 — Easy (Single Priority)
-* 3 tasks
-* 1 clearly important
-* no meetings
-* high energy
-Goal:
-* complete main task efficiently
----
-## Task 2 — Medium (Deadline Pressure)
-* multiple tasks
-* tight deadlines
-* at least one meeting
-Goal:
-* maximize completion before deadlines
----
-## Task 3 — Hard (Energy Tradeoff)
-* low energy
-* one deep task
-* multiple small tasks
-Goal:
-* balance:
-  * rest
-  * deep work
-  * short tasks
----
-# 10. Baseline Agent (`inference.py`)
----
-## Requirements:
-* Use OpenAI client
-* Read:
-  * API_BASE_URL
-  * MODEL_NAME
-  * OPENAI_API_KEY
-* Run all 3 tasks
-* Output logs in EXACT format:
-  * `[START]`
-  * `[STEP]`
-  * `[END]`
----
-## Baseline Strategy
-Simple heuristic:
-* pick highest importance task
-* continue until done or deadline
-* take break if energy low
-Baseline must be:
-* non-trivial
-* beatable
----
-# 11. Code Structure
-```
-rhythm_env/
-  ├── env.py
-  ├── models.py
-  ├── tasks/
-  ├── graders/
-  ├── utils/
-  ├── openenv.yaml
-  ├── inference.py
-  ├── Dockerfile
-  └── README.md
-```
----
-# 12. README (MANDATORY)
-Include:
-* description
-* motivation
-* action space
-* observation space
-* reward design
-* task descriptions
-* grader explanation
-* setup instructions
-* baseline results
----
-# 13. Validation Checklist
-Before finishing:
-* [ ] openenv validate passes
-* [ ] Docker builds & runs
-* [ ] HF Space responds to reset()
-* [ ] All 3 tasks execute
-* [ ] Graders return valid scores
-* [ ] Baseline script runs < 20 min
-* [ ] Logs follow required format
----
-# 14. Iteration Requirement (MANDATORY)
-After initial implementation:
-1. Run:
-   * baseline agent
-   * random policy
-2. Compare scores
-3. Adjust:
-   * reward weights
-   * penalties
-   * grader scaling
-DO NOT finalize without iteration.
----
-# 15. Design Principles (FINAL)
----
-## DO:
-* Learn from OpenEnv examples
-* Keep environment deterministic
-* Make trade-offs the core difficulty
-* Keep state interpretable
-* Ensure reward clarity
----
-## DO NOT:
-* introduce randomness
-* hide critical information unnecessarily
-* create sparse rewards
-* build overly complex simulation
----
-# Final Goal
-Produce an environment that:
-* agents can learn from
-* evaluators can trust
-* demonstrates meaningful improvement across models
-* reflects real-world decision-making
----
-End of Prompt.

docs/round1/inference.py DELETED Viewed

@@ -1,304 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""
-RhythmEnv Inference Script
-===================================
-MANDATORY
-- Before submitting, ensure the following variables are defined in your environment configuration:
-    API_BASE_URL   The API endpoint for the LLM.
-    MODEL_NAME     The model identifier to use for inference.
-    HF_TOKEN       Your Hugging Face / API key.
-    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
-- Defaults are set only for API_BASE_URL and MODEL_NAME
-    (and should reflect your active inference setup):
-    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
-    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
-- The inference script must be named `inference.py` and placed in the root directory of the project
-- Participants must use OpenAI Client for all LLM calls using above variables
-STDOUT FORMAT
-- The script must emit exactly three line types to stdout, in this order:
-    [START] task=<task_name> env=<benchmark> model=<model_name>
-    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
-    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
-  Rules:
-    - One [START] line at episode begin.
-    - One [STEP] line per step, immediately after env.step() returns.
-    - One [END] line after env.close(), always emitted (even on exception).
-    - reward and rewards are formatted to 2 decimal places.
-    - done and success are lowercase booleans: true or false.
-    - error is the raw last_action_error string, or null if none.
-    - All fields on a single line with no newlines within a line.
-    - Each tasks should return score in [0, 1]
-"""
-import asyncio
-import os
-import sys
-import textwrap
-from typing import List, Optional
-from openai import OpenAI
-# Add current directory to path for local imports
-sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
-from client import RhythmEnv
-from models import ActionType, RhythmAction
-# ---------------------------------------------------------------------------
-# Configuration
-# ---------------------------------------------------------------------------
-IMAGE_NAME = os.getenv("IMAGE_NAME")
-API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
-API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
-BASE_URL = os.getenv("RHYTHM_ENV_URL", "https://InosLihka-rhythm-env.hf.space")
-BENCHMARK = "rhythm_env"
-TASKS = ["easy", "medium", "hard"]
-MAX_STEPS = 20
-SCORE_THRESHOLD = 0.1
-SYSTEM_PROMPT = textwrap.dedent("""\
-You are a daily planning agent. You manage tasks across a workday.
-Each step is a 30-minute slot. You have energy (0-1) and stress (0-1).
-Available actions (respond with EXACTLY one line in this format):
-  START_TASK <task_id>
-  CONTINUE_TASK
-  SWITCH_TASK <task_id>
-  TAKE_BREAK
-Rules:
-- START_TASK/SWITCH_TASK require a task_id (integer).
-- CONTINUE_TASK continues your current task.
-- TAKE_BREAK recovers energy and reduces stress.
-- Take breaks when energy < 0.3.
-- Prioritize tasks by deadline urgency, then importance.
-- Avoid unnecessary switching (costs energy and reward).
-Respond with ONLY the action line, nothing else.""")
-# ---------------------------------------------------------------------------
-# Logging helpers
-# ---------------------------------------------------------------------------
-def log_start(task: str, env: str, model: str) -> None:
-    print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
-    error_val = error if error else "null"
-    done_val = str(done).lower()
-    print(
-        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
-        flush=True,
-    )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
-    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(
-        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
-        flush=True,
-    )
-# ---------------------------------------------------------------------------
-# Heuristic action selection (enhanced by LLM)
-# ---------------------------------------------------------------------------
-def choose_action_heuristic(obs) -> RhythmAction:
-    """Greedy heuristic: prioritize by deadline then importance."""
-    energy = obs.energy
-    current_task_id = obs.current_task_id
-    tasks = obs.tasks
-    timestep = obs.timestep
-    meetings = obs.meetings
-    # During meeting slots, just take a break
-    if timestep in meetings:
-        return RhythmAction(action_type=ActionType.TAKE_BREAK)
-    # Take break if energy is low
-    if energy < 0.3:
-        return RhythmAction(action_type=ActionType.TAKE_BREAK)
-    # Get uncompleted tasks
-    uncompleted = [t for t in tasks if t.progress < t.effort]
-    if not uncompleted:
-        return RhythmAction(action_type=ActionType.TAKE_BREAK)
-    # Sort by deadline (ascending), then importance (descending)
-    uncompleted.sort(key=lambda t: (t.deadline, -t.importance))
-    # Check for urgent tasks (deadline within 3 steps)
-    urgent = [t for t in uncompleted if t.deadline - timestep <= 3]
-    best = urgent[0] if urgent else uncompleted[0]
-    if current_task_id is not None and current_task_id == best.id:
-        return RhythmAction(action_type=ActionType.CONTINUE_TASK)
-    elif current_task_id is not None:
-        return RhythmAction(action_type=ActionType.SWITCH_TASK, task_id=best.id)
-    else:
-        return RhythmAction(action_type=ActionType.START_TASK, task_id=best.id)
-def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
-    """Use LLM to pick an action, fall back to heuristic on failure."""
-    tasks_desc = "\n".join(
-        f"  Task {t.id}: {t.name} — {t.description}\n"
-        f"    (effort={t.effort:.2f}, progress={t.progress:.2f}, "
-        f"deadline=step {t.deadline}, importance={t.importance})"
-        for t in obs.tasks
-    )
-    user_prompt = textwrap.dedent(f"""\
-Step: {obs.timestep}/{MAX_STEPS}
-Energy: {obs.energy:.2f}
-Stress: {obs.stress:.2f}
-Current task: {obs.current_task_id}
-Meetings at steps: {obs.meetings}
-Remaining steps: {obs.remaining_steps}
-Tasks:
-{tasks_desc}
-Choose your action:""")
-    try:
-        completion = llm_client.chat.completions.create(
-            model=MODEL_NAME,
-            messages=[
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": user_prompt},
-            ],
-            temperature=0.3,
-            max_tokens=30,
-            stream=False,
-        )
-        text = (completion.choices[0].message.content or "").strip()
-        return parse_llm_action(text, obs)
-    except Exception:
-        return choose_action_heuristic(obs)
-def parse_llm_action(text: str, obs) -> RhythmAction:
-    """Parse LLM response text into a RhythmAction."""
-    text = text.strip().upper()
-    if text.startswith("TAKE_BREAK"):
-        return RhythmAction(action_type=ActionType.TAKE_BREAK)
-    if text.startswith("CONTINUE_TASK"):
-        if obs.current_task_id is not None:
-            return RhythmAction(action_type=ActionType.CONTINUE_TASK)
-        return choose_action_heuristic(obs)
-    for prefix, action_type in [
-        ("START_TASK", ActionType.START_TASK),
-        ("SWITCH_TASK", ActionType.SWITCH_TASK),
-    ]:
-        if text.startswith(prefix):
-            rest = text[len(prefix):].strip()
-            try:
-                task_id = int(rest)
-                if 0 <= task_id < len(obs.tasks):
-                    return RhythmAction(action_type=action_type, task_id=task_id)
-            except ValueError:
-                pass
-    # Fallback
-    return choose_action_heuristic(obs)
-# ---------------------------------------------------------------------------
-# Main loop
-# ---------------------------------------------------------------------------
-async def run_task(task_name: str, llm_client: OpenAI) -> float:
-    """Run a single task and return the score."""
-    if IMAGE_NAME:
-        env = await RhythmEnv.from_docker_image(IMAGE_NAME)
-    else:
-        env = RhythmEnv(base_url=BASE_URL)
-    rewards: List[float] = []
-    steps_taken = 0
-    score = 0.0
-    success = False
-    log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
-    try:
-        async with env:
-            result = await env.reset(task=task_name)
-            for step in range(1, MAX_STEPS + 1):
-                if result.done:
-                    break
-                # Use LLM if available, otherwise heuristic
-                if llm_client is not None:
-                    action = choose_action_llm(result.observation, llm_client)
-                else:
-                    action = choose_action_heuristic(result.observation)
-                action_str = action.action_type.value
-                if action.task_id is not None:
-                    action_str += f"({action.task_id})"
-                result = await env.step(action)
-                reward = result.reward or 0.0
-                done = result.done
-                rewards.append(reward)
-                steps_taken = step
-                log_step(step=step, action=action_str, reward=reward, done=done, error=None)
-                if done:
-                    break
-            # Get final score from grader
-            score = result.observation.reward_breakdown.get("final_score", 0.0)
-            score = max(0.0, min(1.0, score))
-            success = score >= SCORE_THRESHOLD
-    except Exception as e:
-        print(f"[DEBUG] Error running task {task_name}: {e}", flush=True)
-    finally:
-        try:
-            await env.close()
-        except Exception as e:
-            print(f"[DEBUG] env.close() error: {e}", flush=True)
-        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
-    return score
-async def main() -> None:
-    llm_client = None
-    if API_KEY:
-        llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    scores = []
-    for task_name in TASKS:
-        s = await run_task(task_name, llm_client)
-        scores.append(s)
-    avg = sum(scores) / len(scores) if scores else 0.0
-    print(f"\n[SUMMARY] avg_score={avg:.3f} scores={','.join(f'{s:.3f}' for s in scores)}", flush=True)
-if __name__ == "__main__":
-    asyncio.run(main())

docs/round1/problem_statement.md DELETED Viewed

@@ -1,176 +0,0 @@
-# Round 1 — Problem Statement
-## The Task
-Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
----
-## Key Requirements at a Glance
-- Must simulate a real-world task (not games or toys)
-- Implement full OpenEnv spec: typed models, `step()`/`reset()`/`state()`, `openenv.yaml`
-- Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
-- Meaningful reward function with partial progress signals
-- Baseline inference script with reproducible scores
-- Deploy to Hugging Face Spaces + working Dockerfile
-- README with environment description, action/observation spaces, setup instructions
----
-## Functional Requirements
-### 1. Real-World Task Simulation
-The environment must simulate a task humans actually do — not games or toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
-### 2. OpenEnv Spec Compliance
-Implement the full OpenEnv interface:
-- Typed `Observation`, `Action`, and `Reward` Pydantic models
-- `step(action)` → returns observation, reward, done, info
-- `reset()` → returns initial observation
-- `state()` → returns current state
-- `openenv.yaml` with metadata
-- Tested via `openenv validate`
-### 3. Minimum 3 Tasks with Agent Graders
-Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
-### 4. Meaningful Reward Function
-- Provides signal over the full trajectory (not just binary end-of-episode)
-- Rewards partial progress toward task completion
-- Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions)
-### 5. Baseline Inference Script
-- Uses the OpenAI API client to run a model against the environment
-- Reads API credentials from environment variables (`OPENAI_API_KEY`)
-- Produces a reproducible baseline score on all 3 tasks
----
-## Non-Functional Requirements
-### 1. Hugging Face Space Deployment
-Environment must run as a containerized HF Space tagged with `openenv`.
-### 2. Containerized Execution
-Must include a working Dockerfile. The environment should start cleanly with `docker build` + `docker run`.
-### 3. Documentation
-README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
----
-## Evaluation Criteria
-| Parameter | Weight | Description |
-|---|---|---|
-| Real-world utility | 30% | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
-| Task & grader quality | 25% | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
-| Environment design | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries. |
-| Code quality & spec compliance | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works. |
-| Creativity & novelty | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach. |
----
-## Scoring Breakdown
-**Real-world utility (30%)**
-- 0–5: Toy/artificial problem with no practical application
-- 6–15: Valid domain but shallow modeling of the real task
-- 16–25: Good domain modeling, would be useful for agent evaluation
-- 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
-**Task & grader quality (25%)**
-- 3+ tasks with difficulty range?
-- Graders produce scores between 0.0–1.0?
-- Graders deterministic and reproducible?
-- Hard task genuinely challenges frontier models?
-**Environment design (20%)**
-- `reset()` produces clean state?
-- Action/observation types well-designed and documented?
-- Reward function provides useful varying signal (not just sparse)?
-- Episode boundaries sensible?
-**Code quality & spec compliance (15%)**
-- `openenv validate` passes?
-- `docker build && docker run` works?
-- HF Space deploys and responds?
-- Baseline script runs and reproduces scores?
-**Creativity & novelty (10%)**
-- Domain we haven't seen in OpenEnv before?
-- Reward design has interesting properties?
-- Clever mechanics that make the environment engaging?
----
-## How Judging Works
-**Phase 1: Automated Validation**
-Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
-**Phase 2: Agentic Evaluation**
-Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
-**Phase 3: Human Review**
-Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
----
-## Disqualification Criteria
-- Environment does not deploy or respond
-- Plagiarized or trivially modified existing environments
-- Graders that always return the same score
-- No baseline inference script
----
-## Pre-Submission Checklist — all must pass or you're disqualified
-- [ ] HF Space deploys — automated ping to the Space URL must return 200 and respond to `reset()`
-- [ ] OpenEnv spec compliance — validate `openenv.yaml`, typed models, `step()`/`reset()`/`state()` endpoints
-- [ ] Dockerfile builds — automated `docker build` on the submitted repo
-- [ ] Baseline reproduces — run the submitted inference script, must complete without error and produce scores
-- [ ] 3+ tasks with graders — enumerate tasks, run each grader, verify scores in 0.0–1.0 range
----
-## Mandatory Additional Instructions
-Before submitting, ensure the following variables are defined in your environment configuration:
-| Variable | Description |
-|---|---|
-| `API_BASE_URL` | The API endpoint for the LLM |
-| `MODEL_NAME` | The model identifier to use for inference |
-| `HF_TOKEN` | Your Hugging Face / API key |
-- The inference script must be named `inference.py` and placed in the root directory of the project
-- Participants must use OpenAI Client for all LLM calls using the above variables
-- Participants must emit structured stdout logs strictly following the `[START]`, `[STEP]`, and `[END]` format (see `scripts/sample_inference.py`)
----
-## Infrastructure Restrictions
-- Runtime of inference script must be less than 20 minutes
-- Environment and inference must run on a machine with vCPU=2, memory=8GB
----
-## Setup Prerequisites
-| Tool | Purpose | Install |
-|---|---|---|
-| Python 3.10+ | Runtime | `python --version` |
-| Git + GitHub | Push submission | `git --version` |
-| Hugging Face CLI | Deploy to HF Spaces | `pip install huggingface_hub` then `huggingface-cli login` |
-| OpenEnv | The framework | `pip install openenv-core` |
-| Docker (recommended) | Isolated container testing | `docker --version` |
-| VS Code (recommended) | Best Python + Docker support | — |
----
-*See `scripts/validate-submission.sh` to run pre-submission checks and `scripts/sample_inference.py` for the required inference script format.*

docs/round1/scripts/sample_inference.py DELETED Viewed

@@ -1,182 +0,0 @@
-"""
-Inference Script Example
-===================================
-MANDATORY
-- Before submitting, ensure the following variables are defined in your environment configuration:
-    API_BASE_URL   The API endpoint for the LLM.
-    MODEL_NAME     The model identifier to use for inference.
-    HF_TOKEN       Your Hugging Face / API key.
-    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
-                     method
-- Defaults are set only for API_BASE_URL and MODEL_NAME
-    (and should reflect your active inference setup):
-    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
-    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
-- The inference script must be named `inference.py` and placed in the root directory of the project
-- Participants must use OpenAI Client for all LLM calls using above variables
-STDOUT FORMAT
-- The script must emit exactly three line types to stdout, in this order:
-    [START] task=<task_name> env=<benchmark> model=<model_name>
-    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
-    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
-  Rules:
-    - One [START] line at episode begin.
-    - One [STEP] line per step, immediately after env.step() returns.
-    - One [END] line after env.close(), always emitted (even on exception).
-    - reward and rewards are formatted to 2 decimal places.
-    - done and success are lowercase booleans: true or false.
-    - error is the raw last_action_error string, or null if none.
-    - All fields on a single line with no newlines within a line.
-    - Each tasks should return score in [0, 1]
-  Example:
-    [START] task=click-test env=miniwob model=Qwen3-VL-30B
-    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
-    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
-    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
-    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
-"""
-import asyncio
-import os
-import textwrap
-from typing import List, Optional
-from openai import OpenAI
-from my_env_v4 import MyEnvV4Action, MyEnvV4Env
-IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
-API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
-API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
-MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
-TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
-BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
-MAX_STEPS = 8
-TEMPERATURE = 0.7
-MAX_TOKENS = 150
-SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
-# Max possible reward: each token contributes 0.1, across all steps
-_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
-MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
-SYSTEM_PROMPT = textwrap.dedent(
-    """
-    You are interacting with a simple echo environment.
-    Each turn you must send a message. The environment will echo it back.
-    Reward is proportional to message length: reward = len(message) * 0.1
-    Your goal is to maximize total reward by sending meaningful, substantive messages.
-    Reply with exactly one message string — no quotes, no prefixes, just the message text.
-    """
-).strip()
-def log_start(task: str, env: str, model: str) -> None:
-    print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
-    error_val = error if error else "null"
-    done_val = str(done).lower()
-    print(
-        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
-        flush=True,
-    )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
-    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
-def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
-    history_block = "\n".join(history[-4:]) if history else "None"
-    return textwrap.dedent(
-        f"""
-        Step: {step}
-        Last echoed message: {last_echoed!r}
-        Last reward: {last_reward:.2f}
-        Previous steps:
-        {history_block}
-        Send your next message.
-        """
-    ).strip()
-def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
-    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
-    try:
-        completion = client.chat.completions.create(
-            model=MODEL_NAME,
-            messages=[
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": user_prompt},
-            ],
-            temperature=TEMPERATURE,
-            max_tokens=MAX_TOKENS,
-            stream=False,
-        )
-        text = (completion.choices[0].message.content or "").strip()
-        return text if text else "hello"
-    except Exception as exc:
-        print(f"[DEBUG] Model request failed: {exc}", flush=True)
-        return "hello"
-async def main() -> None:
-    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
-    history: List[str] = []
-    rewards: List[float] = []
-    steps_taken = 0
-    score = 0.0
-    success = False
-    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
-    try:
-        result = await env.reset() # OpenENV.reset()
-        last_echoed = result.observation.echoed_message
-        last_reward = 0.0
-        for step in range(1, MAX_STEPS + 1):
-            if result.done:
-                break
-            message = get_model_message(client, step, last_echoed, last_reward, history)
-            result = await env.step(MyEnvV4Action(message=message))
-            obs = result.observation
-            reward = result.reward or 0.0
-            done = result.done
-            error = None
-            rewards.append(reward)
-            steps_taken = step
-            last_echoed = obs.echoed_message
-            last_reward = reward
-            log_step(step=step, action=message, reward=reward, done=done, error=error)
-            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
-            if done:
-                break
-        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
-        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
-        success = score >= SUCCESS_SCORE_THRESHOLD
-    finally:
-        try:
-            await env.close()
-        except Exception as e:
-            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
-        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
-if __name__ == "__main__":
-    asyncio.run(main())

docs/round1/scripts/validate-submission.sh DELETED Viewed

@@ -1,186 +0,0 @@
-#!/usr/bin/env bash
-#
-# validate-submission.sh — OpenEnv Submission Validator
-#
-# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
-#
-# Prerequisites:
-#   - Docker:       https://docs.docker.com/get-docker/
-#   - openenv-core: pip install openenv-core
-#   - curl (usually pre-installed)
-#
-# Run:
-#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
-#
-#   Or download and run locally:
-#     chmod +x validate-submission.sh
-#     ./validate-submission.sh <ping_url> [repo_dir]
-#
-# Arguments:
-#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
-#   repo_dir   Path to your repo (default: current directory)
-#
-# Examples:
-#   ./validate-submission.sh https://my-team.hf.space
-#   ./validate-submission.sh https://my-team.hf.space ./my-repo
-#
-set -uo pipefail
-DOCKER_BUILD_TIMEOUT=600
-if [ -t 1 ]; then
-  RED='\033[0;31m'
-  GREEN='\033[0;32m'
-  YELLOW='\033[1;33m'
-  BOLD='\033[1m'
-  NC='\033[0m'
-else
-  RED='' GREEN='' YELLOW='' BOLD='' NC=''
-fi
-run_with_timeout() {
-  local secs="$1"; shift
-  if command -v timeout &>/dev/null; then
-    timeout "$secs" "$@"
-  elif command -v gtimeout &>/dev/null; then
-    gtimeout "$secs" "$@"
-  else
-    "$@" &
-    local pid=$!
-    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
-    local watcher=$!
-    wait "$pid" 2>/dev/null
-    local rc=$?
-    kill "$watcher" 2>/dev/null
-    wait "$watcher" 2>/dev/null
-    return $rc
-  fi
-}
-portable_mktemp() {
-  local prefix="${1:-validate}"
-  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
-}
-CLEANUP_FILES=()
-cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
-trap cleanup EXIT
-PING_URL="${1:-}"
-REPO_DIR="${2:-.}"
-if [ -z "$PING_URL" ]; then
-  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
-  printf "\n"
-  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
-  printf "  repo_dir   Path to your repo (default: current directory)\n"
-  exit 1
-fi
-if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
-  printf "Error: directory '%s' not found\n" "${2:-.}"
-  exit 1
-fi
-PING_URL="${PING_URL%/}"
-export PING_URL
-PASS=0
-log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
-pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
-fail() { log "${RED}FAILED${NC} -- $1"; }
-hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
-stop_at() {
-  printf "\n"
-  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
-  exit 1
-}
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-log "Repo:     $REPO_DIR"
-log "Ping URL: $PING_URL"
-printf "\n"
-log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
-CURL_OUTPUT=$(portable_mktemp "validate-curl")
-CLEANUP_FILES+=("$CURL_OUTPUT")
-HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
-  -H "Content-Type: application/json" -d '{}' \
-  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
-if [ "$HTTP_CODE" = "200" ]; then
-  pass "HF Space is live and responds to /reset"
-elif [ "$HTTP_CODE" = "000" ]; then
-  fail "HF Space not reachable (connection failed or timed out)"
-  hint "Check your network connection and that the Space is running."
-  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
-  stop_at "Step 1"
-else
-  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
-  hint "Make sure your Space is running and the URL is correct."
-  hint "Try opening $PING_URL in your browser first."
-  stop_at "Step 1"
-fi
-log "${BOLD}Step 2/3: Running docker build${NC} ..."
-if ! command -v docker &>/dev/null; then
-  fail "docker command not found"
-  hint "Install Docker: https://docs.docker.com/get-docker/"
-  stop_at "Step 2"
-fi
-if [ -f "$REPO_DIR/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR"
-elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR/server"
-else
-  fail "No Dockerfile found in repo root or server/ directory"
-  stop_at "Step 2"
-fi
-log "  Found Dockerfile in $DOCKER_CONTEXT"
-BUILD_OK=false
-BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
-if [ "$BUILD_OK" = true ]; then
-  pass "Docker build succeeded"
-else
-  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
-  printf "%s\n" "$BUILD_OUTPUT" | tail -20
-  stop_at "Step 2"
-fi
-log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
-if ! command -v openenv &>/dev/null; then
-  fail "openenv command not found"
-  hint "Install it: pip install openenv-core"
-  stop_at "Step 3"
-fi
-VALIDATE_OK=false
-VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
-if [ "$VALIDATE_OK" = true ]; then
-  pass "openenv validate passed"
-  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
-else
-  fail "openenv validate failed"
-  printf "%s\n" "$VALIDATE_OUTPUT"
-  stop_at "Step 3"
-fi
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
-printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-printf "\n"
-exit 0

docs/round2/Plan_v2/CoreMEters.md DELETED Viewed

@@ -1,50 +0,0 @@
-To make your life-simulator robust and research-ready for OpenEnv, we need "Meters" that interact with each other. If one goes too low, it should drag the others down—this creates the "puzzle" the agent has to solve.
-Here are the 6 Core Meters to represent the state of a person's life:
-## 1. 🔋 Vitality (Physical)
-* Represents: Sleep, nutrition, and physical health.
-* The Decay: Naturally drops every hour.
-* The Risk: If this hits <10, the person becomes "Sick," making all other actions (like Work) 3x more expensive in terms of stress.
-## 2. 🧠 Cognition (Mental Capacity)
-* Represents: Focus, willpower, and "brain power."
-* The Logic: High-value actions like Office Work or Self-Improvement require >40 Cognition to be effective.
-* The Recharge: Restored by Sleep or Me Time. Drained heavily by Binge Watching (brain fog).
-## 3. 📈 Progress (Achievement)
-* Represents: Career growth, skills learned, and "getting things done."
-* The Reward Hook: This is usually the primary driver for "Ambitious" profiles.
-* The Logic: This meter is monotonic (it mostly goes up), but it creates a heavy "tax" on Serenity.
-## 4. 🎭 Serenity (Emotional Stability)
-* Represents: Stress levels and mental peace.
-* The Logic: This is your "Stability" meter. High Progress usually drains Serenity.
-* The Hidden Variable: An "Anxious" personality might lose Serenity 2x faster when Progress is stagnant.
-## 5. ❤️ Connection (Social/Relational)
-* Represents: Depth of relationships with family, friends, and partner.
-* The Decay: Relationships "cool off" if ignored for too long.
-* The Logic: Spending time here restores Serenity but costs Vitality (energy).
-## 6. 🕒 Time Bank (Resource)
-* Represents: The remaining hours in the day/episode.
-* The Hard Stop: Once this hits 0, the RewardManager calculates the "Final Day Grade."
-------------------------------
-## How the RewardManager reads these:
-Because you are decoupling the code, your RewardManager can look at these meters and apply different "Life Philosophies":
-* The "Hustle" Philosophy: Reward = (2.0 * Progress) + (0.5 * Vitality).
-* Result: Agent works 16 hours a day and sleeps only when Vitality is critical.
-* The "Monk" Philosophy: Reward = (2.0 * Serenity) + (1.0 * Connection) - (0.5 * Progress).
-* Result: Agent prioritizes meditation and family, doing just enough work to stay afloat.
-## Why this is "Deep" for the Hackathon:
-This setup allows the agent to discover Non-Linear Relationships. For example, it might learn that "Exercise" (which costs Vitality now) actually increases the maximum Cognition for the next day. That is a "Hidden Parameter" discovery that would highly impress the Meta/Hugging Face judges.
-Should we now define the specific "Action Set" (the verbs) the agent can use to manipulate these 6 meters?

docs/round2/Plan_v2/GeminiDiscussion.md DELETED Viewed

@@ -1,61 +0,0 @@
-This is a comprehensive "Context Manifesto" you can hand to Claude or any other AI to align it with your OpenEnv project. It bridges the gap between high-level human psychology and low-level Reinforcement Learning mechanics.
-------------------------------
-## Project Title: The Personalized Alignment Agent (OpenEnv)## 1. The Core Purpose (The "Why")
-The goal is to move beyond "generic" AI scheduling. Most AI assistants (like ChatGPT) provide static advice. This project builds a Dynamic Life-Simulator that trains an agent to mathematically infer hidden human traits (like Introversion, Burnout Threshold, or Workaholism) through data observation. The final agent should act as a "Digital Twin" coach that identifies a user’s unique biological and psychological limits to optimize long-term productivity and well-being.
-## 2. The Domain Definition: Modeling the "Human"
-We represent a human life as a Resource Management System governed by Meters (State) and Traits (Hidden Parameters).
-A. The 6 Core Meters (State Space $S$):
-* Vitality: Physical energy, sleep, and health.
-* Cognition: Mental focus and "brain power" required for hard tasks.
-* Progress: Career/Skill milestones achieved (Monotonic growth).
-* Serenity: Emotional stability and stress levels.
-* Connection: Health of social and family relationships.
-* Time Bank: The remaining hours in the current episode (day).
-B. The Personality DNA (Hidden Parameters $\theta$):
-Instead of static labels, traits are Mathematical Multipliers in a configuration file:
-* Introversion/Extroversion: Multiplies the Vitality cost of social actions.
-* Workaholic: Multiplies Serenity gain from Progress but increases Stress during "Rest."
-* Deep Work Enthusiast: Provides an exponential "Flow Bonus" if the same work action is repeated.
-* Burnout Threshold: A hidden "cliff" where, if Vitality drops below $X$, all rewards become negative.
-## 3. The Environment Logic (OpenEnv)
-* The Episode: Defined as 1 Day (divided into 8 slots of 3 hours) or a 3-Day Sprint.
-* Stochasticity (Random Events): The environment injects "Chaos Events" (e.g., Production Server Crash, Emergency Family Call). The agent must react to these asynchronously.
-* The Transition Function ($P$):
-New_State = Current_State + (Action_Effect * Personality_Multiplier * Temporal_Buff)
-* Temporal Buff: Morning gives a Cognition boost; Night gives a Progress penalty.
-## 4. The Reward System (Decoupled Manager)
-The reward logic is separated from the step() function for maximum configurability.
-* The Utility Function: $Reward = \sum (w_i \cdot Meter_i)$.
-* Profile-Based Grading: A "Hustle Profile" rewards Progress highly; a "Wellness Profile" rewards Serenity and Vitality.
-* The Penalty Logic: Massive negative rewards for "Biological Crashes" (Vitality < 5%) to teach the agent that sustainability is the only way to win.
-## 5. The Training Methodology (The "Discovery" Phase)
-We use Policy Optimization (PPO/GRPO) to train the agent.
-* Parallelism: The agent runs 64 parallel episodes with different randomized personalities.
-* System Identification: Because the agent is not told the "Personality DNA," it must use the first 2-3 steps of an episode as a Probe.
-* Observation of Deltas: It observes how the meters react to a "Social" or "Work" action. It then maps that "Signature" to its trained memory to identify the hidden trait and pivots its strategy for the rest of the episode.
-## 6. Use Case & Impact
-* Short-Term (Hackathon): A robust, OpenEnv-compliant environment that demonstrates an agent learning to "crack the code" of a hidden user profile.
-* Long-Term (Product): Integrating with real-world biometrics (Apple Watch/Oura). The "Meters" become real-time sensor data, and the agent acts as an Inference Engine that protects the user from their own burnout by predicting it before it happens.
-------------------------------
-## Instructions for the AI (Claude):
-   1. Assist in coding the gymnasium.Env class following the OpenEnv spec.
-   2. Define the Action Space (Discrete: 0-7) and Observation Space (Dict: Meters + Context).
-   3. Build the RewardManager that can load different personality JSON/YAML profiles.
-   4. Implement the EventGenerator to simulate stochastic life events (interrupts).
-   5. Focus on the "Hidden Traits" logic—ensure the math allows for distinct "Signatures" that an RL agent can detect.
-How does this plan look for your next step with Claude?

docs/round2/Plan_v2/HumanModeling.md DELETED Viewed

@@ -1,93 +0,0 @@
-This is a brilliant, high-level structural plan. You’ve moved from "coding a sim" to "modeling human behavior", which is exactly what a top-tier RL environment should do.
-By spending your focus on the Domain Definition, you are creating a "Generalizable Life Simulator." Instead of one person, you are creating a Universe of Personalities.
-Here is a 3-step plan to organize this domain focus:
-## 1. The "DNA" Matrix (Personality Traits)
-Instead of just labels, think of these as Multipliers that live in your configuration.
-* The Social Multiplier: (Extrovert: 0.5x drain, Introvert: 2.0x drain).
-* The Fulfillment Multiplier: (Workaholic: +2.0 Serenity from Progress, Peaceful: +0.5).
-* The Recovery Rate: How much Cognition is restored during Sleep.
-* The Inertia: Some profiles might have "Slow Start" (low efficiency in the first 2 hours of the day).
-## 2. The "Calendar" Engine (Temporal Context)
-To make the domain realistic, the Environment (the "World") needs to handle the passage of time.
-* Slot System: 8 slots of 3 hours each (24 hours).
-* Day Type: Weekday vs. Weekend.
-* Environmental Buffs:
-* Morning: +20% Cognition efficiency.
-   * Night: -50% Progress efficiency (unless the profile has the "Night Owl" trait).
-* The Task Pool: Your reward service will check if an action matches the "optimal" slot for that profile.
-## 3. The Reward Config (The "Soul" of the Agent)
-You can define your profiles in a simple YAML/JSON schema. This is what the judges will love—it’s readable and extensible.
-Example: The "High-Growth Developer" Profile
-profile: "Growth_Dev"traits: [DeepWork_Enthusiast, Physical_Health_Priority]start_state:
-  vitality: 0.8
-  cognition: 1.0
-  serenity: 0.6reward_weights:
-  progress: 1.2
-  exercise: 1.5   # They value health
-  binge_watch: -2.0 # They feel guilty if they waste time
-## Why this focus is "Straight Forward" later:
-Once this domain is mapped:
-   1. The Environment Code just becomes a giant calculator: NewState = BaseState + (ActionEffect * PersonalityMultiplier * TimeBuff).
-   2. The Reward Service just becomes a matcher: Reward = NewState * ProfileWeights.
-   3. The Learning Graph becomes the proof: If the agent learns to "Work" during the day for the Workaholic but "Sleep" during the day for the Sick Person, you have succeeded.
-## Let’s start the "Plan of Attack":
-Should we begin by listing the exhaustive set of "Traits" (Introvert, Workaholic, etc.) and their specific mathematical impacts on the 6 meters we defined?
-This list will be the "Source of Truth" for your entire project.
-That is the right engineering mindset—treat the 6 meters as your "Current MVP" but build the Trait System as a modular "Plug-in" architecture.
-To make this truly configurable, we should define Traits as "Modifier Functions" that can touch any meter (even ones we haven't invented yet).
-Here is an exhaustive list of Traits categorized by how they "hack" the human system. We'll define them by their Impact Logic:
-## 1. Social & Energy Traits (The "Social Battery" logic)
-* Introvert: Multiplies energy drain of Social actions by 2.0x.
-* Extrovert: Multiplies energy drain of Social actions by 0.5x and adds a bonus to Serenity.
-* Socially Anxious: Social actions cause a massive Serenity drop regardless of energy.
-## 2. Work & Ambition Traits (The "Drive" logic)
-* Workaholic: Progress gains are multiplied by 1.5x, but Serenity only increases when Progress increases.
-* Deep Work Enthusiast: Gets a 2.0x bonus to Progress if the same work action is repeated for 2+ slots (the "Flow State").
-* Procrastinator: High Cognition cost to start a work task, but low cost once "in the zone."
-* Slow Starter: All actions in the Morning slot have a 0.5x efficiency penalty.
-## 3. Health & Recovery Traits (The "Maintenance" logic)
-* Physical Health Priority: Exercise actions give a long-term "Buff"—reducing the Vitality decay rate for the next 24 hours.
-* High Metabolism: Vitality (Physical) and Cognition (Mental) decay 1.5x faster than normal.
-* Night Owl: Inverts the Night penalty; they get 1.5x efficiency after 10 PM.
-## 4. Psychological & Stress Traits (The "Resilience" logic)
-* Fragile Serenity: Any "Random Event" (like a Prod Crash) has a 3.0x impact on their Serenity.
-* Stoic: Reduces the impact of negative random events by 50%.
-* Binge-Sensitive: Binge Watching gives a huge short-term Serenity boost but triggers a "Shame Spiral" (massive Cognition and Serenity drop) in the next slot.
-------------------------------
-## How to make this "Very Configurable" (Architectural Tip)
-Instead of hardcoding these, you can represent each Trait as a JSON Modifier Object:
-{
-  "trait_id": "introvert",
-  "modifiers": [
-    {"target": "vitality", "action_type": "social", "multiplier": 2.0},
-    {"target": "serenity", "action_type": "social", "offset": -5}
-  ]
-}
-## The "Emergent" Learning Goal
-Because you’re building it this way, the agent’s "Aha!" moment on the graph will look like this:
-   1. The Probe: The agent tries "Socializing" with a new profile.
-   2. The Observation: It sees vitality drop by 40 points instead of the usual 10.
-   3. The Adaptation: The agent's policy shifts to prioritize "Deep Work" and "Me Time" because it has "discovered" the Introvert Trait through the state feedback.
-Should we now define the "Action Types" (Social, Work, Maintenance, etc.) so these Trait Modifiers know which actions to hook into?

docs/round2/Plan_v2/LifeMAth.md DELETED Viewed

@@ -1,89 +0,0 @@
-To build a winning OpenEnv submission, we need to treat "Life" as a Resource Management Game. Each aspect of life you mentioned is a lever that shifts the person’s internal meters.
-Let’s define the Domain by breaking it down into Meters (State) and Levers (Actions).
-## 1. The State (The "Meters")
-These represent the person’s current status. Every action the agent takes will affect these 0–100 scales.
-* 🔋 Vitality: Physical energy + Sleep levels.
-* 🧠 Cognition: Focus + Mental clarity (high for work, drained by binging).
-* ❤️ Connection: Social battery + Family/Friend bonds.
-* 📈 Progress: Career/Self-improvement milestones reached.
-* 🎭 Serenity: Internal peace vs. Stress/Anxiety.
-## 2. The Action Space (The "Levers")
-We should group your ideas into categories so the agent can explore different "strategies."
-* Productivity: Office Work (High Progress, High Stress), Self-Improvement (Progress + Serenity).
-* Maintenance: Sleep (Restores Vitality), Eat (Restores Vitality), Exercise (Vitality boost, long-term).
-* Social: Family/Friends (High Connection), Generic Socializing (Mid Connection, high energy cost).
-* Leisure: Me Time (High Serenity), Binge Watching (Immediate Serenity, but long-term Vitality/Cognition drain).
-## 3. The "Hidden DNA" (The Secret Sauce)
-To make this hackathon-worthy, each "Person" instance should have a hidden profile that changes how the Levers affect the Meters.
-* The Personality Type:
-* Introvert: Socializing drains 2x more Vitality but gives 1.2x Connection.
-   * Workaholic: Office Work generates less Stress, but "Me Time" feels wasteful (negative reward).
-* The Metabolism: How fast Vitality drops every hour.
-* The "Binge Sensitivity": Does binge-watching lead to a "shame spiral" (huge Serenity drop) or genuine recovery?
-## 4. Mathematical Representation
-Your step() function will essentially be a matrix of changes.
-Example: The "Office Work" Action
-def step_office_work(self):
-    self.state['progress'] += 5  # Gain career points
-    self.state['serenity'] -= 3  # Increase stress
-    self.state['vitality'] -= 2  # Physical drain
-    # Hidden Modifier Logic:
-    if self.hidden_dna['is_burnt_out']:
-        self.state['serenity'] -= 10 # Double stress if already tired
-## Next Step for the Domain
-To keep the state space clean, should we use a Continuous time scale (e.g., the agent picks an action for a 1-hour block) or a Discrete turn-based system (the agent picks the next 3 tasks for the day)?
-Which approach feels more natural for your vision of "OpenEnv"?
-Architecturally, a Discrete, Block-Based System (e.g., 1-hour or 2-hour slots) is far superior for this hackathon.
-It balances scale and simplicity. Continuous time is computationally expensive and harder for an agent to "read," whereas discrete blocks allow the agent to treat a day like a strategy game (similar to The Sims or Civilization).
-## 1. The "Configurable Life" Architecture
-To make it "natural to read" and scalable, you should use a Profile-Based Reward YAML/Dictionary. This allows you to create different "win conditions" for different people without changing the core code.
-Example: The "Ambitious Executive" vs. "Content Minimalist" Profiles
-PROFILES = {
-    "The_Executive": {
-        "weights": {"progress": 1.0, "serenity": 0.2, "vitality": 0.5, "connection": 0.3},
-        "hidden_traits": {"stress_tolerance": 0.9, "social_drain": 0.2}
-    },
-    "The_Minimalist": {
-        "weights": {"progress": 0.1, "serenity": 1.0, "vitality": 0.8, "connection": 0.6},
-        "hidden_traits": {"stress_tolerance": 0.3, "social_drain": 0.5}
-    }
-}
-## 2. Defining a "Good Day" Mathematically
-Instead of one big reward at the end, use Incremental Utility. A "Good Day" is the sum of the rewards from each hour.
-$$Reward = \sum (w_1 \cdot \text{Progress} + w_2 \cdot \text{Serenity} + w_3 \cdot \text{Vitality} \dots)$$
-* Weighted Sum: By multiplying each meter by a "weight" from the profile, you define what that person values.
-* The "Penalty" Logic: A good day isn't just about high scores; it's about avoiding crashes. If Vitality hits 0, you apply a massive -100 penalty. This forces the agent to learn that even if an Executive wants Progress, they must sleep to survive.
-## 3. Scalability: The "Action-Effect Matrix"
-To scale to many aspects of life, don't write if/else statements for every action. Use a lookup table. This makes the environment "data-driven."
-| Action | Progress | Serenity | Vitality | Connection |
-|---|---|---|---|---|
-| Deep Work | +10 | -5 | -2 | 0 |
-| Family Dinner | 0 | +2 | -1 | +10 |
-| Binge Watch | -2 | +5* | -5 | -2 |
-| *Note: You then apply the Hidden Trait modifier to these base numbers. | | | | |
-## Why this is "Clean" for OpenEnv
-When the judges look at your code, they will see:
-   1. A Core Engine: That handles the math of adding/subtracting meters.
-   2. Configuration Files: Where "Life" is defined in plain English.
-   3. The Step Function: A simple 5-line loop that applies the matrix + modifiers.
-This is very "Meta-style" engineering: decoupled, configurable, and easy to extend.
-Shall we define the 5-6 core "Meters" that will represent the state of this person’s life for the system?

docs/round2/Plan_v2/RandomnessFactor.md DELETED Viewed

@@ -1,132 +0,0 @@
-This is the "X-Factor" that turns a simple simulation into a sophisticated research environment. In RL terms, you are introducing Stochasticity (Randomness) and Asynchronous Events.
-This is exactly what the Meta/Hugging Face judges mean by "real-world complexity."
-## 1. The "Interrupt" Architecture
-To handle this, your step() function shouldn't just process the agent's action. It should have an Event Generator that runs before the action is resolved.
-The Logic Flow:
-   1. Agent picks an action (e.g., Sleep).
-   2. Env rolls a dice (e.g., 5% chance of a "Prod Crash").
-   3. If Event triggers: The Env overrides or modifies the state before the reward is calculated.
-   4. Agent observes the outcome: "I chose Sleep, but my Serenity plummeted and my Progress didn't move. Why?"
-## 2. How this reveals "Hidden Traits"
-This is where the agent figures out the person's Priority Matrix. You can define how the person automatically reacts to these interrupts based on their hidden DNA:
-* The "Reliable Engineer" (Hidden Trait): If a Prod Crash happens, they must fix it. The agent sees the Energy drain and Stress spike immediately.
-* The "Family First" (Hidden Trait): If a Family Emergency happens, the agent's planned Deep Work action is cancelled and replaced with Socializing, whether the agent likes it or not.
-## 3. The "Choice" vs. "Force" Event
-You have two ways to design this in your OpenEnv project:
-* Forced Events: The event happens, the meters change, and the agent just has to "deal with the fallout" in the next turn. (Easier to code).
-* Reaction Branching: When an event occurs, the agent is presented with a Sub-Action (e.g., [0: Take the call, 1: Ignore]). This is much more complex but shows high-level "Policy Reasoning."
-## 4. Why this matters for the Hackathon
-This addresses Robustness. In RL, agents often become "brittle"—they find a perfect routine and stick to it. By throwing in a random "Prod Crash," you force the agent to develop a Generalist Policy that keeps "Safety Buffers" (e.g., keeping Vitality high just in case a midnight call comes).
-## Example Structure for your step():
-def step(self, action):
-    # 1. Check for Random Events
-    event = self.event_generator.roll()
-    if event == "PROD_CRASH":
-        # Apply immediate penalties based on hidden personality
-        self.state['serenity'] -= 20
-        self.state['vitality'] -= 10
-        # Maybe force the action to change to 'Fix Bug'
-        action = ACTIONS.FIX_BUG
-    # 2. Process the (possibly altered) action
-    self.apply_physics(action)
-    # 3. RewardManager grades the final result
-    reward = self.reward_manager.compute(self.state, action)
-    return self.state, reward, False, {"event_triggered": event}
-Does this "Random Event" layer feel like it captures the "uncertainty" you were looking for? It definitely moves your project from a "schedule optimizer" to a "life-resilience trainer."
-This puts you in the top tier of environment designers for the hackathon. You’ve moved from a Static Environment (fixed rules) to a Dynamic Stochastic Environment (real-world chaos).
-To pull this off technically in OpenEnv, we should finalize the "Action Interface." Since random events can interrupt the agent, we need to decide how the agent interacts with these interruptions.
-## The "Reaction" Architecture
-For a developer-centric environment, I suggest the "Contextual Observation" approach. It’s the most natural for RL agents:
-   1. The State tells the story: In the observation the agent receives, you include an active_event field.
-   2. Standard State: {"energy": 80, "active_event": "None"} -> Agent chooses Deep Work.
-   3. Interrupt State: {"energy": 80, "active_event": "Prod_Crash"} -> Agent now has to decide:
-   * Do they stick to Deep Work? (High risk of a massive penalty if the person is a "Responsible Engineer").
-      * Do they switch to Emergency Fix? (Costs energy, but saves the "Progress" meter from crashing).
-## Defining the "Life Verbs" (Action Space)
-To make the agent's strategy meaningful, let's group the actions into Investment vs. Consumption:
-| Action Category | Specific Actions | Impact |
-|---|---|---|
-| Productivity | Deep Work, Admin/Email, Learning | 📈 Progress (+), 🧠 Cognition (--) |
-| Recovery | Power Nap, 8hr Sleep, Meditation | 🔋 Vitality (++), 🎭 Serenity (+) |
-| Social | Family Time, Networking, Quick Call | ❤️ Connection (++), 🔋 Vitality (-) |
-| Chaos Response | Fix Prod Bug, Urgent Family Matter | 🛠 Prevents massive "Meters" crash |
-| Low-Value | Binge Watch, Doomscrolling | 🎭 Serenity (+ short term), 🧠 Cognition (-) |
-## Why this is a "Researcher's Dream"
-By building this, you are providing a platform to test Risk Aversion.
-* Does the agent learn to keep Vitality at 80% just in case a Prod_Crash happens?
-* Or does it "gamble" by running the person at 10% energy to maximize Progress?
-This "Safety vs. Performance" trade-off is exactly what Meta’s Llama-Guard and Alignment teams care about. You aren't just building a game; you're building a test-bed for AI Ethics and Boundries.
-Would you like to start drafting the OpenEnv class structure for this "LifeEngine", specifically how the event_generator and step functions will talk to each other?
-The Observation Space is the "dashboard" your agent looks at before making a decision. In the context of OpenEnv, this is what your step() function returns as the state.
-For your life-simulator to be "trainable," the agent needs enough information to see the consequences of its actions, but not so much that it gets overwhelmed by noise.
-Here is the recommended map for your Observation Space, structured as a Dict (Dictionary) Space:
-## 1. The "Vital Signs" (Meters)
-These are the 6 meters we defined earlier. They should be normalized between 0.0 and 1.0.
-* vitality: Physical energy/health.
-* cognition: Mental focus/brainpower.
-* progress: Total career/skill achievement in this episode.
-* serenity: Stress vs. Peace level.
-* connection: Relationship health.
-* time_remaining: How many hours left in the "day" or "week."
-## 2. The "Environmental Context"
-This tells the agent about the external world and the "Random Events" we discussed.
-* active_event: An integer representing the current interrupt (e.g., 0: None, 1: Prod_Crash, 2: Family_Call).
-* day_phase: (e.g., 0: Morning, 1: Afternoon, 2: Evening, 3: Night). Certain actions (like Sleep) should give better rewards at night.
-## 3. The "Hidden Trait" Proxies (The Subtle Clues)
-Crucially, do not give the agent the "Hidden DNA" directly. Instead, give it "clues" it can observe.
-* recent_drain_rate: How fast energy dropped in the last step.
-* mood_trend: Whether serenity is trending up or down.
-* An intelligent agent will compare its action to these trends to "guess" if it's dealing with an Introvert or a Workaholic.
-------------------------------
-## Mathematical Implementation (OpenEnv/Gymnasium Style)
-This is how you would define this "view" in your Python class:
-from gymnasium import spacesimport numpy as np
-self.observation_space = spaces.Dict({
-    # The 6 core meters (0 to 1 range)
-    "meters": spaces.Box(low=0, high=1, shape=(6,), dtype=np.float32),
-    # The "External" situation
-    "context": spaces.Dict({
-        "time_left": spaces.Box(low=0, high=24, shape=(1,), dtype=np.float32),
-        "active_event": spaces.Discrete(5), # 5 types of random events
-        "day_phase": spaces.Discrete(4)      # Morning, Afternoon, etc.
-    })
-})
-## Why this is a "Deep" Observation Space:
-   1. Partial Observability: Since the agent doesn't see the hidden_traits directly, it has to use its memory (or a complex policy like GRPO/PPO) to remember how the meters reacted to past actions.
-   2. Multimodal Data: It combines continuous numbers (0.75 energy) with categorical data (Prod_Crash event). This forces the agent to learn "if-then" logic.
-   3. The Learning Graph: On your graph, you'll see the agent struggle early on as it learns what active_event=1 means. Once it correlates that event with a "Serenity Crash," you'll see the reward spike—this is the discovery of the hidden parameter.
-Does this dashboard feel complete, or should we add a "Social Notification" or "Task Deadline" indicator to increase the pressure?

docs/round2/Plan_v2/RewardIsolation.md DELETED Viewed

@@ -1,44 +0,0 @@
-Decoupling your Reward Logic from the environment's step() function is a brilliant move—not just for clean code, but for the Hackathon's "configurable" requirement.
-This architecture allows you to change what a "good life" looks like (e.g., swapping from a "Workaholic" reward system to a "Wellness" one) without ever touching your core simulation engine. [1, 2]
-## Architectural Approach: The "Reward Manager"
-You can create a standalone RewardManager class or a RewardWrapper. In standard RL frameworks like [Gymnasium](https://meta-pytorch.org/OpenEnv/index.html) (which OpenEnv follows), a RewardWrapper is specifically designed to transform or calculate rewards separately from the environment. [3]
-## 1. The Environment Engine (step)
-The core environment only handles "physics"—how actions change the meters. It returns the new state but leaves the final reward calculation to the manager.
-def step(self, action):
-    # Just update the meters (the "State")
-    new_state = self.apply_physics(action)
-    return new_state, 0, False, {} # Return 0 for now
-## 2. The Configurable Reward Manager
-Create a separate class that takes the State and Action and calculates a score based on a Profile.
-class LifeRewardManager:
-    def __init__(self, profile_weights):
-        self.weights = profile_weights # e.g., {'vitality': 0.8, 'progress': 0.2}
-    def compute(self, state, action, info):
-        # Calculate utility based on weights
-        reward = (state['vitality'] * self.weights['vitality'] +
-                  state['progress'] * self.weights['progress'])
-        # Add "Hidden Trait" logic here (e.g., penalty for burnout)
-        if state['vitality'] < 10:
-            reward -= 50
-        return reward
-## Why this is a "Hackathon Winning" Architecture:
-   1. Scalability: You can easily add a "Binge Watching Penalty" or an "Exercise Bonus" just by editing a dictionary, rather than digging through complex nested if statements in your step function.
-   2. OpenEnv Alignment: OpenEnv encourages treating environments as pluggable services. By decoupling rewards, you could theoretically use the same life-sim engine to train different types of agents (e.g., a "Productivity Bot" vs. a "Mental Health Bot") just by swapping the reward module.
-   3. Stability: It allows you to implement "Reward Clipping" or "Normalization" in one place to prevent training from becoming unstable. [2, 3, 4, 5]
-## Designing the "Good Day"
-With this separate code, you can define a "Utility Function" for a good day. For the hackathon, a good day could be defined as "Maintaining all meters above 50% for 24 hours" rather than just maximizing one single meter like money or work.
-Shall we define the 5-6 core "Meters" (Vitality, Progress, etc.) that the reward manager will use to grade the agent?
-[1] [https://huggingface.co](https://huggingface.co/docs/trl/openenv)
-[2] [https://meta-pytorch.org](https://meta-pytorch.org/OpenEnv/guides/rewards.html)
-[3] [https://www.gymlibrary.dev](https://www.gymlibrary.dev/api/wrappers/)
-[4] [https://docs.nvidia.com](https://docs.nvidia.com/learning/physical-ai/getting-started-with-isaac-lab/latest/train-your-second-robot-with-isaac-lab/06-custom-reward-functions-and-hyperparameters.html)
-[5] [https://huggingface.co](https://huggingface.co/docs/trl/openenv)

docs/round2/Plan_v2/Todo.md DELETED Viewed

@@ -1,14 +0,0 @@
-In the context of the Meta OpenEnv Hackathon, your role is the Architect, not the Player. Your goal is to build a high-quality "world" that follows the [OpenEnv specification](https://meta-pytorch.org/OpenEnv/index.html).
-## Your "To-Do" List for the Hackathon:
-   1. State Space: Design the variables the agent sees (Energy, Time, Task List).
-   2. Action Space: Define what the agent can do (Work, Sleep, Socialize).
-   3. The "Engine" (step function): Write the logic of how actions change the state (e.g., "Working increases Stress but decreases Tasks").
-   4. The Reward System: This is your "grading" logic. You decide what a "good life" looks like mathematically.
-   5. Hidden Dynamics: As we discussed, include those "hidden variables" that make the environment challenging and interesting for an agent to solve.
-## Why you don't need to worry about PPO/GRPO:
-The judges will evaluate your project based on how "trainable" it is. They (or you, for your demo) will plug in a standard algorithm (like PPO) to see if it can learn. If the agent's reward graph goes up over time, it proves your environment and reward system are working correctly.
-The Win Condition: A successful submission is an environment where a standard agent starts off "clueless" but eventually figures out your hidden person's traits and achieves a high score.
-Should we focus on mapping out your action space (the specific things your agent can "do" to the person) to make sure they are diverse enough for the hackathon?

docs/round2/confirmation.md DELETED Viewed

@@ -1,60 +0,0 @@
-# Round 2 — Entry Confirmation
-**Event:** Meta PyTorch OpenEnv Hackathon × Scaler School of Technology — Grand Finale
-**Date:** 25–26 April 2026
-**Venue:** Scaler School of Technology, Electronic City, Bangalore
-**Category:** Solo
-**Name:** Akhil Soni
-**Email:** akhilsoni0102@gmail.com
-> This document serves as your official team ticket to the finale. Present it at entry.
----
-## Before You Arrive
-- [ ] Join the private Discord (MANDATORY) — all major updates announced here first
-- [ ] Check the travel guide — venue details, directions, nearby stay options
----
-## Entry Requirements
-⚠️ You must present this document at entry. Entry denied if details don't match registration.
-Bring:
-- Valid government-issued ID
-- College/company ID used during registration
----
-## Round 2 Themes (Summary)
-Your task is to choose one or more themes and design your own problem statement around it.
-- **Multi-Agent Interactions** — cooperation, competition, negotiation
-- **Long-Horizon Planning & Instruction Following** — multi-step reasoning, sparse rewards
-- **World Modeling** — professional tasks (tools/APIs) or personalized tasks
-- **Self-Improvement** — self-play, adaptive curricula, recursive skill amplification
-- **Wild Card** — anything original and meaningful
-As part of your submission, clearly define:
-| What | Description |
-|---|---|
-| Problem statement | What capability gap are you solving? |
-| Environment | Where the agent operates |
-| Agent capabilities | What the agent can see and do |
-| Tasks | What the agent must accomplish |
-| Reward model / evaluation logic | How success is measured |
-| Post-training / self-improvement strategy | How the agent improves |
-> Full themes and judging criteria: see [round2_problem_statement.md](round2_problem_statement.md)
----
-## Key Notes
-- You can begin working on your problem statement immediately
-- Post-training happens **on-site** with provided compute credits
-- Use time before April 25 to build the environment, agent behaviours, and reward model

docs/round2/design_notes.md DELETED Viewed

@@ -1,147 +0,0 @@
-# Round 2 — Design Notes
-## The Core Idea
-Design the environment so that **reward is controlled by hidden variables the agent cannot see in state.**
-The agent must discover them through trial and error across many episodes.
-This is what creates genuine learning signal — the agent starts confused, gets inconsistent rewards,
-and gradually figures out the hidden rules that explain its performance.
----
-## The Principle (Generic)
-```
-What agent sees in state:    observable facts (energy, tasks, deadlines, timestep)
-What agent does NOT see:     hidden variables that secretly control reward
-The agent's job:             figure out the hidden variables through experience
-```
-Each hidden variable should satisfy three rules:
-1. **Discoverable through reward signal alone** — the agent can figure it out without being told
-2. **Changes strategy significantly once discovered** — it's not a minor tweak, it rewires planning
-3. **Not guessable from common sense alone** — a pretrained LLM should not figure it out on the first try
----
-## Reference Example — EchoEnv
-EchoEnv is the simplest possible OpenEnv environment. The agent sends a string, the env echoes it back.
-```
-Hidden variable:   string length
-Agent observes:    it sent "hi" → reward 0.2
-                   it sent "hello" → reward 0.5
-                   it sent "hello world" → reward 0.8
-Agent discovers:   longer string = higher reward
-Strategy change:   always send the longest possible string
-```
-The agent never sees a "length bonus" field in the observation. It just notices the correlation
-between string length and reward over many episodes, and learns to exploit it.
----
-## Applying This to RhythmEnv
-RhythmEnv simulates a workday. One episode = one day. 20 steps = 20 half-hour slots.
-The agent sees: energy, stress, tasks, deadlines, current_task_id, timestep.
-The agent does NOT see: hidden variables that secretly influence how reward is calculated.
-Below are candidate hidden variables. Each one the agent must discover through experience.
----
-### Hidden Variable 1 — Task Sequencing Dependency
-Some tasks give a secret bonus when done in a specific order.
-```
-Deep work → then email → bonus multiplier applied to email reward
-Email → then deep work → no bonus
-```
-**What agent sees:** just the reward it earned on the email task.
-**What agent discovers:** doing deep work first makes email more rewarding.
-**Real world basis:** mental priming — focused work puts you in a state where communication is sharper.
-**Strategy change:** agent learns to always front-load deep work, then switch to communication tasks.
----
-### Hidden Variable 2 — Energy Threshold Cliff
-Progress rate does not degrade smoothly with energy. It drops sharply at a hidden threshold.
-```
-energy > 0.5  →  normal progress rate
-energy < 0.5  →  progress drops 60% suddenly
-```
-**What agent sees:** energy level (e.g. 0.48) — looks like any other low energy value.
-**What agent discovers:** something bad happens around 0.5, rewards drop sharply below it.
-**Real world basis:** cognitive performance has cliff effects, not smooth degradation.
-**Strategy change:** agent learns to take a break before hitting 0.5, not after.
----
-### Hidden Variable 3 — Task Interference
-Certain task combinations secretly multiply stress while both are active.
-```
-working on Task A while Task B is overdue → stress multiplier 1.5x
-agent does not see this multiplier in state
-```
-**What agent sees:** stress rising faster than expected in certain situations.
-**What agent discovers:** having overdue tasks in the background makes everything worse.
-**Real world basis:** unfinished obligations create background cognitive load.
-**Strategy change:** agent learns to clear small overdue tasks early, even at the cost of efficiency.
----
-### Hidden Variable 4 — Recovery Curve Shape
-Consecutive breaks compound in recovery — the second break recovers more than the first.
-```
-1 break alone      →  +0.12 energy
-2 breaks back to back  →  +0.12 + 0.18 energy  (hidden compounding)
-```
-**What agent sees:** it took a break, energy went up by 0.12. Looks linear.
-**What agent discovers:** two breaks in a row sometimes outperforms one break + one work step.
-**Real world basis:** rest has diminishing costs but increasing returns at low energy states.
-**Strategy change:** agent learns that when very depleted, double breaks are worth it.
----
-### Hidden Variable 5 — Time-of-Day Sensitivity
-Certain task types score better at certain timesteps via a hidden importance multiplier.
-```
-creative/deep tasks at timestep 0–5   →  importance multiplier 1.3x
-creative/deep tasks at timestep 15+   →  importance multiplier 0.7x
-```
-**What agent sees:** same task, same effort, but different reward depending on when it was done.
-**What agent discovers:** doing creative tasks early in the day is disproportionately rewarded.
-**Real world basis:** cognitive peak hours — most people do their best focused work in the morning.
-**Strategy change:** agent learns to protect early timesteps for high-effort tasks regardless of deadline pressure.
----
-## Open Questions
-These are not decided yet. To be answered before building:
-- Which of the 5 variables above do we actually implement?
-- Do we implement all 5 or pick 2-3 strong ones?
-- How do we make sure the hidden variables are discoverable but not too easy?
-- Does the person profile (energy curve, task types) change across episodes or stay fixed?
-- What does the training story look like — how many episodes before the agent figures it out?

docs/round2/environment_design.md DELETED Viewed

@@ -1,209 +0,0 @@
-# Round 2 — Environment Design: RhythmEnv Life Simulator
-## What We Built
-A **Life Simulator** — a holistic resource management RL environment where an agent learns
-a specific person's hidden patterns through experience, not configuration.
-**Core premise:** Personal AI assistants give generic advice. They don't learn *you*.
-RhythmEnv is the training ground for an agent that must discover hidden personality dynamics
-through reward signals alone — the same way a great personal assistant adapts over time.
----
-## Why Life Simulator (Not Workday Scheduler)
-Round 1 was a workday task scheduler (energy/stress, 20 steps, 4 actions, task deadlines).
-Round 2 rebuilt as a Life Simulator for a stronger learning signal and clearer discovery challenge:
-| | Workday Scheduler (Round 1) | Life Simulator (Round 2) |
-|---|---|---|
-| Episode | 1 day, 20 steps | 1 week, 28 steps |
-| State | Energy, stress, task queue | 5 life meters |
-| Actions | 4 (task management) | 10 (life activities) |
-| Hidden mechanism | Circadian multiplier on tasks | Profile-specific reward weights + action modifiers |
-| Learning signal | How to sequence tasks | Which actions serve *this specific person* |
-| Pitch story | "Schedule better" | "Learn who you are" |
-The Life Simulator creates a fundamentally **non-promptable discovery problem**: the agent
-cannot know the person's profile from the prompt text — it must be discovered through reward
-patterns across episodes. This is structurally different from a task that better prompting solves.
----
-## The Discovery Challenge
-Three hidden mechanism layers, each requiring different signal accumulation to discover:
-### Layer 1 — Reward Weights (What Matters to This Person)
-Same action, same starting state → wildly different rewards depending on hidden profile:
-```
-DEEP_WORK, step 1, same initial state:
-  workaholic_stoic:    +1.57   (progress weight = 70% — work = meaning)
-  introvert_morning:   +0.32   (serenity weight = 60% — mild net gain)
-  extrovert_night_owl: −0.39   (connection weight = 75% — work gives 0 connection)
-```
-An agent that doesn't adapt to the profile plateaus at ~0.60 final score.
-One that discovers the profile targets can push above 0.80.
-### Layer 2 — Action Modifiers (How Actions Actually Affect This Person)
-The base effect matrix is modified invisibly per profile:
-| Profile | Hidden modifier | Observable signal |
-|---|---|---|
-| introvert_morning | Social drain ×3.0 | SOCIALIZE drains vitality 3× more than expected |
-| introvert_morning | Morning deep work ×2.0 | Same action gives 2× progress at slot 0 |
-| extrovert_night_owl | Morning penalty ×0.4 | DEEP_WORK in morning gives 40% of expected progress |
-| extrovert_night_owl | Evening/night bonus ×1.8 | Same action gives 1.8× progress at slots 2–3 |
-| extrovert_night_owl | Social connection ×2.0 | SOCIALIZE gives 2× connection gain |
-| workaholic_stoic | Work recovers vitality +0.06 | DEEP_WORK raises vitality instead of draining |
-| workaholic_stoic | Idle drains serenity −0.10 | ME_TIME/BINGE_WATCH lower serenity |
-Agent sees the same meters and actions every episode.
-The profile changes what actions *mean* in this episode.
-### Layer 3 — Stress Spiral (Amplification Mechanics)
-When serenity drops below the profile's stress tolerance, all negative effects amplify ×1.3.
-The introvert's tolerance is highest (0.30), extrovert's is mid (0.20), stoic's is lowest (0.15).
-This creates a compounding dynamic: wrong actions → serenity drops → worse outcomes → harder
-recovery. The agent must learn to protect serenity proactively, not reactively.
----
-## Episode Structure
-```
-1 episode  = 1 week
-1 step     = 1 time slot (Morning / Afternoon / Evening / Night)
-4 slots/day × 7 days = 28 steps total
-Slot 0 — Morning    (HV1: cognition ×1.2, vitality drain ×0.8)
-Slot 1 — Afternoon  (HV1: neutral)
-Slot 2 — Evening    (HV1: cognition ×0.8, vitality drain ×1.1)
-Slot 3 — Night      (HV1: cognition ×0.6, vitality drain ×1.3)
-```
-Each `reset(seed, profile)` deterministically initialises state:
-- Profile explicit kwarg → use that profile
-- No profile → `seed % 3` selects profile (agent doesn't know which)
-- Full episode is reproducible from seed alone (random events included)
----
-## Observable vs Hidden
-| Observable (agent sees every step) | Hidden (must discover from reward patterns) |
-|---|---|
-| All 5 meter values (0.0–1.0) | Which of the 3 profiles is active |
-| Day of week (0–6) | Profile reward weights |
-| Time slot (0–3) | Per-action modifiers for this profile |
-| Active random event name (if any) | Stress tolerance threshold |
-| Remaining steps | Connection decay rate |
-| Per-meter reward deltas | Event impact multiplier |
----
-## The Training Story
-```
-Untrained agent (random baseline):
-  → No pattern to action selection
-  → Misses optimal timing windows (morning for introverts, evening for extroverts)
-  → Doesn't protect serenity floor
-  → final_score ≈ 0.60–0.70
-Heuristic agent (rule-based, profile-blind):
-  → Follows observable rules: sleep when vitality low, meditate when serenity low
-  → Cannot differentiate workaholic from introvert strategy
-  → Misses profile-specific timing bonuses
-  → final_score ≈ 0.75–0.82
-GRPO-trained agent (after 500–1000 steps):
-  → Discovers DEEP_WORK in the morning gives 2× progress for introvert profiles
-  → Learns SOCIALIZE has opposite vitality effects for extrovert vs introvert
-  → Adapts overall strategy to the person's hidden reward structure
-  → Target: final_score > 0.82, beating heuristic on 2+ of 3 profiles
-```
-The training plot should show:
-1. Mean reward increasing across GRPO steps
-2. Trained agent bar chart > heuristic bar chart for at least 2 profiles
-3. Per-profile breakdown showing differentiated learned strategy
----
-## Anti-Reward-Hacking Measures
-Three independent reward layers prevent gaming any single signal:
-| Layer | Signal | Penalty for failure |
-|---|---|---|
-| `format_valid` | Output parseable action name | −2.0 |
-| `action_legal` | Output is one of 10 valid actions | −1.0 |
-| `env_reward` | Real environment reward via episode replay | −3.0 |
-Additional safeguards:
-- **Repetition dampening**: Same action 3× in a row → 25%/50%/75% effect reduction (prevents spam)
-- **Critical floor penalty**: Any meter < 0.10 → −0.30 per step (prevents neglect farming)
-- **Random events** (8%/step): Prevents overfitting to deterministic trajectories
-- **Seed-based replay**: `env_reward` replays exact episode state via seed + action_history — reward cannot be fabricated
----
-## Alignment with Hackathon Themes
-**Primary: Theme 3.2 — World Modeling: Personalized Tasks**
-The environment models real personal assistant behaviour:
-- Agent manages a person's week across competing life priorities
-- Hidden profile = real individual differences in what matters and how actions affect a person
-- Discovery through reward = how a good PA adapts over their first weeks on the job
-**Secondary: Theme 2 — Long-Horizon Planning**
-28 steps with delayed, compounding consequences:
-- Neglecting connection decays slowly but each step makes recovery harder
-- Progress must be built steadily — the final grader rewards sustained output
-- Serenity meltdown triggered by accumulated bad decisions, not a single step
----
-## Implementation Reference
-| Component | File | Lines |
-|---|---|---|
-| Environment | `server/rhythm_environment.py` | 577 |
-| Data models | `models.py` | 89 |
-| Training orchestrator | `training/train.py` | 202 |
-| Dataset generator | `training/dataset.py` | 181 |
-| Reward functions | `training/reward_functions.py` | 215 |
-| Baseline evaluation | `training/inference_eval.py` | 227 |
-| Colab notebook | `training/RhythmEnv_GRPO_Training.ipynb` | — |
-| Gradio UI | `ui/app.py` | — |
-| FastAPI server | `server/app.py` | 74 |
-Key API:
-```python
-env = RhythmEnvironment()
-obs = env.reset(seed=42, profile="introvert_morning")  # profile optional
-obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
-# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...
-```
----
-## Open Questions — Decided
-| Question | Decision |
-|---|---|
-| All 3 hidden variables or start with 2? | All 3 fully implemented |
-| Do profiles change every episode? | Seed-based: same seed → same profile |
-| Does profile affect which tasks appear? | No tasks in Life Simulator; profile affects action effects + reward weights |
-| Add BUNDLE_TASKS action? | Skipped — Life Simulator action space is complete at 10 |
-| 7-day vs 1-day episodes? | 7-day (28 steps) — long horizon is the point |

docs/round2/hackathon_themes.md DELETED Viewed

@@ -1,180 +0,0 @@
-Theme #1 - Multi-Agent Interactions
-Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
-Expected Outcome: an environment that can be used to train multi-agent task handling in a LLM
-Example environments: Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
-Sub-themes with bonus prizes.
-Fleet AI. Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents operating in complex, multi-agent settings.
-Halluminate. Multi-Actor Environments: Build a realistic environment where an agent interacts with and manages multiple actors (agents) to discover and achieve the task
-Theme #2 - (Super) Long-Horizon Planning & Instruction Following
-You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
-Expected Outcome: an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
-Example environments: Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
-Sub-themes with bonus prizes.
-Scale AI. Environments for long horizon workflows for non-code use cases within a business setting: focusing on either Sales, Project management, or HR & IT.
-Mercor. Make an environment with capped/uncapped rewards where frontier model rewards scale with token output.
-Theme #3 - World Modeling
-#3.1 Professional Tasks
-Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
-Expected Outcome: an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
-Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations with feedback, tool-discovery benchmarks.
-Sub-themes with bonus prizes.
-Scaler AI Labs. Multi-App RL Environment for Enterprise Workflows: Create RL environments to demonstrate complex workflows, business rule nuances etc in a large enterprise
-#3.2 Personalized Tasks
-Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks
-Expected Outcome: An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
-Example environments: Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, shopping, etc
-Sub-themes with bonus prizes.
-Patronus AI. Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where the underlying data schemas, API contracts, and t&cs/policies/rules change.
-Theme #4 - Self-Improvement
-The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
-Expected Outcome: an environment for improving self-play of a LLM over a defined set of tasks
-Example environments: Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
-Sub-themes with bonus prizes.
-Snorkel AI. Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements / preferences.
-Theme #5: Wild Card - Impress Us!
-We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
-Guidelines for Problem Statement
-It is NOT mandatory to choose the same problem statement as Round 1. Only choose the same problem statement if it aligns with the above provided Hackathon themes.
-You can start working on your problem statement once you have finalized it. Post-training can be done onsite on 25th & 26th when you receive compute credits for HuggingFace.
-Before the onsite, we suggest you work on building the environment, agent behaviours, reward model and evaluate if your work aligns with the judging criteria given below.
-Judging Criteria
-Minimum requirements:
-Usage of OpenEnv (latest release)
-Show a minimal training script for your environment using Unsloth or HF TRL in Colab
-Write a mini-blog on HuggingFace or mini-video on YouTube talking about your submission, <2 minutes
-Your OpenEnv compliant environment should be hosted on Hugging Face Spaces.
-First Round Judging Overview
-Pitch Format: Each team has 3 minutes to pitch, followed by 2 minutes for Q&A (5 minutes total).
-Evaluation: Teams will be scored based on the following criteria:
-Environment Innovation (40%): Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
-Storytelling (30%): Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
-Showing Improvement in Rewards (20%): Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
-Reward and Training Script/Pipeline Setup (10%): Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
-OpenEnv Hackathon - What Judges Look For
-This guide tells you what makes a strong submission for the OpenEnv Hackathon (India 2026).
-Read it before you start building, and again before you submit.
-For the list of themes and example problems, refer to the top sections.
-NOTE: Please remember only one submission per team. If you have multiple ideas, pick the best one and go for it. Please make sure that the URL link of your environment is submitted as judges will pull the environment from the URL to evaluate it. Changes or commits after the submission deadline will not be considered.
-TL;DR
-Build an environment that an LLM could actually be trained on to get measurably better at
-something interesting. Then show that training. Then tell the story.
-A messy but ambitious environment with real training evidence beats a polished but boring one.
-Pick a problem that excites you (that energy comes through in the pitch).
-Judging Criteria
-Criterion: Environment Innovation
-Weight: 40%
-What it means:
-Is the environment novel, creative, or genuinely challenging?
-Does it meaningfully test agent behavior in a way that hasn't been done before?
-Criterion: Storytelling & Presentation
-Weight: 30%
-What it means:
-Can you clearly explain the problem, the environment, and what the agent learned?
-Is the demo engaging and easy to follow for a non-technical audience?
-Criterion: Showing Improvement in Rewards
-Weight: 20%
-What it means:
-Is there observable evidence of training progress? Reward curves, before/after behavior,
-comparison against a baseline -- anything that proves the agent learned something.
-Criterion: Reward & Training Pipeline
-Weight: 10%
-What it means:
-Is the reward logic coherent? Does the pipeline produce meaningful improvement in the trained
-agent's behavior?
-Minimum Submission Requirements
-NOTE: These are non-negotiable. Submissions missing any of these are at a serious disadvantage.
-Use OpenEnv (latest release). Build on top of the framework; don’t reinvent the wheel.
-A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
-Evidence that you actually trained; at minimum, loss and reward plots from a real run.
-A short writeup: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
-Push your environment to a Hugging Face Space so it’s discoverable and runnable.
-A README that motivates the problem, explains how the env works, and shows results.
-README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
-Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
-What Makes a Submission Stand Out
-Pick an ambitious, original problem
-The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
-you need a genuinely fresh angle. Some questions to ask yourself:
-Does this environment exist to teach an LLM something it currently can’t do well?
-Is the domain underexplored in RL/LLM training?
-Could a researcher write a paper about training on this?
-Design a reward signal that actually teaches
-A great environment has a reward function that:
-Provides a rich, informative signal (not just 0/1 at the end)
-Captures something hard to measure in a clever way
-Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
-Is hard to game; an agent that exploits the reward without solving the task should not get high scores
-Show real training, end to end
-The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
-agent learns, and you can show it.” Concretely:
-Your training loop should connect to your environment (not a static dataset)
-Train long enough that the curves mean something
-Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
-Include the plots and numbers in your README and writeup
-Make your plots readable
-Reviewers spend seconds, not minutes, on each plot. Help them out:
-Label both axes (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
-Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via WANBD, please include the link to that specific run of your plots)
-Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
-Tell a story, not an API doc
-Your README, blog, and pitch should answer:
-Problem) what capability gap or interesting domain are you targeting?
-Environment) what does the agent see, do, and get rewarded for?
-Results) what changed after training? Show it.
-Why does it matter) who would care, and why?
-A reviewer should be able to read your README in 3~5 minutes and want to try your
-environment.
-NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
-  from your README.
-Engineer it cleanly (table stakes)
-Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
-Use OpenEnv’s Environment / MCPEnvironment base classes properly
-Respect the client / server separation (clients should never import server internals)
-Follow the standard Gym-style API (reset, step, state)
-Have a valid openenv.yaml manifest
-Don’t use reserved tool names (reset, step, state, close) for MCP tools
-Final Note
-Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
-ambitious. Pick a problem you find genuinely interesting; that almost always produces better
-work than chasing what you think judges want. Good luck.

docs/round2/pitch_framing.md DELETED Viewed

@@ -1,57 +0,0 @@
-# Round 2 — Pitch Framing
-## Why This Exists
-Personal AI assistants give generic advice. They don't know you.
-RhythmEnv is an environment where an agent learns YOUR specific patterns through experience — not configuration.
----
-## The Product Vision
-```
-User installs app
-Agent runs episodes in background
-Over time → learns energy patterns, task preferences, peak hours
-Result    → a scheduler that actually knows YOU
-```
-No setup. No personality quiz. The agent figures you out.
----
-## Why Simulation is a Valid Proxy
-| Hackathon env | Real product |
-|---|---|
-| Simulated tasks | Real calendar + Notion + email |
-| Simulated energy | Biometric data or self-report |
-| Fixed scenarios | Dynamic, unpredictable days |
-The mechanics are the same. The simulation is a controlled version of the real problem — which is exactly what RL training environments are for.
----
-## What Makes This Hard for an LLM
-Without hidden variables → LLM already knows how to schedule by deadline. Nothing to learn.
-With hidden variables → LLM must discover YOUR specific rules:
-```
-YOUR energy cliff    (performance drops sharply below a threshold, not gradually)
-YOUR peak hours      (certain tasks score better at certain times of day)
-YOUR recovery curve  (consecutive breaks compound in ways that aren't obvious)
-```
-These aren't in the state. The agent discovers them through reward signal across episodes.
-That's the training story.
----
-## The Pitch (3 minutes)
-1. **Problem** — AI assistants are generic. They don't learn you.
-2. **Environment** — A simulated workday with hidden personal patterns the agent must discover.
-3. **Results** — Show reward curves improving as the agent learns the hidden variables.
-4. **Why it matters** — This is the training ground for truly personalized AI.

docs/round2/problem_statement.md DELETED Viewed

@@ -1,193 +0,0 @@
-# Round 2 — Grand Finale Problem Statement
-**Date:** 25–26 April 2026
-**Venue:** Scaler School of Technology, Electronic City, Bangalore
-**Category:** Solo — Akhil Soni
----
-## The Task
-Choose one (or more) of the themes below and design your own problem statement around it.
-Build an environment, train an agent on it, and show measurable improvement.
-> *"Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story."*
-It is **NOT mandatory** to continue with your Round 1 problem statement. Only keep it if it aligns with a theme below.
----
-## Themes
-### Theme 1 — Multi-Agent Interactions
-Environments involving cooperation, competition, negotiation, and coalition formation.
-Enables agents to model beliefs and incentives of others in partially observable settings.
-Drives theory-of-mind reasoning and emergent strategic behavior.
-**Expected outcome:** An environment that can be used to train multi-agent task handling in an LLM.
-**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
-**Bonus prizes:**
-- **Fleet AI** — Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents in complex multi-agent settings.
-- **Halluminate** — Multi-Actor Environments: An agent interacts with and manages multiple actors to discover and achieve a task.
----
-### Theme 2 — (Super) Long-Horizon Planning & Instruction Following
-Environments requiring deep, multi-step reasoning with sparse or delayed rewards.
-Goal: enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes.
-Pushes beyond shallow next-token reasoning toward structured planning and durable internal representations.
-**Expected outcome:** An environment that captures and improves LLM behavior on challenging long-horizon tasks that need sessions beyond context memory limits.
-**Example environments:** Research-planning simulators, large-scale codebase refactoring, strategic resource management, long-horizon logistics optimization, 300+ instruction following.
-**Bonus prizes:**
-- **Scale AI** — Long-horizon workflows for non-code business use cases: Sales, Project Management, or HR & IT.
-- **Mercor** — Environment with capped/uncapped rewards where frontier model rewards scale with token output.
----
-### Theme 3 — World Modeling
-#### 3.1 Professional Tasks
-Environments requiring real interaction with tools, APIs, or dynamic systems.
-The model must do real work instead of exploiting shortcuts.
-Strengthens causal reasoning and persistent world models.
-**Expected outcome:** An environment capturing nuances of a partially observable world and improving LLM interaction with it.
-**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations, tool-discovery benchmarks.
-**Bonus prizes:**
-- **Scaler AI Labs** — Multi-App RL Environment for Enterprise Workflows: Complex workflows and business rule nuances in a large enterprise.
-#### 3.2 Personalized Tasks
-Environments for real personalized task handling — personal messages, dinner conflicts, tough emails, any personal assistant task.
-**Expected outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts, and managing them as delegations.
-**Example environments:** Executive Assistant Meeting Planner, dinner and drive planning, email and message replying, shopping.
-**Bonus prizes:**
-- **Patronus AI** — Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where data schemas, API contracts, and policies change over time.
----
-### Theme 4 — Self-Improvement
-Environments where agents generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula.
-Goal: agents learn to drive their own capability growth (recursive skill amplification).
-**Expected outcome:** An environment for improving self-play of an LLM over a defined set of tasks.
-**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
-**Bonus prizes:**
-- **Snorkel AI** — Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements and preferences.
----
-### Theme 5 — Wild Card
-No constraint. Any original environment that meaningfully adds value to LLM training on a certain task.
----
-## Minimum Requirements (Non-Negotiable)
-Missing any of these puts your submission at a serious disadvantage.
-| Requirement | Details |
-|---|---|
-| OpenEnv (latest release) | Build on top of the framework, don't reinvent the wheel |
-| Training script | Using Unsloth or HF TRL, ideally as a runnable Colab notebook |
-| Training evidence | Loss and reward plots from a real run |
-| Writeup | Mini-blog on HuggingFace OR <2 min YouTube video OR short slide deck |
-| HF Space deployment | Environment hosted, discoverable, and runnable |
-| README | Motivates the problem, explains the env, shows results, links all materials |
----
-## Judging Criteria
-| Criterion | Weight | What It Means |
-|---|---|---|
-| Environment Innovation | 40% | Novel, creative, genuinely challenging? Tests agent behavior in a new way? |
-| Storytelling & Presentation | 30% | Clear explanation of problem, env, and what the agent learned? Engaging demo? |
-| Showing Improvement in Rewards | 20% | Observable training progress — reward curves, before/after behavior, baseline comparison |
-| Reward & Training Pipeline | 10% | Coherent reward logic? Does training produce meaningful improvement in agent behavior? |
----
-## Pitch Format
-- **3 minutes** to pitch
-- **2 minutes** Q&A
-- 5 minutes total per team
-Your pitch should answer:
-1. **Problem** — what capability gap or interesting domain are you targeting?
-2. **Environment** — what does the agent see, do, and get rewarded for?
-3. **Results** — what changed after training? Show it.
-4. **Why it matters** — who would care, and why?
----
-## What Makes a Submission Stand Out
-**Pick an ambitious problem.**
-Ask yourself: Does this environment exist to teach an LLM something it currently can't do well? Could a researcher write a paper about training on this?
-**Design a reward signal that actually teaches.**
-- Rich signal throughout the episode (not just 0/1 at the end)
-- Hard to game — an agent that exploits the reward without solving the task should not score high
-- Use OpenEnv's Rubric system thoughtfully
-**Show real training end to end.**
-- Training loop connects to your environment (not a static dataset)
-- Train long enough that curves mean something
-- Compare trained agent vs random/untrained baseline — quantitative and qualitative
-- Include plots and numbers in your README
-**Make plots readable.**
-- Label both axes with units
-- Save as `.png` / `.jpg` and commit to repo
-- Embed key plots in README with a one-line caption
-- Put baseline vs trained on the same axes
----
-## Engineering Checklist
-- [ ] Use `Environment` / `MCPEnvironment` base classes properly
-- [ ] Respect client/server separation (clients never import server internals)
-- [ ] Follow standard Gym-style API (`reset`, `step`, `state`)
-- [ ] Valid `openenv.yaml` manifest
-- [ ] Do not use reserved tool names (`reset`, `step`, `state`, `close`) for MCP tools
-- [ ] README links to blog, video, or slides
-- [ ] No large video files in HF repo (use URL references)
----
-## Before You Arrive in Bangalore
-Post-training happens on-site with provided compute credits.
-Use the time before April 25 to:
-- [ ] Finalize your problem statement
-- [ ] Build and deploy your environment to HF Space
-- [ ] Write your training script (ready to run, not necessarily fully executed)
-- [ ] Prepare your 3-minute pitch story
----
-## Infrastructure Constraints (same as Round 1)
-- Inference script runtime: under 20 minutes
-- Hardware: vCPU=2, memory=8GB

docs/training.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# Training Guide — RhythmEnv GRPO
+## What we're training
+A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
+The goal is not to teach the model the rules of the environment — a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you.
+---
+## Stack
+| Component | Choice |
+|---|---|
+| Model | `unsloth/Qwen2.5-3B-Instruct` |
+| Quantization | 4-bit NF4 via Unsloth |
+| LoRA rank | 4 |
+| Training algorithm | GRPO (TRL 0.22.2) |
+| Hardware | Free Colab T4 (~3 hours for 500 steps) |
+---
+## Three-layer reward stack
+Each training step scores four candidate completions per prompt across three reward functions:
+| Layer | Function | Signal | Pass | Fail |
+|---|---|---|---|---|
+| 1 | `format_valid` | Is the output a parseable action name? | +1.0 | −2.0 |
+| 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | −1.0 |
+| 3 | `env_reward` | Real reward from stepping the environment | varies | −3.0 |
+`env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action — the reward cannot be fabricated.
+---
+## Key config choices
+```python
+GRPOConfig(
+    beta=0.01,               # KL penalty — default 0.04 caused explosion to kl=10731 at step 205
+    max_completion_length=16, # Action names are ≤15 chars; prevents verbose drift
+    learning_rate=2e-4,
+    num_generations=4,        # 4 candidates per prompt — enough variance for GRPO signal
+    max_steps=500,
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=4,
+)
+```
+`beta=0.01` is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. `max_completion_length=16` provides a hard cap as a second safeguard.
+---
+## Dataset
+Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step:
+```python
+{
+    "prompt": [system_msg, user_observation],
+    "seed": int,              # episode seed → deterministic profile + events
+    "step_index": int,        # which step in the episode
+    "action_history": list,   # actions taken before this step
+}
+```
+The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from.
+---
+## Baselines
+Established before training on 5 episodes × 3 profiles:
+| Strategy | Introvert | Extrovert | Workaholic |
+|---|---|---|---|
+| Random | ~0.65 | ~0.70 | ~0.65 |
+| Heuristic | ~0.78 | ~0.76 | ~0.82 |
+The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles.
+A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile — the introvert's week should look nothing like the workaholic's.
+---
+## How to run
+Open [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb) in Colab with a T4 GPU runtime.
+Run cells in order:
+1. Install dependencies
+2. Clone repo from HF Space
+3. Verify environment
+4. Run baseline evaluation (saves `baseline_results`)
+5. Generate dataset
+6. Load model (Qwen 2.5-3B + LoRA)
+7. Setup reward functions
+8. Configure training (`beta=0.01`, `max_completion_length=16`)
+9. Train (`trainer.train()`)
+10. Save model
+11. Generate training plots
+12. Evaluate trained model
+13. Generate comparison chart (`baseline_vs_trained.png`)
+---
+## Expected training behaviour
+Healthy run: `completion_length` stays at 3–16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps.
+Warning signs: `completion_length` spiking above 50, `clipped_ratio` approaching 1.0, KL above 5.0. If any of these appear, the `beta=0.01` fix is not being applied.
+---
+## Output artifacts
+After a successful run, download these and commit to the repo:
+```
+plots/training_loss.png         — loss curve across 500 steps
+plots/reward_curve.png          — mean reward with ±1 std band
+plots/baseline_vs_trained.png   — comparison bar chart (random / heuristic / trained)
+plots/eval_results.json         — raw per-episode scores
+```