| # OpenEnv Hackathon Readout |
|
|
| Date: 2026-04-24 |
|
|
| ## What The Hackathon Wants |
|
|
| The winning submission should be an OpenEnv-compliant environment where an LLM acts step by step, receives programmatic feedback, and measurably improves through RL or RL-style training. |
|
|
| The most important judging weights are: |
|
|
| | Criterion | Weight | Practical meaning | |
| |---|---:|---| |
| | Environment innovation | 40% | Novel, challenging, meaningful agent behavior, not a clone of common games or toy tasks. | |
| | Storytelling | 30% | A judge should understand the world, the agent, what it learned, and why it matters in 3 to 5 minutes. | |
| | Showing improvement | 20% | Reward curves, before/after runs, baseline comparison, actual training evidence. | |
| | Reward/training pipeline | 10% | Coherent rubrics, TRL or Unsloth script, reproducible pipeline. | |
|
|
| Minimum gates: |
|
|
| - Use latest OpenEnv. |
| - Hosted Hugging Face Space. |
| - OpenEnv-compliant `reset`, `step`, `state`, typed models, `openenv.yaml`. |
| - Training script using Unsloth or HF TRL, ideally Colab. |
| - Evidence of real training, including reward/loss plots. |
| - README with problem, environment, actions, observations, tasks, setup, results. |
|
|
| ## Strategic Lessons From The Docs |
|
|
| 1. Pick a task where success can be verified programmatically. |
| 2. Make the environment ambitious but keep the first curriculum levels easy enough for non-zero reward. |
| 3. Use multiple reward signals, not one monolithic score. |
| 4. Build the environment and verifier before training. |
| 5. Show a before/after behavior difference, not only a training script. |
| 6. Avoid a static benchmark. Adaptive curriculum and self-play read as much more ambitious. |
| 7. The story matters almost as much as the engineering. |
|
|
| ## Lessons From The Prior DocEdit Work |
|
|
| The old DocEdit environment passed because it was: |
|
|
| - Real-world, not a game. |
| - OpenEnv compliant. |
| - Lightweight enough for the constraints. |
| - Deterministically graded. |
| - Easy to explain. |
|
|
| The later Qwen SFT + GRPO postmortem proved that document repair can improve with training, but it also exposed a strategic limitation: full-document rewrite policies are probably not the best final design. A stronger next step is a planner/executor setup with structured edit actions and verifier feedback. |
|
|
| ## Lessons From The Winning Kube SRE Example |
|
|
| The winning pattern was not just "Kubernetes environment." It was: |
|
|
| - A vivid professional world: a tiny model learns to be on-call. |
| - Real or realistic tools. |
| - Multi-step investigation and repair. |
| - Adaptive curriculum. |
| - Adversarial scenario generation. |
| - Multi-layer rewards. |
| - A story where the agent and environment co-evolve. |
|
|
| The key insight to borrow: |
|
|
| > The environment should fight back as the agent improves. |
|
|
| ## Our Target Shape |
|
|
| To maximize win probability, the idea should combine: |
|
|
| - Theme 2: long-horizon planning, ideally up to 300 actions. |
| - Theme 3.1: professional world modeling with realistic tools and persistent state. |
| - Theme 4: self-improvement through adaptive scenario generation. |
| - Existing leverage from DocEdit so we can build fast. |
|
|
| The strongest direction is therefore not "another document editor." It is a long-horizon professional control room where document edits are one part of a larger verified workflow. |
|
|
|
|