OpenEnv Hackathon Readout
Date: 2026-04-24
What The Hackathon Wants
The winning submission should be an OpenEnv-compliant environment where an LLM acts step by step, receives programmatic feedback, and measurably improves through RL or RL-style training.
The most important judging weights are:
| Criterion | Weight | Practical meaning |
|---|---|---|
| Environment innovation | 40% | Novel, challenging, meaningful agent behavior, not a clone of common games or toy tasks. |
| Storytelling | 30% | A judge should understand the world, the agent, what it learned, and why it matters in 3 to 5 minutes. |
| Showing improvement | 20% | Reward curves, before/after runs, baseline comparison, actual training evidence. |
| Reward/training pipeline | 10% | Coherent rubrics, TRL or Unsloth script, reproducible pipeline. |
Minimum gates:
- Use latest OpenEnv.
- Hosted Hugging Face Space.
- OpenEnv-compliant
reset,step,state, typed models,openenv.yaml. - Training script using Unsloth or HF TRL, ideally Colab.
- Evidence of real training, including reward/loss plots.
- README with problem, environment, actions, observations, tasks, setup, results.
Strategic Lessons From The Docs
- Pick a task where success can be verified programmatically.
- Make the environment ambitious but keep the first curriculum levels easy enough for non-zero reward.
- Use multiple reward signals, not one monolithic score.
- Build the environment and verifier before training.
- Show a before/after behavior difference, not only a training script.
- Avoid a static benchmark. Adaptive curriculum and self-play read as much more ambitious.
- The story matters almost as much as the engineering.
Lessons From The Prior DocEdit Work
The old DocEdit environment passed because it was:
- Real-world, not a game.
- OpenEnv compliant.
- Lightweight enough for the constraints.
- Deterministically graded.
- Easy to explain.
The later Qwen SFT + GRPO postmortem proved that document repair can improve with training, but it also exposed a strategic limitation: full-document rewrite policies are probably not the best final design. A stronger next step is a planner/executor setup with structured edit actions and verifier feedback.
Lessons From The Winning Kube SRE Example
The winning pattern was not just "Kubernetes environment." It was:
- A vivid professional world: a tiny model learns to be on-call.
- Real or realistic tools.
- Multi-step investigation and repair.
- Adaptive curriculum.
- Adversarial scenario generation.
- Multi-layer rewards.
- A story where the agent and environment co-evolve.
The key insight to borrow:
The environment should fight back as the agent improves.
Our Target Shape
To maximize win probability, the idea should combine:
- Theme 2: long-horizon planning, ideally up to 300 actions.
- Theme 3.1: professional world modeling with realistic tools and persistent state.
- Theme 4: self-improvement through adaptive scenario generation.
- Existing leverage from DocEdit so we can build fast.
The strongest direction is therefore not "another document editor." It is a long-horizon professional control room where document edits are one part of a larger verified workflow.