# OpenEnv Hackathon Readout

Date: 2026-04-24

## What The Hackathon Wants

The winning submission should be an OpenEnv-compliant environment where an LLM acts step by step, receives programmatic feedback, and measurably improves through RL or RL-style training.

The most important judging weights are:

| Criterion | Weight | Practical meaning |
|---|---:|---|
| Environment innovation | 40% | Novel, challenging, meaningful agent behavior, not a clone of common games or toy tasks. |
| Storytelling | 30% | A judge should understand the world, the agent, what it learned, and why it matters in 3 to 5 minutes. |
| Showing improvement | 20% | Reward curves, before/after runs, baseline comparison, actual training evidence. |
| Reward/training pipeline | 10% | Coherent rubrics, TRL or Unsloth script, reproducible pipeline. |

Minimum gates:

- Use latest OpenEnv.
- Hosted Hugging Face Space.
- OpenEnv-compliant `reset`, `step`, `state`, typed models, `openenv.yaml`.
- Training script using Unsloth or HF TRL, ideally Colab.
- Evidence of real training, including reward/loss plots.
- README with problem, environment, actions, observations, tasks, setup, results.

## Strategic Lessons From The Docs

1. Pick a task where success can be verified programmatically.
2. Make the environment ambitious but keep the first curriculum levels easy enough for non-zero reward.
3. Use multiple reward signals, not one monolithic score.
4. Build the environment and verifier before training.
5. Show a before/after behavior difference, not only a training script.
6. Avoid a static benchmark. Adaptive curriculum and self-play read as much more ambitious.
7. The story matters almost as much as the engineering.

## Lessons From The Prior DocEdit Work

The old DocEdit environment passed because it was:

- Real-world, not a game.
- OpenEnv compliant.
- Lightweight enough for the constraints.
- Deterministically graded.
- Easy to explain.

The later Qwen SFT + GRPO postmortem proved that document repair can improve with training, but it also exposed a strategic limitation: full-document rewrite policies are probably not the best final design. A stronger next step is a planner/executor setup with structured edit actions and verifier feedback.

## Lessons From The Winning Kube SRE Example

The winning pattern was not just "Kubernetes environment." It was:

- A vivid professional world: a tiny model learns to be on-call.
- Real or realistic tools.
- Multi-step investigation and repair.
- Adaptive curriculum.
- Adversarial scenario generation.
- Multi-layer rewards.
- A story where the agent and environment co-evolve.

The key insight to borrow:

> The environment should fight back as the agent improves.

## Our Target Shape

To maximize win probability, the idea should combine:

- Theme 2: long-horizon planning, ideally up to 300 actions.
- Theme 3.1: professional world modeling with realistic tools and persistent state.
- Theme 4: self-improvement through adaptive scenario generation.
- Existing leverage from DocEdit so we can build fast.

The strongest direction is therefore not "another document editor." It is a long-horizon professional control room where document edits are one part of a larger verified workflow.