| # **Hackathon Self-Serve Guide: Build an RL Environment, Train an LLM, Ship a Demo** |
|
|
| ## **0\) What you are building** |
|
|
| The core idea is not just to fine-tune a text model, but to build a **specialized LLM system** that can act inside an environment, get feedback, and improve through reinforcement learning. The practical stack discussed here is: |
|
|
| **Environment → verifier/reward functions → TRL trainer → Unsloth for efficiency → deployment on OpenEnv / Spaces**. |
|
|
| A strong project usually looks like one of these, |
|
|
| Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements. |
|
|
| ## **1\) Start with the right project idea** |
|
|
| Pick a task that has all three of these properties: |
|
|
| 1. **The model can act step by step** |
| 2. **You can verify success programmatically** |
| 3. **The task is hard enough to be interesting, but not so hard that the model never succeeds** |
|
|
| This last point matters a lot. RL only works if the probability of getting a good answer is greater than zero. If your task is so hard that the model never gets any reward, you will burn compute and learn nothing. |
|
|
| Please refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements. |
|
|
| A useful rule: **prefer tasks with crisp verification over tasks that only “look good” to a human.** RL gets easier when the reward is objective. |
|
|
| ## **2\) Understand the minimum RL loop before you build** |
|
|
| At a high level, your loop is: |
|
|
| 1. Give the model a prompt |
| 2. Let it generate an action, strategy, answer, or code |
| 3. Execute that output in an environment or verifier |
| 4. Convert the result into a reward |
| 5. Update the model so higher-reward behavior becomes more likely |
|
|
| That is the practical mental model for RL here. The system samples many outputs, scores them, and shifts probability mass away from bad outputs and toward better ones. |
|
|
| One especially useful framing is that RL is like a more efficient version of repeated in-context improvement. Instead of repeatedly stuffing previous examples into the context, you let backpropagation store what worked into the weights. |
|
|
| ## **3\) Decide whether you need SFT first** |
|
|
| Use this simple rule: |
|
|
| * If you have **a lot of good data**, use **SFT** |
| * If you **do not have data but can verify outputs**, use **RL** |
| * In many practical cases, do **a little SFT first**, then RL |
|
|
| Why this matters: |
|
|
| * SFT is generally more sample-efficient |
| * RL is useful when you can test outcomes but cannot cheaply author ideal traces |
| * RL often needs some warm start, formatting priming, or easy tasks first so that good rollouts happen at all |
|
|
| For hackathon teams, the best path is usually: |
|
|
| 1. Start from a capable base/instruct model |
| 2. Add light formatting or task scaffolding if needed |
| 3. Use RL for improvement, not as magic from scratch |
|
|
| ## **4\) Design the environment before you design the trainer** |
|
|
| Treat the environment as a first-class artifact. It should define: |
|
|
| * **reset()**: start a fresh episode |
| * **step(action)**: apply an action and return the next result |
| * **state() / observation**: what the agent sees |
| * **reward**: what counts as progress or success |
|
|
| OpenEnv standardizes this so the same training code can work across many environments, instead of every team inventing a different API. That is one of the main reasons to use it in a hackathon. |
|
|
| Think about your environment in this order: |
|
|
| 1. What does the agent observe? |
| 2. What actions can it take? |
| 3. What ends an episode? |
| 4. How do you compute reward? |
| 5. How do you stop abuse, infinite loops, or cheating? |
|
|
| **5\) Build the environment using OpenEnv** |
|
|
| The intended workflow is to bootstrap an environment skeleton and then fill in the behavior. OpenEnv’s CLI creates the scaffolding for you. The environment is implemented as a Python package and exposed via a FastAPI app. |
|
|
| Your implementation typically defines: |
|
|
| * action dataclass |
| * observation dataclass |
| * state representation |
| * environment methods like reset and step |
| * FastAPI wrapper / client-server interface |
|
|
| That gives you a clean separation: |
|
|
| * the **environment** handles world dynamics and scoring, |
| * the **trainer** handles optimization, |
| * and the **model** just learns to act inside the interface. |
|
|
| ## **6\) Keep the task simple at first** |
|
|
| Do not begin with your hardest benchmark. Start with the easiest version of your environment that still proves the concept. This is where curriculum learning helps. |
|
|
| A good progression: |
|
|
| 1. easy tasks with short horizons, |
| 2. medium tasks with a little more branching, |
| 3. harder tasks only after the model starts getting non-zero reward. |
|
|
| The principle is simple: **make success possible early**. If the model never sees successful trajectories, learning stalls. |
|
|
| ## **7\) Design rewards carefully** |
|
|
| Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the model will optimize the wrong thing very efficiently. |
|
|
| A strong reward design usually includes multiple components, for example: |
|
|
| * execution success, |
| * correctness, |
| * format compliance, |
| * timeouts, |
| * resource usage, |
| * safety constraints, |
| * and anti-cheating checks. |
|
|
| One explicit recommendation was to use **multiple independent reward functions**, not just one. If you only have a single reward signal, it is easier for the model to hack it. Multiple independent checks reduce that risk. |
|
|
| For example, for a coding environment: |
|
|
| * reward passing tests, |
| * penalize timeouts, |
| * reward format compliance, |
| * reject use of forbidden globals, |
| * and separately verify the function contract. |
|
|
| ## **8\) Protect yourself against reward hacking** |
|
|
| Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts that maximize your reward without solving the real task. Examples mentioned include: |
|
|
| * editing timers, |
| * caching results, |
| * abusing globals, |
| * mutating protected state, |
| * or exploiting environment bugs. |
|
|
| What to do: |
|
|
| 1. Use multiple independent reward functions |
| 2. Lock down execution where possible |
| 3. Add time limits |
| 4. Avoid unrestricted global state |
| 5. Sample outputs frequently and inspect them |
| 6. Terminate or roll back runs if behavior drifts badly |
|
|
| A particularly practical recommendation was to use a **locked-down function** or restricted execution approach so the model cannot rely on undeclared globals or hidden cached state. |
|
|
| Also, do not just let training run forever without checking generations. Periodic human inspection is still necessary. |
|
|
| ## **9\) Use process-aware feedback when you can** |
|
|
| Naively assigning the same final reward to every token is inefficient. If possible, use richer supervision that distinguishes good intermediate steps from bad ones. That is the idea behind **process supervision**. |
|
|
| In practice, this can be approximated by: |
|
|
| * line-by-line checks, |
| * step-level verifiers, |
| * program trace analysis, |
| * or LLM-as-a-judge for intermediate reasoning. |
|
|
| But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal. |
|
|
| For a hackathon, outcome-based verification plus a few lightweight process checks is usually the sweet spot. |
|
|
| ## **10\) Pick the right training stack** |
|
|
| The intended stack here is: |
|
|
| * **TRL** for RL training algorithms |
| * **Unsloth** to make RL training and inference more efficient |
| * **OpenEnv** to standardize environment interaction |
|
|
| This combination works because: |
|
|
| * OpenEnv gives you a common environment interface |
| * TRL gives you RL trainers like GRPO |
| * Unsloth reduces memory use and improves efficiency on top of TRL |
|
|
| One of the practical examples used the same prompt repeated many times, routed through an environment, with TRL driving training and Unsloth helping with performance. |
|
|
| ## **11\) Prefer GRPO / RLVR style training for verifiable tasks** |
|
|
| The RL setup discussed here leans toward **RL with verifiable rewards**: |
|
|
| * instead of a learned reward model, |
| * use a verifier, test harness, regex check, executor, or environment. |
|
|
| GRPO was described as a more efficient evolution relative to older PPO-style setups, especially by simplifying away parts like the value model. |
|
|
| For hackathon purposes, the key practical takeaway is: |
|
|
| * if the task is verifiable, |
| * build the verifier first, |
| * then plug that verifier into RL training. |
|
|
| ## **12\) Keep inference fast** |
|
|
| One important point: in RL for LLMs, **inference can dominate total runtime**. Over time, rollout generation often becomes the bottleneck, not the optimizer step. |
|
|
| That means your project speed depends heavily on: |
|
|
| * fast sampling, |
| * tight environment loops, |
| * low-overhead execution, |
| * and efficient model runtime. |
|
|
| This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy environments early in the hackathon. |
|
|
| ## **13\) Deploy your environment early** |
|
|
| OpenEnv environments are designed to be deployed as **Hugging Face Spaces**, which provide: |
|
|
| * a running server, |
| * a Git repository, |
| * and a container registry. |
|
|
| That gives you several ways to work: |
|
|
| * interact with the remote Space directly, |
| * install the client code from the repo, |
| * pull and run the container locally, |
| * or run the FastAPI app locally via Python/Uvicorn. |
|
|
| Why this is good for a hackathon: |
|
|
| * one shared source of truth, |
| * easier collaboration, |
| * easier demos, |
| * easier switching between local and remote execution. |
|
|
| A good habit is to deploy an early version of the environment before training seriously. That catches API and packaging issues early. |
|
|
| ## **14\) Scale only after the environment is stable** |
|
|
| There was a dedicated tutorial flow around: |
|
|
| 1. environment, |
| 2. deployment, |
| 3. scaling, |
| 4. training with TRL and Wordle. |
|
|
| Follow the same order. |
|
|
| Do **not** start with scale. First confirm: |
|
|
| * reset works, |
| * step works, |
| * rewards are sensible, |
| * timeouts work, |
| * logs are visible, |
| * and the environment can be run locally and remotely. |
|
|
| Only then: |
|
|
| * increase batch sizes, |
| * duplicate prompts or tasks, |
| * expand task diversity, |
| * and benchmark throughput. |
|
|
| ## **15\) Monitor the right things during training** |
|
|
| Do not watch only one scalar. Monitor: |
|
|
| * overall reward, |
| * individual reward function columns, |
| * success indicators, |
| * timeout frequency, |
| * and generated strategies over time. |
|
|
| A very concrete suggestion was: |
|
|
| * watch whether the reward is going up, |
| * and separately watch critical columns like “function works.” |
|
|
| Also inspect actual generations during training. A rising reward is not enough if the model is learning to exploit bugs. |
|
|
| ## **16\) Save models correctly** |
|
|
| If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was: |
|
|
| **Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively.** That can badly damage model quality. Instead, use the proper merged-save path, or use the adapters directly. |
|
|
| For participants, that means: |
|
|
| * keep your training save path simple, |
| * test post-training inference immediately, |
| * and do not leave export until the end. |
|
|
| ## **17\) How to structure your team over the hackathon** |
|
|
| A very effective team split is: |
|
|
| **Person A: Environment** |
|
|
| * builds reset/step/state |
| * adds timeouts and safety constraints |
| * makes local and remote execution work |
|
|
| **Person B: Verifier / Rewards** |
|
|
| * writes multiple reward functions |
| * adds anti-hacking checks |
| * makes failure cases visible |
|
|
| **Person C: Training** |
|
|
| * sets up TRL \+ Unsloth |
| * runs experiments |
| * tracks metrics and generations |
|
|
| **Person D: Demo / Product** |
|
|
| * prepares the Space demo |
| * creates a simple interface |
| * records examples and final benchmarks |
|
|
| This split matches the way the stack naturally decomposes in practice. |
|
|
| ## **18\) A practical 1-day execution plan** |
|
|
| ### **Phase 1: Pick a narrow task** |
|
|
| Choose a small, verifiable environment. Avoid huge long-horizon tasks first. |
|
|
| ### **Phase 2: Build the environment** |
|
|
| Use OpenEnv init, implement reset/step/state, and get a local loop working. |
|
|
| ### **Phase 3: Build rewards** |
|
|
| Add at least 2–4 independent reward checks, plus timeout and anti-cheat logic. |
|
|
| ### **Phase 4: Deploy** |
|
|
| Push to a Space or run locally via container/Uvicorn so teammates can use the same environment. |
|
|
| ### **Phase 5: Train small** |
|
|
| Run a tiny TRL \+ Unsloth experiment first. Look at outputs, not just metrics. |
|
|
| ### **Phase 6: Inspect for hacking** |
|
|
| Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts. |
|
|
| ### **Phase 7: Add curriculum** |
|
|
| If the model gets zero reward too often, simplify tasks or add easier start states. |
|
|
| ### **Phase 8: Train bigger** |
|
|
| Only after the loop is stable should you increase scale, batch size, or environment diversity. |
|
|
| ### **Phase 9: Save and demo** |
|
|
| Export the trained model correctly, test inference, and show before/after behavior. |
|
|
| ## **19\) What judges or reviewers will likely find compelling** |
|
|
| The strongest hackathon projects usually show: |
|
|
| * a clear environment design, |
| * objective reward functions, |
| * evidence that the model improved, |
| * prevention against reward hacking, |
| * a reproducible deployment story, |
| * and a sharp demo. |
|
|
| A simple but strong demo format is: |
|
|
| 1. baseline model attempt, |
| 2. reward/verifier output, |
| 3. trained model attempt, |
| 4. measurable improvement, |
| 5. short explanation of safeguards. |
|
|
| ## **20\) Suggested problem statement theme directions** |
|
|
| Please Refer to [\[External\] Apr ‘26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) |
|
|
| ## **21\) Common mistakes to avoid** |
|
|
| * Picking a task so hard that success probability is zero |
| * Using only one reward function |
| * Not checking for reward hacking |
| * Training before the environment is stable |
| * Relying only on average reward and not inspecting outputs |
| * Forgetting timeouts and sandbox limits |
| * Saving LoRA/QLoRA models incorrectly |
|
|
| ## **22\) Learning Resources** |
|
|
| **(Recommended) RL Environment Lecture Chapters:** |
| [**RL Mega Lecture**](https://openenv-india-apr-2026.lovable.app/) |
|
|
| **Module 1: Why OpenEnv?** (\~7 min) |
| ▸ Workshop 8:02–15:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=482s](https://www.youtube.com/watch?v=1jU05MlENOI&t=482s) |
| ▸ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec \+ Docker |
| ▸ Alt: Mega Lecture 40:01–46:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=2401s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s) |
|
|
| **Module 2: Using Existing Envs** (\~7.5 min) |
| ▸ Workshop 35:33–43:05 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2133s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s) |
| ▸ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry), from\_hub |
| ▸ Alt: Mega Lecture 1:24:11–1:30:00 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5051s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s) |
| |
| **Module 3: Deploying Envs** (\~9 min) |
| ▸ Mega Lecture 1:30:00–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5400s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s) |
| ▸ Ben: live openenv init, scaffold, running locally, openenv push, Docker run from Space |
| ▸ Alt: Workshop 43:05–48:30 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2585s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s) |
| |
| **Module 4: Building Your Own** (\~6.5 min) |
| ▸ Workshop 43:45–50:20 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2625s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s) |
| ▸ Ben: scaffold files, business logic (reset/step), models, client, publishing |
| ▸ Alt: Mega Lecture 1:33:30–1:39:07 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5610s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s) |
| |
| **Module 5: Training \+ TRL** (\~14 min) |
| ▸ Mega Lecture 1:53:20–2:07:12 — [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=6800s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s) |
| ▸ Lewis: Wordle GRPO walkthrough — rollout function, reward shaping, GRPOTrainer, live training |
| ▸ Alt: Workshop 22:24–34:12 — [https://www.youtube.com/watch?v=1jU05MlENOI\&t=1344s](https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s) |