Spaces:

anugrahhu
/

cernenv

Sleeping

App Files Files Community

anugrahhu commited on 16 days ago

Commit

9c00159

verified ·

1 Parent(s): f28409b

feat: interactive Gradio demo at /demo

Browse files

Files changed (8) hide show

Hackathon FAQs (participants).txt +339 -0
requirements-train.txt +7 -0
requirements-unsloth.txt +10 -0
server/app.py +27 -1
space/env/gradio_demo.py +690 -0
space/env/requirements.txt +1 -0
training/evaluate.py +153 -152
training/training_unsloth.py +342 -341

Hackathon FAQs (participants).txt ADDED Viewed

	@@ -0,0 +1,339 @@

+1) What is reinforcement learning in the context of LLMs?
+Reinforcement learning for LLMs is a loop where the model generates an answer, code snippet, plan, or action sequence; that output is evaluated by a verifier or environment; and the resulting reward is used to update the model so higher-reward behaviors become more likely over time. In practice, this is often used after pretraining and supervised fine-tuning to sharpen behaviors like reasoning, code generation, or tool use. The session framed this intuition as turning repeated trial-and-error into weight updates instead of stuffing more and more examples into the prompt.
+A good mental model is: supervised fine-tuning tells the model “copy this good target,” while RL tells it “try many possibilities and move probability mass toward the ones that score better.” PPO is one classic algorithm for this style of training, and GRPO is a later variant used heavily in modern LLM work because it can be more memory-efficient for certain setups. (arXiv)
+For deeper reading:
+* TRL docs for RL trainers and workflows. (Hugging Face)
+* PPO paper. (arXiv)
+* DeepSeekMath for GRPO. (arXiv)
+2) Why do rewards matter so much?
+Rewards are the only signal telling the model what “better” means. If your reward is well aligned with the real task, RL can push the model toward genuinely useful behavior. If your reward is incomplete or easy to game, the model will optimize the wrong thing very effectively. The session emphasized that RL gives you what you asked for, not necessarily what you meant.
+For example, if you reward generated code only for passing a shallow regex or a weak unit test, the model may learn to exploit those checks instead of solving the underlying problem. This is why reward design is not a detail; it is the task specification. DeepMind’s discussion of “specification gaming” makes the same point in broader RL terms: weakly specified rewards create loopholes that search will discover. (Google DeepMind)
+Useful reading:
+* DeepMind on specification gaming. (Google DeepMind)
+* Lilian Weng on reward hacking. (Lil'Log)
+3) What is rewards engineering?
+Rewards engineering is the work of designing, combining, validating, and monitoring reward signals so that optimization pressure produces the behavior you actually want. In LLM RL, that usually means deciding:
+* what gets rewarded,
+* how much it gets rewarded,
+* when it gets rewarded,
+* what gets penalized,
+* and how you audit whether the reward is being gamed.
+A practical reward function often has several components. For a code task, you might combine syntax validity, execution success, unit test pass rate, latency, memory use, formatting compliance, and safety checks. The session highlighted verifier-based reward design such as formatting checks, execution checks, regex checks, and environment-based evaluation instead of a learned reward model alone.
+A useful principle is to reward outcomes first, then add process constraints only where needed. Over-shaping the reward can make training brittle or bias the model into narrow strategies, while under-shaping makes hacking easier. (Google DeepMind)
+4) What is RLVR, and how is it different from using a reward model?
+RLVR usually means reinforcement learning with verifiable rewards. Instead of asking a learned reward model to score outputs, you use a verifier, tester, or environment that can check correctness more directly. The session gave examples like formatting checks, execution checks, regex-based checks, and environment rollouts.
+This is powerful when correctness is externally testable. Code can be compiled and unit-tested. Math can often be checked against a final answer or symbolic verifier. Games can expose reward from the environment. Browser tasks can be checked by page state or task completion. In such cases, verifier-driven rewards are often more trustworthy than a purely learned scalar reward model.
+TRL documents this broader environment-based training pattern, and OpenEnv is meant to standardize how such environments are defined and used. (Hugging Face)
+5) Why do RL environments matter for LLMs?
+Static prompt-response datasets are useful, but they are limited. Real deployments require models to interact with systems: codebases, browsers, files, APIs, games, tools, and simulators. RL environments let the model act, observe consequences, and keep going across multiple steps, which is much closer to real agent behavior. The session described environments as the bridge from isolated prompt solving to real-world interaction.
+They also enable dynamic difficulty and richer feedback. Instead of training forever on a fixed set of prompts, the environment can generate or surface tasks that are more appropriate for the current model, which makes curriculum learning and continual challenge easier. This matches the broader “RL with environments” direction discussed in recent OpenEnv and TRL material. (Hugging Face)
+For examples:
+* BrowserGym for web-task environments. (GitHub)
+* OpenEnv course and TRL integration docs. (GitHub)
+6) What is OpenEnv, and why would a hackathon team use it?
+OpenEnv is an open-source framework for defining and interacting with RL environments for LLM and agent training. The session described it as a standardized interface around concepts like reset, step, state, observations, actions, and rewards, with deployment built around Hugging Face Spaces and containerized execution.
+A hackathon team would use OpenEnv because it reduces environment plumbing. Instead of inventing a new interface for each task, you can standardize how the model talks to the environment and then connect that to a trainer like TRL. That means you spend more time on task design and rewards, and less on adapter glue. The session also highlighted openenv init for bootstrapping an environment skeleton quickly.
+Good starting points:
+* OpenEnv repo. (GitHub)
+* OpenEnv course. (GitHub)
+* TRL’s OpenEnv integration guide. (Hugging Face)
+7) How does OpenEnv work at a high level?
+At a high level, an OpenEnv environment exposes a small set of standard operations:
+* reset the environment,
+* step the environment with an action,
+* return observations, rewards, and state.
+The session described OpenEnv environments as FastAPI applications that can be run locally, deployed on Hugging Face Spaces, or pulled as containers. That gives teams several options: they can use the remote environment directly, install client code from the repo, or run the environment locally through the container image.
+This design is useful because it treats environments as portable, versioned software artifacts rather than ad hoc scripts. Hugging Face’s own TRL docs describe OpenEnv similarly, including support for backend-server execution and standardized APIs. (Hugging Face)
+8) Where do TRL and Unsloth fit in this stack?
+TRL is the training library. It provides trainers and workflows for SFT, DPO, PPO, GRPO, reward modeling, and related post-training methods for transformer models. In a typical hackathon setup, TRL handles rollout collection, reward integration, optimization, logging, and trainer configuration. (Hugging Face)
+Unsloth fits in as the acceleration and memory-efficiency layer for training and RL fine-tuning. The session described Unsloth as making RL training more efficient and inference faster, which matters because rollout generation often dominates runtime in RL loops. It also noted a practical QLoRA warning: don’t naively upcast a 4-bit model to 16-bit and then merge adapters, because that can damage model quality; use the proper merge path instead.
+Relevant docs:
+* TRL docs and GRPO cookbook. (Hugging Face)
+* Unsloth repository/readme. (GitHub)
+9) What is the difference between PPO and GRPO?
+PPO is a classic policy optimization algorithm that stabilizes updates by constraining how much the policy changes between iterations. It is one of the most influential RL algorithms in modern deep learning. (arXiv)
+GRPO is a later group-relative variant used in LLM training that compares sampled outputs within a group to estimate relative advantage, and it is often discussed as a more memory-efficient alternative to full PPO-style setups in some LLM post-training pipelines. The session summarized GRPO as a more efficient version of PPO and specifically noted removing the value model from the setup.
+For deeper details:
+* PPO paper. (arXiv)
+* DeepSeekMath / GRPO references via TRL paper index and cookbook. (arXiv)
+10) Why is RL often described as inefficient, yet still useful?
+RL is often inefficient because the feedback is sparse and delayed. A long rollout may end in one scalar reward, and that weak signal has to train many decisions. The session used a simple example: if a code answer fails at one line but you assign the same negative reward to every token, you’re throwing away a lot of structure.
+It is still useful because it can optimize behaviors where exact supervised targets are unavailable, too expensive, or too limiting. If you can verify success but cannot easily author perfect demonstrations for every scenario, RL can still improve the model by repeated interaction. This is why RL is especially attractive for code execution, tool use, games, browser tasks, and agent workflows.
+A practical takeaway: use RL where verifiers exist and where exploration is worth the extra compute.
+11) What is process supervision, and why is it important?
+Process supervision means giving feedback on intermediate reasoning or intermediate steps, not only on the final outcome. The session contrasted this with assigning the same reward to every token in the answer, which can be very wasteful. Under process supervision, you try to identify which parts of a trace were good, irrelevant, or harmful.
+This matters because not all failures are equal. Maybe the model chose the right algorithmic approach but made one implementation mistake. Final-outcome-only rewards blur that distinction. Step-aware rewards can improve sample efficiency and make debugging easier, though they also raise new risks if the step labels are noisy or exploitable.
+The session also noted that process supervision is often approximated with humans or LLM-as-a-judge. That can help, but it creates another optimization target that itself may be gamed.
+12) What is reward hacking?
+Reward hacking is when the model finds a way to maximize reward without genuinely doing the intended task. In other words, the optimization succeeds, but the task specification failed. The session gave intuitive examples such as editing variables, bypassing intended checks, or exploiting quirks in the environment rather than solving the real problem.
+This is the same phenomenon often called specification gaming. DeepMind describes it as agents exploiting flaws or ambiguities in the reward function, and Lilian Weng’s overview covers how common and fundamental this problem is in RL systems. (Google DeepMind)
+A useful mindset is: reward hacking is not proof the model is “evil”; it is proof that optimization pressure found a loophole.
+13) How can a hackathon team reduce reward hacking in practice?
+Use strong verifiers. Prefer executable checks over stylistic heuristics. For code, run tests, time the solution, validate output shapes and edge cases, and isolate execution. For tool use, verify actual state transitions, not just verbal claims. The session repeatedly emphasized verifiers and environments over vague reward signals.
+Monitor training actively. The session recommended sampling outputs periodically, looking for suspicious patterns, and terminating or rolling back runs when drift appears. It also suggested filtering bad responses and adding guardrails when patterns of exploitation are observed.
+Use layered rewards. Combine success criteria with anti-cheat constraints. For example:
+* pass tests,
+* do not edit protected files,
+* do not bypass timers,
+* stay within time and memory budget,
+* preserve task-required formatting,
+* and log intermediate actions for audit.
+This general strategy aligns with broader RL safety guidance on specification gaming. (Google DeepMind)
+14) What is curriculum learning, and why does it help RL?
+Curriculum learning means controlling the order or difficulty of training tasks so the model learns from easier tasks first and gradually moves to harder ones. The session directly recommended this for RL: if tasks are too hard at the start, the model may never produce a successful rollout, which means the reward signal is effectively zero and learning stalls.
+This is especially important in LLM RL because many tasks are long-horizon and brittle. An easier initial distribution can bootstrap behavior, after which harder tasks become reachable. In the RL literature more broadly, curriculum learning is a standard way to improve exploration and sample efficiency in difficult environments. (arXiv)
+Practical idea for hackathons:
+* start with short horizons,
+* fewer tools,
+* simpler state spaces,
+* stronger hints,
+* easier test cases,
+* then gradually remove scaffolding.
+15) How do I know whether a task is suitable for RL?
+A task is a good candidate for RL if:
+* you can verify success or partial progress,
+* exploration is meaningful,
+* multi-step interaction matters,
+* and you do not already have abundant high-quality demonstrations.
+The session highlighted a key rule of thumb: the probability of a good answer must be greater than zero. If the task is so hard that the model never stumbles into any rewarding behavior, RL will waste compute. That means task selection, warm starts, formatting scaffolds, or light SFT can be essential.
+Good hackathon candidates include:
+* code generation with executable tests,
+* browser navigation with page-state checks,
+* games with clear win conditions,
+* API/tool workflows with verifiable side effects.
+16) Should we jump straight into RL, or do some SFT first?
+Usually, do some SFT or at least a warm start first. The session’s guidance was that pretraining carries most of the capability burden, SFT helps shape the behavior, and RL refines it. It explicitly argued against relying on RL alone from scratch for most practical settings.
+That matches modern post-training stacks: pretrain heavily, align or instruct-tune, then apply preference optimization and/or RL where it adds value. TRL’s supported workflows reflect exactly this broader stack. (Hugging Face)
+A hackathon-friendly recipe is:
+1. Start from a solid instruct model.
+2. Add a tiny amount of task-format SFT if needed.
+3. Build a strong verifier.
+4. Use GRPO/PPO-style RL only after the model can at least occasionally succeed.
+17) What should we actually monitor during RL training?
+Monitor more than the headline reward. The session specifically called out tracking reward trends, component rewards, and whether important success columns are improving over time. It also recommended checking generated strategies and periodically sampling outputs during training rather than letting runs continue blindly.
+Useful metrics include:
+* average reward,
+* verifier pass rate,
+* timeout rate,
+* format adherence,
+* rollout length,
+* diversity of successful solutions,
+* frequency of suspicious shortcuts,
+* and cost per useful trajectory.
+If the average reward rises but the actual task quality drops or becomes brittle, that is often a reward-design problem rather than a model-capability problem.
+18) What is a strong hackathon strategy for building an RL environment fast?
+Pick a task with a crisp verifier. Build the smallest environment that exposes reset, step, observations, and reward. Use OpenEnv to standardize the interface and TRL to handle training. Use Unsloth if you need to fit training into tighter hardware budgets. (Hugging Face)
+A practical sequence:
+1. Define the task and what “success” means.
+2. Write the verifier before writing the policy loop.
+3. Create a few toy tasks the model can solve.
+4. Add curriculum or easier variants first.
+5. Run small-scale debugging before long training.
+6. Sample outputs constantly for reward hacking.
+7. Only then scale rollouts and environment diversity.
+19) What are good starter resources for participants?
+For TRL:
+* Main docs. (Hugging Face)
+* PPO trainer docs. (Hugging Face)
+* GRPO cookbook. (Hugging Face)
+* Paper index for GRPO/DeepSeekMath references. (Hugging Face)
+For OpenEnv:
+* OpenEnv GitHub repo. (GitHub)
+* OpenEnv course. (GitHub)
+* TRL’s OpenEnv integration docs. (Hugging Face)
+For environments and benchmarks:
+* BrowserGym. (GitHub)
+For reward design and failure modes:
+* DeepMind on specification gaming. (Google DeepMind)
+* Lilian Weng on reward hacking. (Lil'Log)
+For RL algorithms:
+* PPO paper. (arXiv)
+* DeepSeekMath / GRPO paper. (arXiv)
+For Unsloth:
+* Unsloth repo/readme. (GitHub)
+20) What is the one-sentence summary participants should remember?
+If you can build a task where success is verifiable, difficulty is controllable, and loopholes are monitored, RL can turn an LLM from “good at answering” into “better at acting.”
+21) What is RLVR?
+RLVR stands for reinforcement learning with verifiable rewards. Instead of relying only on a learned reward model or human preference model, the training loop uses programmatic checks to determine whether an output is correct. Typical examples include exact-answer checks for math, unit tests for code, schema validation for structured output, or environment-based task completion checks. This makes RLVR especially attractive for domains where correctness can be verified automatically and consistently. (Label Studio)
+22) What is RLVE?
+RLVE is reinforcement learning with verifiable environments. The key idea is to train on environments that can procedurally generate tasks, expose adjustable difficulty, and provide algorithmically verifiable rewards. Recent work on adaptive verifiable environments argues that static prompt datasets often become either too easy or too hard during training, causing learning to stall, while adaptive environments keep the model near its capability frontier. (arXiv)
+23) How is RLVE different from RLVR?
+RLVR usually refers to verifiable rewards on a fixed or semi-fixed set of prompts or problems. RLVE goes a step further by making the task source itself dynamic: the environment can generate new problems, vary difficulty, and keep serving appropriately challenging tasks as the model improves. In practice, RLVE is often better for preventing saturation on static datasets and for building curriculum naturally into training. (arXiv)
+24) Why are RL environments useful for LLM post-training?
+They let the model interact, not just answer. In a real environment, the model can act, observe consequences, act again, and get reward from actual task outcomes. That makes environments a better fit for tool use, browsers, APIs, coding agents, games, and long-horizon tasks than plain prompt-response datasets. Hugging Face’s OpenEnv and TRL material reflects this shift toward environment-based agent training. (Hugging Face)
+25) Where do TRL, GRPO, and Unsloth fit in?
+TRL is the training framework that provides RL trainers and infrastructure for post-training transformer models, including GRPO. GRPO is the RL optimization method popularized in DeepSeekMath and now widely used in open LLM RL pipelines because it can be more memory-efficient than PPO-style setups in this context. Unsloth is typically used as the efficiency layer to make fine-tuning and RL training faster and more affordable on limited hardware. (Hugging Face)
+26) Why do rewards matter so much?
+Because the reward is the task definition as far as optimization is concerned. If your reward captures the real objective, RL can improve useful behavior. If your reward is incomplete, noisy, or hackable, the model will optimize the proxy instead of the real task. DeepMind’s write-up on specification gaming makes this point very clearly: the agent’s ingenuity is helpful only when the specification is correct. (Google DeepMind)
+27) What is reward engineering?
+Reward engineering is the design of the reward function, the verifier, the shaping terms, the penalties, and the monitoring strategy. In LLM RL, this includes deciding what counts as success, how partial progress is rewarded, what shortcuts are forbidden, and how to detect reward hacking. OpenEnv’s reward-design guide explicitly warns about reward hacking, sparse rewards, and conflicting signals as common pitfalls. (Meta-PyTorch)
+28) What is reward hacking?
+Reward hacking happens when a model finds a way to maximize the reward without actually doing the intended task. DeepMind describes this as specification gaming: the system satisfies the literal reward but not the real goal. Classic causes include poorly designed shaping rewards, missing constraints in the success condition, and simulator or verifier loopholes. (Google DeepMind)
+29) Why is sparse reward a common problem?
+If successful trajectories are too rare, the model may never get enough positive signal to improve. OpenEnv’s docs explicitly call sparse rewards a common pitfall because the agent may never find positive signal. RLVE work similarly notes that overly difficult tasks can yield consistently poor rewards and stall gradient-based learning. (Meta-PyTorch)
+30) Why can dense rewards also be dangerous?
+Dense rewards can speed up learning, but they can also create local optima and incentive misalignment. OpenEnv recommends starting simple and shaping carefully, because intermediate rewards can steer the model toward proxy behaviors. DeepMind gives the broader warning that poorly designed shaping can change the optimal policy itself rather than just helping the model reach the intended outcome faster. (Meta-PyTorch)
+________________
+Common Pitfalls in Building RL Environments
+31) What is the most common mistake when designing an RL environment?
+Making the environment easy to verify but not faithful to the real task. A verifier that checks only the final string, a regex, or a narrow success pattern may be convenient, but it often misses equivalent correct answers or allows degenerate shortcuts. Recent verifier analysis on mathematical RL found that rule-based verifiers often reject correct but differently formatted answers, while model-based verifiers can be exploited to produce false positives during RL. (arXiv)
+32) What goes wrong with weak verifiers?
+Two opposite failure modes are common. Rule-based verifiers can be too brittle and produce false negatives when the answer is correct but phrased differently. Model-based verifiers can be too permissive and produce false positives that the policy learns to exploit. The verifier study on mathematical reasoning reports both problems and shows that stronger policies make verifier weaknesses more obvious. (arXiv)
+33) Why is “just use an LLM as judge” often risky?
+Because the judge becomes part of the optimization target. If the policy can find surface patterns that fool the judge, training can inflate reward without improving real task quality. That is exactly why model-based verifiers, despite better static accuracy, can be vulnerable during RL training. Use them carefully, stress-test them, and combine them with hard checks whenever possible. (arXiv)
+34) What is a common environment-design pitfall for tool-using agents?
+Not modeling realistic failure modes. Real APIs fail because of permissions, invalid formats, missing fields, timezones, or bad parameters. Hugging Face’s OpenEnv blog highlights examples like missing OAuth scopes and bad RFC3339 datetime formatting. If the environment hides these realities, the resulting policy will be overfit to a toy setup and brittle in deployment. (Hugging Face)
+35) Why is static task difficulty a problem?
+Because the learning signal collapses at both extremes. Tasks that are too easy stop teaching the model anything useful. Tasks that are too hard yield near-zero reward and also stop teaching. RLVE was proposed largely to solve this problem by dynamically adjusting task difficulty as the policy improves. (arXiv)
+36) What is a common pitfall in environment diversity?
+Training on too few task types. Recent RLVE results argue that scaling the number of environments improves generalizable reasoning capability, and Reasoning Gym was built around procedurally generated tasks across many domains for exactly this reason. A narrow environment set often produces narrow competence and fragile transfer. (arXiv)
+37) Why do many RL environments fail to transfer to real-world performance?
+Because they optimize the wrong abstraction level. If the environment is too toy-like, omits realistic constraints, or over-simplifies tool feedback, the model may become good at the benchmark but not at the actual workflow. This is a practical version of specification gaming: the benchmark is solved, the real job is not. (Google DeepMind)
+________________
+Common Pitfalls in Reward Engineering
+38) What is the biggest reward-engineering mistake?
+Using a proxy metric as if it were the goal. Goodhart-style failures are everywhere in RL: token count, response format, test count, or intermediate progress can all become targets the model exploits. DeepMind’s examples of shaping mistakes and reward misspecification are the canonical warning here. (Google DeepMind)
+39) Should I start with a complicated reward function?
+Usually no. OpenEnv explicitly recommends starting simple, often with sparse success/failure reward, before layering in shaping terms. This makes debugging easier and reduces the chance that the model learns the wrong intermediate incentives before it learns the actual task. (Meta-PyTorch)
+40) What happens when reward components conflict?
+Learning becomes unstable or confused. OpenEnv lists conflicting signals as a common pitfall: if one term rewards brevity, another rewards verbosity, a third rewards format, and a fourth rewards exploration, the policy may oscillate or learn brittle shortcuts instead of coherent behavior. (Meta-PyTorch)
+41) Why is binary reward often appealing?
+Because it is easy to reason about and harder to game superficially. Label Studio’s RLVR overview notes that verifiable rewards are often binary and directly tied to correctness criteria, which makes evaluation simple and scalable. Binary reward is not always sufficient, but it is often a good starting point for precision-critical tasks like code and math. (Label Studio)
+42) Why is binary reward sometimes not enough?
+Because it can be too sparse, especially for long-horizon tasks. If success only happens at the very end, the model may not learn at all. That is where carefully designed shaping, step-level evaluation, or adaptive curriculum can help — but only if you can add them without creating easy-to-game shortcuts. (Meta-PyTorch)
+43) How do I know whether my reward is being hacked?
+Watch for rising reward without corresponding task-quality gains. Typical signs are strange formatting habits, repetitive surface patterns, degenerate short solutions, suspiciously high judge scores, or solutions that pass weak checks but fail stronger ones. The verifier case study is a strong reminder that static verification accuracy is not enough; you must observe what happens under optimization pressure. (arXiv)
+44) What is a safe pattern for reward engineering?
+Use layered verification. Start with hard outcome checks. Add anti-cheat constraints. Then add minimal shaping only where the sparse reward is too weak. Keep a holdout evaluator separate from the training reward when possible. This matches both OpenEnv’s “start simple, shape carefully” guidance and DeepMind’s warning about shaping altering the true objective. (Meta-PyTorch)
+________________
+Common Pitfalls in RL Post-Training Pipelines with RLVR / RLVE / GRPO
+45) What is a common mistake in GRPO training runs?
+Using RL before the base model is ready. GRPO is powerful, but it is a post-training method, not a substitute for capability. TRL’s own GRPO examples start from instruct models and task datasets rather than from weak base checkpoints. If the model almost never produces a correct rollout, the reward signal is too sparse for productive RL. (Hugging Face)
+46) Why does RL post-training plateau?
+Because the model saturates the available prompt distribution or the reward signal no longer differentiates useful improvements. RLVE explicitly frames static data saturation as a problem and shows that adaptive environments can keep learning going after conventional RLVR pipelines flatten out. (arXiv)
+47) Why can “more RL” make a model worse?
+Because optimization pressure amplifies whatever the reward favors, including undesirable shortcuts. If the verifier is noisy, if the environment is unrealistic, or if the reward overvalues superficial structure, more training can push the model deeper into those artifacts rather than improving real competence. (arXiv)
+48) What is a common pitfall in RLVR datasets?
+Finite, static datasets get stale. Once the model has mastered or overfit their distribution, additional RL yields little signal. RLVE work argues that procedurally generated environments with adjustable difficulty are one way around this limitation. Reasoning Gym makes a similar case for unlimited data generation with controllable complexity. (arXiv)
+49) Why do identical-looking GRPO runs produce different outcomes?
+Because RL is highly sensitive to rollout quality, verifier behavior, reward scaling, task mix, generation parameters, and environment bugs. Even if the trainer code is the same, small differences in reward computation or environment behavior can change optimization dynamics substantially. The verifier study is a good reminder that the reward pipeline itself is part of the model. (arXiv)
+50) What is a common pitfall when mixing many environments?
+Using an unbalanced mixture. If some environments are much easier, much denser in reward, or much shorter in trajectory length, they can dominate training and starve harder but more important environments. RLVE’s adaptive-difficulty framing exists partly to keep the training distribution informative instead of letting it collapse into easy tasks. (arXiv)
+51) Why are long-horizon tasks especially hard in RL post-training?
+Because reward arrives late and useful trajectories are rare. Long tasks need either decomposition, better intermediate signals, stronger initialization, or curriculum. Otherwise, the rollout cost is high and the success rate stays near zero. This is one reason why adaptive environments and procedural curricula are getting attention. (arXiv)
+52) What monitoring mistake do teams make most often?
+They monitor the training reward but not actual behavior. Reward alone is not enough because the reward channel can be flawed. You need sampled rollout audits, stronger offline evaluation, and held-out environments or benchmarks. The verifier case study shows why this matters: reward can rise while real quality does not. (arXiv)
+53) What is the safest way to structure an RL post-training pipeline?
+A good pattern is:
+start from a strong instruct or SFT checkpoint, use a task with a strong verifier, begin with simple reward, validate the environment thoroughly, run small-scale debug experiments, audit rollouts manually, then scale training and only later add curriculum or more shaping. This is consistent with TRL’s practical GRPO examples, OpenEnv’s reward guidance, and the lessons from verifier-failure studies. (Hugging Face)
+________________
+Practical “What should we do in a hackathon?” FAQs
+54) What kind of project is most likely to succeed in a hackathon?
+Pick a task with:
+a clear success condition,
+a verifier you trust,
+short to medium trajectory length,
+few external dependencies,
+and adjustable difficulty.
+Good examples are code repair with tests, structured extraction with schema validation, grid or puzzle games, tool-using workflows with exact state checks, and browser tasks with explicit completion criteria. These are the sweet spot for RLVR and lightweight RLVE prototypes. (Label Studio)
+55) What should we avoid building?
+Avoid tasks that are subjective, hard to verify, require massive infrastructure, or depend heavily on an LLM judge without hard backstops. Also avoid environments whose failure cases you do not understand. If you cannot explain how the reward could be hacked, you are not ready to optimize it yet. (arXiv)
+56) What is the best debugging order?
+First debug the environment manually.
+Then debug the verifier.
+Then run scripted baseline policies.
+Then run a frozen model.
+Then run a tiny RL experiment.
+Only then scale.
+This order isolates bugs early and prevents you from blaming the optimizer for what is really an environment or reward bug. It follows directly from the fact that verifier reliability is foundational in RLVR. (arXiv)
+57) What is one rule the team should remember?
+Do not optimize a reward you have not tried to break yourself first. The easiest way to avoid reward hacking is to adversarially test your environment and reward design before the model does. (Google DeepMind)
+________________
+58) Strong references for deeper learning
+For GRPO and TRL:
+* TRL GRPO Trainer docs. (Hugging Face)
+* Hugging Face GRPO cookbook. (Hugging Face)
+For RL environments and reward design:
+* OpenEnv reward design guide. (Meta-PyTorch)
+* OpenEnv tool-using environment examples. (Hugging Face)
+For pitfalls and failure modes:
+* DeepMind on specification gaming. (Google DeepMind)
+* Pitfalls of rule-based and model-based verifiers. (arXiv)
+For scalable environment-based training:
+* RLVE paper on adaptive verifiable environments. (arXiv)
+* Reasoning Gym. (OpenReview)
+Here are solid Unsloth RL post-training recipes worth checking out, with a bias toward official or close-to-official examples.
+59) Core Unsloth GRPO recipes
+Qwen2.5 (3B) GRPO notebook
+A straightforward starter recipe for GRPO with Unsloth. It covers data prep, training, inference, and saving, so it is a good baseline if you want the least opinionated end-to-end example. (GitHub)
+Llama 3.1 (8B) GRPO notebook
+Same general pattern, but on a larger model family. Useful if you want a more realistic “reasoning/capability uplift” recipe without jumping straight to very large models. (GitHub)
+Gemma 3 (1B) GRPO notebook
+A smaller-scale recipe that is easier to run and debug. Good for iterating on reward functions and rollout settings before spending more compute on larger checkpoints. (GitHub)
+59.1) Advanced Unsloth GRPO recipes
+Advanced Qwen3 (4B) GRPO notebook
+This is one of the more interesting recipes because it adds more than the bare trainer loop. Unsloth’s June 2025 discussion explicitly calls out: proximity scoring for more nuanced rewards, OpenR1 dataset support, advanced templates, and “prefinetuning to skip GRPO format learning.” That makes it a better recipe when you care about reward shaping and format bootstrapping, not just getting GRPO to run. (GitHub)
+HF LLM Course: Practical Exercise — GRPO with Unsloth
+Not an Unsloth-maintained notebook repo entry, but it is a structured learning recipe that uses Unsloth specifically to fine-tune a model with GRPO for reasoning. It is a good companion when you want a didactic walkthrough instead of just notebook cells. (Hugging Face)
+59.2) Environment / agent-style RL recipes
+GPT-OSS 20B + 2048 game RL notebook
+This is closer to “RL with an environment” than plain static-prompt RLVR. The notebook goal is explicitly to make GPT-OSS play 2048 with reinforcement learning / GRPO, which makes it a useful recipe if you want to move beyond math/code answer verification into interactive environment training. (GitHub)
+59.3) Broader recipe collection
+Unsloth notebooks repository
+The main repo currently advertises “250+ Fine-tuning & RL Notebooks,” including GRPO and reinforcement learning notebooks. If you want the widest set of recipes in one place, this is the best starting point. (GitHub)
+59.4) Useful adjacent recipes and examples
+Scheduler GRPO example using Unsloth
+A community example that trains a scheduling model with GRPO using Unsloth and QLoRA. It is useful because it shows a non-math, non-code structured-output task where rewards are tied to output format and schedule correctness. (Hugging Face)
+SFT → GRPO pipeline example
+There is a community “show and tell” example for a full SFT-then-GRPO pipeline. I would treat it as inspiration rather than an official recipe, but it is valuable if your intended workflow is “teach format first, then do RL.” (GitHub)
+59.5) What these recipes collectively cover
+Across these examples, the main recipe patterns are:
+* plain GRPO on reasoning-style tasks,
+* GRPO with better reward shaping like proximity scoring,
+* pre-SFT or preformatting before RL,
+* QLoRA-based memory-efficient RL fine-tuning,
+* and environment-style RL with game interaction. (GitHub)
+59.6) Two gaps to keep in mind
+One gap is multi-turn GRPO with stepwise rewards. There is a feature request asking for reward on each step plus a final reward, which suggests this is not yet a mature first-class recipe in Unsloth. (GitHub)
+Another gap is notebook stability across versions/hardware. Several issue threads mention breakage or edge cases in GRPO notebooks, including fast inference assumptions, VRAM growth, and vision-GRPO issues. That does not make the recipes unusable, but it does mean you should pin versions and test on a small run first. (GitHub)
+59.7) Best recipes by use case
+If you want the simplest starting point:
+* Qwen2.5 (3B) GRPO
+* Gemma 3 (1B) GRPO (GitHub)
+If you care about reward engineering:
+* Advanced Qwen3 (4B) GRPO (GitHub)
+If you care about environment-style RL:
+* GPT-OSS 20B 2048 notebook (GitHub)
+If you want the most guided learning path:
+* HF practical exercise with Unsloth + GRPO (Hugging Face)
+If helpful, I can turn this into a curated table with columns for model, task type, reward type, hardware footprint, and what each recipe teaches.
+Additional Resources:
+* OpenEnv Core (An interface library for RL post training with environments)
+   * https://github.com/meta-pytorch/OpenEnv
+* OpenEnv-PyTorch Docs
+   * https://meta-pytorch.org/OpenEnv/
+* HuggingFace OpenEnv Environments Hub
+   * https://huggingface.co/openenv
+   * https://huggingface.co/openenv/spaces
+* Tutorials to build, run and train RL environments and training pipelines
+   * https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial
+   * RL Training Examples: https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples
+   * RL Environment Examples: https://github.com/meta-pytorch/OpenEnv/tree/main/envs
+* Few additional YT Videos on building RL Environments:
+   * https://www.youtube.com/watch?v=0airz7BhBiA
+   * https://www.youtube.com/watch?v=ap4q4sAK4OY
+   * https://www.youtube.com/watch?v=Jew4lhAiqnw (Recommended)

requirements-train.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+torch>=2.2.0
+transformers>=4.44.0
+accelerate>=1.0.0
+trl>=0.9.0
+peft>=0.10.0
+datasets>=2.18.0
+matplotlib>=3.8.0

requirements-unsloth.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+unsloth
+unsloth_zoo
+torch>=2.2.0
+transformers>=4.44.0
+trl>=0.9.0
+peft>=0.10.0
+accelerate>=1.0.0
+datasets>=2.18.0
+matplotlib>=3.8.0
+bitsandbytes>=0.43.0

server/app.py CHANGED Viewed

@@ -46,6 +46,14 @@ _LANDING_PAGE = """\
   <h1>⚛️ CERNenv</h1>
   <p class=muted>An LHC (Large Hadron Collider) particle-discovery RL environment for autonomous physicist agents — built for the Meta OpenEnv Hackathon.</p>
   <p>
     <span class=pill>OpenEnv</span>
     <span class=pill>POMDP</span>
@@ -124,7 +132,9 @@ def build_app(
     The OpenEnv-provided routes (`/reset`, `/step`, `/state`, `/schema`,
     `/health`, `/mcp`) come from ``create_fastapi_app``. We then mount a
-    friendly landing page at ``/`` so the Space preview is informative.
     """
     factory = make_env_factory(max_steps=max_steps, default_difficulty=default_difficulty)
     fa_app = create_fastapi_app(factory, ExperimentAction, CollisionObservation)
@@ -133,6 +143,22 @@ def build_app(
     def landing() -> HTMLResponse:  # pragma: no cover - trivial
         return HTMLResponse(_LANDING_PAGE)
     return fa_app

   <h1>⚛️ CERNenv</h1>
   <p class=muted>An LHC (Large Hadron Collider) particle-discovery RL environment for autonomous physicist agents — built for the Meta OpenEnv Hackathon.</p>
+  <p style="margin: 1rem 0">
+    <a href="/demo" style="display:inline-block; background:#1d4ed8; color:white;
+       padding:.6rem 1.1rem; border-radius:6px; text-decoration:none;
+       font-weight:600; box-shadow:0 1px 3px rgba(0,0,0,.1)">
+      ▶ Try the interactive demo at /demo
+    </a>
+  </p>
   <p>
     <span class=pill>OpenEnv</span>
     <span class=pill>POMDP</span>
     The OpenEnv-provided routes (`/reset`, `/step`, `/state`, `/schema`,
     `/health`, `/mcp`) come from ``create_fastapi_app``. We then mount a
+    friendly landing page at ``/`` so the Space preview is informative,
+    and an interactive Gradio demo at ``/demo`` so non-technical visitors
+    can play with the env in the browser without writing any code.
     """
     factory = make_env_factory(max_steps=max_steps, default_difficulty=default_difficulty)
     fa_app = create_fastapi_app(factory, ExperimentAction, CollisionObservation)
     def landing() -> HTMLResponse:  # pragma: no cover - trivial
         return HTMLResponse(_LANDING_PAGE)
+    # Best-effort mount of the Gradio interactive demo at /demo. We only
+    # *add* a route here — the OpenEnv HTTP API at /health, /reset, /step,
+    # /state, /schema, /mcp, /docs is unchanged. If gradio isn't installed
+    # in the runtime environment (e.g. during minimal CI), we log and skip.
+    try:
+        import gradio as gr  # noqa: F401  (presence check)
+        from space.env.gradio_demo import build_gradio_demo
+        demo = build_gradio_demo()
+        fa_app = gr.mount_gradio_app(fa_app, demo, path="/demo")
+    except Exception as exc:  # pragma: no cover - optional dep
+        import logging
+        logging.getLogger(__name__).warning(
+            "Gradio demo not mounted at /demo: %s", exc
+        )
     return fa_app

space/env/gradio_demo.py ADDED Viewed

	@@ -0,0 +1,690 @@

+"""Interactive Gradio demo for CERNenv.
+A small, dependency-light Blocks app a non-technical visitor can use to
+*play* with the LHC discovery environment in the browser:
+* Tab 1 — *Watch a baseline*: pick a scenario, run a Random / Heuristic /
+  Oracle agent, watch its action+reward trace stream in, see whether it
+  found the hidden particle.
+* Tab 2 — *Build your own actions*: an "Action Builder" that lets the
+  visitor pick one of the 16 ``ActionType`` values, fill in a few
+  parameters, and step the env one action at a time. State persists
+  across clicks via ``gr.State``.
+* Tab 3 — *About*: a tight one-screen explanation of what the env is and
+  why this is interesting RL data for LLMs.
+This module is mounted onto the Space's existing FastAPI app at ``/demo``
+by ``server.app.build_app``. It does **not** alter the OpenEnv HTTP API
+(``/health``, ``/reset``, ``/step``, ``/state``, ``/schema``, ``/mcp``).
+No heavy ML deps (torch / transformers / trl) are imported here — the
+env Space runs on ``cpu-basic`` and only needs gradio + pydantic + numpy.
+"""
+from __future__ import annotations
+import logging
+from typing import Any, Dict, Iterator, List, Optional, Tuple
+import gradio as gr
+logger = logging.getLogger(__name__)
+# ── Scenario / agent registries ─────────────────────────────────────────
+# Visible labels in the dropdown. Curated scenarios use their canonical
+# names so the UI matches what the docs / scripts use.
+SCENARIO_CHOICES: List[Tuple[str, str]] = [
+    ("easy_diphoton_160 — narrow scalar at 160 GeV (tutorial)", "easy_diphoton_160"),
+    ("higgs_like_125 — Higgs-like scalar at 125 GeV", "higgs_like_125"),
+    ("hidden_zprime_600 — heavy Z' at 600 GeV", "hidden_zprime_600"),
+    ("diphoton_750 — 2015 diphoton excess re-investigation", "diphoton_750"),
+    ("random easy", "__random_easy__"),
+    ("random medium", "__random_medium__"),
+    ("random hard", "__random_hard__"),
+]
+_LABEL_TO_VALUE: Dict[str, str] = {lab: val for lab, val in SCENARIO_CHOICES}
+def _resolve_scenario(label_or_value: str) -> Dict[str, Any]:
+    """Map a dropdown selection (label OR canonical value) to ``reset()`` kwargs."""
+    value = _LABEL_TO_VALUE.get(label_or_value, label_or_value)
+    if value.startswith("__random_"):
+        difficulty = value.strip("_").replace("random_", "")
+        return {"difficulty": difficulty, "scenario": None}
+    return {"scenario": value, "difficulty": None}
+AGENT_CHOICES = ["random", "heuristic", "oracle"]
+# ── Helpers for rendering observations ──────────────────────────────────
+def _resource_progress_md(usage) -> str:
+    """Pretty resource bars as one Markdown block."""
+    def _bar(label: str, used: float, total: float, unit: str) -> str:
+        if total <= 0:
+            pct = 0.0
+        else:
+            pct = max(0.0, min(1.0, used / total))
+        n = 24
+        filled = int(round(pct * n))
+        bar = "█" * filled + "░" * (n - filled)
+        return f"`{label:<10}` `{bar}` **{used:.1f} / {total:.1f} {unit}**"
+    budget_total = usage.budget_used_musd + usage.budget_remaining_musd
+    lumi_total = usage.luminosity_used_fb + usage.luminosity_remaining_fb
+    time_total = usage.time_used_days + usage.time_remaining_days
+    return "\n\n".join([
+        _bar("budget", usage.budget_used_musd, budget_total, "M$"),
+        _bar("lumi",   usage.luminosity_used_fb, lumi_total, "fb⁻¹"),
+        _bar("time",   usage.time_used_days, time_total, "days"),
+    ])
+def _candidates_table(obs) -> List[List[Any]]:
+    rows: List[List[Any]] = []
+    masses = list(obs.candidate_masses_gev or [])
+    sigs = list(obs.candidate_significances or [])
+    for i in range(max(len(masses), len(sigs))):
+        m = masses[i] if i < len(masses) else None
+        s = sigs[i] if i < len(sigs) else None
+        rows.append([
+            i,
+            f"{m:.2f}" if isinstance(m, (int, float)) else "—",
+            f"{s:.2f}" if isinstance(s, (int, float)) else "—",
+        ])
+    if not rows:
+        rows = [[0, "—", "—"]]
+    return rows
+def _step_breakdown_table(breakdown: Dict[str, float]) -> List[List[Any]]:
+    if not breakdown:
+        return [["—", 0.0]]
+    return [[k, round(float(v), 4)] for k, v in breakdown.items()]
+def _violations_md(violations: List[str]) -> str:
+    if not violations:
+        return "*(no rule violations)*"
+    items = "\n".join(f"- ⚠️ `{v}`" for v in violations)
+    return f"<span style='color:#c00'>**Rule violations**</span>\n\n{items}"
+def _truth_md(truth: Optional[Dict[str, Any]], state) -> str:
+    """Hidden-truth reveal: the latent particle the agent was trying to find."""
+    if not truth:
+        return "*(no episode loaded yet)*"
+    rows = [
+        f"- **name**: `{truth.get('name', '?')}`",
+        f"- **mass**: `{float(truth.get('mass_gev', 0.0)):.2f}` GeV",
+        f"- **width**: `{float(truth.get('width_gev', 0.0)):.4f}` GeV",
+        f"- **spin / parity**: `{int(truth.get('spin', 0))}{truth.get('parity', '?')}`",
+        f"- **primary channel**: `{truth.get('primary_channel', '?')}`",
+        f"- **cross-section**: `{float(truth.get('cross_section_fb', 0.0)):.2f}` fb",
+    ]
+    if state is not None:
+        flags = []
+        if state.discovered is not None:
+            flags.append(("discovered", state.discovered))
+        if state.correct_mass is not None:
+            flags.append(("correct mass", state.correct_mass))
+        if state.correct_channel is not None:
+            flags.append(("correct channel", state.correct_channel))
+        if state.correct_spin is not None:
+            flags.append(("correct spin", state.correct_spin))
+        if flags:
+            verdict_lines = []
+            for label, ok in flags:
+                badge = "✅" if ok else "❌"
+                verdict_lines.append(f"- {badge} **{label}**: `{ok}`")
+            rows.append("\n**Verdict vs. agent claim**\n" + "\n".join(verdict_lines))
+    return "\n".join(rows)
+def _format_action(action) -> str:
+    parts = [f"`{action.action_type.value}`"]
+    if action.method:
+        parts.append(f"method=`{action.method}`")
+    if action.parameters:
+        # Truncate verbose parameter dicts (e.g. claim) so the log stays readable.
+        s = str(action.parameters)
+        if len(s) > 80:
+            s = s[:77] + "…"
+        parts.append(f"params={s}")
+    return " ".join(parts)
+# ── Tab 1: watch a baseline ─────────────────────────────────────────────
+def _stream_baseline(
+    scenario_label: str,
+    seed: int,
+    agent_name: str,
+    max_steps: int = 40,
+) -> Iterator[Tuple[str, str, str, str, List[List[Any]], str, str, str]]:
+    """Run a full episode in-process; yield UI updates per step."""
+    # Lazy imports so this module is cheap to import (and trivially testable).
+    from server.environment import CERNCollisionEnvironment
+    from scripts.baseline_agents import HeuristicAgent, OracleAgent, RandomAgent
+    agent_cls = {"random": RandomAgent, "heuristic": HeuristicAgent, "oracle": OracleAgent}[agent_name]
+    agent = agent_cls(seed=int(seed)) if agent_name == "random" else agent_cls()
+    env = CERNCollisionEnvironment(max_steps=int(max_steps))
+    reset_kwargs = _resolve_scenario(scenario_label)
+    obs = env.reset(seed=int(seed), **reset_kwargs)
+    if agent_name == "oracle":
+        agent.truth = env.hidden_truth()
+    agent.reset()
+    log_lines: List[str] = [
+        f"### Running **{agent_name}** agent on `{env.state.scenario_name}` "
+        f"(difficulty=`{env.state.difficulty}`, seed=`{seed}`)\n",
+        "```",
+        f"{'step':>4}  {'action':<24}  {'reward':>8}  {'cum.':>8}  done",
+        "-" * 60,
+    ]
+    cumulative = 0.0
+    truth = env.hidden_truth()
+    yield (
+        log_lines[0] + "\n" + "\n".join(log_lines[1:]) + "\n```",
+        "0.0", "0", "0.0",
+        _candidates_table(obs),
+        _resource_progress_md(obs.resource_usage),
+        _violations_md(obs.rule_violations or []),
+        _truth_md(truth, env.state) if obs.done else "*(truth revealed when the episode ends)*",
+    )
+    while not obs.done:
+        action = agent.act(obs)
+        obs = env.step(action)
+        rew = float(obs.reward or 0.0)
+        cumulative += rew
+        log_lines.append(
+            f"{obs.step_index:>4}  {action.action_type.value:<24}  "
+            f"{rew:>+8.3f}  {cumulative:>+8.3f}  {obs.done}"
+        )
+        body = log_lines[0] + "\n" + "\n".join(log_lines[1:]) + "\n```"
+        yield (
+            body,
+            f"{cumulative:+.3f}",
+            str(obs.step_index),
+            f"{obs.cumulative_significance:.2f}",
+            _candidates_table(obs),
+            _resource_progress_md(obs.resource_usage),
+            _violations_md(obs.rule_violations or []),
+            _truth_md(truth, env.state) if obs.done else "*(truth revealed when the episode ends)*",
+        )
+    # Append final claim summary
+    log_lines.append("-" * 60)
+    log_lines.append(
+        f"final  cumulative_reward={cumulative:+.3f}  "
+        f"terminal_reward={env.state.terminal_reward}  "
+        f"discovered={env.state.discovered}"
+    )
+    body = log_lines[0] + "\n" + "\n".join(log_lines[1:]) + "\n```"
+    yield (
+        body,
+        f"{cumulative:+.3f}",
+        str(obs.step_index),
+        f"{obs.cumulative_significance:.2f}",
+        _candidates_table(obs),
+        _resource_progress_md(obs.resource_usage),
+        _violations_md(obs.rule_violations or []),
+        _truth_md(truth, env.state),
+    )
+# ── Tab 2: build your own actions ───────────────────────────────────────
+def _new_episode(scenario_label: str, seed: int) -> Tuple[Any, Any, str, str, str, List[List[Any]], List[List[Any]], str, str, str]:
+    """Start a fresh episode for the manual Action Builder. Returns the
+    (env, obs) pair to be stored in ``gr.State`` and a fresh set of UI
+    values."""
+    from server.environment import CERNCollisionEnvironment
+    env = CERNCollisionEnvironment(max_steps=40)
+    reset_kwargs = _resolve_scenario(scenario_label)
+    obs = env.reset(seed=int(seed), **reset_kwargs)
+    header = (
+        f"### Episode started — scenario `{env.state.scenario_name}` "
+        f"(difficulty=`{env.state.difficulty}`, seed=`{seed}`)\n\n"
+        f"**Mass search window**: `{obs.task.mass_search_window_gev[0]:.0f} – "
+        f"{obs.task.mass_search_window_gev[1]:.0f}` GeV\n\n"
+        f"_Pick an action below and click **Submit step**._"
+    )
+    return (
+        env,                                  # gr.State env
+        obs,                                  # gr.State obs
+        header,                               # status_md
+        "0.0",                                # cumulative_reward
+        "0",                                  # step_index
+        _step_breakdown_table({}),            # step_breakdown
+        _candidates_table(obs),               # candidates
+        _resource_progress_md(obs.resource_usage),  # resources
+        _violations_md([]),                   # violations
+        "*(submit a `submit_discovery_claim` or run out of budget to reveal)*",
+    )
+def _submit_step(
+    env,
+    obs,
+    action_type: str,
+    method: str,
+    channel: str,
+    trigger: str,
+    beam_energy: str,
+    luminosity_fb: float,
+    mass_window_lo: float,
+    mass_window_hi: float,
+    claim_mass: float,
+    claim_sigma: float,
+    claim_spin: int,
+    claim_parity: str,
+    confidence: float,
+) -> Tuple[Any, Any, str, str, str, List[List[Any]], List[List[Any]], str, str, str]:
+    """Apply one manual action to the persisted ``env``."""
+    from models import ActionType, ExperimentAction
+    if env is None or obs is None:
+        return (
+            env, obs,
+            "_No active episode — click **New episode** first._",
+            "0.0", "0",
+            _step_breakdown_table({}),
+            [[0, "—", "—"]],
+            "_(no episode)_",
+            _violations_md([]),
+            "_(no episode)_",
+        )
+    if obs.done:
+        return (
+            env, obs,
+            "_Episode is already over — click **New episode** to start fresh._",
+            f"{env.state.cumulative_reward:+.3f}",
+            str(obs.step_index),
+            _step_breakdown_table(obs.step_reward_breakdown or {}),
+            _candidates_table(obs),
+            _resource_progress_md(obs.resource_usage),
+            _violations_md(obs.rule_violations or []),
+            _truth_md(env.hidden_truth(), env.state),
+        )
+    try:
+        at = ActionType(action_type)
+    except Exception:
+        return (
+            env, obs,
+            f"_Invalid action_type: `{action_type}`._",
+            f"{env.state.cumulative_reward:+.3f}",
+            str(obs.step_index),
+            _step_breakdown_table(obs.step_reward_breakdown or {}),
+            _candidates_table(obs),
+            _resource_progress_md(obs.resource_usage),
+            _violations_md(obs.rule_violations or []),
+            "*(truth shown at end of episode)*",
+        )
+    # Build parameter dict from the relevant fields. We keep this minimal
+    # and forgiving — the env's RulesEngine will reject anything illegal.
+    params: Dict[str, Any] = {}
+    if at == ActionType.CONFIGURE_BEAM and beam_energy:
+        params["beam_energy"] = beam_energy
+    if at == ActionType.SELECT_CHANNEL and channel:
+        params["channel"] = channel
+    if at == ActionType.SET_TRIGGER and trigger:
+        params["trigger"] = trigger
+    if at in (ActionType.ALLOCATE_LUMINOSITY, ActionType.COLLECT_COLLISIONS):
+        params["luminosity_fb"] = float(luminosity_fb or 0.0)
+    if at == ActionType.BUILD_INVARIANT_MASS:
+        lo = float(mass_window_lo or obs.task.mass_search_window_gev[0])
+        hi = float(mass_window_hi or obs.task.mass_search_window_gev[1])
+        params["mass_window_gev"] = [lo, hi]
+    if at == ActionType.SUBMIT_DISCOVERY_CLAIM:
+        params["claim"] = {
+            "mass_estimate_gev": float(claim_mass or 0.0) or None,
+            "mass_uncertainty_gev": 1.0,
+            "significance_sigma": float(claim_sigma or 0.0) or None,
+            "decay_channel": (channel or obs.selected_channel or "diphoton"),
+            "spin_hypothesis": int(claim_spin),
+            "parity": claim_parity or "+",
+            "confidence": float(confidence or 0.5),
+        }
+    action = ExperimentAction(
+        action_type=at,
+        method=(method or None),
+        parameters=params,
+        confidence=float(confidence or 0.5),
+        justification="manual action from /demo Action Builder",
+    )
+    new_obs = env.step(action)
+    status = (
+        f"### Step `{new_obs.step_index}` — `{at.value}` "
+        f"→ reward `{(new_obs.reward or 0.0):+.3f}`\n\n"
+    )
+    if new_obs.latest_output is not None:
+        status += f"**Output** (`{new_obs.latest_output.output_type.value}`): {new_obs.latest_output.summary}\n"
+    if new_obs.done:
+        status += "\n*Episode complete — see the Hidden Truth panel below.*"
+    truth_md = (
+        _truth_md(env.hidden_truth(), env.state)
+        if new_obs.done
+        else "*(truth revealed when the episode ends)*"
+    )
+    return (
+        env,
+        new_obs,
+        status,
+        f"{env.state.cumulative_reward:+.3f}",
+        str(new_obs.step_index),
+        _step_breakdown_table(new_obs.step_reward_breakdown or {}),
+        _candidates_table(new_obs),
+        _resource_progress_md(new_obs.resource_usage),
+        _violations_md(new_obs.rule_violations or []),
+        truth_md,
+    )
+# ── Top-level Blocks ────────────────────────────────────────────────────
+_HEADER_MD = """\
+# ⚛️ CERNenv — interactive demo
+You are a high-energy physicist running an analysis at the **Large Hadron Collider**.
+There is a hidden particle out there — somewhere in a search window of mass —
+and your job is to discover and characterise it.
+Each step you pick **one structured action** (configure the beam, allocate luminosity,
+fit a resonance, estimate significance, submit a discovery claim, …) and the
+environment hands back a noisy detector-style observation. Wrong channel,
+mis-matched trigger, or wasteful step? You burn budget and don't see the signal.
+Make a calibrated 5σ discovery claim before you run out of resources, and you win.
+This page lets you **watch a baseline agent run an episode**, or **drive the
+environment yourself** action-by-action. The hidden particle is revealed at the end
+of every episode so you can see what the agent was actually trying to find.
+"""
+_ABOUT_MD = """\
+## About this environment
+`CERNenv` is a partially-observable Markov decision process (POMDP) modelled on
+LHC particle-discovery campaigns. It is a research environment for training and
+evaluating LLM agents on a *real-shaped* scientific task with prerequisites,
+budgets, calibration, systematics, and structured discovery claims.
+* **16 structured actions** — DAQ, reconstruction, calibration, analysis,
+  systematics, theory review, discovery claim.
+* **Hidden ground truth per episode** — mass, width, spin, parity, primary
+  decay channel, cross-section. The agent never sees these directly.
+* **Curated scenarios** inspired by famous LHC discoveries (Higgs-like 125 GeV,
+  hidden Z', the 2015 750 GeV diphoton excess) plus a procedural curriculum at
+  three difficulty tiers.
+* **Reward decomposition** — per-step shaping (good prerequisites, tool fit,
+  rule compliance) plus a dominant terminal reward calibrated against the
+  hidden particle.
+* **OpenEnv-compatible HTTP API** at `/health`, `/reset`, `/step`, `/state`,
+  `/schema`, `/mcp`, `/docs` — see the landing page for examples.
+### Hackathon submission angle (Theme #3.1 — World Modeling)
+`CERNenv` plays as a *miniature world model* for an LHC analysis. The agent
+must plan over a long horizon, manage scarce resources, choose the right tool
+for each sub-task, and — most critically — submit a structured, calibrated
+discovery claim. That makes it a good fit for **RL training of LLMs on
+professional scientific tasks**: every transition is structured, every reward
+is decomposable, and the terminal reward is grounded in physical truth rather
+than human preference labels.
+### Companion artefacts
+* Trainer Space (Unsloth + LoRA + GRPO on A100) —
+  [`anugrah55/cernenv-trainer`](https://huggingface.co/spaces/anugrah55/cernenv-trainer)
+* Trained adapter weights —
+  [`anugrah55/cernenv-grpo-qwen2.5-3b`](https://huggingface.co/anugrah55/cernenv-grpo-qwen2.5-3b)
+"""
+def build_gradio_demo() -> gr.Blocks:
+    """Construct the CERNenv interactive demo as a Gradio Blocks app."""
+    # Lazy import so module-level import remains side-effect-light.
+    from models import ActionType
+    from server.simulator.latent_state import LatentParticle  # noqa: F401  (side-effect import, keeps tests honest)
+    action_options = [a.value for a in ActionType]
+    channels = ["diphoton", "dilepton_ee", "dilepton_mumu", "dijet", "four_lepton", "bb"]
+    triggers = ["low_pt", "high_pt", "diphoton_hlt", "dilepton_hlt", "jet_hlt"]
+    beam_energies = ["7TeV", "8TeV", "13TeV", "14TeV"]
+    with gr.Blocks(theme=gr.themes.Soft(), title="CERNenv interactive demo") as demo:
+        gr.Markdown(_HEADER_MD)
+        with gr.Tabs():
+            # ───────── Tab 1: Watch a baseline ─────────
+            with gr.TabItem("▶ Watch a baseline"):
+                gr.Markdown(
+                    "Pick a scenario and seed, then click one of **Random / Heuristic / "
+                    "Oracle**. The agent will play a full episode and stream every "
+                    "action+reward into the log. The hidden particle truth is revealed "
+                    "at the end."
+                )
+                with gr.Row():
+                    scenario_dd = gr.Dropdown(
+                        choices=[lab for lab, _ in SCENARIO_CHOICES],
+                        value=SCENARIO_CHOICES[0][0],
+                        label="Scenario",
+                        interactive=True,
+                    )
+                    seed_in = gr.Number(value=7, precision=0, label="Seed")
+                with gr.Row():
+                    btn_random = gr.Button("▶ Run Random agent", variant="secondary")
+                    btn_heuristic = gr.Button("▶ Run Heuristic agent", variant="primary")
+                    btn_oracle = gr.Button("▶ Run Oracle agent", variant="secondary")
+                with gr.Row():
+                    with gr.Column(scale=3):
+                        log_md = gr.Markdown(
+                            "*(no rollout yet — pick an agent above)*",
+                            label="Episode log",
+                        )
+                    with gr.Column(scale=2):
+                        cum_reward_b = gr.Textbox(value="0.0", label="Cumulative reward", interactive=False)
+                        step_b = gr.Textbox(value="0", label="Step", interactive=False)
+                        sig_b = gr.Textbox(value="0.0", label="Best significance σ", interactive=False)
+                        cands_b = gr.Dataframe(
+                            headers=["#", "mass (GeV)", "σ"],
+                            value=[[0, "—", "—"]],
+                            label="Candidate peaks",
+                            interactive=False,
+                        )
+                        res_b = gr.Markdown("*(no rollout yet)*", label="Resources")
+                        viol_b = gr.Markdown("", label="Rule violations")
+                        truth_b = gr.Markdown(
+                            "*(truth revealed when the episode ends)*",
+                            label="🎯 Hidden truth (revealed at end of episode)",
+                        )
+                def _run(scenario_label, seed, agent_name):
+                    yield from _stream_baseline(
+                        scenario_label,
+                        int(seed),
+                        agent_name,
+                    )
+                outputs_b = [log_md, cum_reward_b, step_b, sig_b, cands_b, res_b, viol_b, truth_b]
+                btn_random.click(
+                    lambda s, sd: _run(s, sd, "random"),
+                    inputs=[scenario_dd, seed_in],
+                    outputs=outputs_b,
+                )
+                btn_heuristic.click(
+                    lambda s, sd: _run(s, sd, "heuristic"),
+                    inputs=[scenario_dd, seed_in],
+                    outputs=outputs_b,
+                )
+                btn_oracle.click(
+                    lambda s, sd: _run(s, sd, "oracle"),
+                    inputs=[scenario_dd, seed_in],
+                    outputs=outputs_b,
+                )
+            # ───────── Tab 2: Build your own actions ─────────
+            with gr.TabItem("🛠 Build your own actions"):
+                gr.Markdown(
+                    "Run the env one action at a time. Each click of **Submit step** "
+                    "calls `env.step(action)` on a session-scoped environment. The "
+                    "**Live evidence** panel updates after every step."
+                )
+                with gr.Row():
+                    scenario_dd2 = gr.Dropdown(
+                        choices=[lab for lab, _ in SCENARIO_CHOICES],
+                        value=SCENARIO_CHOICES[0][0],
+                        label="Scenario",
+                        interactive=True,
+                    )
+                    seed_in2 = gr.Number(value=42, precision=0, label="Seed")
+                    btn_new = gr.Button("🔄 New episode", variant="primary")
+                env_state = gr.State(value=None)
+                obs_state = gr.State(value=None)
+                status_md = gr.Markdown(
+                    "*Click **🔄 New episode** to begin.*", label="Status"
+                )
+                with gr.Row():
+                    with gr.Column(scale=3):
+                        gr.Markdown("### Action Builder")
+                        action_type = gr.Dropdown(
+                            choices=action_options,
+                            value="configure_beam",
+                            label="action_type",
+                        )
+                        with gr.Row():
+                            method = gr.Textbox(
+                                value="",
+                                label="method (optional, e.g. ROOT_RooFit)",
+                                placeholder="ROOT_RooFit / BumpHunter / Athena / …",
+                            )
+                            confidence = gr.Slider(
+                                minimum=0.0, maximum=1.0, step=0.05, value=0.7,
+                                label="confidence",
+                            )
+                        with gr.Row():
+                            channel = gr.Dropdown(
+                                choices=channels, value="diphoton", label="channel",
+                            )
+                            trigger = gr.Dropdown(
+                                choices=triggers, value="diphoton_hlt", label="trigger",
+                            )
+                            beam_energy = gr.Dropdown(
+                                choices=beam_energies, value="13TeV", label="beam_energy",
+                            )
+                        with gr.Row():
+                            luminosity_fb = gr.Number(
+                                value=80.0, label="luminosity_fb (allocate / collect)",
+                            )
+                            mass_window_lo = gr.Number(
+                                value=80.0, label="mass_window_lo (GeV)",
+                            )
+                            mass_window_hi = gr.Number(
+                                value=300.0, label="mass_window_hi (GeV)",
+                            )
+                        gr.Markdown("**Discovery claim parameters** *(only used for `submit_discovery_claim`)*")
+                        with gr.Row():
+                            claim_mass = gr.Number(value=125.0, label="claim mass (GeV)")
+                            claim_sigma = gr.Number(value=5.0, label="claim significance σ")
+                            claim_spin = gr.Dropdown(choices=[0, 1, 2], value=0, label="claim spin")
+                            claim_parity = gr.Dropdown(choices=["+", "-"], value="+", label="claim parity")
+                        btn_submit = gr.Button("✅ Submit step", variant="primary")
+                    with gr.Column(scale=3):
+                        gr.Markdown("### Live evidence")
+                        cum_reward = gr.Textbox(value="0.0", label="Cumulative reward", interactive=False)
+                        step_idx = gr.Textbox(value="0", label="Step index", interactive=False)
+                        breakdown = gr.Dataframe(
+                            headers=["component", "value"],
+                            value=[["—", 0.0]],
+                            label="Step reward breakdown",
+                            interactive=False,
+                        )
+                        candidates = gr.Dataframe(
+                            headers=["#", "mass (GeV)", "σ"],
+                            value=[[0, "—", "—"]],
+                            label="Candidate peaks",
+                            interactive=False,
+                        )
+                        resources = gr.Markdown("*(no episode yet)*", label="Resources")
+                        violations = gr.Markdown("", label="Rule violations")
+                        truth = gr.Markdown(
+                            "*(truth revealed when the episode ends)*",
+                            label="🎯 Hidden truth",
+                        )
+                btn_new.click(
+                    _new_episode,
+                    inputs=[scenario_dd2, seed_in2],
+                    outputs=[
+                        env_state, obs_state, status_md, cum_reward, step_idx,
+                        breakdown, candidates, resources, violations, truth,
+                    ],
+                )
+                btn_submit.click(
+                    _submit_step,
+                    inputs=[
+                        env_state, obs_state, action_type, method,
+                        channel, trigger, beam_energy, luminosity_fb,
+                        mass_window_lo, mass_window_hi,
+                        claim_mass, claim_sigma, claim_spin, claim_parity,
+                        confidence,
+                    ],
+                    outputs=[
+                        env_state, obs_state, status_md, cum_reward, step_idx,
+                        breakdown, candidates, resources, violations, truth,
+                    ],
+                )
+            # ───────── Tab 3: About ─────────
+            with gr.TabItem("ℹ About"):
+                gr.Markdown(_ABOUT_MD)
+        gr.Markdown(
+            "---\n"
+            "*Built for the Meta OpenEnv Hackathon (Theme #3.1, World Modeling). "
+            "The OpenEnv HTTP API is still live alongside this UI at "
+            "`/health`, `/reset`, `/step`, `/state`, `/schema`, `/mcp`, `/docs`.*"
+        )
+    return demo
+__all__ = ["build_gradio_demo"]

space/env/requirements.txt CHANGED Viewed

@@ -4,3 +4,4 @@ pydantic>=2.0.0
 fastapi>=0.110.0
 uvicorn>=0.27.0
 openenv-core[core]>=0.2.3

 fastapi>=0.110.0
 uvicorn>=0.27.0
 openenv-core[core]>=0.2.3
+gradio>=4.40.0,<5.0

training/evaluate.py CHANGED Viewed

@@ -1,152 +1,153 @@
-"""Evaluate an LLM (with optional LoRA adapters) on CERNenv.
-Usage:
-    python -m training.evaluate --model_name unsloth/Qwen2.5-3B-Instruct \\
-        --difficulty easy --episodes 16 --tag pre_train \\
-        --out training/runs/eval_pre_train.jsonl
-    python -m training.evaluate --model_name unsloth/Qwen2.5-3B-Instruct \\
-        --adapter_dir training/runs/unsloth-grpo --difficulty easy \\
-        --episodes 16 --tag post_train --out training/runs/eval_post_train.jsonl
-"""
-from __future__ import annotations
-import argparse
-import json
-import logging
-import os
-from dataclasses import asdict
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
-logger = logging.getLogger(__name__)
-def _build_generate_fn(
-    *,
-    model_name: str,
-    adapter_dir: Optional[str],
-    use_unsloth: bool,
-    max_seq_length: int,
-):
-    if use_unsloth:
-        from unsloth import FastLanguageModel  # type: ignore
-        model, tokenizer = FastLanguageModel.from_pretrained(
-            model_name=model_name,
-            max_seq_length=max_seq_length,
-            load_in_4bit=True,
-            fast_inference=True,
-        )
-        if adapter_dir:
-            model.load_adapter(adapter_dir)
-        FastLanguageModel.for_inference(model)
-    else:
-        import torch
-        from transformers import AutoModelForCausalLM, AutoTokenizer
-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-        model = AutoModelForCausalLM.from_pretrained(
-            model_name,
-            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
-            device_map="auto" if torch.cuda.is_available() else None,
-        )
-        if adapter_dir:
-            from peft import PeftModel  # type: ignore
-            model = PeftModel.from_pretrained(model, adapter_dir)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    def prompt_fn(chat: List[Dict[str, str]]) -> str:
-        return tokenizer.apply_chat_template(
-            chat, add_generation_prompt=True, tokenize=False
-        )
-    def generate_fn(prompt: str, config) -> str:
-        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-        outputs = model.generate(
-            **inputs,
-            max_new_tokens=config.max_new_tokens,
-            do_sample=True,
-            temperature=config.temperature,
-            top_p=config.top_p,
-            pad_token_id=tokenizer.pad_token_id,
-        )
-        gen = outputs[0][inputs["input_ids"].shape[1]:]
-        return tokenizer.decode(gen, skip_special_tokens=True)
-    return prompt_fn, generate_fn
-def main() -> None:  # pragma: no cover
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_name", required=True)
-    parser.add_argument("--adapter_dir", default=None)
-    parser.add_argument("--scenario", default=None)
-    parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
-    parser.add_argument("--episodes", type=int, default=16)
-    parser.add_argument("--seed", type=int, default=1000)
-    parser.add_argument("--max_steps", type=int, default=18)
-    parser.add_argument("--max_seq_length", type=int, default=2048)
-    parser.add_argument("--no_unsloth", action="store_true")
-    parser.add_argument("--tag", default="eval")
-    parser.add_argument("--out", required=True)
-    args = parser.parse_args()
-    from server.environment import CERNCollisionEnvironment
-    from training.llm_agent import LLMAgentConfig
-    from training.rollouts import collect_episode, save_episodes_jsonl
-    use_unsloth = not args.no_unsloth
-    try:
-        prompt_fn, generate_fn = _build_generate_fn(
-            model_name=args.model_name,
-            adapter_dir=args.adapter_dir,
-            use_unsloth=use_unsloth,
-            max_seq_length=args.max_seq_length,
-        )
-    except ImportError as exc:
-        logger.warning("Unsloth not available (%s); falling back to transformers.", exc)
-        prompt_fn, generate_fn = _build_generate_fn(
-            model_name=args.model_name,
-            adapter_dir=args.adapter_dir,
-            use_unsloth=False,
-            max_seq_length=args.max_seq_length,
-        )
-    env = CERNCollisionEnvironment(max_steps=args.max_steps)
-    cfg = LLMAgentConfig()
-    episodes = []
-    for ep in range(args.episodes):
-        seed = args.seed + ep
-        rec = collect_episode(
-            env=env,
-            seed=seed,
-            scenario=args.scenario,
-            difficulty=args.difficulty,
-            prompt_fn=prompt_fn,
-            generate_fn=generate_fn,
-            config=cfg,
-        )
-        episodes.append(rec)
-        logger.info(
-            "[%s][%d/%d] reward=%+.3f discovered=%s mass=%s channel=%s",
-            args.tag, ep + 1, args.episodes,
-            rec.cumulative_reward, rec.discovered, rec.correct_mass, rec.correct_channel,
-        )
-    Path(args.out).parent.mkdir(parents=True, exist_ok=True)
-    save_episodes_jsonl(episodes, args.out)
-    rewards = [e.cumulative_reward for e in episodes]
-    success = sum(1 for e in episodes if e.discovered) / len(episodes)
-    logger.info("[%s] mean_reward=%.3f success_rate=%.2f", args.tag, sum(rewards) / len(rewards), success)
-if __name__ == "__main__":  # pragma: no cover
-    main()

+"""Evaluate an LLM (with optional LoRA adapters) on CERNenv.
+Usage:
+    python -m training.evaluate --model_name unsloth/Qwen2.5-3B-Instruct \\
+        --difficulty easy --episodes 16 --tag pre_train \\
+        --out training/runs/eval_pre_train.jsonl
+    python -m training.evaluate --model_name unsloth/Qwen2.5-3B-Instruct \\
+        --adapter_dir training/runs/unsloth-grpo --difficulty easy \\
+        --episodes 16 --tag post_train --out training/runs/eval_post_train.jsonl
+"""
+from __future__ import annotations
+import argparse
+import json
+import logging
+import os
+from dataclasses import asdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
+logger = logging.getLogger(__name__)
+def _build_generate_fn(
+    *,
+    model_name: str,
+    adapter_dir: Optional[str],
+    use_unsloth: bool,
+    max_seq_length: int,
+):
+    if use_unsloth:
+        from unsloth import FastLanguageModel  # type: ignore
+        model, tokenizer = FastLanguageModel.from_pretrained(
+            model_name=model_name,
+            max_seq_length=max_seq_length,
+            load_in_4bit=True,
+            # fast_inference requires vLLM, which is not in requirements; plain transformers generation is used instead. Re-enable after pinning vllm in space/training/requirements.txt.
+            fast_inference=False,
+        )
+        if adapter_dir:
+            model.load_adapter(adapter_dir)
+        FastLanguageModel.for_inference(model)
+    else:
+        import torch
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+            device_map="auto" if torch.cuda.is_available() else None,
+        )
+        if adapter_dir:
+            from peft import PeftModel  # type: ignore
+            model = PeftModel.from_pretrained(model, adapter_dir)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    def prompt_fn(chat: List[Dict[str, str]]) -> str:
+        return tokenizer.apply_chat_template(
+            chat, add_generation_prompt=True, tokenize=False
+        )
+    def generate_fn(prompt: str, config) -> str:
+        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=config.max_new_tokens,
+            do_sample=True,
+            temperature=config.temperature,
+            top_p=config.top_p,
+            pad_token_id=tokenizer.pad_token_id,
+        )
+        gen = outputs[0][inputs["input_ids"].shape[1]:]
+        return tokenizer.decode(gen, skip_special_tokens=True)
+    return prompt_fn, generate_fn
+def main() -> None:  # pragma: no cover
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", required=True)
+    parser.add_argument("--adapter_dir", default=None)
+    parser.add_argument("--scenario", default=None)
+    parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
+    parser.add_argument("--episodes", type=int, default=16)
+    parser.add_argument("--seed", type=int, default=1000)
+    parser.add_argument("--max_steps", type=int, default=18)
+    parser.add_argument("--max_seq_length", type=int, default=2048)
+    parser.add_argument("--no_unsloth", action="store_true")
+    parser.add_argument("--tag", default="eval")
+    parser.add_argument("--out", required=True)
+    args = parser.parse_args()
+    from server.environment import CERNCollisionEnvironment
+    from training.llm_agent import LLMAgentConfig
+    from training.rollouts import collect_episode, save_episodes_jsonl
+    use_unsloth = not args.no_unsloth
+    try:
+        prompt_fn, generate_fn = _build_generate_fn(
+            model_name=args.model_name,
+            adapter_dir=args.adapter_dir,
+            use_unsloth=use_unsloth,
+            max_seq_length=args.max_seq_length,
+        )
+    except ImportError as exc:
+        logger.warning("Unsloth not available (%s); falling back to transformers.", exc)
+        prompt_fn, generate_fn = _build_generate_fn(
+            model_name=args.model_name,
+            adapter_dir=args.adapter_dir,
+            use_unsloth=False,
+            max_seq_length=args.max_seq_length,
+        )
+    env = CERNCollisionEnvironment(max_steps=args.max_steps)
+    cfg = LLMAgentConfig()
+    episodes = []
+    for ep in range(args.episodes):
+        seed = args.seed + ep
+        rec = collect_episode(
+            env=env,
+            seed=seed,
+            scenario=args.scenario,
+            difficulty=args.difficulty,
+            prompt_fn=prompt_fn,
+            generate_fn=generate_fn,
+            config=cfg,
+        )
+        episodes.append(rec)
+        logger.info(
+            "[%s][%d/%d] reward=%+.3f discovered=%s mass=%s channel=%s",
+            args.tag, ep + 1, args.episodes,
+            rec.cumulative_reward, rec.discovered, rec.correct_mass, rec.correct_channel,
+        )
+    Path(args.out).parent.mkdir(parents=True, exist_ok=True)
+    save_episodes_jsonl(episodes, args.out)
+    rewards = [e.cumulative_reward for e in episodes]
+    success = sum(1 for e in episodes if e.discovered) / len(episodes)
+    logger.info("[%s] mean_reward=%.3f success_rate=%.2f", args.tag, sum(rewards) / len(rewards), success)
+if __name__ == "__main__":  # pragma: no cover
+    main()

training/training_unsloth.py CHANGED Viewed

@@ -1,341 +1,342 @@
-"""Unsloth + LoRA (Low-Rank Adaptation) GRPO training for CERNenv.
-This is the recommended path for Colab / single- or multi-GPU runs because
-Unsloth's fused kernels and 4-bit loading let us train 2B–8B models with
-limited VRAM, while TRL's GRPO (Group-Relative Policy Optimization) loop
-handles the policy-gradient math.
-The trainer is wired up to produce **all** "training-progress evidence"
-artifacts demanded by the OpenEnv hackathon's scoring rubric:
-* per-step training log + reward/loss curve PNG (Portable Network Graphics)
-* mid-training checkpoint evaluations + progression curve PNG
-* (post-run) before/after summary + reward-distribution PNG
-All artifacts land in ``--evidence_dir`` (default: ``evidence/``).
-Run on Colab / single GPU:
-    !python -m training.training_unsloth \
-        --model_name unsloth/Qwen2.5-3B-Instruct \
-        --total_episodes 400 --num_generations 4 --output_dir runs/unsloth-grpo
-Run on a 4×A100 Hugging Face Space (multi-GPU via accelerate):
-    accelerate launch --num_processes 4 -m training.training_unsloth \
-        --total_episodes 1500 --num_generations 8 --output_dir runs/unsloth-grpo
-"""
-from __future__ import annotations
-import argparse
-import logging
-import time
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
-logger = logging.getLogger(__name__)
-def _build_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_name", default="unsloth/Qwen2.5-3B-Instruct")
-    parser.add_argument("--scenario", default=None)
-    parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
-    parser.add_argument(
-        "--curriculum",
-        action="store_true",
-        help=(
-            "Enable adaptive curriculum: start at --difficulty and promote "
-            "to medium/hard once held-out success rate clears the threshold "
-            "(see training/curriculum.py)."
-        ),
-    )
-    parser.add_argument("--curriculum_promote", type=float, default=0.55)
-    parser.add_argument("--curriculum_demote", type=float, default=0.10)
-    parser.add_argument("--total_episodes", type=int, default=400)
-    parser.add_argument("--seed", type=int, default=42)
-    parser.add_argument("--max_steps", type=int, default=18)
-    parser.add_argument("--num_generations", type=int, default=4)
-    parser.add_argument("--max_prompt_length", type=int, default=2048)
-    parser.add_argument("--max_completion_length", type=int, default=384)
-    parser.add_argument("--learning_rate", type=float, default=5e-6)
-    parser.add_argument("--load_in_4bit", action="store_true", default=True)
-    parser.add_argument("--lora_rank", type=int, default=16)
-    parser.add_argument("--lora_alpha", type=int, default=16)
-    parser.add_argument("--per_device_batch_size", type=int, default=1)
-    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
-    parser.add_argument("--logging_steps", type=int, default=2)
-    parser.add_argument("--save_steps", type=int, default=50)
-    parser.add_argument("--checkpoint_eval_steps", type=int, default=25,
-                        help="Run a held-out eval every N updates for the progression curve.")
-    parser.add_argument("--checkpoint_eval_episodes", type=int, default=8,
-                        help="Number of held-out episodes per mid-training eval.")
-    parser.add_argument("--output_dir", default="runs/unsloth-grpo")
-    parser.add_argument("--evidence_dir", default="evidence")
-    return parser.parse_args()
-def main() -> None:  # pragma: no cover - heavy GPU path
-    args = _build_args()
-    # IMPORTANT: Unsloth MUST be imported before transformers / trl. It
-    # patches transformers' lazy ``_import_structure`` to register a few
-    # symbols (notably ``PreTrainedModel`` under torch-aware paths). If trl
-    # loads transformers first, the lazy loader will fail with a confusing
-    # ``ImportError: cannot import name 'PreTrainedModel' from 'transformers'``
-    # at GRPOTrainer import time — which is exactly what we hit on the
-    # trainer Space before this reorder.
-    # See: https://github.com/unslothai/unsloth and the matching
-    # transformers issue #42548 for the lazy-import root cause.
-    from unsloth import FastLanguageModel
-    from transformers import TrainerCallback
-    from trl import GRPOConfig, GRPOTrainer
-    from server.environment import CERNCollisionEnvironment
-    from training.curriculum import CurriculumConfig, CurriculumManager
-    from training.evidence import (
-        CheckpointEvalWriter,
-        EvidencePaths,
-        RewardComponentLogWriter,
-        TrainingLogWriter,
-        render_checkpoint_progression,
-        render_reward_components,
-        render_training_curve,
-    )
-    from training.llm_agent import LLMAgentConfig
-    from training.rollouts import collect_episode
-    from training.training_script import (
-        EpisodeContext,
-        RewardComponentAccumulator,
-    )
-    paths = EvidencePaths(root=Path(args.evidence_dir))
-    paths.ensure()
-    log_writer = TrainingLogWriter(paths.training_log_csv)
-    ckpt_writer = CheckpointEvalWriter(paths.checkpoint_evals_csv)
-    component_writer = RewardComponentLogWriter(paths.reward_components_csv)
-    component_accumulator = RewardComponentAccumulator()
-    curriculum: Optional[CurriculumManager] = None
-    if args.curriculum:
-        curriculum = CurriculumManager(
-            CurriculumConfig(
-                start_difficulty=args.difficulty,
-                promote_threshold=args.curriculum_promote,
-                demote_threshold=args.curriculum_demote,
-            )
-        )
-        logger.info("Curriculum enabled: start=%s promote≥%.2f demote≤%.2f",
-                    args.difficulty, args.curriculum_promote, args.curriculum_demote)
-    logger.info("Loading Unsloth model: %s", args.model_name)
-    model, tokenizer = FastLanguageModel.from_pretrained(
-        model_name=args.model_name,
-        max_seq_length=args.max_prompt_length + args.max_completion_length,
-        load_in_4bit=args.load_in_4bit,
-        fast_inference=True,
-    )
-    model = FastLanguageModel.get_peft_model(
-        model,
-        r=args.lora_rank,
-        lora_alpha=args.lora_alpha,
-        target_modules=[
-            "q_proj", "k_proj", "v_proj", "o_proj",
-            "gate_proj", "up_proj", "down_proj",
-        ],
-        use_gradient_checkpointing="unsloth",
-    )
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    from training.training_script import build_dataset, make_reward_fn
-    env = CERNCollisionEnvironment(max_steps=args.max_steps)
-    dataset = build_dataset(
-        tokenizer=tokenizer,
-        n_prompts=args.total_episodes,
-        seed=args.seed,
-        scenario=args.scenario,
-        difficulty=args.difficulty,
-        curriculum=args.curriculum,
-    )
-    ctx = EpisodeContext(
-        env=env, seed=args.seed,
-        scenario=args.scenario, difficulty=args.difficulty,
-    )
-    reward_fn = make_reward_fn(ctx, accumulator=component_accumulator)
-    cfg = GRPOConfig(
-        output_dir=args.output_dir,
-        per_device_train_batch_size=args.per_device_batch_size,
-        gradient_accumulation_steps=args.gradient_accumulation_steps,
-        num_generations=args.num_generations,
-        learning_rate=args.learning_rate,
-        max_prompt_length=args.max_prompt_length,
-        max_completion_length=args.max_completion_length,
-        logging_steps=args.logging_steps,
-        save_steps=args.save_steps,
-        seed=args.seed,
-        bf16=True,
-        report_to=[],
-    )
-    held_out_seeds = list(range(900_000, 900_000 + args.checkpoint_eval_episodes))
-    class EvidenceCallback(TrainerCallback):
-        """Stream training metrics + run periodic mid-training evals."""
-        def __init__(self) -> None:
-            self._t0 = time.time()
-            self._last_eval_step = -1
-        def on_log(self, _args, state, control, logs=None, **kw):
-            logs = logs or {}
-            row = {
-                "step": state.global_step,
-                "epoch": logs.get("epoch"),
-                "loss": logs.get("loss"),
-                "reward": logs.get("reward") or logs.get("rewards/mean"),
-                "reward_std": logs.get("reward_std") or logs.get("rewards/std"),
-                "kl": logs.get("kl"),
-                "grad_norm": logs.get("grad_norm"),
-                "learning_rate": logs.get("learning_rate"),
-                "wall_time_s": round(time.time() - self._t0, 2),
-            }
-            if any(v is not None for k, v in row.items() if k != "step"):
-                log_writer.append(row)
-                render_training_curve(paths.training_log_csv, paths.training_curve_png)
-            # Per-component reward summary (FAQ Q17, Q43, Q52: don't watch
-            # only the mean reward — track terminal vs shaping, success
-            # rates, and parse rate so verifier hacks become visible).
-            drained = component_accumulator.drain()
-            if drained:
-                summary = RewardComponentAccumulator.summarise(drained)
-                summary["step"] = state.global_step
-                component_writer.append(summary)
-                render_reward_components(
-                    paths.reward_components_csv, paths.reward_components_png,
-                )
-        def on_step_end(self, _args, state, control, **kw):
-            step = state.global_step
-            if step <= 0 or step == self._last_eval_step:
-                return control
-            if step % args.checkpoint_eval_steps != 0:
-                return control
-            self._last_eval_step = step
-            try:
-                self._run_checkpoint_eval(step, state)
-            except Exception as exc:
-                logger.warning("checkpoint eval failed at step %d: %s", step, exc)
-            return control
-        def _run_checkpoint_eval(self, step: int, state) -> None:
-            FastLanguageModel.for_inference(model)
-            try:
-                # When curriculum is enabled, evaluate at whatever tier the
-                # adaptive manager currently considers appropriate. Otherwise
-                # use the static --difficulty.
-                eval_difficulty = (
-                    curriculum.next_difficulty()
-                    if curriculum is not None
-                    else args.difficulty
-                )
-                episodes = []
-                for s in held_out_seeds:
-                    ep = self._rollout_one(seed=s, difficulty=eval_difficulty)
-                    if ep is not None:
-                        episodes.append(ep)
-                if not episodes:
-                    return
-                rewards = [e.cumulative_reward for e in episodes]
-                success_rate = sum(1 for e in episodes if e.discovered) / len(episodes)
-                ckpt_writer.append(
-                    step=step,
-                    fraction_done=round(step / max(state.max_steps or step, 1), 4),
-                    episodes=len(episodes),
-                    mean_reward=round(sum(rewards) / len(rewards), 4),
-                    success_rate=round(success_rate, 4),
-                    mass_acc=round(sum(1 for e in episodes if e.correct_mass) / len(episodes), 4),
-                    channel_acc=round(sum(1 for e in episodes if e.correct_channel) / len(episodes), 4),
-                )
-                render_checkpoint_progression(
-                    paths.checkpoint_evals_csv,
-                    paths.checkpoint_progression_png,
-                )
-                if curriculum is not None:
-                    snap = curriculum.record(
-                        success=success_rate >= 0.5,
-                        reward=sum(rewards) / len(rewards),
-                    )
-                    curriculum.save(paths.root / "curriculum_state.json")
-                    if snap.get("event"):
-                        logger.info(
-                            "[curriculum] %s @ step=%d → tier=%s (rolling=%.2f)",
-                            snap["event"], step, snap["current"], snap["rolling_success"],
-                        )
-                logger.info(
-                    "[checkpoint-eval step=%d difficulty=%s] reward=%.3f success=%.2f",
-                    step, eval_difficulty,
-                    rewards and (sum(rewards) / len(rewards)) or 0.0,
-                    success_rate,
-                )
-            finally:
-                FastLanguageModel.for_training(model)
-        def _rollout_one(self, seed: int, difficulty: Optional[str] = None):
-            def prompt_fn(chat):
-                return tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
-            def generate_fn(prompt: str, _config) -> str:
-                inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-                outputs = model.generate(
-                    **inputs,
-                    max_new_tokens=args.max_completion_length,
-                    do_sample=True, temperature=0.7, top_p=0.95,
-                    pad_token_id=tokenizer.pad_token_id,
-                )
-                gen = outputs[0][inputs["input_ids"].shape[1]:]
-                return tokenizer.decode(gen, skip_special_tokens=True)
-            return collect_episode(
-                env=env, seed=seed,
-                scenario=args.scenario,
-                difficulty=difficulty or args.difficulty,
-                prompt_fn=prompt_fn, generate_fn=generate_fn,
-                config=LLMAgentConfig(),
-            )
-    trainer = GRPOTrainer(
-        model=model,
-        processing_class=tokenizer,
-        train_dataset=dataset,
-        reward_funcs=[reward_fn],
-        args=cfg,
-        callbacks=[EvidenceCallback()],
-    )
-    logger.info("Starting Unsloth + LoRA GRPO training")
-    trainer.train()
-    # Drain whatever rollouts the final on_log didn't catch so the last
-    # row of reward_components.csv is correct.
-    final_drain = component_accumulator.drain()
-    if final_drain:
-        summary = RewardComponentAccumulator.summarise(final_drain)
-        summary["step"] = trainer.state.global_step
-        component_writer.append(summary)
-        render_reward_components(
-            paths.reward_components_csv, paths.reward_components_png,
-        )
-    trainer.save_model(args.output_dir)
-    tokenizer.save_pretrained(args.output_dir)
-    logger.info("Saved adapters to %s", args.output_dir)
-    logger.info("Evidence artifacts in %s", paths.root)
-if __name__ == "__main__":  # pragma: no cover
-    main()

+"""Unsloth + LoRA (Low-Rank Adaptation) GRPO training for CERNenv.
+This is the recommended path for Colab / single- or multi-GPU runs because
+Unsloth's fused kernels and 4-bit loading let us train 2B–8B models with
+limited VRAM, while TRL's GRPO (Group-Relative Policy Optimization) loop
+handles the policy-gradient math.
+The trainer is wired up to produce **all** "training-progress evidence"
+artifacts demanded by the OpenEnv hackathon's scoring rubric:
+* per-step training log + reward/loss curve PNG (Portable Network Graphics)
+* mid-training checkpoint evaluations + progression curve PNG
+* (post-run) before/after summary + reward-distribution PNG
+All artifacts land in ``--evidence_dir`` (default: ``evidence/``).
+Run on Colab / single GPU:
+    !python -m training.training_unsloth \
+        --model_name unsloth/Qwen2.5-3B-Instruct \
+        --total_episodes 400 --num_generations 4 --output_dir runs/unsloth-grpo
+Run on a 4×A100 Hugging Face Space (multi-GPU via accelerate):
+    accelerate launch --num_processes 4 -m training.training_unsloth \
+        --total_episodes 1500 --num_generations 8 --output_dir runs/unsloth-grpo
+"""
+from __future__ import annotations
+import argparse
+import logging
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
+logger = logging.getLogger(__name__)
+def _build_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", default="unsloth/Qwen2.5-3B-Instruct")
+    parser.add_argument("--scenario", default=None)
+    parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
+    parser.add_argument(
+        "--curriculum",
+        action="store_true",
+        help=(
+            "Enable adaptive curriculum: start at --difficulty and promote "
+            "to medium/hard once held-out success rate clears the threshold "
+            "(see training/curriculum.py)."
+        ),
+    )
+    parser.add_argument("--curriculum_promote", type=float, default=0.55)
+    parser.add_argument("--curriculum_demote", type=float, default=0.10)
+    parser.add_argument("--total_episodes", type=int, default=400)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--max_steps", type=int, default=18)
+    parser.add_argument("--num_generations", type=int, default=4)
+    parser.add_argument("--max_prompt_length", type=int, default=2048)
+    parser.add_argument("--max_completion_length", type=int, default=384)
+    parser.add_argument("--learning_rate", type=float, default=5e-6)
+    parser.add_argument("--load_in_4bit", action="store_true", default=True)
+    parser.add_argument("--lora_rank", type=int, default=16)
+    parser.add_argument("--lora_alpha", type=int, default=16)
+    parser.add_argument("--per_device_batch_size", type=int, default=1)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
+    parser.add_argument("--logging_steps", type=int, default=2)
+    parser.add_argument("--save_steps", type=int, default=50)
+    parser.add_argument("--checkpoint_eval_steps", type=int, default=25,
+                        help="Run a held-out eval every N updates for the progression curve.")
+    parser.add_argument("--checkpoint_eval_episodes", type=int, default=8,
+                        help="Number of held-out episodes per mid-training eval.")
+    parser.add_argument("--output_dir", default="runs/unsloth-grpo")
+    parser.add_argument("--evidence_dir", default="evidence")
+    return parser.parse_args()
+def main() -> None:  # pragma: no cover - heavy GPU path
+    args = _build_args()
+    # IMPORTANT: Unsloth MUST be imported before transformers / trl. It
+    # patches transformers' lazy ``_import_structure`` to register a few
+    # symbols (notably ``PreTrainedModel`` under torch-aware paths). If trl
+    # loads transformers first, the lazy loader will fail with a confusing
+    # ``ImportError: cannot import name 'PreTrainedModel' from 'transformers'``
+    # at GRPOTrainer import time — which is exactly what we hit on the
+    # trainer Space before this reorder.
+    # See: https://github.com/unslothai/unsloth and the matching
+    # transformers issue #42548 for the lazy-import root cause.
+    from unsloth import FastLanguageModel
+    from transformers import TrainerCallback
+    from trl import GRPOConfig, GRPOTrainer
+    from server.environment import CERNCollisionEnvironment
+    from training.curriculum import CurriculumConfig, CurriculumManager
+    from training.evidence import (
+        CheckpointEvalWriter,
+        EvidencePaths,
+        RewardComponentLogWriter,
+        TrainingLogWriter,
+        render_checkpoint_progression,
+        render_reward_components,
+        render_training_curve,
+    )
+    from training.llm_agent import LLMAgentConfig
+    from training.rollouts import collect_episode
+    from training.training_script import (
+        EpisodeContext,
+        RewardComponentAccumulator,
+    )
+    paths = EvidencePaths(root=Path(args.evidence_dir))
+    paths.ensure()
+    log_writer = TrainingLogWriter(paths.training_log_csv)
+    ckpt_writer = CheckpointEvalWriter(paths.checkpoint_evals_csv)
+    component_writer = RewardComponentLogWriter(paths.reward_components_csv)
+    component_accumulator = RewardComponentAccumulator()
+    curriculum: Optional[CurriculumManager] = None
+    if args.curriculum:
+        curriculum = CurriculumManager(
+            CurriculumConfig(
+                start_difficulty=args.difficulty,
+                promote_threshold=args.curriculum_promote,
+                demote_threshold=args.curriculum_demote,
+            )
+        )
+        logger.info("Curriculum enabled: start=%s promote≥%.2f demote≤%.2f",
+                    args.difficulty, args.curriculum_promote, args.curriculum_demote)
+    logger.info("Loading Unsloth model: %s", args.model_name)
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=args.model_name,
+        max_seq_length=args.max_prompt_length + args.max_completion_length,
+        load_in_4bit=args.load_in_4bit,
+        # fast_inference requires vLLM, which is not in requirements; plain transformers generation is used instead. Re-enable after pinning vllm in space/training/requirements.txt.
+        fast_inference=False,
+    )
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=args.lora_rank,
+        lora_alpha=args.lora_alpha,
+        target_modules=[
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj",
+        ],
+        use_gradient_checkpointing="unsloth",
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    from training.training_script import build_dataset, make_reward_fn
+    env = CERNCollisionEnvironment(max_steps=args.max_steps)
+    dataset = build_dataset(
+        tokenizer=tokenizer,
+        n_prompts=args.total_episodes,
+        seed=args.seed,
+        scenario=args.scenario,
+        difficulty=args.difficulty,
+        curriculum=args.curriculum,
+    )
+    ctx = EpisodeContext(
+        env=env, seed=args.seed,
+        scenario=args.scenario, difficulty=args.difficulty,
+    )
+    reward_fn = make_reward_fn(ctx, accumulator=component_accumulator)
+    cfg = GRPOConfig(
+        output_dir=args.output_dir,
+        per_device_train_batch_size=args.per_device_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        num_generations=args.num_generations,
+        learning_rate=args.learning_rate,
+        max_prompt_length=args.max_prompt_length,
+        max_completion_length=args.max_completion_length,
+        logging_steps=args.logging_steps,
+        save_steps=args.save_steps,
+        seed=args.seed,
+        bf16=True,
+        report_to=[],
+    )
+    held_out_seeds = list(range(900_000, 900_000 + args.checkpoint_eval_episodes))
+    class EvidenceCallback(TrainerCallback):
+        """Stream training metrics + run periodic mid-training evals."""
+        def __init__(self) -> None:
+            self._t0 = time.time()
+            self._last_eval_step = -1
+        def on_log(self, _args, state, control, logs=None, **kw):
+            logs = logs or {}
+            row = {
+                "step": state.global_step,
+                "epoch": logs.get("epoch"),
+                "loss": logs.get("loss"),
+                "reward": logs.get("reward") or logs.get("rewards/mean"),
+                "reward_std": logs.get("reward_std") or logs.get("rewards/std"),
+                "kl": logs.get("kl"),
+                "grad_norm": logs.get("grad_norm"),
+                "learning_rate": logs.get("learning_rate"),
+                "wall_time_s": round(time.time() - self._t0, 2),
+            }
+            if any(v is not None for k, v in row.items() if k != "step"):
+                log_writer.append(row)
+                render_training_curve(paths.training_log_csv, paths.training_curve_png)
+            # Per-component reward summary (FAQ Q17, Q43, Q52: don't watch
+            # only the mean reward — track terminal vs shaping, success
+            # rates, and parse rate so verifier hacks become visible).
+            drained = component_accumulator.drain()
+            if drained:
+                summary = RewardComponentAccumulator.summarise(drained)
+                summary["step"] = state.global_step
+                component_writer.append(summary)
+                render_reward_components(
+                    paths.reward_components_csv, paths.reward_components_png,
+                )
+        def on_step_end(self, _args, state, control, **kw):
+            step = state.global_step
+            if step <= 0 or step == self._last_eval_step:
+                return control
+            if step % args.checkpoint_eval_steps != 0:
+                return control
+            self._last_eval_step = step
+            try:
+                self._run_checkpoint_eval(step, state)
+            except Exception as exc:
+                logger.warning("checkpoint eval failed at step %d: %s", step, exc)
+            return control
+        def _run_checkpoint_eval(self, step: int, state) -> None:
+            FastLanguageModel.for_inference(model)
+            try:
+                # When curriculum is enabled, evaluate at whatever tier the
+                # adaptive manager currently considers appropriate. Otherwise
+                # use the static --difficulty.
+                eval_difficulty = (
+                    curriculum.next_difficulty()
+                    if curriculum is not None
+                    else args.difficulty
+                )
+                episodes = []
+                for s in held_out_seeds:
+                    ep = self._rollout_one(seed=s, difficulty=eval_difficulty)
+                    if ep is not None:
+                        episodes.append(ep)
+                if not episodes:
+                    return
+                rewards = [e.cumulative_reward for e in episodes]
+                success_rate = sum(1 for e in episodes if e.discovered) / len(episodes)
+                ckpt_writer.append(
+                    step=step,
+                    fraction_done=round(step / max(state.max_steps or step, 1), 4),
+                    episodes=len(episodes),
+                    mean_reward=round(sum(rewards) / len(rewards), 4),
+                    success_rate=round(success_rate, 4),
+                    mass_acc=round(sum(1 for e in episodes if e.correct_mass) / len(episodes), 4),
+                    channel_acc=round(sum(1 for e in episodes if e.correct_channel) / len(episodes), 4),
+                )
+                render_checkpoint_progression(
+                    paths.checkpoint_evals_csv,
+                    paths.checkpoint_progression_png,
+                )
+                if curriculum is not None:
+                    snap = curriculum.record(
+                        success=success_rate >= 0.5,
+                        reward=sum(rewards) / len(rewards),
+                    )
+                    curriculum.save(paths.root / "curriculum_state.json")
+                    if snap.get("event"):
+                        logger.info(
+                            "[curriculum] %s @ step=%d → tier=%s (rolling=%.2f)",
+                            snap["event"], step, snap["current"], snap["rolling_success"],
+                        )
+                logger.info(
+                    "[checkpoint-eval step=%d difficulty=%s] reward=%.3f success=%.2f",
+                    step, eval_difficulty,
+                    rewards and (sum(rewards) / len(rewards)) or 0.0,
+                    success_rate,
+                )
+            finally:
+                FastLanguageModel.for_training(model)
+        def _rollout_one(self, seed: int, difficulty: Optional[str] = None):
+            def prompt_fn(chat):
+                return tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
+            def generate_fn(prompt: str, _config) -> str:
+                inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+                outputs = model.generate(
+                    **inputs,
+                    max_new_tokens=args.max_completion_length,
+                    do_sample=True, temperature=0.7, top_p=0.95,
+                    pad_token_id=tokenizer.pad_token_id,
+                )
+                gen = outputs[0][inputs["input_ids"].shape[1]:]
+                return tokenizer.decode(gen, skip_special_tokens=True)
+            return collect_episode(
+                env=env, seed=seed,
+                scenario=args.scenario,
+                difficulty=difficulty or args.difficulty,
+                prompt_fn=prompt_fn, generate_fn=generate_fn,
+                config=LLMAgentConfig(),
+            )
+    trainer = GRPOTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=dataset,
+        reward_funcs=[reward_fn],
+        args=cfg,
+        callbacks=[EvidenceCallback()],
+    )
+    logger.info("Starting Unsloth + LoRA GRPO training")
+    trainer.train()
+    # Drain whatever rollouts the final on_log didn't catch so the last
+    # row of reward_components.csv is correct.
+    final_drain = component_accumulator.drain()
+    if final_drain:
+        summary = RewardComponentAccumulator.summarise(final_drain)
+        summary["step"] = trainer.state.global_step
+        component_writer.append(summary)
+        render_reward_components(
+            paths.reward_components_csv, paths.reward_components_png,
+        )
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    logger.info("Saved adapters to %s", args.output_dir)
+    logger.info("Evidence artifacts in %s", paths.root)
+if __name__ == "__main__":  # pragma: no cover
+    main()