# MolForge RL Training in Colab Use [issue/molforge_grpo_colab_training.ipynb](issue/molforge_grpo_colab_training.ipynb) for the judge-rerunnable workflow. The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv `environment_factory` pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for `edit`, `run_assay`, `submit`, `restart`, and `defer`, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence. ## Outputs Each run writes to `/content/molforge_rl_runs//` and copies the same folder to `DRIVE_OUTPUT_DIR` when set. Important artifacts: - `logs/openenv_tool_rollouts.jsonl`: every tool call, reward, governance status, and score diagnostics. - `logs/trainer_log_history.jsonl`: trainer loss, grad norm, learning rate, and step timing. - `openenv_tool_metrics.csv`: spreadsheet-friendly tool rollout reward table. - `eval_before_training.json`: full 3-task rollout before GRPO. - `eval_after_training.json`: full 3-task rollout after GRPO. - `plots/reward_curve.png`: completion reward curve and moving average. - `plots/loss_curve.png`: trainer loss curve. - `plots/eval_before_after.png`: before/after final_score comparison. - `plots/action_distribution.png`: sampled action mix. - `adapters/`: trained LoRA adapter checkpoint. - `.zip`: portable archive of the run outputs. ## Fast Demo Settings For a quick A100/H100 proof run: ```python os.environ["RL_MAX_STEPS"] = "80" os.environ["NUM_GENERATIONS"] = "2" os.environ["RL_DATASET_SIZE"] = "120" os.environ["RL_BATCH_SIZE"] = "2" os.environ["RL_GRAD_ACCUM"] = "4" os.environ["RL_LEARNING_RATE"] = "2e-6" ``` For a stronger run, try `RL_MAX_STEPS=200` and `NUM_GENERATIONS=4` on H100. If Colab runs out of memory, reduce `MAX_COMPLETION_LENGTH` to `1024`; keep `RL_BATCH_SIZE` divisible by `NUM_GENERATIONS`. If TRL fails during import with `No module named 'mergekit'`, install `mergekit` in the same setup cell as `trl`. ## What to Show Judges Use the before/after rollout JSON plus these plots: - `reward_curve.png` for reward improvement during RL. - `loss_curve.png` for actual training evidence. - `eval_before_after.png` for task-level behavior change. The official environment score remains `final_score`; `progress_score` and per-step rewards are debugging signals.