Spaces:
Running
MolForge RL Training in Colab
Use issue/molforge_grpo_colab_training.ipynb for the judge-rerunnable workflow.
The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv environment_factory pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for edit, run_assay, submit, restart, and defer, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence.
Outputs
Each run writes to /content/molforge_rl_runs/<run_name>/ and copies the same folder to DRIVE_OUTPUT_DIR when set.
Important artifacts:
logs/openenv_tool_rollouts.jsonl: every tool call, reward, governance status, and score diagnostics.logs/trainer_log_history.jsonl: trainer loss, grad norm, learning rate, and step timing.openenv_tool_metrics.csv: spreadsheet-friendly tool rollout reward table.eval_before_training.json: full 3-task rollout before GRPO.eval_after_training.json: full 3-task rollout after GRPO.plots/reward_curve.png: completion reward curve and moving average.plots/loss_curve.png: trainer loss curve.plots/eval_before_after.png: before/after final_score comparison.plots/action_distribution.png: sampled action mix.adapters/: trained LoRA adapter checkpoint.<run_name>.zip: portable archive of the run outputs.
Fast Demo Settings
For a quick A100/H100 proof run:
os.environ["RL_MAX_STEPS"] = "80"
os.environ["NUM_GENERATIONS"] = "2"
os.environ["RL_DATASET_SIZE"] = "120"
os.environ["RL_BATCH_SIZE"] = "2"
os.environ["RL_GRAD_ACCUM"] = "4"
os.environ["RL_LEARNING_RATE"] = "2e-6"
For a stronger run, try RL_MAX_STEPS=200 and NUM_GENERATIONS=4 on H100.
If Colab runs out of memory, reduce MAX_COMPLETION_LENGTH to 1024; keep RL_BATCH_SIZE divisible by NUM_GENERATIONS.
If TRL fails during import with No module named 'mergekit', install mergekit in the same setup cell as trl.
What to Show Judges
Use the before/after rollout JSON plus these plots:
reward_curve.pngfor reward improvement during RL.loss_curve.pngfor actual training evidence.eval_before_after.pngfor task-level behavior change.
The official environment score remains final_score; progress_score and per-step rewards are debugging signals.