molforge / RL_TRAINING_COLAB.md
Adhitya122's picture
Prepare MolForge OpenEnv Docker Space submission
bf9e424 verified

MolForge RL Training in Colab

Use issue/molforge_grpo_colab_training.ipynb for the judge-rerunnable workflow.

The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv environment_factory pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for edit, run_assay, submit, restart, and defer, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence.

Outputs

Each run writes to /content/molforge_rl_runs/<run_name>/ and copies the same folder to DRIVE_OUTPUT_DIR when set.

Important artifacts:

  • logs/openenv_tool_rollouts.jsonl: every tool call, reward, governance status, and score diagnostics.
  • logs/trainer_log_history.jsonl: trainer loss, grad norm, learning rate, and step timing.
  • openenv_tool_metrics.csv: spreadsheet-friendly tool rollout reward table.
  • eval_before_training.json: full 3-task rollout before GRPO.
  • eval_after_training.json: full 3-task rollout after GRPO.
  • plots/reward_curve.png: completion reward curve and moving average.
  • plots/loss_curve.png: trainer loss curve.
  • plots/eval_before_after.png: before/after final_score comparison.
  • plots/action_distribution.png: sampled action mix.
  • adapters/: trained LoRA adapter checkpoint.
  • <run_name>.zip: portable archive of the run outputs.

Fast Demo Settings

For a quick A100/H100 proof run:

os.environ["RL_MAX_STEPS"] = "80"
os.environ["NUM_GENERATIONS"] = "2"
os.environ["RL_DATASET_SIZE"] = "120"
os.environ["RL_BATCH_SIZE"] = "2"
os.environ["RL_GRAD_ACCUM"] = "4"
os.environ["RL_LEARNING_RATE"] = "2e-6"

For a stronger run, try RL_MAX_STEPS=200 and NUM_GENERATIONS=4 on H100. If Colab runs out of memory, reduce MAX_COMPLETION_LENGTH to 1024; keep RL_BATCH_SIZE divisible by NUM_GENERATIONS.

If TRL fails during import with No module named 'mergekit', install mergekit in the same setup cell as trl.

What to Show Judges

Use the before/after rollout JSON plus these plots:

  • reward_curve.png for reward improvement during RL.
  • loss_curve.png for actual training evidence.
  • eval_before_after.png for task-level behavior change.

The official environment score remains final_score; progress_score and per-step rewards are debugging signals.