Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

molforge / RL_TRAINING_COLAB.md

Adhitya122

Prepare MolForge OpenEnv Docker Space submission

bf9e424 verified 28 days ago

preview code

raw

history blame contribute delete

2.46 kB

MolForge RL Training in Colab

Use issue/molforge_grpo_colab_training.ipynb for the judge-rerunnable workflow.

The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv environment_factory pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for edit, run_assay, submit, restart, and defer, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence.

Outputs

Each run writes to /content/molforge_rl_runs/<run_name>/ and copies the same folder to DRIVE_OUTPUT_DIR when set.

Important artifacts:

logs/openenv_tool_rollouts.jsonl: every tool call, reward, governance status, and score diagnostics.
logs/trainer_log_history.jsonl: trainer loss, grad norm, learning rate, and step timing.
openenv_tool_metrics.csv: spreadsheet-friendly tool rollout reward table.
eval_before_training.json: full 3-task rollout before GRPO.
eval_after_training.json: full 3-task rollout after GRPO.
plots/reward_curve.png: completion reward curve and moving average.
plots/loss_curve.png: trainer loss curve.
plots/eval_before_after.png: before/after final_score comparison.
plots/action_distribution.png: sampled action mix.
adapters/: trained LoRA adapter checkpoint.
<run_name>.zip: portable archive of the run outputs.

Fast Demo Settings

For a quick A100/H100 proof run:

os.environ["RL_MAX_STEPS"] = "80"
os.environ["NUM_GENERATIONS"] = "2"
os.environ["RL_DATASET_SIZE"] = "120"
os.environ["RL_BATCH_SIZE"] = "2"
os.environ["RL_GRAD_ACCUM"] = "4"
os.environ["RL_LEARNING_RATE"] = "2e-6"

For a stronger run, try RL_MAX_STEPS=200 and NUM_GENERATIONS=4 on H100. If Colab runs out of memory, reduce MAX_COMPLETION_LENGTH to 1024; keep RL_BATCH_SIZE divisible by NUM_GENERATIONS.

If TRL fails during import with No module named 'mergekit', install mergekit in the same setup cell as trl.

What to Show Judges

Use the before/after rollout JSON plus these plots:

reward_curve.png for reward improvement during RL.
loss_curve.png for actual training evidence.
eval_before_after.png for task-level behavior change.

The official environment score remains final_score; progress_score and per-step rewards are debugging signals.