Spaces:
Running
Running
| # MolForge RL Training in Colab | |
| Use [issue/molforge_grpo_colab_training.ipynb](issue/molforge_grpo_colab_training.ipynb) for the judge-rerunnable workflow. | |
| The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv `environment_factory` pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for `edit`, `run_assay`, `submit`, `restart`, and `defer`, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence. | |
| ## Outputs | |
| Each run writes to `/content/molforge_rl_runs/<run_name>/` and copies the same folder to `DRIVE_OUTPUT_DIR` when set. | |
| Important artifacts: | |
| - `logs/openenv_tool_rollouts.jsonl`: every tool call, reward, governance status, and score diagnostics. | |
| - `logs/trainer_log_history.jsonl`: trainer loss, grad norm, learning rate, and step timing. | |
| - `openenv_tool_metrics.csv`: spreadsheet-friendly tool rollout reward table. | |
| - `eval_before_training.json`: full 3-task rollout before GRPO. | |
| - `eval_after_training.json`: full 3-task rollout after GRPO. | |
| - `plots/reward_curve.png`: completion reward curve and moving average. | |
| - `plots/loss_curve.png`: trainer loss curve. | |
| - `plots/eval_before_after.png`: before/after final_score comparison. | |
| - `plots/action_distribution.png`: sampled action mix. | |
| - `adapters/`: trained LoRA adapter checkpoint. | |
| - `<run_name>.zip`: portable archive of the run outputs. | |
| ## Fast Demo Settings | |
| For a quick A100/H100 proof run: | |
| ```python | |
| os.environ["RL_MAX_STEPS"] = "80" | |
| os.environ["NUM_GENERATIONS"] = "2" | |
| os.environ["RL_DATASET_SIZE"] = "120" | |
| os.environ["RL_BATCH_SIZE"] = "2" | |
| os.environ["RL_GRAD_ACCUM"] = "4" | |
| os.environ["RL_LEARNING_RATE"] = "2e-6" | |
| ``` | |
| For a stronger run, try `RL_MAX_STEPS=200` and `NUM_GENERATIONS=4` on H100. | |
| If Colab runs out of memory, reduce `MAX_COMPLETION_LENGTH` to `1024`; keep `RL_BATCH_SIZE` divisible by `NUM_GENERATIONS`. | |
| If TRL fails during import with `No module named 'mergekit'`, install `mergekit` in the same setup cell as `trl`. | |
| ## What to Show Judges | |
| Use the before/after rollout JSON plus these plots: | |
| - `reward_curve.png` for reward improvement during RL. | |
| - `loss_curve.png` for actual training evidence. | |
| - `eval_before_after.png` for task-level behavior change. | |
| The official environment score remains `final_score`; `progress_score` and per-step rewards are debugging signals. | |