Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

molforge / RL_TRAINING_COLAB.md

Adhitya122

Prepare MolForge OpenEnv Docker Space submission

bf9e424 verified 28 days ago

preview code

raw

history blame contribute delete

2.46 kB

	# MolForge RL Training in Colab

	Use [issue/molforge_grpo_colab_training.ipynb](issue/molforge_grpo_colab_training.ipynb) for the judge-rerunnable workflow.

	The notebook trains from the Qwen3.5 2B SFT v4 adapter with TRL GRPO against the real MolForge environment reward. It uses the TRL/OpenEnv `environment_factory` pattern from the Wordle/Sudoku examples: MolForge exposes tool methods for `edit`, `run_assay`, `submit`, `restart`, and `defer`, and reward functions read scores from the environment instances. It is set up for short evidence runs on A100/H100 rather than full convergence.

	## Outputs

	Each run writes to `/content/molforge_rl_runs/<run_name>/` and copies the same folder to `DRIVE_OUTPUT_DIR` when set.

	Important artifacts:

	- `logs/openenv_tool_rollouts.jsonl`: every tool call, reward, governance status, and score diagnostics.
	- `logs/trainer_log_history.jsonl`: trainer loss, grad norm, learning rate, and step timing.
	- `openenv_tool_metrics.csv`: spreadsheet-friendly tool rollout reward table.
	- `eval_before_training.json`: full 3-task rollout before GRPO.
	- `eval_after_training.json`: full 3-task rollout after GRPO.
	- `plots/reward_curve.png`: completion reward curve and moving average.
	- `plots/loss_curve.png`: trainer loss curve.
	- `plots/eval_before_after.png`: before/after final_score comparison.
	- `plots/action_distribution.png`: sampled action mix.
	- `adapters/`: trained LoRA adapter checkpoint.
	- `<run_name>.zip`: portable archive of the run outputs.

	## Fast Demo Settings

	For a quick A100/H100 proof run:

	```python
	os.environ["RL_MAX_STEPS"] = "80"
	os.environ["NUM_GENERATIONS"] = "2"
	os.environ["RL_DATASET_SIZE"] = "120"
	os.environ["RL_BATCH_SIZE"] = "2"
	os.environ["RL_GRAD_ACCUM"] = "4"
	os.environ["RL_LEARNING_RATE"] = "2e-6"
	```

	For a stronger run, try `RL_MAX_STEPS=200` and `NUM_GENERATIONS=4` on H100.
	If Colab runs out of memory, reduce `MAX_COMPLETION_LENGTH` to `1024`; keep `RL_BATCH_SIZE` divisible by `NUM_GENERATIONS`.

	If TRL fails during import with `No module named 'mergekit'`, install `mergekit` in the same setup cell as `trl`.

	## What to Show Judges

	Use the before/after rollout JSON plus these plots:

	- `reward_curve.png` for reward improvement during RL.
	- `loss_curve.png` for actual training evidence.
	- `eval_before_after.png` for task-level behavior change.

	The official environment score remains `final_score`; `progress_score` and per-step rewards are debugging signals.