Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

molforge / TRAINING_INSTRUCTIONS.md

Adhitya122

Prepare MolForge OpenEnv Docker Space submission

bf9e424 verified 12 days ago

preview code

raw

history blame contribute delete

9.2 kB

	# MolForge Training Instructions

	This guide is for training a small model against MolForge without teaching it to exploit the environment.

	## 1. Safety Defaults

	MolForge now hides true internal molecule properties from public `state()` metadata by default. If you need to debug the environment manually, use:

	```bash
	MOLFORGE_DEBUG_STATE=1 python inference.py
	```

	Do not use `MOLFORGE_DEBUG_STATE=1` while collecting SFT data or running RL.

	The chemistry oracle path uses RDKit descriptors by default and TDC molecule oracles when `pytdc` is available. TDC is kept as an optional extra because current PyTDC releases pull a large platform-sensitive ML stack; install it with `uv sync --extra tdc` on a compatible Python if you want TDC SA/QED oracles active. RDKit remains active in the default Docker/HF deployment, and the environment records the active backend in observation metadata.

	The default reward mode is `assay_gated`, which gives coarse edit feedback and leaves the strongest quality signal to assays and terminal graders. For early RL warmup, use the curriculum reward mode:

	```bash
	MOLFORGE_REWARD_MODE=curriculum python inference.py
	```

	Curriculum mode keeps the official `submission_score` strict, but gives bounded
	training reward for useful evidence collection, evidence-supported submit
	decisions, and non-submitted near-miss episodes. If the model reaches a strong
	evidence package and still fails to submit before the deadline, curriculum mode
	adds a small missed-nomination penalty. This prevents small models from seeing
	only zero terminal scores while they are still learning when to submit, without
	letting endless assay collection become the best behavior. Use this for initial
	GRPO curves, then switch back to `assay_gated` for final evaluation.

	For curriculum experiments only, you can also restore the older dense edit reward:

	```bash
	MOLFORGE_REWARD_MODE=dense python inference.py
	```

	Use randomized training episodes when collecting data or training a policy:

	```bash
	MOLFORGE_TRAINING_RANDOMIZATION=1 MOLFORGE_RANDOM_SEED=42 python inference.py
	```

	Keep randomization off for judge-facing baseline runs so scores remain reproducible.

	## 2. Recommended Training Plan

	Use a two-stage plan:

	1. Small SFT warm start
	2. RL with verifiable rewards

	SFT is only for teaching the model the action schema and basic workflow. RL should do the real environment optimization.

	## 3. What SFT Should Teach

	Include these example types:

	- Valid JSON action formatting
	- Correct `acting_role` for each action
	- Short `rationale` values that explain the decision without chain-of-thought
	- `evidence` lists that cite visible observation facts only
	- `expected_effects` dictionaries with directional predictions, not hidden scores
	- Specialist message bundles with proposal, approval, objection, assay request, or rejection
	- Running cheap/necessary assays before risky submissions
	- Editing toward safer fragments when toxicity risk is visible
	- Restarting early in the hard sunk-cost scenario
	- Submitting only when evidence covers the task constraints
	- Handling noisy assay estimates without undoing a high-confidence final candidate at the last moment
	- Recovering from low budget by choosing small actions or stopping

	Avoid these example types:

	- Any example that reads `state.metadata.debug_hidden_properties`
	- Any answer that mentions exact hidden objective deltas
	- Hidden chain-of-thought or long private reasoning transcripts
	- Repetitive message spam just to collect coordination reward
	- Premature submit actions without potency/safety evidence
	- Examples where missing specialist messages are silently repaired by the runner

	## 4. Generate a Starter SFT Dataset

	For the first schema warm start, use the strict curriculum dataset. It includes
	explicit JSON `null` fields, only the intended top-level action keys, all action
	types, all assay tools, all edit subtypes, and valid role/message permissions:

	```bash
	python scripts/generate_sft_schema_strict_dataset.py \
	--episodes 75 \
	--output data/molforge_sft_schema_strict.jsonl

	python scripts/validate_sft_traces.py data/molforge_sft_schema_strict.jsonl
	```

	Use this file first for Qwen 2B-class SFT:

	```text
	data/molforge_sft_schema_strict.jsonl
	```

	The older trace generator is still useful after the model learns the exact
	schema, because it provides more policy-like trajectories:

	Run:

	```bash
	python scripts/generate_sft_traces.py --episodes 80 --output data/molforge_sft_traces.jsonl
	```

	For a more robust dataset:

	```bash
	python scripts/generate_sft_traces.py \
	--episodes 200 \
	--randomized \
	--output data/molforge_sft_traces_randomized.jsonl
	```

	The generated records use chat-style JSONL:

	```json
	{"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}
	```

	Before training, spot-check the JSONL:

	```bash
	python - <<'PY'
	import json
	from pathlib import Path

	path = Path("data/molforge_sft_traces.jsonl")
	for i, line in zip(range(3), path.open()):
	item = json.loads(line)
	print(i, item["metadata"], item["messages"][-1]["content"][:300])
	PY
	```

	## 5. SFT Settings

	Start small:

	- Dataset size: 200 to 1,000 action examples
	- Max sequence length: 2,048 or 4,096
	- LoRA rank: 16 or 32
	- Learning rate: `1e-4` to `2e-4`
	- Epochs: 1 to 3
	- Target modules: attention and MLP projection layers
	- Save LoRA adapters first; test them before merging

	Stop SFT once the model reliably emits valid `MolForgeAction` JSON. Do not overfit it into copying one fixed heuristic path.

	## 6. RL Stage

	After SFT, run RL/GRPO with MolForge as the verifier environment.

	Use these environment settings:

	```bash
	export MOLFORGE_TRAINING_RANDOMIZATION=1
	export MOLFORGE_REWARD_MODE=curriculum
	unset MOLFORGE_DEBUG_STATE
	```

	Once the model starts submitting valid candidates, run a second RL/evaluation
	phase with:

	```bash
	export MOLFORGE_REWARD_MODE=assay_gated
	```

	Report both curves if possible:

	- curriculum reward curve for early learning progress;
	- strict terminal `submission_score` before/after for judge-facing task success.

	Track these metrics separately:

	- Average terminal `submission_score`
	- Average terminal `candidate_score`
	- Average terminal `budget_score`
	- Budget remaining at valid submit
	- Invalid action rate
	- Policy veto rate
	- Budget exhaustion rate
	- Repeated assay count
	- Loop penalty count
	- Coordination score
	- Evidence score
	- Submitted-without-evidence count
	- Constraint margin score
	- Number of actions before submit

	Inspect generations every few hundred updates. A rising reward is not enough if the model learns to spam messages, submit without evidence, or memorize the three default scenarios.

	## 7. Evaluation Protocol

	Use three evaluations:

	1. Deterministic public tasks
	Run with randomization off and compare to `python inference.py`.
	2. Randomized training tasks
	Run with `MOLFORGE_TRAINING_RANDOMIZATION=1`.
	3. Holdout tasks
	Add new scenario configs or fragment perturbations not present in SFT traces.

	A trained model should improve terminal submission score while keeping invalid actions and evidence-free submissions low.

	For the full testing protocol, including how to compare curriculum reward
	against strict evaluation, see [EVALUATION_PROTOCOL.md](EVALUATION_PROTOCOL.md).

	## 8. Model Choice

	Recommended starting point:

	- `unsloth/Qwen3.5-2B` for the lightest serious iteration loop
	- `unsloth/Qwen3-4B-Instruct-2507` if you can afford a little more VRAM and want stronger JSON/tool following

	Why:

	- Qwen3.5 has 0.8B, 2B, and 4B Unsloth fine-tuning support.
	- The 2B class should be fast enough for repeated MolForge SFT/RL experiments.
	- The 4B class is still lightweight, but should be more reliable for structured action generation.

	Use `Qwen3.5-0.8B` only for plumbing tests. It is useful to verify the training loop, but likely too weak to judge the environment.

	If you have more GPU budget:

	- `unsloth/Qwen3-8B` or a current Qwen3/Qwen3.5 8B-class instruct model

	If you specifically want alternate-family baselines:

	- `unsloth/Llama-3.1-8B-Instruct`
	- Gemma 3/4 small instruct models can be tested, but prefer Qwen first because the current Unsloth Qwen3.5 fine-tuning path is clearer for 2B/4B RL iteration.

	For the hackathon, prefer faster iteration over maximum model size. A clean 4B model trained well against this environment is more useful than a larger model that only runs a few noisy experiments.

	## 9. Honest Inference Reporting

	`inference.py` has no heuristic fallback. It requires a configured model and exits with an error if the model is missing, times out, or emits unparsable action JSON.

	`local_inference.py` also has no heuristic policy fallback and does not patch missing team messages into model outputs. If a model omits reviewer communication, that weakness should appear as missing-review penalties and a lower `coordination_score`.

	For real model evaluation, run:

	```bash
	API_BASE_URL=https://router.huggingface.co/v1 \
	MODEL_NAME=your-model \
	HF_TOKEN=your-token \
	python inference.py
	```

	Use the deterministic trace policy only for SFT data generation, not for reporting model scores.