Spaces:
Running
Running
| # MolForge Training Instructions | |
| This guide is for training a small model against MolForge without teaching it to exploit the environment. | |
| ## 1. Safety Defaults | |
| MolForge now hides true internal molecule properties from public `state()` metadata by default. If you need to debug the environment manually, use: | |
| ```bash | |
| MOLFORGE_DEBUG_STATE=1 python inference.py | |
| ``` | |
| Do not use `MOLFORGE_DEBUG_STATE=1` while collecting SFT data or running RL. | |
| The chemistry oracle path uses RDKit descriptors by default and TDC molecule oracles when `pytdc` is available. TDC is kept as an optional extra because current PyTDC releases pull a large platform-sensitive ML stack; install it with `uv sync --extra tdc` on a compatible Python if you want TDC SA/QED oracles active. RDKit remains active in the default Docker/HF deployment, and the environment records the active backend in observation metadata. | |
| The default reward mode is `assay_gated`, which gives coarse edit feedback and leaves the strongest quality signal to assays and terminal graders. For early RL warmup, use the curriculum reward mode: | |
| ```bash | |
| MOLFORGE_REWARD_MODE=curriculum python inference.py | |
| ``` | |
| Curriculum mode keeps the official `submission_score` strict, but gives bounded | |
| training reward for useful evidence collection, evidence-supported submit | |
| decisions, and non-submitted near-miss episodes. If the model reaches a strong | |
| evidence package and still fails to submit before the deadline, curriculum mode | |
| adds a small missed-nomination penalty. This prevents small models from seeing | |
| only zero terminal scores while they are still learning when to submit, without | |
| letting endless assay collection become the best behavior. Use this for initial | |
| GRPO curves, then switch back to `assay_gated` for final evaluation. | |
| For curriculum experiments only, you can also restore the older dense edit reward: | |
| ```bash | |
| MOLFORGE_REWARD_MODE=dense python inference.py | |
| ``` | |
| Use randomized training episodes when collecting data or training a policy: | |
| ```bash | |
| MOLFORGE_TRAINING_RANDOMIZATION=1 MOLFORGE_RANDOM_SEED=42 python inference.py | |
| ``` | |
| Keep randomization off for judge-facing baseline runs so scores remain reproducible. | |
| ## 2. Recommended Training Plan | |
| Use a two-stage plan: | |
| 1. Small SFT warm start | |
| 2. RL with verifiable rewards | |
| SFT is only for teaching the model the action schema and basic workflow. RL should do the real environment optimization. | |
| ## 3. What SFT Should Teach | |
| Include these example types: | |
| - Valid JSON action formatting | |
| - Correct `acting_role` for each action | |
| - Short `rationale` values that explain the decision without chain-of-thought | |
| - `evidence` lists that cite visible observation facts only | |
| - `expected_effects` dictionaries with directional predictions, not hidden scores | |
| - Specialist message bundles with proposal, approval, objection, assay request, or rejection | |
| - Running cheap/necessary assays before risky submissions | |
| - Editing toward safer fragments when toxicity risk is visible | |
| - Restarting early in the hard sunk-cost scenario | |
| - Submitting only when evidence covers the task constraints | |
| - Handling noisy assay estimates without undoing a high-confidence final candidate at the last moment | |
| - Recovering from low budget by choosing small actions or stopping | |
| Avoid these example types: | |
| - Any example that reads `state.metadata.debug_hidden_properties` | |
| - Any answer that mentions exact hidden objective deltas | |
| - Hidden chain-of-thought or long private reasoning transcripts | |
| - Repetitive message spam just to collect coordination reward | |
| - Premature submit actions without potency/safety evidence | |
| - Examples where missing specialist messages are silently repaired by the runner | |
| ## 4. Generate a Starter SFT Dataset | |
| For the first schema warm start, use the strict curriculum dataset. It includes | |
| explicit JSON `null` fields, only the intended top-level action keys, all action | |
| types, all assay tools, all edit subtypes, and valid role/message permissions: | |
| ```bash | |
| python scripts/generate_sft_schema_strict_dataset.py \ | |
| --episodes 75 \ | |
| --output data/molforge_sft_schema_strict.jsonl | |
| python scripts/validate_sft_traces.py data/molforge_sft_schema_strict.jsonl | |
| ``` | |
| Use this file first for Qwen 2B-class SFT: | |
| ```text | |
| data/molforge_sft_schema_strict.jsonl | |
| ``` | |
| The older trace generator is still useful after the model learns the exact | |
| schema, because it provides more policy-like trajectories: | |
| Run: | |
| ```bash | |
| python scripts/generate_sft_traces.py --episodes 80 --output data/molforge_sft_traces.jsonl | |
| ``` | |
| For a more robust dataset: | |
| ```bash | |
| python scripts/generate_sft_traces.py \ | |
| --episodes 200 \ | |
| --randomized \ | |
| --output data/molforge_sft_traces_randomized.jsonl | |
| ``` | |
| The generated records use chat-style JSONL: | |
| ```json | |
| {"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]} | |
| ``` | |
| Before training, spot-check the JSONL: | |
| ```bash | |
| python - <<'PY' | |
| import json | |
| from pathlib import Path | |
| path = Path("data/molforge_sft_traces.jsonl") | |
| for i, line in zip(range(3), path.open()): | |
| item = json.loads(line) | |
| print(i, item["metadata"], item["messages"][-1]["content"][:300]) | |
| PY | |
| ``` | |
| ## 5. SFT Settings | |
| Start small: | |
| - Dataset size: 200 to 1,000 action examples | |
| - Max sequence length: 2,048 or 4,096 | |
| - LoRA rank: 16 or 32 | |
| - Learning rate: `1e-4` to `2e-4` | |
| - Epochs: 1 to 3 | |
| - Target modules: attention and MLP projection layers | |
| - Save LoRA adapters first; test them before merging | |
| Stop SFT once the model reliably emits valid `MolForgeAction` JSON. Do not overfit it into copying one fixed heuristic path. | |
| ## 6. RL Stage | |
| After SFT, run RL/GRPO with MolForge as the verifier environment. | |
| Use these environment settings: | |
| ```bash | |
| export MOLFORGE_TRAINING_RANDOMIZATION=1 | |
| export MOLFORGE_REWARD_MODE=curriculum | |
| unset MOLFORGE_DEBUG_STATE | |
| ``` | |
| Once the model starts submitting valid candidates, run a second RL/evaluation | |
| phase with: | |
| ```bash | |
| export MOLFORGE_REWARD_MODE=assay_gated | |
| ``` | |
| Report both curves if possible: | |
| - curriculum reward curve for early learning progress; | |
| - strict terminal `submission_score` before/after for judge-facing task success. | |
| Track these metrics separately: | |
| - Average terminal `submission_score` | |
| - Average terminal `candidate_score` | |
| - Average terminal `budget_score` | |
| - Budget remaining at valid submit | |
| - Invalid action rate | |
| - Policy veto rate | |
| - Budget exhaustion rate | |
| - Repeated assay count | |
| - Loop penalty count | |
| - Coordination score | |
| - Evidence score | |
| - Submitted-without-evidence count | |
| - Constraint margin score | |
| - Number of actions before submit | |
| Inspect generations every few hundred updates. A rising reward is not enough if the model learns to spam messages, submit without evidence, or memorize the three default scenarios. | |
| ## 7. Evaluation Protocol | |
| Use three evaluations: | |
| 1. Deterministic public tasks | |
| Run with randomization off and compare to `python inference.py`. | |
| 2. Randomized training tasks | |
| Run with `MOLFORGE_TRAINING_RANDOMIZATION=1`. | |
| 3. Holdout tasks | |
| Add new scenario configs or fragment perturbations not present in SFT traces. | |
| A trained model should improve terminal submission score while keeping invalid actions and evidence-free submissions low. | |
| For the full testing protocol, including how to compare curriculum reward | |
| against strict evaluation, see [EVALUATION_PROTOCOL.md](EVALUATION_PROTOCOL.md). | |
| ## 8. Model Choice | |
| Recommended starting point: | |
| - `unsloth/Qwen3.5-2B` for the lightest serious iteration loop | |
| - `unsloth/Qwen3-4B-Instruct-2507` if you can afford a little more VRAM and want stronger JSON/tool following | |
| Why: | |
| - Qwen3.5 has 0.8B, 2B, and 4B Unsloth fine-tuning support. | |
| - The 2B class should be fast enough for repeated MolForge SFT/RL experiments. | |
| - The 4B class is still lightweight, but should be more reliable for structured action generation. | |
| Use `Qwen3.5-0.8B` only for plumbing tests. It is useful to verify the training loop, but likely too weak to judge the environment. | |
| If you have more GPU budget: | |
| - `unsloth/Qwen3-8B` or a current Qwen3/Qwen3.5 8B-class instruct model | |
| If you specifically want alternate-family baselines: | |
| - `unsloth/Llama-3.1-8B-Instruct` | |
| - Gemma 3/4 small instruct models can be tested, but prefer Qwen first because the current Unsloth Qwen3.5 fine-tuning path is clearer for 2B/4B RL iteration. | |
| For the hackathon, prefer faster iteration over maximum model size. A clean 4B model trained well against this environment is more useful than a larger model that only runs a few noisy experiments. | |
| ## 9. Honest Inference Reporting | |
| `inference.py` has no heuristic fallback. It requires a configured model and exits with an error if the model is missing, times out, or emits unparsable action JSON. | |
| `local_inference.py` also has no heuristic policy fallback and does not patch missing team messages into model outputs. If a model omits reviewer communication, that weakness should appear as missing-review penalties and a lower `coordination_score`. | |
| For real model evaluation, run: | |
| ```bash | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=your-model \ | |
| HF_TOKEN=your-token \ | |
| python inference.py | |
| ``` | |
| Use the deterministic trace policy only for SFT data generation, not for reporting model scores. | |