Spaces:
Running
MolForge Training Instructions
This guide is for training a small model against MolForge without teaching it to exploit the environment.
1. Safety Defaults
MolForge now hides true internal molecule properties from public state() metadata by default. If you need to debug the environment manually, use:
MOLFORGE_DEBUG_STATE=1 python inference.py
Do not use MOLFORGE_DEBUG_STATE=1 while collecting SFT data or running RL.
The chemistry oracle path uses RDKit descriptors by default and TDC molecule oracles when pytdc is available. TDC is kept as an optional extra because current PyTDC releases pull a large platform-sensitive ML stack; install it with uv sync --extra tdc on a compatible Python if you want TDC SA/QED oracles active. RDKit remains active in the default Docker/HF deployment, and the environment records the active backend in observation metadata.
The default reward mode is assay_gated, which gives coarse edit feedback and leaves the strongest quality signal to assays and terminal graders. For early RL warmup, use the curriculum reward mode:
MOLFORGE_REWARD_MODE=curriculum python inference.py
Curriculum mode keeps the official submission_score strict, but gives bounded
training reward for useful evidence collection, evidence-supported submit
decisions, and non-submitted near-miss episodes. If the model reaches a strong
evidence package and still fails to submit before the deadline, curriculum mode
adds a small missed-nomination penalty. This prevents small models from seeing
only zero terminal scores while they are still learning when to submit, without
letting endless assay collection become the best behavior. Use this for initial
GRPO curves, then switch back to assay_gated for final evaluation.
For curriculum experiments only, you can also restore the older dense edit reward:
MOLFORGE_REWARD_MODE=dense python inference.py
Use randomized training episodes when collecting data or training a policy:
MOLFORGE_TRAINING_RANDOMIZATION=1 MOLFORGE_RANDOM_SEED=42 python inference.py
Keep randomization off for judge-facing baseline runs so scores remain reproducible.
2. Recommended Training Plan
Use a two-stage plan:
- Small SFT warm start
- RL with verifiable rewards
SFT is only for teaching the model the action schema and basic workflow. RL should do the real environment optimization.
3. What SFT Should Teach
Include these example types:
- Valid JSON action formatting
- Correct
acting_rolefor each action - Short
rationalevalues that explain the decision without chain-of-thought evidencelists that cite visible observation facts onlyexpected_effectsdictionaries with directional predictions, not hidden scores- Specialist message bundles with proposal, approval, objection, assay request, or rejection
- Running cheap/necessary assays before risky submissions
- Editing toward safer fragments when toxicity risk is visible
- Restarting early in the hard sunk-cost scenario
- Submitting only when evidence covers the task constraints
- Handling noisy assay estimates without undoing a high-confidence final candidate at the last moment
- Recovering from low budget by choosing small actions or stopping
Avoid these example types:
- Any example that reads
state.metadata.debug_hidden_properties - Any answer that mentions exact hidden objective deltas
- Hidden chain-of-thought or long private reasoning transcripts
- Repetitive message spam just to collect coordination reward
- Premature submit actions without potency/safety evidence
- Examples where missing specialist messages are silently repaired by the runner
4. Generate a Starter SFT Dataset
For the first schema warm start, use the strict curriculum dataset. It includes
explicit JSON null fields, only the intended top-level action keys, all action
types, all assay tools, all edit subtypes, and valid role/message permissions:
python scripts/generate_sft_schema_strict_dataset.py \
--episodes 75 \
--output data/molforge_sft_schema_strict.jsonl
python scripts/validate_sft_traces.py data/molforge_sft_schema_strict.jsonl
Use this file first for Qwen 2B-class SFT:
data/molforge_sft_schema_strict.jsonl
The older trace generator is still useful after the model learns the exact schema, because it provides more policy-like trajectories:
Run:
python scripts/generate_sft_traces.py --episodes 80 --output data/molforge_sft_traces.jsonl
For a more robust dataset:
python scripts/generate_sft_traces.py \
--episodes 200 \
--randomized \
--output data/molforge_sft_traces_randomized.jsonl
The generated records use chat-style JSONL:
{"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}
Before training, spot-check the JSONL:
python - <<'PY'
import json
from pathlib import Path
path = Path("data/molforge_sft_traces.jsonl")
for i, line in zip(range(3), path.open()):
item = json.loads(line)
print(i, item["metadata"], item["messages"][-1]["content"][:300])
PY
5. SFT Settings
Start small:
- Dataset size: 200 to 1,000 action examples
- Max sequence length: 2,048 or 4,096
- LoRA rank: 16 or 32
- Learning rate:
1e-4to2e-4 - Epochs: 1 to 3
- Target modules: attention and MLP projection layers
- Save LoRA adapters first; test them before merging
Stop SFT once the model reliably emits valid MolForgeAction JSON. Do not overfit it into copying one fixed heuristic path.
6. RL Stage
After SFT, run RL/GRPO with MolForge as the verifier environment.
Use these environment settings:
export MOLFORGE_TRAINING_RANDOMIZATION=1
export MOLFORGE_REWARD_MODE=curriculum
unset MOLFORGE_DEBUG_STATE
Once the model starts submitting valid candidates, run a second RL/evaluation phase with:
export MOLFORGE_REWARD_MODE=assay_gated
Report both curves if possible:
- curriculum reward curve for early learning progress;
- strict terminal
submission_scorebefore/after for judge-facing task success.
Track these metrics separately:
- Average terminal
submission_score - Average terminal
candidate_score - Average terminal
budget_score - Budget remaining at valid submit
- Invalid action rate
- Policy veto rate
- Budget exhaustion rate
- Repeated assay count
- Loop penalty count
- Coordination score
- Evidence score
- Submitted-without-evidence count
- Constraint margin score
- Number of actions before submit
Inspect generations every few hundred updates. A rising reward is not enough if the model learns to spam messages, submit without evidence, or memorize the three default scenarios.
7. Evaluation Protocol
Use three evaluations:
- Deterministic public tasks
Run with randomization off and compare to
python inference.py. - Randomized training tasks
Run with
MOLFORGE_TRAINING_RANDOMIZATION=1. - Holdout tasks Add new scenario configs or fragment perturbations not present in SFT traces.
A trained model should improve terminal submission score while keeping invalid actions and evidence-free submissions low.
For the full testing protocol, including how to compare curriculum reward against strict evaluation, see EVALUATION_PROTOCOL.md.
8. Model Choice
Recommended starting point:
unsloth/Qwen3.5-2Bfor the lightest serious iteration loopunsloth/Qwen3-4B-Instruct-2507if you can afford a little more VRAM and want stronger JSON/tool following
Why:
- Qwen3.5 has 0.8B, 2B, and 4B Unsloth fine-tuning support.
- The 2B class should be fast enough for repeated MolForge SFT/RL experiments.
- The 4B class is still lightweight, but should be more reliable for structured action generation.
Use Qwen3.5-0.8B only for plumbing tests. It is useful to verify the training loop, but likely too weak to judge the environment.
If you have more GPU budget:
unsloth/Qwen3-8Bor a current Qwen3/Qwen3.5 8B-class instruct model
If you specifically want alternate-family baselines:
unsloth/Llama-3.1-8B-Instruct- Gemma 3/4 small instruct models can be tested, but prefer Qwen first because the current Unsloth Qwen3.5 fine-tuning path is clearer for 2B/4B RL iteration.
For the hackathon, prefer faster iteration over maximum model size. A clean 4B model trained well against this environment is more useful than a larger model that only runs a few noisy experiments.
9. Honest Inference Reporting
inference.py has no heuristic fallback. It requires a configured model and exits with an error if the model is missing, times out, or emits unparsable action JSON.
local_inference.py also has no heuristic policy fallback and does not patch missing team messages into model outputs. If a model omits reviewer communication, that weakness should appear as missing-review penalties and a lower coordination_score.
For real model evaluation, run:
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=your-model \
HF_TOKEN=your-token \
python inference.py
Use the deterministic trace policy only for SFT data generation, not for reporting model scores.