# MolForge Training Instructions

This guide is for training a small model against MolForge without teaching it to exploit the environment.

## 1. Safety Defaults

MolForge now hides true internal molecule properties from public `state()` metadata by default. If you need to debug the environment manually, use:

```bash
MOLFORGE_DEBUG_STATE=1 python inference.py
```

Do not use `MOLFORGE_DEBUG_STATE=1` while collecting SFT data or running RL.

The chemistry oracle path uses RDKit descriptors by default and TDC molecule oracles when `pytdc` is available. TDC is kept as an optional extra because current PyTDC releases pull a large platform-sensitive ML stack; install it with `uv sync --extra tdc` on a compatible Python if you want TDC SA/QED oracles active. RDKit remains active in the default Docker/HF deployment, and the environment records the active backend in observation metadata.

The default reward mode is `assay_gated`, which gives coarse edit feedback and leaves the strongest quality signal to assays and terminal graders. For early RL warmup, use the curriculum reward mode:

```bash
MOLFORGE_REWARD_MODE=curriculum python inference.py
```

Curriculum mode keeps the official `submission_score` strict, but gives bounded
training reward for useful evidence collection, evidence-supported submit
decisions, and non-submitted near-miss episodes. If the model reaches a strong
evidence package and still fails to submit before the deadline, curriculum mode
adds a small missed-nomination penalty. This prevents small models from seeing
only zero terminal scores while they are still learning when to submit, without
letting endless assay collection become the best behavior. Use this for initial
GRPO curves, then switch back to `assay_gated` for final evaluation.

For curriculum experiments only, you can also restore the older dense edit reward:

```bash
MOLFORGE_REWARD_MODE=dense python inference.py
```

Use randomized training episodes when collecting data or training a policy:

```bash
MOLFORGE_TRAINING_RANDOMIZATION=1 MOLFORGE_RANDOM_SEED=42 python inference.py
```

Keep randomization off for judge-facing baseline runs so scores remain reproducible.

## 2. Recommended Training Plan

Use a two-stage plan:

1. Small SFT warm start
2. RL with verifiable rewards

SFT is only for teaching the model the action schema and basic workflow. RL should do the real environment optimization.

## 3. What SFT Should Teach

Include these example types:

- Valid JSON action formatting
- Correct `acting_role` for each action
- Short `rationale` values that explain the decision without chain-of-thought
- `evidence` lists that cite visible observation facts only
- `expected_effects` dictionaries with directional predictions, not hidden scores
- Specialist message bundles with proposal, approval, objection, assay request, or rejection
- Running cheap/necessary assays before risky submissions
- Editing toward safer fragments when toxicity risk is visible
- Restarting early in the hard sunk-cost scenario
- Submitting only when evidence covers the task constraints
- Handling noisy assay estimates without undoing a high-confidence final candidate at the last moment
- Recovering from low budget by choosing small actions or stopping

Avoid these example types:

- Any example that reads `state.metadata.debug_hidden_properties`
- Any answer that mentions exact hidden objective deltas
- Hidden chain-of-thought or long private reasoning transcripts
- Repetitive message spam just to collect coordination reward
- Premature submit actions without potency/safety evidence
- Examples where missing specialist messages are silently repaired by the runner

## 4. Generate a Starter SFT Dataset

For the first schema warm start, use the strict curriculum dataset. It includes
explicit JSON `null` fields, only the intended top-level action keys, all action
types, all assay tools, all edit subtypes, and valid role/message permissions:

```bash
python scripts/generate_sft_schema_strict_dataset.py \
  --episodes 75 \
  --output data/molforge_sft_schema_strict.jsonl

python scripts/validate_sft_traces.py data/molforge_sft_schema_strict.jsonl
```

Use this file first for Qwen 2B-class SFT:

```text
data/molforge_sft_schema_strict.jsonl
```

The older trace generator is still useful after the model learns the exact
schema, because it provides more policy-like trajectories:

Run:

```bash
python scripts/generate_sft_traces.py --episodes 80 --output data/molforge_sft_traces.jsonl
```

For a more robust dataset:

```bash
python scripts/generate_sft_traces.py \
  --episodes 200 \
  --randomized \
  --output data/molforge_sft_traces_randomized.jsonl
```

The generated records use chat-style JSONL:

```json
{"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}
```

Before training, spot-check the JSONL:

```bash
python - <<'PY'
import json
from pathlib import Path

path = Path("data/molforge_sft_traces.jsonl")
for i, line in zip(range(3), path.open()):
    item = json.loads(line)
    print(i, item["metadata"], item["messages"][-1]["content"][:300])
PY
```

## 5. SFT Settings

Start small:

- Dataset size: 200 to 1,000 action examples
- Max sequence length: 2,048 or 4,096
- LoRA rank: 16 or 32
- Learning rate: `1e-4` to `2e-4`
- Epochs: 1 to 3
- Target modules: attention and MLP projection layers
- Save LoRA adapters first; test them before merging

Stop SFT once the model reliably emits valid `MolForgeAction` JSON. Do not overfit it into copying one fixed heuristic path.

## 6. RL Stage

After SFT, run RL/GRPO with MolForge as the verifier environment.

Use these environment settings:

```bash
export MOLFORGE_TRAINING_RANDOMIZATION=1
export MOLFORGE_REWARD_MODE=curriculum
unset MOLFORGE_DEBUG_STATE
```

Once the model starts submitting valid candidates, run a second RL/evaluation
phase with:

```bash
export MOLFORGE_REWARD_MODE=assay_gated
```

Report both curves if possible:

- curriculum reward curve for early learning progress;
- strict terminal `submission_score` before/after for judge-facing task success.

Track these metrics separately:

- Average terminal `submission_score`
- Average terminal `candidate_score`
- Average terminal `budget_score`
- Budget remaining at valid submit
- Invalid action rate
- Policy veto rate
- Budget exhaustion rate
- Repeated assay count
- Loop penalty count
- Coordination score
- Evidence score
- Submitted-without-evidence count
- Constraint margin score
- Number of actions before submit

Inspect generations every few hundred updates. A rising reward is not enough if the model learns to spam messages, submit without evidence, or memorize the three default scenarios.

## 7. Evaluation Protocol

Use three evaluations:

1. Deterministic public tasks
   Run with randomization off and compare to `python inference.py`.
2. Randomized training tasks
   Run with `MOLFORGE_TRAINING_RANDOMIZATION=1`.
3. Holdout tasks
   Add new scenario configs or fragment perturbations not present in SFT traces.

A trained model should improve terminal submission score while keeping invalid actions and evidence-free submissions low.

For the full testing protocol, including how to compare curriculum reward
against strict evaluation, see [EVALUATION_PROTOCOL.md](EVALUATION_PROTOCOL.md).

## 8. Model Choice

Recommended starting point:

- `unsloth/Qwen3.5-2B` for the lightest serious iteration loop
- `unsloth/Qwen3-4B-Instruct-2507` if you can afford a little more VRAM and want stronger JSON/tool following

Why:

- Qwen3.5 has 0.8B, 2B, and 4B Unsloth fine-tuning support.
- The 2B class should be fast enough for repeated MolForge SFT/RL experiments.
- The 4B class is still lightweight, but should be more reliable for structured action generation.

Use `Qwen3.5-0.8B` only for plumbing tests. It is useful to verify the training loop, but likely too weak to judge the environment.

If you have more GPU budget:

- `unsloth/Qwen3-8B` or a current Qwen3/Qwen3.5 8B-class instruct model

If you specifically want alternate-family baselines:

- `unsloth/Llama-3.1-8B-Instruct`
- Gemma 3/4 small instruct models can be tested, but prefer Qwen first because the current Unsloth Qwen3.5 fine-tuning path is clearer for 2B/4B RL iteration.

For the hackathon, prefer faster iteration over maximum model size. A clean 4B model trained well against this environment is more useful than a larger model that only runs a few noisy experiments.

## 9. Honest Inference Reporting

`inference.py` has no heuristic fallback. It requires a configured model and exits with an error if the model is missing, times out, or emits unparsable action JSON.

`local_inference.py` also has no heuristic policy fallback and does not patch missing team messages into model outputs. If a model omits reviewer communication, that weakness should appear as missing-review penalties and a lower `coordination_score`.

For real model evaluation, run:

```bash
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=your-model \
HF_TOKEN=your-token \
python inference.py
```

Use the deterministic trace policy only for SFT data generation, not for reporting model scores.