molforge / EVALUATION_PROTOCOL.md
Adhitya122's picture
Prepare MolForge OpenEnv Docker Space submission
bf9e424 verified

MolForge Evaluation Protocol

Use two reward settings for different purposes.

1. Training / RL Warmup

Use curriculum mode:

MOLFORGE_REWARD_MODE=curriculum
MOLFORGE_TRAINING_RANDOMIZATION=1

Track:

  • mean episode reward;
  • valid JSON/action rate;
  • policy veto rate;
  • evidence score;
  • number of oracle calls;
  • budget remaining at submit;
  • submit rate;
  • missed-nomination rate;
  • strict terminal submission_score.

Curriculum reward is allowed to be generous because its purpose is learning. It rewards useful evidence collection and evidence-supported submit timing.

2. Judge-Facing Evaluation

Use strict/default mode:

unset MOLFORGE_TRAINING_RANDOMIZATION
export MOLFORGE_REWARD_MODE=assay_gated

Report:

  • average_submission_score;
  • average_final_score;
  • per-task final_score;
  • per-task submission_score;
  • candidate_score;
  • progress_score;
  • constraint_margin_score;
  • evidence_score;
  • coordination_score;
  • budget_score;
  • submitted vs not submitted;
  • invalid action count;
  • policy veto count.

The official score should not be minimum number of steps. Real drug discovery does not reward the fastest project if it skips necessary evidence. Instead, MolForge rewards finishing within the available budget and decision horizon. final_score is the single scalar to optimize and headline. It equals submission_score for submitted episodes and gives only small capped partial credit to non-submitted episodes. progress_score is useful for debugging but is not a substitute for final_score or submission_score: it is capped when constraints fail, when the hard trap scenario is not restarted, or when the model loops through repeated assays and vetoes.

Budget And Step Interpretation

MolForge has both:

  • max_steps: the project decision deadline;
  • remaining_budget: the assay/resource budget.

The agent must finish inside both limits.

Budget effects:

  • assays subtract from remaining_budget;
  • over-budget assays are invalid;
  • budget exhaustion terminates the episode;
  • valid submissions receive a transition-level budget_efficiency reward;
  • formal submission_score receives a small bonus for unused budget only when the submission has required evidence, passes constraints, and beats baseline;
  • curriculum near-miss reward includes budget_score, but missed nomination is penalized if the evidence package was ready and the model failed to submit.

Step effects:

  • reaching max_steps without submission ends the episode;
  • there is a step-limit penalty;
  • no extra score is given merely for fewer steps;
  • faster is better only if the candidate is supported by evidence and budget is preserved.

Recommended Comparison Table

For the README/demo, compare:

| Model | Reward mode | Submit rate | Avg final_score | Avg submission_score | Avg evidence_score | Avg budget_score | Veto rate | | --- | --- | ---: | ---: | ---: | ---: | ---: | | Base model | assay_gated | low | low | low | low/medium | variable | high | | SFT v4 | assay_gated | better | better | better | better | variable | lower | | SFT v4 + RL | assay_gated | best | best | best | high | healthy | low |

For training plots, show curriculum reward increasing, but always pair it with strict submission_score before/after so the improvement is credible.