molforge / EVALUATION_PROTOCOL.md
Adhitya122's picture
Prepare MolForge OpenEnv Docker Space submission
bf9e424 verified
# MolForge Evaluation Protocol
Use two reward settings for different purposes.
## 1. Training / RL Warmup
Use curriculum mode:
```bash
MOLFORGE_REWARD_MODE=curriculum
MOLFORGE_TRAINING_RANDOMIZATION=1
```
Track:
- mean episode reward;
- valid JSON/action rate;
- policy veto rate;
- evidence score;
- number of oracle calls;
- budget remaining at submit;
- submit rate;
- missed-nomination rate;
- strict terminal `submission_score`.
Curriculum reward is allowed to be generous because its purpose is learning.
It rewards useful evidence collection and evidence-supported submit timing.
## 2. Judge-Facing Evaluation
Use strict/default mode:
```bash
unset MOLFORGE_TRAINING_RANDOMIZATION
export MOLFORGE_REWARD_MODE=assay_gated
```
Report:
- `average_submission_score`;
- `average_final_score`;
- per-task `final_score`;
- per-task `submission_score`;
- `candidate_score`;
- `progress_score`;
- `constraint_margin_score`;
- `evidence_score`;
- `coordination_score`;
- `budget_score`;
- submitted vs not submitted;
- invalid action count;
- policy veto count.
The official score should not be minimum number of steps. Real drug discovery
does not reward the fastest project if it skips necessary evidence. Instead,
MolForge rewards finishing within the available budget and decision horizon.
`final_score` is the single scalar to optimize and headline. It equals
`submission_score` for submitted episodes and gives only small capped partial
credit to non-submitted episodes. `progress_score` is useful for debugging but
is not a substitute for `final_score` or `submission_score`: it is capped when
constraints fail, when the hard trap scenario is not restarted, or when the
model loops through repeated assays and vetoes.
## Budget And Step Interpretation
MolForge has both:
- `max_steps`: the project decision deadline;
- `remaining_budget`: the assay/resource budget.
The agent must finish inside both limits.
Budget effects:
- assays subtract from `remaining_budget`;
- over-budget assays are invalid;
- budget exhaustion terminates the episode;
- valid submissions receive a transition-level `budget_efficiency` reward;
- formal `submission_score` receives a small bonus for unused budget only when
the submission has required evidence, passes constraints, and beats baseline;
- curriculum near-miss reward includes `budget_score`, but missed nomination is
penalized if the evidence package was ready and the model failed to submit.
Step effects:
- reaching `max_steps` without submission ends the episode;
- there is a step-limit penalty;
- no extra score is given merely for fewer steps;
- faster is better only if the candidate is supported by evidence and budget is
preserved.
## Recommended Comparison Table
For the README/demo, compare:
| Model | Reward mode | Submit rate | Avg final_score | Avg submission_score | Avg evidence_score | Avg budget_score | Veto rate |
| --- | --- | ---: | ---: | ---: | ---: | ---: |
| Base model | assay_gated | low | low | low | low/medium | variable | high |
| SFT v4 | assay_gated | better | better | better | better | variable | lower |
| SFT v4 + RL | assay_gated | best | best | best | high | healthy | low |
For training plots, show curriculum reward increasing, but always pair it with
strict `submission_score` before/after so the improvement is credible.