Spaces:
Running
Running
| # MolForge Evaluation Protocol | |
| Use two reward settings for different purposes. | |
| ## 1. Training / RL Warmup | |
| Use curriculum mode: | |
| ```bash | |
| MOLFORGE_REWARD_MODE=curriculum | |
| MOLFORGE_TRAINING_RANDOMIZATION=1 | |
| ``` | |
| Track: | |
| - mean episode reward; | |
| - valid JSON/action rate; | |
| - policy veto rate; | |
| - evidence score; | |
| - number of oracle calls; | |
| - budget remaining at submit; | |
| - submit rate; | |
| - missed-nomination rate; | |
| - strict terminal `submission_score`. | |
| Curriculum reward is allowed to be generous because its purpose is learning. | |
| It rewards useful evidence collection and evidence-supported submit timing. | |
| ## 2. Judge-Facing Evaluation | |
| Use strict/default mode: | |
| ```bash | |
| unset MOLFORGE_TRAINING_RANDOMIZATION | |
| export MOLFORGE_REWARD_MODE=assay_gated | |
| ``` | |
| Report: | |
| - `average_submission_score`; | |
| - `average_final_score`; | |
| - per-task `final_score`; | |
| - per-task `submission_score`; | |
| - `candidate_score`; | |
| - `progress_score`; | |
| - `constraint_margin_score`; | |
| - `evidence_score`; | |
| - `coordination_score`; | |
| - `budget_score`; | |
| - submitted vs not submitted; | |
| - invalid action count; | |
| - policy veto count. | |
| The official score should not be minimum number of steps. Real drug discovery | |
| does not reward the fastest project if it skips necessary evidence. Instead, | |
| MolForge rewards finishing within the available budget and decision horizon. | |
| `final_score` is the single scalar to optimize and headline. It equals | |
| `submission_score` for submitted episodes and gives only small capped partial | |
| credit to non-submitted episodes. `progress_score` is useful for debugging but | |
| is not a substitute for `final_score` or `submission_score`: it is capped when | |
| constraints fail, when the hard trap scenario is not restarted, or when the | |
| model loops through repeated assays and vetoes. | |
| ## Budget And Step Interpretation | |
| MolForge has both: | |
| - `max_steps`: the project decision deadline; | |
| - `remaining_budget`: the assay/resource budget. | |
| The agent must finish inside both limits. | |
| Budget effects: | |
| - assays subtract from `remaining_budget`; | |
| - over-budget assays are invalid; | |
| - budget exhaustion terminates the episode; | |
| - valid submissions receive a transition-level `budget_efficiency` reward; | |
| - formal `submission_score` receives a small bonus for unused budget only when | |
| the submission has required evidence, passes constraints, and beats baseline; | |
| - curriculum near-miss reward includes `budget_score`, but missed nomination is | |
| penalized if the evidence package was ready and the model failed to submit. | |
| Step effects: | |
| - reaching `max_steps` without submission ends the episode; | |
| - there is a step-limit penalty; | |
| - no extra score is given merely for fewer steps; | |
| - faster is better only if the candidate is supported by evidence and budget is | |
| preserved. | |
| ## Recommended Comparison Table | |
| For the README/demo, compare: | |
| | Model | Reward mode | Submit rate | Avg final_score | Avg submission_score | Avg evidence_score | Avg budget_score | Veto rate | | |
| | --- | --- | ---: | ---: | ---: | ---: | ---: | | |
| | Base model | assay_gated | low | low | low | low/medium | variable | high | | |
| | SFT v4 | assay_gated | better | better | better | better | variable | lower | | |
| | SFT v4 + RL | assay_gated | best | best | best | high | healthy | low | | |
| For training plots, show curriculum reward increasing, but always pair it with | |
| strict `submission_score` before/after so the improvement is credible. | |