Spaces:
Running
MolForge Real-World Workflow Mapping
MolForge should feel like a compressed medicinal-chemistry lead-optimization program, not a one-shot molecule generator.
The real-world pattern is:
- A team starts with a scaffold.
- Chemists propose edits based on structure-activity reasoning.
- Assay teams spend limited budget to measure uncertain properties.
- Safety and process specialists veto risky or impractical candidates.
- The team decides whether to keep optimizing, restart, or nominate a lead.
- Success depends on evidence, not only on the final molecule.
This is exactly the shape MolForge should copy.
Real-World Loop
1. Design Hypothesis
Real teams do not mutate molecules randomly. A medicinal chemist proposes a change with an intended purpose:
- improve potency;
- reduce toxicity;
- improve solubility or ADME;
- simplify synthesis;
- escape a known scaffold liability.
MolForge equivalent:
editrationaleexpected_effectsevidence
The model should not only choose a fragment. It should say what scientific pressure that edit is meant to address.
2. Cheap Triage Before Expensive Assays
Real projects usually run cheap computational or low-cost screens before expensive experiments.
MolForge equivalent:
evaluate_propertiessearch_literatureestimate_synthesizabilitydock_target
These should be useful but imperfect. They help the model decide where to spend more serious assay budget.
3. Expensive Evidence Gates
Real lead candidates require stronger evidence before nomination:
- potency evidence;
- toxicity/safety evidence;
- synthesis or route feasibility evidence;
- sometimes post-mutation or resistance-panel evidence.
MolForge equivalent:
assay_toxicitydock_targetestimate_synthesizability- hard evidence requirements in
submit evidence_score
This is why submission_score should remain strict. A molecule that looks good
but was never properly assayed is not a real lead candidate.
4. Cross-Functional Decision Board
Real projects are not controlled by one chemist. A lead-optimization meeting usually includes:
- medicinal chemistry;
- assay biology;
- toxicology/safety;
- process chemistry or manufacturability;
- project leadership.
MolForge equivalent:
lead_chemistassay_plannertoxicologistprocess_chemist- governance messages;
- hard vetoes;
coordination_score
This is one of MolForge's strongest environment-innovation points. The agent is not just optimizing a molecule; it is coordinating a scientific team.
5. Stop, Submit, or Restart
Real teams must decide when to stop spending money. Sometimes the right answer is to abandon a scaffold early because the series is a trap.
MolForge equivalent:
submitrestart- budget limits;
- max decision horizon;
- hard scenario target shift;
- sunk-cost trap in
level_2_hard
This lets the environment test project judgment, not just local molecule edits.
How To Use This In MolForge
Keep Two Scores
Use two kinds of reward:
Training reward Helps the model learn the workflow.
Formal submission score Measures whether the agent actually nominated a valid candidate.
That means:
MOLFORGE_REWARD_MODE=curriculumfor early RL;- default
assay_gatedmode for final reporting; submission_scorestays0.0without a formal submit.
This mirrors the real world: a project can make progress without nominating a lead, but it cannot claim lead success without a nomination package.
Make Rewards Stage-Gated
A good real-world reward should not be one giant final number only.
Useful reward components:
- valid action/schema;
- useful design edit;
- useful first assay;
- evidence coverage;
- safety improvement;
- synthesis improvement;
- avoiding repeated assays;
- avoiding vetoed decisions;
- submitting only with enough support;
- restarting from a bad scaffold when appropriate.
This gives RL a learnable path while preserving strict final success.
Make The Demo Story Simple
Judges should understand this in one sentence:
MolForge tests whether an LLM can run a miniature drug-discovery project: design molecules, buy assays, respect safety vetoes, manage budget, and nominate a candidate only when the evidence package is strong enough.
Then show:
- baseline model repeats invalid or vetoed actions;
- SFT model learns the action language;
- RL model learns better evidence and submit timing;
- final candidate report card shows potency, toxicity, synthesis, evidence, budget, and coordination.
What We Already Have
MolForge already contains most of this real-world structure:
- molecule slot edits;
- RDKit/TDC-backed surrogate oracle path;
- limited assay budget;
- cheap and expensive tools;
- hidden true properties;
- visible assay estimates;
- toxicity and synthesis constraints;
- multi-agent specialist governance;
- safety vetoes;
- restart action;
- hard target-shift scenario;
- decomposed report card;
- strict terminal
submission_score; - curriculum reward mode for early RL.
What To Strengthen Next
The next useful additions should make the environment feel even more like a real project:
Assay uncertainty Repeated assays should narrow confidence intervals, but cost budget.
Stage labels Mark states as
design,triage,evidence_package,nomination, orno-go.No-go decisions Reward a model for stopping or restarting when the evidence says the series is unsafe or infeasible.
Portfolio-style report At terminal time, show why the candidate was nominated or rejected.
Holdout variants Randomize scaffold starts and budgets so the model cannot memorize only three paths.
For the hackathon, the best near-term path is:
SFT v4 for action/workflow competence
-> curriculum RL for observable reward improvement
-> strict assay_gated evaluation for final submission_score
-> README/demo framed as a real drug-discovery decision board