Spaces:
Sleeping
DrugEnv
Nine out of ten drugs that enter clinical trials never make it out. The expensive part isn't the chemistry. It isn't even the trial itself. The expensive part is choosing the wrong protein to attack in the first place. That decision is made years earlier, often on a Tuesday afternoon, often by someone who genuinely believed they had picked the right one.
This is the decision we wanted to teach a language model to make.
Not write about. Not summarise. But to make it.
The Gap Nobody Talks About
Ask a modern LLM why a drug program against CD33 in acute myeloid leukaemia is risky and you'll get a clean, well-cited paragraph. The model knows. It has read every review article on the planet. But knowing is not the same as deciding. A scientist sitting in a meeting room with a finite budget and a dozen candidate targets cannot afford to recite a textbook. She has to pick what to investigate next, when to stop, and when to walk away.
That is a different skill entirely. And it is the skill current LLMs are quietly bad at.
DrugEnv exists to close that gap. We turned the validation of a drug target into a game with rules, costs, evidence, and consequences, and then we trained an LLM to play it.
How a DrugEnv Episode Feels
You are dropped into a problem like this: "Validate KRAS_G12C as a target in pancreatic cancer. You have one hundred research credits."
You see the gene, the disease, your remaining budget, and an empty dossier. What you do not see is the truth. You don't know whether the target is actually druggable, whether it has hidden toxicity, whether somebody has already tried it and failed quietly in a Phase II trial three years ago. The simulator knows all of this. You will only learn what you choose to learn, and every question costs you credits you cannot get back.
This is the entire point. Real drug discovery is not an exam with a visible question paper. It is an investigation in fog.
The Tools at the Agent's Disposal
We did not give the agent a toy interface. We gave it the kind of toolkit a real computational biologist would reach for, organised the way a real workflow flows.
When the agent wants to understand whether the target is even expressed where the disease lives, it has the omics layer. query_expression returns tissue-level expression, differential_expression compares tumour against normal, pathway_enrichment shows which biological pathways light up, and coexpression_network finds what moves with the target.
When the question turns to can we actually drug this thing, the structural layer kicks in. protein_structure_lookup retrieves known 3D structures, binding_site_analysis looks for a pocket a small molecule can sit in, protein_interaction_network maps partners, and druggability_screen returns an overall druggability verdict.
The clinical and safety layer is where promising targets often die. clinical_trial_lookup reveals what has already been tried, toxicity_panel predicts safety liabilities, off_target_screen warns of collateral damage to other proteins, and patient_stratification answers which subgroup of patients would actually benefit.
The literature and synthesis layer (literature_search, evidence_synthesis, competitor_landscape) is the cheap one. A smart agent uses it early, before burning budget on expensive experiments.
The experimental layer is the most expensive. crispr_knockout asks whether removing the gene changes the disease, biomarker_correlation ties the target to a measurable signal, in_vitro_assay provides cell-line evidence, and in_vivo_model provides animal-model evidence.
And finally the meta layer. flag_red_flag lets the agent explicitly note a concern, request_expert_review lets it escalate, and the terminal action submit_validation_report is where the agent must commit: go or no_go, with a confidence score it cannot take back.
Every tool costs credits. Submitting early is a guess. Submitting late wastes the budget. The skill is the order in which questions are asked.
The Reward System, Designed Like a Suspicious Examiner
A single scalar reward would be hacked within an afternoon. A model would discover that spamming literature_search is cheap and gives small positive reward, and it would happily do that forever. So the reward function is not one number. It is a panel of independent judges, each watching for a different thing.
At every step, the agent gets a small positive reward when it uncovers a genuinely new dimension of evidence such as expression, druggability, safety, or clinical history. Repeating a tool it has already used pays nothing. Calling a tool that has no relevance to the target type costs it. Investigating in a coherent order, like checking whether the target is expressed before splurging on an in-vivo model, is rewarded. Doing things in absurd order is penalised because it signals the agent is acting without a hypothesis. Saving credits is rewarded too, because real scientists do not have infinite money.
The terminal reward is where the real test lives. When the agent finally submits its go or no_go, the environment compares the answer against the hidden ground truth. A correct decision is rewarded. A wrong one is penalised. But here is the part that matters: confidence is separately graded. A confident correct answer earns more than a hesitant correct answer. A confident wrong answer is punished far more harshly than a hesitant wrong one. We are not just teaching the model to be right. We are teaching it to know when it is right.
There are also hard penalties for the obvious failures: submitting without enough evidence to support the decision, running out of budget without submitting at all, or quietly never committing. The agent cannot survive by being clever in only one dimension. It has to investigate, decide, and calibrate, all three.
This is the part we are most proud of. The reward function does not reward sounding scientific. It rewards being scientific.
The Traps We Built On Purpose
A few of the curated scenarios are designed to expose specific failure modes.
EGFR in EGFR-mutant non-small-cell lung cancer is the honest case. Clear expression, clear druggability, clear go. An agent that gets this wrong has not learned anything yet.
CD33 in acute myeloid leukaemia is the trap. Expression is sky high. A naive agent looks at expression alone, declares victory, and submits go with high confidence. Only an agent that bothers with toxicity_panel and off_target_screen discovers why every CD33 program in history has struggled. It is the right protein in the wrong context. Correct answer: no_go.
KRAS_G12C in pancreatic cancer is the temporal trap. Until very recently, this target was considered undruggable, and any agent leaning on older priors will say so. Recent literature and clinical trial data tell a different story. An agent that does not check what year it is gets the historically correct but currently wrong answer.
TP53 is the fame trap. It is one of the most important genes in cancer biology. It is also a terrible direct drug target. An agent that confuses "important" with "druggable" walks straight into a no_go.
These are not memorisation tests. They are tests of whether the agent has learned to investigate before believing.
How We Trained It
The base model is Qwen2.5-3B-Instruct. The algorithm is GRPO through Hugging Face TRL. Unsloth handles memory and rollout speed. The environment itself acts as the verifier. There is no learned reward model, no fragile RLHF preference data, just a simulator that knows the truth and a reward function that has opinions about how the agent got there.
There is one wrinkle. Long-horizon RL from a cold start rarely converges on a small model. A 3B parameter network does not stumble into a coherent fifteen-step investigation often enough to learn anything. So we run a brief supervised warm start on oracle trajectories first. Just enough for the model to learn the shape of a real investigation. Then GRPO takes over and improves the policy through experience. Curriculum matters too. Easy scenarios first, traps later, very hard cases only once the reward stops sitting at zero.
What Changed After Training
Before training, the base model behaved exactly like an overeager intern. On the CD33 scenario, it called query_expression once, saw high expression, and submitted go with confidence zero point nine. Wrong call. Overconfident. Most of its budget unused.
After GRPO, the same scenario unfolded differently. The model called query_expression first, saw the expression signal, was not satisfied, called druggability_screen, then off_target_screen, then toxicity_panel. It saw the safety red flag. It submitted no_go with confidence zero point seven. Right answer, appropriately uncertain, with an evidence dossier that actually supported the decision.
This is the shift we wanted. Not from wrong to right. From confidently wrong to cautiously right. That is the difference between an LLM that talks about science and one that practises it.
Why This Matters Beyond Drug Discovery
DrugEnv is a simulator. The targets are curated. The biology is simplified. We are not claiming to have built a real drug discovery pipeline.
What we are claiming is something more interesting. The pattern underneath this environment, investigate under a budget and commit to a decision, is the same pattern in clinical diagnosis, security incident triage, financial due diligence, and a dozen other domains where current LLMs are eloquent but unreliable. The skill we trained here is portable. The environment we built here is a template.
Drug discovery was the right place to start because the verification is crisp, the ground truth exists, and the failure modes are intellectually honest. But the deeper bet is this. The next leap in language models will not come from larger pre-training corpora. It will come from environments that reward how a model decides, not just what it says.
DrugEnv is one such environment. We hope it is the first of many.
Links
- Environment Space: https://huggingface.co/spaces/anugrahteesdollar/drugenv
- Trainer Space: https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer
- Training script:
training/training_script.py - Demo UI:
space/env/gradio_demo.py - Full scenario list and code:
README.md
Thank you for reading. We know you have many of these to get through.