Spaces:
Sleeping
Sleeping
| # DrugEnv | |
| Nine out of ten drugs that enter clinical trials never make it out. The expensive part isn't the chemistry. It isn't even the trial itself. The expensive part is choosing the **wrong protein to attack** in the first place. That decision is made years earlier, often on a Tuesday afternoon, often by someone who genuinely believed they had picked the right one. | |
| This is the decision we wanted to teach a language model to make. | |
| Not write about. Not summarise. *But to make it.* | |
| ## The Gap Nobody Talks About | |
| Ask a modern LLM why a drug program against `CD33` in acute myeloid leukaemia is risky and you'll get a clean, well-cited paragraph. The model knows. It has read every review article on the planet. But knowing is not the same as deciding. A scientist sitting in a meeting room with a finite budget and a dozen candidate targets cannot afford to recite a textbook. She has to pick *what to investigate next*, *when to stop*, and *when to walk away*. | |
| That is a different skill entirely. And it is the skill current LLMs are quietly bad at. | |
| DrugEnv exists to close that gap. We turned the validation of a drug target into a game with rules, costs, evidence, and consequences, and then we trained an LLM to play it. | |
| ## How a DrugEnv Episode Feels | |
| You are dropped into a problem like this: *"Validate `KRAS_G12C` as a target in pancreatic cancer. You have one hundred research credits."* | |
| You see the gene, the disease, your remaining budget, and an empty dossier. What you do not see is the truth. You don't know whether the target is actually druggable, whether it has hidden toxicity, whether somebody has already tried it and failed quietly in a Phase II trial three years ago. The simulator knows all of this. You will only learn what you choose to learn, and every question costs you credits you cannot get back. | |
| This is the entire point. Real drug discovery is not an exam with a visible question paper. It is an investigation in fog. | |
| ## The Tools at the Agent's Disposal | |
| We did not give the agent a toy interface. We gave it the kind of toolkit a real computational biologist would reach for, organised the way a real workflow flows. | |
| When the agent wants to understand whether the target is even *expressed* where the disease lives, it has the omics layer. `query_expression` returns tissue-level expression, `differential_expression` compares tumour against normal, `pathway_enrichment` shows which biological pathways light up, and `coexpression_network` finds what moves with the target. | |
| When the question turns to *can we actually drug this thing*, the structural layer kicks in. `protein_structure_lookup` retrieves known 3D structures, `binding_site_analysis` looks for a pocket a small molecule can sit in, `protein_interaction_network` maps partners, and `druggability_screen` returns an overall druggability verdict. | |
| The clinical and safety layer is where promising targets often die. `clinical_trial_lookup` reveals what has already been tried, `toxicity_panel` predicts safety liabilities, `off_target_screen` warns of collateral damage to other proteins, and `patient_stratification` answers which subgroup of patients would actually benefit. | |
| The literature and synthesis layer (`literature_search`, `evidence_synthesis`, `competitor_landscape`) is the cheap one. A smart agent uses it early, before burning budget on expensive experiments. | |
| The experimental layer is the most expensive. `crispr_knockout` asks whether removing the gene changes the disease, `biomarker_correlation` ties the target to a measurable signal, `in_vitro_assay` provides cell-line evidence, and `in_vivo_model` provides animal-model evidence. | |
| And finally the meta layer. `flag_red_flag` lets the agent explicitly note a concern, `request_expert_review` lets it escalate, and the terminal action `submit_validation_report` is where the agent must commit: `go` or `no_go`, with a confidence score it cannot take back. | |
| Every tool costs credits. Submitting early is a guess. Submitting late wastes the budget. The skill is the *order* in which questions are asked. | |
| ## The Reward System, Designed Like a Suspicious Examiner | |
| A single scalar reward would be hacked within an afternoon. A model would discover that spamming `literature_search` is cheap and gives small positive reward, and it would happily do that forever. So the reward function is not one number. It is a panel of independent judges, each watching for a different thing. | |
| At every step, the agent gets a small positive reward when it uncovers a genuinely *new* dimension of evidence such as expression, druggability, safety, or clinical history. Repeating a tool it has already used pays nothing. Calling a tool that has no relevance to the target type costs it. Investigating in a coherent order, like checking whether the target is expressed before splurging on an in-vivo model, is rewarded. Doing things in absurd order is penalised because it signals the agent is acting without a hypothesis. Saving credits is rewarded too, because real scientists do not have infinite money. | |
| The terminal reward is where the real test lives. When the agent finally submits its `go` or `no_go`, the environment compares the answer against the hidden ground truth. A correct decision is rewarded. A wrong one is penalised. But here is the part that matters: confidence is *separately* graded. A confident correct answer earns more than a hesitant correct answer. A confident wrong answer is punished far more harshly than a hesitant wrong one. We are not just teaching the model to be right. We are teaching it to *know when it is right*. | |
| There are also hard penalties for the obvious failures: submitting without enough evidence to support the decision, running out of budget without submitting at all, or quietly never committing. The agent cannot survive by being clever in only one dimension. It has to investigate, decide, and calibrate, all three. | |
| This is the part we are most proud of. The reward function does not reward sounding scientific. It rewards *being* scientific. | |
| ## The Traps We Built On Purpose | |
| A few of the curated scenarios are designed to expose specific failure modes. | |
| `EGFR` in EGFR-mutant non-small-cell lung cancer is the honest case. Clear expression, clear druggability, clear `go`. An agent that gets this wrong has not learned anything yet. | |
| `CD33` in acute myeloid leukaemia is the trap. Expression is sky high. A naive agent looks at expression alone, declares victory, and submits `go` with high confidence. Only an agent that bothers with `toxicity_panel` and `off_target_screen` discovers why every CD33 program in history has struggled. It is the right protein in the wrong context. Correct answer: `no_go`. | |
| `KRAS_G12C` in pancreatic cancer is the temporal trap. Until very recently, this target was considered undruggable, and any agent leaning on older priors will say so. Recent literature and clinical trial data tell a different story. An agent that does not check what year it is gets the historically correct but currently wrong answer. | |
| `TP53` is the fame trap. It is one of the most important genes in cancer biology. It is also a terrible direct drug target. An agent that confuses "important" with "druggable" walks straight into a `no_go`. | |
| These are not memorisation tests. They are tests of whether the agent has learned to *investigate before believing*. | |
| ## How We Trained It | |
| The base model is Qwen2.5-3B-Instruct. The algorithm is GRPO through Hugging Face TRL. Unsloth handles memory and rollout speed. The environment itself acts as the verifier. There is no learned reward model, no fragile RLHF preference data, just a simulator that knows the truth and a reward function that has opinions about how the agent got there. | |
| There is one wrinkle. Long-horizon RL from a cold start rarely converges on a small model. A 3B parameter network does not stumble into a coherent fifteen-step investigation often enough to learn anything. So we run a brief supervised warm start on oracle trajectories first. Just enough for the model to learn the *shape* of a real investigation. Then GRPO takes over and improves the policy through experience. Curriculum matters too. Easy scenarios first, traps later, very hard cases only once the reward stops sitting at zero. | |
| ## What Changed After Training | |
| Before training, the base model behaved exactly like an overeager intern. On the `CD33` scenario, it called `query_expression` once, saw high expression, and submitted `go` with confidence zero point nine. Wrong call. Overconfident. Most of its budget unused. | |
| After GRPO, the same scenario unfolded differently. The model called `query_expression` first, saw the expression signal, was *not* satisfied, called `druggability_screen`, then `off_target_screen`, then `toxicity_panel`. It saw the safety red flag. It submitted `no_go` with confidence zero point seven. Right answer, appropriately uncertain, with an evidence dossier that actually supported the decision. | |
| This is the shift we wanted. Not from wrong to right. From *confidently wrong* to *cautiously right*. That is the difference between an LLM that talks about science and one that practises it. | |
| ## Why This Matters Beyond Drug Discovery | |
| DrugEnv is a simulator. The targets are curated. The biology is simplified. We are not claiming to have built a real drug discovery pipeline. | |
| What we are claiming is something more interesting. The pattern underneath this environment, *investigate under a budget and commit to a decision*, is the same pattern in clinical diagnosis, security incident triage, financial due diligence, and a dozen other domains where current LLMs are eloquent but unreliable. The skill we trained here is portable. The environment we built here is a template. | |
| Drug discovery was the right place to start because the verification is crisp, the ground truth exists, and the failure modes are intellectually honest. But the deeper bet is this. The next leap in language models will not come from larger pre-training corpora. It will come from environments that reward *how* a model decides, not just *what* it says. | |
| DrugEnv is one such environment. We hope it is the first of many. | |
| ## Links | |
| - Environment Space: https://huggingface.co/spaces/anugrahteesdollar/drugenv | |
| - Trainer Space: https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer | |
| - Training script: `training/training_script.py` | |
| - Demo UI: `space/env/gradio_demo.py` | |
| - Full scenario list and code: `README.md` | |
| Thank you for reading. We know you have many of these to get through. | |