Spaces:
Sleeping
title: Drug Target Validation Environment
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
- reinforcement-learning
- drug-discovery
- pharma
𧬠DrugEnv β Drug Target Validation Environment
DrugEnv β an OpenEnv RL environment that teaches LLMs to do computational drug-target validation.
This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a computational drug discovery scientist. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated go / no-go validation report with a confidence score.
The environment is designed as a partially observable Markov decision process (POMDP) with:
- a hidden ground-truth
TargetProfile(expression, druggability, selectivity, toxicity, clinical precedent) - noisy database / assay outputs governed by
DataQualityState - a single unified experimental credit budget per episode
- visible task metadata, dossier of accumulated findings, and step history
- dense step-wise reward plus terminal reward for decision quality and evidence coverage
Why drug target validation?
Roughly 90% of drug development programs fail in clinical trials, and a large fraction of failures trace back to mistakes during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden toxicity. The cost of progressing a single bad target through Phase III can run into the billions of dollars. Even modest improvements in early-stage decision quality therefore translate into enormous savings and faster cures.
This environment lets you train and benchmark agents on exactly that bottleneck: acquiring the right evidence cheaply and submitting a well-calibrated go / no-go.
How it works
At a high level, each episode looks like this:
reset()selects a drug-target-validation scenario and seeds the simulator.- The agent receives a
ValidationObservationdescribing the target, indication, remaining credits, accumulated dossier, and step history. - The agent submits a
DrugTargetActionsuch asquery_expression,druggability_screen,off_target_screen, orsubmit_validation_report. - The rule engine checks credit budget, redundancy, and ordering prerequisites.
- The transition engine deducts credits and asks the output generator to simulate evidence from the hidden
TargetProfile. - The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance.
- The environment returns a new observation with an updated
EvidenceDossier, latest output, violations, and reward. - The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit.
The core mental model
Hidden state
The simulator maintains a FullLatentState that the agent never sees directly:
TargetProfileβ true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus the hiddencorrect_decision,true_viability_score,key_evidence_dimensions, and anymisleading_signals.DataQualityStateβ noise level, false-positive rate, false-negative rate, database coverage.CreditStateβ total / used / remaining experimental credits.ValidationProgressβ boolean flags for which evidence dimensions have been investigated and whether a report has been submitted.
Visible state
The agent only sees ValidationObservation, which includes:
target_gene,disease_context,indicationcredits_remaining/credits_totaldossierβ runningEvidenceDossierof expression / protein / clinical / safety / literature / experimental findings, plus anyflagged_red_flagspipeline_historyβ list of past actions and their summary outputslatest_outputβ typedIntermediateOutputfrom the most recent steprule_violationsandstep_reward_breakdownfor the last step
Action space
| Category | Action | Cost (credits) |
|---|---|---|
| Expression & omics | query_expression, differential_expression, pathway_enrichment, coexpression_network |
2 |
| Protein & structure | protein_structure_lookup, binding_site_analysis, druggability_screen |
3 |
| Protein & structure | protein_interaction_network |
2 |
| Clinical & safety | clinical_trial_lookup, toxicity_panel, off_target_screen, patient_stratification |
3 |
| Literature | literature_search, evidence_synthesis, competitor_landscape |
1 |
| Experimental | crispr_knockout, biomarker_correlation |
4 / 3 |
| Experimental | in_vitro_assay |
5 |
| Experimental | in_vivo_model |
8 |
| Meta | flag_red_flag, request_expert_review |
0 / 1 |
| Terminal | submit_validation_report |
0 |
submit_validation_report carries two extra fields: final_decision ("go" or "no_go") and confidence in [0, 1]. The
episode ends as soon as the report is submitted.
Reward function
Every step receives a decomposed reward:
R_t = evidence_novelty_bonus
- reasoning_coherence_bonus
- credit_efficiency_penalty
- rule_violation_penalty
- [Ο(s_{t+1}) β Ο(s_t)]
When the episode ends, a terminal reward is added:
R_T = 0.40 * decision_accuracy
- 0.35 * evidence_coverage
- 0.15 * credit_efficiency
- 0.10 * reasoning_coherence
Where:
decision_accuracyβ1.0if the final go / no-go matched the hiddencorrect_decision, scaled by2 * |confidence - 0.5|so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised.evidence_coverageβ fraction of the scenario'skey_evidence_dimensions(e.g.expression,druggability,off_target,clinical,in_vitro) that the agent actually investigated.credit_efficiencyβ1 β redundant_calls / total_calls.reasoning_coherenceβ fraction of actions whose soft prerequisites (e.g.expressionbeforetoxicity,in_vitrobeforein_vivo) were satisfied.
Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting credits without ever submitting a report.
Curated scenarios
| Name | Difficulty | Correct decision | Why it's interesting |
|---|---|---|---|
egfr_nsclc_viable |
easy | go |
Clear viable target β expression + druggability alone are sufficient. |
kras_pdac_borderline |
medium | go |
Historically undruggable; recent inhibitor literature is decisive. |
cd33_aml_misleading |
hard | no_go |
Naive expression query says "go", but off-target + toxicity + clinical reveal the |
| right answer. | |||
tp53_solid_tumors_clear_fail |
easy-medium | no_go |
Druggability check alone is sufficient. |
ptpn11_juvenile_mml_complex |
very hard | go |
Requires binding_site_analysis(include_allosteric=True), off-target work, |
| patient stratification, and an in-vitro assay. |
The procedural generator (server/tasks/procedural_generator.py) layers on additional easy / medium / hard scenarios sampled
from a pool of 20 real cancer targets and 8 cancer indications.
Setup
# 1. Install dependencies (env runtime only)
pip install -e .
# 2. Or install with training extras (torch + transformers + trl + peft pinned to working set)
pip install -e .[train]
# 3. Run the environment server
PYTHONPATH=. python -m server.app
# server is now available at http://localhost:8000
The legacy uv sync workflow still works if you have uv.lock checked
in locally; the editable pip install path above is the primary
supported route.
## Talking to the environment
from client import DrugTargetEnv
from models import DrugTargetAction
with DrugTargetEnv(base_url="http://localhost:8000") as env:
result = env.reset()
print(result.observation.target_gene, "/", result.observation.indication)
result = env.step(DrugTargetAction(
action_type="query_expression",
parameters={"database": "GTEx"},
reasoning="Establish tissue baseline",
))
print(result.observation.latest_output.summary)
result = env.step(DrugTargetAction(
action_type="submit_validation_report",
reasoning="Sufficient evidence for go",
final_decision="go",
confidence=0.85,
))
print("done:", result.done, "reward:", result.reward)
## Running the baseline agent
PYTHONPATH=. python run_agent.py
The script writes a live JSON snapshot to _dashboard_state.json after every step so you can watch the agent's progress.
The agent model is configurable via env vars:
- RUN_AGENT_MODEL_ID (HF model id)
- RUN_AGENT_LORA_ADAPTER (optional LoRA adapter repo/path)
Example (base + LoRA):
RUN_AGENT_MODEL_ID=Qwen/Qwen2.5-0.5B-Instruct \
RUN_AGENT_LORA_ADAPTER=anugrahteesdollar/drugenv-qwen25-05b-lora \
PYTHONPATH=. python run_agent.py
## Quick Training Update (Colab LoRA Adapter)
HF compute credits ran out before completing a full GRPO run, so we ran a short adapter-only LoRA fine-tune on free Colab T4 to
demonstrate measurable improvement.
- Base model: Qwen/Qwen2.5-0.5B-Instruct (4-bit)
- LoRA adapter (HF Hub): anugrahteesdollar/drugenv-qwen25-05b-lora
- Training: ~300 steps, LoRA r=8, alpha=16, dropout=0.05, target modules q_proj, v_proj
- Outcome: improved DrugEnv action JSON parameter schema adherence.
Example (step 0 prompt, EGFR / NSCLC):
Before (base):
{"action_type":"query_expression","parameters":{"term":"EGFR","dossier_id":"NSCLC"},"reasoning":"Query for the presence of EGFR
in the given dataset."}
After (base + LoRA):
{"action_type":"query_expression","parameters":{"database":"GTEx"},"reasoning":"Start by querying the expressible space across
the entire expression space."}
## Reproduce
Three commands cover the env-locally / training-locally / training-on-Space paths:
# 1. Env locally (CPU is fine β the env itself is dependency-light)
pip install -e . && PYTHONPATH=. python -m server.app
# β http://localhost:8000 (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed)
# 2. Training locally (single GPU, vanilla GRPO)
pip install -e .[train]
PYTHONPATH=. python -m training.training_script \
--model-id Qwen/Qwen2.5-3B-Instruct \
--evidence-dir evidence \
--output-dir runs/grpo-output
# 3. Training on a Hugging Face Space (H200 single-GPU)
# Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN
# in the Space variables, then POST /train.
# β https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer
The trainer Space's FastAPI control panel (space/training/app.py)
streams a live evidence dashboard while training runs β per-step
training curve, mid-training checkpoint progression, and a before /
after summary card. Default expected hardware: H200 single-GPU
(h200x1); H200 is β 4Γ A100 throughput, ~$0.05β0.10 per step on
Qwen2.5-3B-class GRPO.
An optional SFT warm-start (training/sft_warmstart.py) is
controlled via the SFT_WARMSTART env var on the Space (default on).
It collects oracle trajectories on the curated scenario library, SFTs
the base model with a small LoRA, and hands the merged checkpoint to
GRPO so the policy starts with a non-zero prior over correct
trajectories.
## Baseline scores
| Difficulty bucket | Random policy | Heuristic policy | Trained Qwen2.5-3B |
|---|---|---|---|
| Easy (egfr_nsclc_viable) | filled in after first training run | filled in after first training run | filled in after first
training run |
| Medium (kras_pdac_borderline) | filled in after first training run | filled in after first training run | filled in after first
training run |
| Hard (cd33_aml_misleading) | filled in after first training run | filled in after first training run | filled in after first
training run |
The trainer Space writes the populated table to
evidence/before_after_metrics.json automatically on every run.
## Evolution note
The deployment scaffolding in this repository β the trainer Space
control panel, the live training-evidence callback, the SFT warm-start
script, and the working dependency pin set β was originally validated
against a particle-physics-themed prototype and then carried forward
when we pivoted to drug discovery. The simulator, scenarios, action
space, reward function, and rules engine are all drug-domain native;
the inheritance is exclusively in the training and evaluation
scaffolding.