Spaces:
Sleeping
Sleeping
| title: Drug Target Validation Environment | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - drug-discovery | |
| - pharma | |
| # 𧬠DrugEnv β Drug Target Validation Environment | |
| > **DrugEnv** β an OpenEnv RL environment that teaches LLMs to do computational drug-target validation. | |
| This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a **computational | |
| drug discovery scientist**. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate | |
| target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated **go / | |
| no-go** validation report with a confidence score. | |
| The environment is designed as a partially observable Markov decision process (POMDP) with: | |
| - a hidden ground-truth `TargetProfile` (expression, druggability, selectivity, toxicity, clinical precedent) | |
| - noisy database / assay outputs governed by `DataQualityState` | |
| - a single unified **experimental credit** budget per episode | |
| - visible task metadata, dossier of accumulated findings, and step history | |
| - dense step-wise reward plus terminal reward for decision quality and evidence coverage | |
| ## Why drug target validation? | |
| Roughly **90% of drug development programs fail** in clinical trials, and a large fraction of failures trace back to mistakes | |
| during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden | |
| toxicity. The cost of progressing a single bad target through Phase III can run into the **billions of dollars**. Even modest | |
| improvements in early-stage decision quality therefore translate into enormous savings and faster cures. | |
| This environment lets you train and benchmark agents on exactly that bottleneck: **acquiring the right evidence cheaply and | |
| submitting a well-calibrated go / no-go**. | |
| ## How it works | |
| At a high level, each episode looks like this: | |
| 1. `reset()` selects a drug-target-validation scenario and seeds the simulator. | |
| 2. The agent receives a `ValidationObservation` describing the target, indication, remaining credits, accumulated dossier, and | |
| step history. | |
| 3. The agent submits a `DrugTargetAction` such as `query_expression`, `druggability_screen`, `off_target_screen`, or | |
| `submit_validation_report`. | |
| 4. The rule engine checks credit budget, redundancy, and ordering prerequisites. | |
| 5. The transition engine deducts credits and asks the output generator to simulate evidence from the hidden `TargetProfile`. | |
| 6. The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance. | |
| 7. The environment returns a new observation with an updated `EvidenceDossier`, latest output, violations, and reward. | |
| 8. The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit. | |
| ## The core mental model | |
| ### Hidden state | |
| The simulator maintains a `FullLatentState` that the agent never sees directly: | |
| - `TargetProfile` β true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket | |
| quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus | |
| the hidden `correct_decision`, `true_viability_score`, `key_evidence_dimensions`, and any `misleading_signals`. | |
| - `DataQualityState` β noise level, false-positive rate, false-negative rate, database coverage. | |
| - `CreditState` β total / used / remaining experimental credits. | |
| - `ValidationProgress` β boolean flags for which evidence dimensions have been investigated and whether a report has been | |
| submitted. | |
| ### Visible state | |
| The agent only sees `ValidationObservation`, which includes: | |
| - `target_gene`, `disease_context`, `indication` | |
| - `credits_remaining` / `credits_total` | |
| - `dossier` β running `EvidenceDossier` of expression / protein / clinical / safety / literature / experimental findings, plus | |
| any `flagged_red_flags` | |
| - `pipeline_history` β list of past actions and their summary outputs | |
| - `latest_output` β typed `IntermediateOutput` from the most recent step | |
| - `rule_violations` and `step_reward_breakdown` for the last step | |
| ## Action space | |
| | Category | Action | Cost (credits) | | |
| |---|---|---| | |
| | Expression & omics | `query_expression`, `differential_expression`, `pathway_enrichment`, `coexpression_network` | 2 | | |
| | Protein & structure | `protein_structure_lookup`, `binding_site_analysis`, `druggability_screen` | 3 | | |
| | Protein & structure | `protein_interaction_network` | 2 | | |
| | Clinical & safety | `clinical_trial_lookup`, `toxicity_panel`, `off_target_screen`, `patient_stratification` | 3 | | |
| | Literature | `literature_search`, `evidence_synthesis`, `competitor_landscape` | 1 | | |
| | Experimental | `crispr_knockout`, `biomarker_correlation` | 4 / 3 | | |
| | Experimental | `in_vitro_assay` | 5 | | |
| | Experimental | `in_vivo_model` | 8 | | |
| | Meta | `flag_red_flag`, `request_expert_review` | 0 / 1 | | |
| | Terminal | `submit_validation_report` | 0 | | |
| `submit_validation_report` carries two extra fields: `final_decision` (`"go"` or `"no_go"`) and `confidence` in `[0, 1]`. The | |
| episode ends as soon as the report is submitted. | |
| ## Reward function | |
| Every step receives a decomposed reward: | |
| R_t = evidence_novelty_bonus | |
| + reasoning_coherence_bonus | |
| + credit_efficiency_penalty | |
| + rule_violation_penalty | |
| + [Ο(s_{t+1}) β Ο(s_t)] | |
| When the episode ends, a terminal reward is added: | |
| R_T = 0.40 * decision_accuracy | |
| + 0.35 * evidence_coverage | |
| + 0.15 * credit_efficiency | |
| + 0.10 * reasoning_coherence | |
| Where: | |
| - `decision_accuracy` β `1.0` if the final go / no-go matched the hidden `correct_decision`, scaled by `2 * |confidence - 0.5|` | |
| so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised. | |
| - `evidence_coverage` β fraction of the scenario's `key_evidence_dimensions` (e.g. `expression`, `druggability`, `off_target`, | |
| `clinical`, `in_vitro`) that the agent actually investigated. | |
| - `credit_efficiency` β `1 β redundant_calls / total_calls`. | |
| - `reasoning_coherence` β fraction of actions whose soft prerequisites (e.g. `expression` before `toxicity`, `in_vitro` before | |
| `in_vivo`) were satisfied. | |
| Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting | |
| credits without ever submitting a report. | |
| ## Curated scenarios | |
| | Name | Difficulty | Correct decision | Why it's interesting | | |
| |---|---|---|---| | |
| | `egfr_nsclc_viable` | easy | `go` | Clear viable target β expression + druggability alone are sufficient. | | |
| | `kras_pdac_borderline` | medium | `go` | Historically undruggable; recent inhibitor literature is decisive. | | |
| | `cd33_aml_misleading` | hard | `no_go` | Naive expression query says "go", but off-target + toxicity + clinical reveal the | |
| right answer. | | |
| | `tp53_solid_tumors_clear_fail` | easy-medium | `no_go` | Druggability check alone is sufficient. | | |
| | `ptpn11_juvenile_mml_complex` | very hard | `go` | Requires `binding_site_analysis(include_allosteric=True)`, off-target work, | |
| patient stratification, and an in-vitro assay. | | |
| The procedural generator (`server/tasks/procedural_generator.py`) layers on additional easy / medium / hard scenarios sampled | |
| from a pool of 20 real cancer targets and 8 cancer indications. | |
| ## Setup | |
| ```bash | |
| # 1. Install dependencies (env runtime only) | |
| pip install -e . | |
| # 2. Or install with training extras (torch + transformers + trl + peft pinned to working set) | |
| pip install -e .[train] | |
| # 3. Run the environment server | |
| PYTHONPATH=. python -m server.app | |
| # server is now available at http://localhost:8000 | |
| The legacy uv sync workflow still works if you have uv.lock checked | |
| in locally; the editable pip install path above is the primary | |
| supported route. | |
| ## Talking to the environment | |
| from client import DrugTargetEnv | |
| from models import DrugTargetAction | |
| with DrugTargetEnv(base_url="http://localhost:8000") as env: | |
| result = env.reset() | |
| print(result.observation.target_gene, "/", result.observation.indication) | |
| result = env.step(DrugTargetAction( | |
| action_type="query_expression", | |
| parameters={"database": "GTEx"}, | |
| reasoning="Establish tissue baseline", | |
| )) | |
| print(result.observation.latest_output.summary) | |
| result = env.step(DrugTargetAction( | |
| action_type="submit_validation_report", | |
| reasoning="Sufficient evidence for go", | |
| final_decision="go", | |
| confidence=0.85, | |
| )) | |
| print("done:", result.done, "reward:", result.reward) | |
| ## Running the baseline agent | |
| PYTHONPATH=. python run_agent.py | |
| The script writes a live JSON snapshot to _dashboard_state.json after every step so you can watch the agent's progress. | |
| The agent model is configurable via env vars: | |
| - RUN_AGENT_MODEL_ID (HF model id) | |
| - RUN_AGENT_LORA_ADAPTER (optional LoRA adapter repo/path) | |
| Example (base + LoRA): | |
| RUN_AGENT_MODEL_ID=Qwen/Qwen2.5-0.5B-Instruct \ | |
| RUN_AGENT_LORA_ADAPTER=anugrahteesdollar/drugenv-qwen25-05b-lora \ | |
| PYTHONPATH=. python run_agent.py | |
| ## Quick Training Update (Colab LoRA Adapter) | |
| HF compute credits ran out before completing a full GRPO run, so we ran a short adapter-only LoRA fine-tune on free Colab T4 to | |
| demonstrate measurable improvement. | |
| - Base model: Qwen/Qwen2.5-0.5B-Instruct (4-bit) | |
| - LoRA adapter (HF Hub): anugrahteesdollar/drugenv-qwen25-05b-lora | |
| - Training: ~300 steps, LoRA r=8, alpha=16, dropout=0.05, target modules q_proj, v_proj | |
| - Outcome: improved DrugEnv action JSON parameter schema adherence. | |
| Example (step 0 prompt, EGFR / NSCLC): | |
| Before (base): | |
| {"action_type":"query_expression","parameters":{"term":"EGFR","dossier_id":"NSCLC"},"reasoning":"Query for the presence of EGFR | |
| in the given dataset."} | |
| After (base + LoRA): | |
| {"action_type":"query_expression","parameters":{"database":"GTEx"},"reasoning":"Start by querying the expressible space across | |
| the entire expression space."} | |
| ## Reproduce | |
| Three commands cover the env-locally / training-locally / training-on-Space paths: | |
| # 1. Env locally (CPU is fine β the env itself is dependency-light) | |
| pip install -e . && PYTHONPATH=. python -m server.app | |
| # β http://localhost:8000 (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed) | |
| # 2. Training locally (single GPU, vanilla GRPO) | |
| pip install -e .[train] | |
| PYTHONPATH=. python -m training.training_script \ | |
| --model-id Qwen/Qwen2.5-3B-Instruct \ | |
| --evidence-dir evidence \ | |
| --output-dir runs/grpo-output | |
| # 3. Training on a Hugging Face Space (H200 single-GPU) | |
| # Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN | |
| # in the Space variables, then POST /train. | |
| # β https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer | |
| The trainer Space's FastAPI control panel (space/training/app.py) | |
| streams a live evidence dashboard while training runs β per-step | |
| training curve, mid-training checkpoint progression, and a before / | |
| after summary card. Default expected hardware: H200 single-GPU | |
| (h200x1); H200 is β 4Γ A100 throughput, ~$0.05β0.10 per step on | |
| Qwen2.5-3B-class GRPO. | |
| An optional SFT warm-start (training/sft_warmstart.py) is | |
| controlled via the SFT_WARMSTART env var on the Space (default on). | |
| It collects oracle trajectories on the curated scenario library, SFTs | |
| the base model with a small LoRA, and hands the merged checkpoint to | |
| GRPO so the policy starts with a non-zero prior over correct | |
| trajectories. | |
| ## Baseline scores | |
| | Difficulty bucket | Random policy | Heuristic policy | Trained Qwen2.5-3B | | |
| |---|---|---|---| | |
| | Easy (egfr_nsclc_viable) | filled in after first training run | filled in after first training run | filled in after first | |
| training run | | |
| | Medium (kras_pdac_borderline) | filled in after first training run | filled in after first training run | filled in after first | |
| training run | | |
| | Hard (cd33_aml_misleading) | filled in after first training run | filled in after first training run | filled in after first | |
| training run | | |
| The trainer Space writes the populated table to | |
| evidence/before_after_metrics.json automatically on every run. | |
| ## Evolution note | |
| The deployment scaffolding in this repository β the trainer Space | |
| control panel, the live training-evidence callback, the SFT warm-start | |
| script, and the working dependency pin set β was originally validated | |
| against a particle-physics-themed prototype and then carried forward | |
| when we pivoted to drug discovery. The simulator, scenarios, action | |
| space, reward function, and rules engine are all drug-domain native; | |
| the inheritance is exclusively in the training and evaluation | |
| scaffolding. |