---
title: Drug Target Validation Environment
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - drug-discovery
  - pharma
---

  # 🧬 DrugEnv — Drug Target Validation Environment

  > **DrugEnv** — an OpenEnv RL environment that teaches LLMs to do computational drug-target validation.

  This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a **computational
  drug discovery scientist**. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate
  target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated **go /
  no-go** validation report with a confidence score.

  The environment is designed as a partially observable Markov decision process (POMDP) with:

  - a hidden ground-truth `TargetProfile` (expression, druggability, selectivity, toxicity, clinical precedent)
  - noisy database / assay outputs governed by `DataQualityState`
  - a single unified **experimental credit** budget per episode
  - visible task metadata, dossier of accumulated findings, and step history
  - dense step-wise reward plus terminal reward for decision quality and evidence coverage

  ## Why drug target validation?

  Roughly **90% of drug development programs fail** in clinical trials, and a large fraction of failures trace back to mistakes
  during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden
  toxicity. The cost of progressing a single bad target through Phase III can run into the **billions of dollars**. Even modest
  improvements in early-stage decision quality therefore translate into enormous savings and faster cures.

  This environment lets you train and benchmark agents on exactly that bottleneck: **acquiring the right evidence cheaply and
  submitting a well-calibrated go / no-go**.

  ## How it works

  At a high level, each episode looks like this:

  1. `reset()` selects a drug-target-validation scenario and seeds the simulator.
  2. The agent receives a `ValidationObservation` describing the target, indication, remaining credits, accumulated dossier, and
  step history.
  3. The agent submits a `DrugTargetAction` such as `query_expression`, `druggability_screen`, `off_target_screen`, or
  `submit_validation_report`.
  4. The rule engine checks credit budget, redundancy, and ordering prerequisites.
  5. The transition engine deducts credits and asks the output generator to simulate evidence from the hidden `TargetProfile`.
  6. The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance.
  7. The environment returns a new observation with an updated `EvidenceDossier`, latest output, violations, and reward.
  8. The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit.

  ## The core mental model

  ### Hidden state

  The simulator maintains a `FullLatentState` that the agent never sees directly:

  - `TargetProfile` — true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket
  quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus
  the hidden `correct_decision`, `true_viability_score`, `key_evidence_dimensions`, and any `misleading_signals`.
  - `DataQualityState` — noise level, false-positive rate, false-negative rate, database coverage.
  - `CreditState` — total / used / remaining experimental credits.
  - `ValidationProgress` — boolean flags for which evidence dimensions have been investigated and whether a report has been
  submitted.

  ### Visible state

  The agent only sees `ValidationObservation`, which includes:

  - `target_gene`, `disease_context`, `indication`
  - `credits_remaining` / `credits_total`
  - `dossier` — running `EvidenceDossier` of expression / protein / clinical / safety / literature / experimental findings, plus
  any `flagged_red_flags`
  - `pipeline_history` — list of past actions and their summary outputs
  - `latest_output` — typed `IntermediateOutput` from the most recent step
  - `rule_violations` and `step_reward_breakdown` for the last step

  ## Action space

  | Category | Action | Cost (credits) |
  |---|---|---|
  | Expression & omics | `query_expression`, `differential_expression`, `pathway_enrichment`, `coexpression_network` | 2 |
  | Protein & structure | `protein_structure_lookup`, `binding_site_analysis`, `druggability_screen` | 3 |
  | Protein & structure | `protein_interaction_network` | 2 |
  | Clinical & safety | `clinical_trial_lookup`, `toxicity_panel`, `off_target_screen`, `patient_stratification` | 3 |
  | Literature | `literature_search`, `evidence_synthesis`, `competitor_landscape` | 1 |
  | Experimental | `crispr_knockout`, `biomarker_correlation` | 4 / 3 |
  | Experimental | `in_vitro_assay` | 5 |
  | Experimental | `in_vivo_model` | 8 |
  | Meta | `flag_red_flag`, `request_expert_review` | 0 / 1 |
  | Terminal | `submit_validation_report` | 0 |

  `submit_validation_report` carries two extra fields: `final_decision` (`"go"` or `"no_go"`) and `confidence` in `[0, 1]`. The
  episode ends as soon as the report is submitted.

  ## Reward function

  Every step receives a decomposed reward:


  R_t = evidence_novelty_bonus
  + reasoning_coherence_bonus
  + credit_efficiency_penalty
  + rule_violation_penalty
  + [φ(s_{t+1}) − φ(s_t)]


  When the episode ends, a terminal reward is added:


  R_T = 0.40 * decision_accuracy
  + 0.35 * evidence_coverage
  + 0.15 * credit_efficiency
  + 0.10 * reasoning_coherence


  Where:

  - `decision_accuracy` — `1.0` if the final go / no-go matched the hidden `correct_decision`, scaled by `2 * |confidence - 0.5|`
  so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised.
  - `evidence_coverage` — fraction of the scenario's `key_evidence_dimensions` (e.g. `expression`, `druggability`, `off_target`,
  `clinical`, `in_vitro`) that the agent actually investigated.
  - `credit_efficiency` — `1 − redundant_calls / total_calls`.
  - `reasoning_coherence` — fraction of actions whose soft prerequisites (e.g. `expression` before `toxicity`, `in_vitro` before
  `in_vivo`) were satisfied.

  Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting
  credits without ever submitting a report.

  ## Curated scenarios

  | Name | Difficulty | Correct decision | Why it's interesting |
  |---|---|---|---|
  | `egfr_nsclc_viable` | easy | `go` | Clear viable target — expression + druggability alone are sufficient. |
  | `kras_pdac_borderline` | medium | `go` | Historically undruggable; recent inhibitor literature is decisive. |
  | `cd33_aml_misleading` | hard | `no_go` | Naive expression query says "go", but off-target + toxicity + clinical reveal the
  right answer. |
  | `tp53_solid_tumors_clear_fail` | easy-medium | `no_go` | Druggability check alone is sufficient. |
  | `ptpn11_juvenile_mml_complex` | very hard | `go` | Requires `binding_site_analysis(include_allosteric=True)`, off-target work,
  patient stratification, and an in-vitro assay. |

  The procedural generator (`server/tasks/procedural_generator.py`) layers on additional easy / medium / hard scenarios sampled
  from a pool of 20 real cancer targets and 8 cancer indications.

  ## Setup

  ```bash
  # 1. Install dependencies (env runtime only)
  pip install -e .

  # 2. Or install with training extras (torch + transformers + trl + peft pinned to working set)
  pip install -e .[train]

  # 3. Run the environment server
  PYTHONPATH=. python -m server.app
  # server is now available at http://localhost:8000

  The legacy uv sync workflow still works if you have uv.lock checked
  in locally; the editable pip install path above is the primary
  supported route.

  ## Talking to the environment

  from client import DrugTargetEnv
  from models import DrugTargetAction

  with DrugTargetEnv(base_url="http://localhost:8000") as env:
      result = env.reset()
      print(result.observation.target_gene, "/", result.observation.indication)

      result = env.step(DrugTargetAction(
          action_type="query_expression",
          parameters={"database": "GTEx"},
          reasoning="Establish tissue baseline",
      ))
      print(result.observation.latest_output.summary)

      result = env.step(DrugTargetAction(
          action_type="submit_validation_report",
          reasoning="Sufficient evidence for go",
          final_decision="go",
          confidence=0.85,
      ))
      print("done:", result.done, "reward:", result.reward)

  ## Running the baseline agent

  PYTHONPATH=. python run_agent.py

  The script writes a live JSON snapshot to _dashboard_state.json after every step so you can watch the agent's progress.

  The agent model is configurable via env vars:

  - RUN_AGENT_MODEL_ID (HF model id)
  - RUN_AGENT_LORA_ADAPTER (optional LoRA adapter repo/path)

  Example (base + LoRA):

  RUN_AGENT_MODEL_ID=Qwen/Qwen2.5-0.5B-Instruct \
  RUN_AGENT_LORA_ADAPTER=anugrahteesdollar/drugenv-qwen25-05b-lora \
  PYTHONPATH=. python run_agent.py

  ## Quick Training Update (Colab LoRA Adapter)

  HF compute credits ran out before completing a full GRPO run, so we ran a short adapter-only LoRA fine-tune on free Colab T4 to
  demonstrate measurable improvement.

  - Base model: Qwen/Qwen2.5-0.5B-Instruct (4-bit)
  - LoRA adapter (HF Hub): anugrahteesdollar/drugenv-qwen25-05b-lora
  - Training: ~300 steps, LoRA r=8, alpha=16, dropout=0.05, target modules q_proj, v_proj
  - Outcome: improved DrugEnv action JSON parameter schema adherence.

  Example (step 0 prompt, EGFR / NSCLC):

  Before (base):
  {"action_type":"query_expression","parameters":{"term":"EGFR","dossier_id":"NSCLC"},"reasoning":"Query for the presence of EGFR
  in the given dataset."}

  After (base + LoRA):
  {"action_type":"query_expression","parameters":{"database":"GTEx"},"reasoning":"Start by querying the expressible space across
  the entire expression space."}

  ## Reproduce

  Three commands cover the env-locally / training-locally / training-on-Space paths:

  # 1. Env locally (CPU is fine — the env itself is dependency-light)
  pip install -e . && PYTHONPATH=. python -m server.app
  # → http://localhost:8000  (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed)

  # 2. Training locally (single GPU, vanilla GRPO)
  pip install -e .[train]
  PYTHONPATH=. python -m training.training_script \
      --model-id Qwen/Qwen2.5-3B-Instruct \
      --evidence-dir evidence \
      --output-dir runs/grpo-output

  # 3. Training on a Hugging Face Space (H200 single-GPU)
  #    Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN
  #    in the Space variables, then POST /train.
  #    → https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer

  The trainer Space's FastAPI control panel (space/training/app.py)
  streams a live evidence dashboard while training runs — per-step
  training curve, mid-training checkpoint progression, and a before /
  after summary card. Default expected hardware: H200 single-GPU
  (h200x1); H200 is ≈ 4× A100 throughput, ~$0.05–0.10 per step on
  Qwen2.5-3B-class GRPO.

  An optional SFT warm-start (training/sft_warmstart.py) is
  controlled via the SFT_WARMSTART env var on the Space (default on).
  It collects oracle trajectories on the curated scenario library, SFTs
  the base model with a small LoRA, and hands the merged checkpoint to
  GRPO so the policy starts with a non-zero prior over correct
  trajectories.

  ## Baseline scores

  | Difficulty bucket | Random policy | Heuristic policy | Trained Qwen2.5-3B |
  |---|---|---|---|
  | Easy (egfr_nsclc_viable) | filled in after first training run | filled in after first training run | filled in after first
  training run |
  | Medium (kras_pdac_borderline) | filled in after first training run | filled in after first training run | filled in after first
  training run |
  | Hard (cd33_aml_misleading) | filled in after first training run | filled in after first training run | filled in after first
  training run |

  The trainer Space writes the populated table to
  evidence/before_after_metrics.json automatically on every run.

  ## Evolution note

  The deployment scaffolding in this repository — the trainer Space
  control panel, the live training-evidence callback, the SFT warm-start
  script, and the working dependency pin set — was originally validated
  against a particle-physics-themed prototype and then carried forward
  when we pivoted to drug discovery. The simulator, scenarios, action
  space, reward function, and rules engine are all drug-domain native;
  the inheritance is exclusively in the training and evaluation
  scaffolding.