drugenv / README.md
anugrahteesdollar's picture
Update README.md
f8e6872 verified
metadata
title: Drug Target Validation Environment
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - drug-discovery
  - pharma

🧬 DrugEnv β€” Drug Target Validation Environment

DrugEnv β€” an OpenEnv RL environment that teaches LLMs to do computational drug-target validation.

This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a computational drug discovery scientist. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated go / no-go validation report with a confidence score.

The environment is designed as a partially observable Markov decision process (POMDP) with:

  • a hidden ground-truth TargetProfile (expression, druggability, selectivity, toxicity, clinical precedent)
  • noisy database / assay outputs governed by DataQualityState
  • a single unified experimental credit budget per episode
  • visible task metadata, dossier of accumulated findings, and step history
  • dense step-wise reward plus terminal reward for decision quality and evidence coverage

Why drug target validation?

Roughly 90% of drug development programs fail in clinical trials, and a large fraction of failures trace back to mistakes during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden toxicity. The cost of progressing a single bad target through Phase III can run into the billions of dollars. Even modest improvements in early-stage decision quality therefore translate into enormous savings and faster cures.

This environment lets you train and benchmark agents on exactly that bottleneck: acquiring the right evidence cheaply and submitting a well-calibrated go / no-go.

How it works

At a high level, each episode looks like this:

  1. reset() selects a drug-target-validation scenario and seeds the simulator.
  2. The agent receives a ValidationObservation describing the target, indication, remaining credits, accumulated dossier, and step history.
  3. The agent submits a DrugTargetAction such as query_expression, druggability_screen, off_target_screen, or submit_validation_report.
  4. The rule engine checks credit budget, redundancy, and ordering prerequisites.
  5. The transition engine deducts credits and asks the output generator to simulate evidence from the hidden TargetProfile.
  6. The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance.
  7. The environment returns a new observation with an updated EvidenceDossier, latest output, violations, and reward.
  8. The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit.

The core mental model

Hidden state

The simulator maintains a FullLatentState that the agent never sees directly:

  • TargetProfile β€” true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus the hidden correct_decision, true_viability_score, key_evidence_dimensions, and any misleading_signals.
  • DataQualityState β€” noise level, false-positive rate, false-negative rate, database coverage.
  • CreditState β€” total / used / remaining experimental credits.
  • ValidationProgress β€” boolean flags for which evidence dimensions have been investigated and whether a report has been submitted.

Visible state

The agent only sees ValidationObservation, which includes:

  • target_gene, disease_context, indication
  • credits_remaining / credits_total
  • dossier β€” running EvidenceDossier of expression / protein / clinical / safety / literature / experimental findings, plus any flagged_red_flags
  • pipeline_history β€” list of past actions and their summary outputs
  • latest_output β€” typed IntermediateOutput from the most recent step
  • rule_violations and step_reward_breakdown for the last step

Action space

Category Action Cost (credits)
Expression & omics query_expression, differential_expression, pathway_enrichment, coexpression_network 2
Protein & structure protein_structure_lookup, binding_site_analysis, druggability_screen 3
Protein & structure protein_interaction_network 2
Clinical & safety clinical_trial_lookup, toxicity_panel, off_target_screen, patient_stratification 3
Literature literature_search, evidence_synthesis, competitor_landscape 1
Experimental crispr_knockout, biomarker_correlation 4 / 3
Experimental in_vitro_assay 5
Experimental in_vivo_model 8
Meta flag_red_flag, request_expert_review 0 / 1
Terminal submit_validation_report 0

submit_validation_report carries two extra fields: final_decision ("go" or "no_go") and confidence in [0, 1]. The episode ends as soon as the report is submitted.

Reward function

Every step receives a decomposed reward:

R_t = evidence_novelty_bonus

  • reasoning_coherence_bonus
  • credit_efficiency_penalty
  • rule_violation_penalty
  • [Ο†(s_{t+1}) βˆ’ Ο†(s_t)]

When the episode ends, a terminal reward is added:

R_T = 0.40 * decision_accuracy

  • 0.35 * evidence_coverage
  • 0.15 * credit_efficiency
  • 0.10 * reasoning_coherence

Where:

  • decision_accuracy β€” 1.0 if the final go / no-go matched the hidden correct_decision, scaled by 2 * |confidence - 0.5| so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised.
  • evidence_coverage β€” fraction of the scenario's key_evidence_dimensions (e.g. expression, druggability, off_target, clinical, in_vitro) that the agent actually investigated.
  • credit_efficiency β€” 1 βˆ’ redundant_calls / total_calls.
  • reasoning_coherence β€” fraction of actions whose soft prerequisites (e.g. expression before toxicity, in_vitro before in_vivo) were satisfied.

Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting credits without ever submitting a report.

Curated scenarios

Name Difficulty Correct decision Why it's interesting
egfr_nsclc_viable easy go Clear viable target β€” expression + druggability alone are sufficient.
kras_pdac_borderline medium go Historically undruggable; recent inhibitor literature is decisive.
cd33_aml_misleading hard no_go Naive expression query says "go", but off-target + toxicity + clinical reveal the
right answer.
tp53_solid_tumors_clear_fail easy-medium no_go Druggability check alone is sufficient.
ptpn11_juvenile_mml_complex very hard go Requires binding_site_analysis(include_allosteric=True), off-target work,
patient stratification, and an in-vitro assay.

The procedural generator (server/tasks/procedural_generator.py) layers on additional easy / medium / hard scenarios sampled from a pool of 20 real cancer targets and 8 cancer indications.

Setup

# 1. Install dependencies (env runtime only)
pip install -e .

# 2. Or install with training extras (torch + transformers + trl + peft pinned to working set)
pip install -e .[train]

# 3. Run the environment server
PYTHONPATH=. python -m server.app
# server is now available at http://localhost:8000

The legacy uv sync workflow still works if you have uv.lock checked
in locally; the editable pip install path above is the primary
supported route.

## Talking to the environment

from client import DrugTargetEnv
from models import DrugTargetAction

with DrugTargetEnv(base_url="http://localhost:8000") as env:
    result = env.reset()
    print(result.observation.target_gene, "/", result.observation.indication)

    result = env.step(DrugTargetAction(
        action_type="query_expression",
        parameters={"database": "GTEx"},
        reasoning="Establish tissue baseline",
    ))
    print(result.observation.latest_output.summary)

    result = env.step(DrugTargetAction(
        action_type="submit_validation_report",
        reasoning="Sufficient evidence for go",
        final_decision="go",
        confidence=0.85,
    ))
    print("done:", result.done, "reward:", result.reward)

## Running the baseline agent

PYTHONPATH=. python run_agent.py

The script writes a live JSON snapshot to _dashboard_state.json after every step so you can watch the agent's progress.

The agent model is configurable via env vars:

- RUN_AGENT_MODEL_ID (HF model id)
- RUN_AGENT_LORA_ADAPTER (optional LoRA adapter repo/path)

Example (base + LoRA):

RUN_AGENT_MODEL_ID=Qwen/Qwen2.5-0.5B-Instruct \
RUN_AGENT_LORA_ADAPTER=anugrahteesdollar/drugenv-qwen25-05b-lora \
PYTHONPATH=. python run_agent.py

## Quick Training Update (Colab LoRA Adapter)

HF compute credits ran out before completing a full GRPO run, so we ran a short adapter-only LoRA fine-tune on free Colab T4 to
demonstrate measurable improvement.

- Base model: Qwen/Qwen2.5-0.5B-Instruct (4-bit)
- LoRA adapter (HF Hub): anugrahteesdollar/drugenv-qwen25-05b-lora
- Training: ~300 steps, LoRA r=8, alpha=16, dropout=0.05, target modules q_proj, v_proj
- Outcome: improved DrugEnv action JSON parameter schema adherence.

Example (step 0 prompt, EGFR / NSCLC):

Before (base):
{"action_type":"query_expression","parameters":{"term":"EGFR","dossier_id":"NSCLC"},"reasoning":"Query for the presence of EGFR
in the given dataset."}

After (base + LoRA):
{"action_type":"query_expression","parameters":{"database":"GTEx"},"reasoning":"Start by querying the expressible space across
the entire expression space."}

## Reproduce

Three commands cover the env-locally / training-locally / training-on-Space paths:

# 1. Env locally (CPU is fine β€” the env itself is dependency-light)
pip install -e . && PYTHONPATH=. python -m server.app
# β†’ http://localhost:8000  (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed)

# 2. Training locally (single GPU, vanilla GRPO)
pip install -e .[train]
PYTHONPATH=. python -m training.training_script \
    --model-id Qwen/Qwen2.5-3B-Instruct \
    --evidence-dir evidence \
    --output-dir runs/grpo-output

# 3. Training on a Hugging Face Space (H200 single-GPU)
#    Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN
#    in the Space variables, then POST /train.
#    β†’ https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer

The trainer Space's FastAPI control panel (space/training/app.py)
streams a live evidence dashboard while training runs β€” per-step
training curve, mid-training checkpoint progression, and a before /
after summary card. Default expected hardware: H200 single-GPU
(h200x1); H200 is β‰ˆ 4Γ— A100 throughput, ~$0.05–0.10 per step on
Qwen2.5-3B-class GRPO.

An optional SFT warm-start (training/sft_warmstart.py) is
controlled via the SFT_WARMSTART env var on the Space (default on).
It collects oracle trajectories on the curated scenario library, SFTs
the base model with a small LoRA, and hands the merged checkpoint to
GRPO so the policy starts with a non-zero prior over correct
trajectories.

## Baseline scores

| Difficulty bucket | Random policy | Heuristic policy | Trained Qwen2.5-3B |
|---|---|---|---|
| Easy (egfr_nsclc_viable) | filled in after first training run | filled in after first training run | filled in after first
training run |
| Medium (kras_pdac_borderline) | filled in after first training run | filled in after first training run | filled in after first
training run |
| Hard (cd33_aml_misleading) | filled in after first training run | filled in after first training run | filled in after first
training run |

The trainer Space writes the populated table to
evidence/before_after_metrics.json automatically on every run.

## Evolution note

The deployment scaffolding in this repository β€” the trainer Space
control panel, the live training-evidence callback, the SFT warm-start
script, and the working dependency pin set β€” was originally validated
against a particle-physics-themed prototype and then carried forward
when we pivoted to drug discovery. The simulator, scenarios, action
space, reward function, and rules engine are all drug-domain native;
the inheritance is exclusively in the training and evaluation
scaffolding.