drugenv-trainer / README.md
anugrahteesdollar's picture
initial: drugenv trainer control panel
e681925 verified
---
title: Drug Target Validation Environment
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
- reinforcement-learning
- drug-discovery
- pharma
---
# 🧬 DrugEnv β€” Drug Target Validation Environment
> **DrugEnv** β€” an OpenEnv RL environment that teaches LLMs to do computational drug-target validation.
This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a **computational drug discovery scientist**. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated **go / no-go** validation report with a confidence score.
The environment is designed as a partially observable Markov decision process (POMDP) with:
- a hidden ground-truth `TargetProfile` (expression, druggability, selectivity, toxicity, clinical precedent)
- noisy database / assay outputs governed by `DataQualityState`
- a single unified **experimental credit** budget per episode
- visible task metadata, dossier of accumulated findings, and step history
- dense step-wise reward plus terminal reward for decision quality and evidence coverage
## Why drug target validation?
Roughly **90% of drug development programs fail** in clinical trials, and a large fraction of failures trace back to mistakes during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden toxicity. The cost of progressing a single bad target through Phase III can run into the **billions of dollars**. Even modest improvements in early-stage decision quality therefore translate into enormous savings and faster cures.
This environment lets you train and benchmark agents on exactly that bottleneck: **acquiring the right evidence cheaply and submitting a well-calibrated go / no-go**.
## How it works
At a high level, each episode looks like this:
1. `reset()` selects a drug-target-validation scenario and seeds the simulator.
2. The agent receives a `ValidationObservation` describing the target, indication, remaining credits, accumulated dossier, and step history.
3. The agent submits a `DrugTargetAction` such as `query_expression`, `druggability_screen`, `off_target_screen`, or `submit_validation_report`.
4. The rule engine checks credit budget, redundancy, and ordering prerequisites.
5. The transition engine deducts credits and asks the output generator to simulate evidence from the hidden `TargetProfile`.
6. The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance.
7. The environment returns a new observation with an updated `EvidenceDossier`, latest output, violations, and reward.
8. The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit.
## The core mental model
### Hidden state
The simulator maintains a `FullLatentState` that the agent never sees directly:
- `TargetProfile` β€” true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus the hidden `correct_decision`, `true_viability_score`, `key_evidence_dimensions`, and any `misleading_signals`.
- `DataQualityState` β€” noise level, false-positive rate, false-negative rate, database coverage.
- `CreditState` β€” total / used / remaining experimental credits.
- `ValidationProgress` β€” boolean flags for which evidence dimensions have been investigated and whether a report has been submitted.
### Visible state
The agent only sees `ValidationObservation`, which includes:
- `target_gene`, `disease_context`, `indication`
- `credits_remaining` / `credits_total`
- `dossier` β€” running `EvidenceDossier` of expression / protein / clinical / safety / literature / experimental findings, plus any `flagged_red_flags`
- `pipeline_history` β€” list of past actions and their summary outputs
- `latest_output` β€” typed `IntermediateOutput` from the most recent step
- `rule_violations` and `step_reward_breakdown` for the last step
## Action space
| Category | Action | Cost (credits) |
|---|---|---|
| Expression & omics | `query_expression`, `differential_expression`, `pathway_enrichment`, `coexpression_network` | 2 |
| Protein & structure | `protein_structure_lookup`, `binding_site_analysis`, `druggability_screen` | 3 |
| Protein & structure | `protein_interaction_network` | 2 |
| Clinical & safety | `clinical_trial_lookup`, `toxicity_panel`, `off_target_screen`, `patient_stratification` | 3 |
| Literature | `literature_search`, `evidence_synthesis`, `competitor_landscape` | 1 |
| Experimental | `crispr_knockout`, `biomarker_correlation` | 4 / 3 |
| Experimental | `in_vitro_assay` | 5 |
| Experimental | `in_vivo_model` | 8 |
| Meta | `flag_red_flag`, `request_expert_review` | 0 / 1 |
| Terminal | `submit_validation_report` | 0 |
`submit_validation_report` carries two extra fields: `final_decision` (`"go"` or `"no_go"`) and `confidence` in `[0, 1]`. The episode ends as soon as the report is submitted.
## Reward function
Every step receives a decomposed reward:
```
R_t = evidence_novelty_bonus
+ reasoning_coherence_bonus
+ credit_efficiency_penalty
+ rule_violation_penalty
+ [Ο†(s_{t+1}) βˆ’ Ο†(s_t)]
```
When the episode ends, a terminal reward is added:
```
R_T = 0.40 * decision_accuracy
+ 0.35 * evidence_coverage
+ 0.15 * credit_efficiency
+ 0.10 * reasoning_coherence
```
Where:
- `decision_accuracy` β€” `1.0` if the final go / no-go matched the hidden `correct_decision`, scaled by `2 * |confidence - 0.5|` so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised.
- `evidence_coverage` β€” fraction of the scenario's `key_evidence_dimensions` (e.g. `expression`, `druggability`, `off_target`, `clinical`, `in_vitro`) that the agent actually investigated.
- `credit_efficiency` β€” `1 βˆ’ redundant_calls / total_calls`.
- `reasoning_coherence` β€” fraction of actions whose soft prerequisites (e.g. `expression` before `toxicity`, `in_vitro` before `in_vivo`) were satisfied.
Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting credits without ever submitting a report.
## Curated scenarios
| Name | Difficulty | Correct decision | Why it's interesting |
|---|---|---|---|
| `egfr_nsclc_viable` | easy | `go` | Clear viable target β€” expression + druggability alone are sufficient. |
| `kras_pdac_borderline` | medium | `go` | Historically undruggable; recent inhibitor literature is decisive. |
| `cd33_aml_misleading` | hard | `no_go` | Naive expression query says "go", but off-target + toxicity + clinical reveal the right answer. |
| `tp53_solid_tumors_clear_fail` | easy-medium | `no_go` | Druggability check alone is sufficient. |
| `ptpn11_juvenile_mml_complex` | very hard | `go` | Requires `binding_site_analysis(include_allosteric=True)`, off-target work, patient stratification, and an in-vitro assay. |
The procedural generator (`server/tasks/procedural_generator.py`) layers on additional easy / medium / hard scenarios sampled from a pool of 20 real cancer targets and 8 cancer indications.
## Setup
```bash
# 1. Install dependencies (env runtime only)
pip install -e .
# 2. Or install with training extras (torch + transformers + trl + peft pinned to working set)
pip install -e .[train]
# 3. Run the environment server
PYTHONPATH=. python -m server.app
# server is now available at http://localhost:8000
```
The legacy `uv sync` workflow still works if you have `uv.lock` checked
in locally; the editable `pip install` path above is the primary
supported route.
## Talking to the environment
```python
from client import DrugTargetEnv
from models import DrugTargetAction
with DrugTargetEnv(base_url="http://localhost:8000") as env:
result = env.reset()
print(result.observation.target_gene, "/", result.observation.indication)
result = env.step(DrugTargetAction(
action_type="query_expression",
parameters={"database": "GTEx"},
reasoning="Establish tissue baseline",
))
print(result.observation.latest_output.summary)
result = env.step(DrugTargetAction(
action_type="submit_validation_report",
reasoning="Sufficient evidence for go",
final_decision="go",
confidence=0.85,
))
print("done:", result.done, "reward:", result.reward)
```
## Running the baseline agent
```bash
PYTHONPATH=. python run_agent.py
```
The script writes a live JSON snapshot to `_dashboard_state.json` after every step so you can watch the agent's progress. Default model is `Qwen/Qwen2.5-3B-Instruct`.
## Reproduce
Three commands cover the env-locally / training-locally / training-on-Space paths:
```bash
# 1. Env locally (CPU is fine β€” the env itself is dependency-light)
pip install -e . && PYTHONPATH=. python -m server.app
# β†’ http://localhost:8000 (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed)
# 2. Training locally (single GPU, vanilla GRPO)
pip install -e .[train]
PYTHONPATH=. python -m training.training_script \
--model-id Qwen/Qwen2.5-3B-Instruct \
--evidence-dir evidence \
--output-dir runs/grpo-output
# 3. Training on a Hugging Face Space (H200 single-GPU)
# Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN
# in the Space variables, then POST /train.
# β†’ https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer
```
The trainer Space's FastAPI control panel (`space/training/app.py`)
streams a live evidence dashboard while training runs β€” per-step
training curve, mid-training checkpoint progression, and a before /
after summary card. Default expected hardware: **H200 single-GPU**
(`h200x1`); H200 is β‰ˆ 4Γ— A100 throughput, ~$0.05–0.10 per step on
Qwen2.5-3B-class GRPO.
An optional **SFT warm-start** (`training/sft_warmstart.py`) is
controlled via the `SFT_WARMSTART` env var on the Space (default on).
It collects oracle trajectories on the curated scenario library, SFTs
the base model with a small LoRA, and hands the merged checkpoint to
GRPO so the policy starts with a non-zero prior over correct
trajectories.
## Baseline scores
| Difficulty bucket | Random policy | Heuristic policy | Trained Qwen2.5-3B |
|---|---|---|---|
| Easy (`egfr_nsclc_viable`) | _filled in after first training run_ | _filled in after first training run_ | _filled in after first training run_ |
| Medium (`kras_pdac_borderline`) | _filled in after first training run_ | _filled in after first training run_ | _filled in after first training run_ |
| Hard (`cd33_aml_misleading`) | _filled in after first training run_ | _filled in after first training run_ | _filled in after first training run_ |
The trainer Space writes the populated table to
`evidence/before_after_metrics.json` automatically on every run.
## Evolution note
The deployment scaffolding in this repository β€” the trainer Space
control panel, the live training-evidence callback, the SFT warm-start
script, and the working dependency pin set β€” was originally validated
against a particle-physics-themed prototype and then carried forward
when we pivoted to drug discovery. The simulator, scenarios, action
space, reward function, and rules engine are all drug-domain native;
the inheritance is exclusively in the training and evaluation
scaffolding.