Spaces:

anugrahteesdollar
/

drugenv-trainer

Runtime error

App Files Files Community

drugenv-trainer / README.md

anugrahteesdollar

initial: drugenv trainer control panel

e681925 verified 17 days ago

preview code

raw

history blame contribute delete

11.6 kB

	---
	title: Drug Target Validation Environment
	sdk: docker
	pinned: false
	app_port: 8000
	tags:
	- openenv
	- reinforcement-learning
	- drug-discovery
	- pharma
	---

	# 🧬 DrugEnv — Drug Target Validation Environment

	> DrugEnv — an OpenEnv RL environment that teaches LLMs to do computational drug-target validation.

	This repository implements an OpenEnv-compatible reinforcement learning environment in which an agent acts as a computational drug discovery scientist. Given a proposed drug target (gene / protein) and a disease context, the agent must investigate target viability by issuing simulated bioinformatics, clinical, and experimental queries, and finally submit a calibrated go / no-go validation report with a confidence score.

	The environment is designed as a partially observable Markov decision process (POMDP) with:

	- a hidden ground-truth `TargetProfile` (expression, druggability, selectivity, toxicity, clinical precedent)
	- noisy database / assay outputs governed by `DataQualityState`
	- a single unified experimental credit budget per episode
	- visible task metadata, dossier of accumulated findings, and step history
	- dense step-wise reward plus terminal reward for decision quality and evidence coverage

	## Why drug target validation?

	Roughly 90% of drug development programs fail in clinical trials, and a large fraction of failures trace back to mistakes during target validation: targets that are not actually disease-driving, are undruggable, lack selectivity, or have hidden toxicity. The cost of progressing a single bad target through Phase III can run into the billions of dollars. Even modest improvements in early-stage decision quality therefore translate into enormous savings and faster cures.

	This environment lets you train and benchmark agents on exactly that bottleneck: acquiring the right evidence cheaply and submitting a well-calibrated go / no-go.

	## How it works

	At a high level, each episode looks like this:

	1. `reset()` selects a drug-target-validation scenario and seeds the simulator.
	2. The agent receives a `ValidationObservation` describing the target, indication, remaining credits, accumulated dossier, and step history.
	3. The agent submits a `DrugTargetAction` such as `query_expression`, `druggability_screen`, `off_target_screen`, or `submit_validation_report`.
	4. The rule engine checks credit budget, redundancy, and ordering prerequisites.
	5. The transition engine deducts credits and asks the output generator to simulate evidence from the hidden `TargetProfile`.
	6. The reward computer scores the step for novelty, reasoning coherence, credit efficiency, and rule compliance.
	7. The environment returns a new observation with an updated `EvidenceDossier`, latest output, violations, and reward.
	8. The episode ends when the agent submits a validation report, exhausts credits, or hits the step limit.

	## The core mental model

	### Hidden state

	The simulator maintains a `FullLatentState` that the agent never sees directly:

	- `TargetProfile` — true expression level / tissue specificity / disease over-expression, druggability score, binding-pocket quality, selectivity ratio, off-target genes, toxicity profile, clinical precedent, expected in-vitro and in-vivo behaviour, plus the hidden `correct_decision`, `true_viability_score`, `key_evidence_dimensions`, and any `misleading_signals`.
	- `DataQualityState` — noise level, false-positive rate, false-negative rate, database coverage.
	- `CreditState` — total / used / remaining experimental credits.
	- `ValidationProgress` — boolean flags for which evidence dimensions have been investigated and whether a report has been submitted.

	### Visible state

	The agent only sees `ValidationObservation`, which includes:

	- `target_gene`, `disease_context`, `indication`
	- `credits_remaining` / `credits_total`
	- `dossier` — running `EvidenceDossier` of expression / protein / clinical / safety / literature / experimental findings, plus any `flagged_red_flags`
	- `pipeline_history` — list of past actions and their summary outputs
	- `latest_output` — typed `IntermediateOutput` from the most recent step
	- `rule_violations` and `step_reward_breakdown` for the last step

	## Action space

	\| Category \| Action \| Cost (credits) \|
	\|---\|---\|---\|
	\| Expression & omics \| `query_expression`, `differential_expression`, `pathway_enrichment`, `coexpression_network` \| 2 \|
	\| Protein & structure \| `protein_structure_lookup`, `binding_site_analysis`, `druggability_screen` \| 3 \|
	\| Protein & structure \| `protein_interaction_network` \| 2 \|
	\| Clinical & safety \| `clinical_trial_lookup`, `toxicity_panel`, `off_target_screen`, `patient_stratification` \| 3 \|
	\| Literature \| `literature_search`, `evidence_synthesis`, `competitor_landscape` \| 1 \|
	\| Experimental \| `crispr_knockout`, `biomarker_correlation` \| 4 / 3 \|
	\| Experimental \| `in_vitro_assay` \| 5 \|
	\| Experimental \| `in_vivo_model` \| 8 \|
	\| Meta \| `flag_red_flag`, `request_expert_review` \| 0 / 1 \|
	\| Terminal \| `submit_validation_report` \| 0 \|

	`submit_validation_report` carries two extra fields: `final_decision` (`"go"` or `"no_go"`) and `confidence` in `[0, 1]`. The episode ends as soon as the report is submitted.

	## Reward function

	Every step receives a decomposed reward:

	```
	R_t = evidence_novelty_bonus
	+ reasoning_coherence_bonus
	+ credit_efficiency_penalty
	+ rule_violation_penalty
	+ [φ(s_{t+1}) − φ(s_t)]
	```

	When the episode ends, a terminal reward is added:

	```
	R_T = 0.40 * decision_accuracy
	+ 0.35 * evidence_coverage
	+ 0.15 * credit_efficiency
	+ 0.10 * reasoning_coherence
	```

	Where:

	- `decision_accuracy` — `1.0` if the final go / no-go matched the hidden `correct_decision`, scaled by `2 * \|confidence - 0.5\|` so a confidently correct answer is fully rewarded and a confidently wrong answer is fully penalised.
	- `evidence_coverage` — fraction of the scenario's `key_evidence_dimensions` (e.g. `expression`, `druggability`, `off_target`, `clinical`, `in_vitro`) that the agent actually investigated.
	- `credit_efficiency` — `1 − redundant_calls / total_calls`.
	- `reasoning_coherence` — fraction of actions whose soft prerequisites (e.g. `expression` before `toxicity`, `in_vitro` before `in_vivo`) were satisfied.

	Hard penalties are applied for: submitting without any evidence, submitting without a decision or confidence, and exhausting credits without ever submitting a report.

	## Curated scenarios

	\| Name \| Difficulty \| Correct decision \| Why it's interesting \|
	\|---\|---\|---\|---\|
	\| `egfr_nsclc_viable` \| easy \| `go` \| Clear viable target — expression + druggability alone are sufficient. \|
	\| `kras_pdac_borderline` \| medium \| `go` \| Historically undruggable; recent inhibitor literature is decisive. \|
	\| `cd33_aml_misleading` \| hard \| `no_go` \| Naive expression query says "go", but off-target + toxicity + clinical reveal the right answer. \|
	\| `tp53_solid_tumors_clear_fail` \| easy-medium \| `no_go` \| Druggability check alone is sufficient. \|
	\| `ptpn11_juvenile_mml_complex` \| very hard \| `go` \| Requires `binding_site_analysis(include_allosteric=True)`, off-target work, patient stratification, and an in-vitro assay. \|

	The procedural generator (`server/tasks/procedural_generator.py`) layers on additional easy / medium / hard scenarios sampled from a pool of 20 real cancer targets and 8 cancer indications.

	## Setup

	```bash
	# 1. Install dependencies (env runtime only)
	pip install -e .

	# 2. Or install with training extras (torch + transformers + trl + peft pinned to working set)
	pip install -e .[train]

	# 3. Run the environment server
	PYTHONPATH=. python -m server.app
	# server is now available at http://localhost:8000
	```

	The legacy `uv sync` workflow still works if you have `uv.lock` checked
	in locally; the editable `pip install` path above is the primary
	supported route.

	## Talking to the environment

	```python
	from client import DrugTargetEnv
	from models import DrugTargetAction

	with DrugTargetEnv(base_url="http://localhost:8000") as env:
	result = env.reset()
	print(result.observation.target_gene, "/", result.observation.indication)

	result = env.step(DrugTargetAction(
	action_type="query_expression",
	parameters={"database": "GTEx"},
	reasoning="Establish tissue baseline",
	))
	print(result.observation.latest_output.summary)

	result = env.step(DrugTargetAction(
	action_type="submit_validation_report",
	reasoning="Sufficient evidence for go",
	final_decision="go",
	confidence=0.85,
	))
	print("done:", result.done, "reward:", result.reward)
	```

	## Running the baseline agent

	```bash
	PYTHONPATH=. python run_agent.py
	```

	The script writes a live JSON snapshot to `_dashboard_state.json` after every step so you can watch the agent's progress. Default model is `Qwen/Qwen2.5-3B-Instruct`.

	## Reproduce

	Three commands cover the env-locally / training-locally / training-on-Space paths:

	```bash
	# 1. Env locally (CPU is fine — the env itself is dependency-light)
	pip install -e . && PYTHONPATH=. python -m server.app
	# → http://localhost:8000 (also at https://huggingface.co/spaces/anugrahteesdollar/drugenv when deployed)

	# 2. Training locally (single GPU, vanilla GRPO)
	pip install -e .[train]
	PYTHONPATH=. python -m training.training_script \
	--model-id Qwen/Qwen2.5-3B-Instruct \
	--evidence-dir evidence \
	--output-dir runs/grpo-output

	# 3. Training on a Hugging Face Space (H200 single-GPU)
	# Push space/training/ to anugrahteesdollar/drugenv-trainer, set PUSH_REPO + HF_TOKEN
	# in the Space variables, then POST /train.
	# → https://huggingface.co/spaces/anugrahteesdollar/drugenv-trainer
	```

	The trainer Space's FastAPI control panel (`space/training/app.py`)
	streams a live evidence dashboard while training runs — per-step
	training curve, mid-training checkpoint progression, and a before /
	after summary card. Default expected hardware: H200 single-GPU
	(`h200x1`); H200 is ≈ 4× A100 throughput, ~$0.05–0.10 per step on
	Qwen2.5-3B-class GRPO.

	An optional SFT warm-start (`training/sft_warmstart.py`) is
	controlled via the `SFT_WARMSTART` env var on the Space (default on).
	It collects oracle trajectories on the curated scenario library, SFTs
	the base model with a small LoRA, and hands the merged checkpoint to
	GRPO so the policy starts with a non-zero prior over correct
	trajectories.

	## Baseline scores

	\| Difficulty bucket \| Random policy \| Heuristic policy \| Trained Qwen2.5-3B \|
	\|---\|---\|---\|---\|
	\| Easy (`egfr_nsclc_viable`) \| _filled in after first training run_ \| _filled in after first training run_ \| _filled in after first training run_ \|
	\| Medium (`kras_pdac_borderline`) \| _filled in after first training run_ \| _filled in after first training run_ \| _filled in after first training run_ \|
	\| Hard (`cd33_aml_misleading`) \| _filled in after first training run_ \| _filled in after first training run_ \| _filled in after first training run_ \|

	The trainer Space writes the populated table to
	`evidence/before_after_metrics.json` automatically on every run.

	## Evolution note

	The deployment scaffolding in this repository — the trainer Space
	control panel, the live training-evidence callback, the SFT warm-start
	script, and the working dependency pin set — was originally validated
	against a particle-physics-themed prototype and then carried forward
	when we pivoted to drug discovery. The simulator, scenarios, action
	space, reward function, and rules engine are all drug-domain native;
	the inheritance is exclusively in the training and evaluation
	scaffolding.