E. coli K12 Drug Effect Prediction Model

Multi-task GNN predicting antibacterial activity, mechanism of action, and protein targets from compound structure

What It Does

Given any compound's SMILES string, this model predicts three things simultaneously:

Output	Description	Accuracy
MIC	Minimum inhibitory concentration (µg/mL)	R²=0.50, Spearman ρ=0.70, 54% within 2-fold
Mechanism of Action	Which of 6 biological processes the drug disrupts	F1=0.98
Protein Targets	Which of 17 E. coli proteins the drug binds	AUROC=0.997

Example Predictions

Drug	Predicted MIC	True MIC	MoA Prediction	Top Target
Ciprofloxacin	0.25 µM	0.03 µM	DNA replication ✓	gyrA ✓
Ampicillin	10.9 µM	8.0 µM	Cell wall ✓	murA ✓
Tetracycline	2.0 µM	1.5 µM	Protein synthesis ✓	—
Trimethoprim	1.4 µM	0.5 µM	Folate pathway ✓	folA (91%) ✓
Rifampicin	27.6 µM	16.0 µM	RNA synthesis ✓	—

Quick Start

Option 1: One-click launcher (recommended)

Requires Python 3.9+ installed.

git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull

Then:

Mac: double-click START_MAC.command
Windows: double-click START_WINDOWS.bat

Opens at http://localhost:7860

Option 2: Manual setup

git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements-app.txt

ECOLI_PROJECT_ROOT=$(pwd) PYTHONPATH=$(pwd) python app.py

Option 3: Python API

from src.inference.predict import predict_drug_effects

result = predict_drug_effects(
    "OC(=O)C1=CN(C2CC2)c2cc(N3CCNCC3)c(F)cc2C1=O",  # Ciprofloxacin
    device="cpu"
)

print(f"MIC: {result['predicted_mic_uM']:.2f} µM")
print(f"MoA: {result['moa_class']} ({result['moa_confidence']:.0%})")
for t in result['top_targets'][:3]:
    print(f"  Target: {t['gene']} ({t['score']:.3f})")

Model Architecture

SMILES Input
    │
    ├── GINEConv GNN (3 layers, 256-dim) ──────────── molecular graph topology
    │     └── 17 atom features + 4 bond features
    │     └── triple pooling: mean + max + attention → 768-dim
    │
    ├── Morgan + RDKit Fingerprints (4096-bit) ────── substructure patterns
    │     └── 3-layer MLP → 256-dim
    │
    ├── Physicochemical Descriptors (12 features) ─── scaffold-independent properties
    │     └── MW, logP, TPSA, HBD, HBA, RotBonds, ArRings, Rings, FrCSP3, MolMR, etc.
    │     └── 2-layer MLP → 256-dim
    │
    └── Data Source Embedding (8 → 16-dim) ────────── corrects for inter-source bias
          │
          ▼
    Fusion: 768-dim MLP + LayerNorm + residual connection
          │
          ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ Target Predictor                                            │
    │   Bilinear attention: compound emb × ESM-2 protein embs    │
    │   17 E. coli genes with pre-computed ESM-2 (650M) vectors  │
    └──────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ Metabolic Integration Layer (differentiable FBA surrogate)  │
    │   iML1515: 1,516 genes × 2,712 reactions × 40 pathways     │
    │   Computes: reaction inhibition → flux perturbation →       │
    │   pathway impact → growth proxy → 256-dim features          │
    └──────────────────────────┬──────────────────────────────────┘
                               │
                 ┌─────────────┼─────────────┐
                 ▼             ▼             ▼
            MIC Head      MoA Head     Pathway Head
          3-layer MLP    6-class CE    40-dim BCE
          → µg/mL        → class       → pathway impacts

Total parameters: ~9M (7.6M trainable)

Training

Curriculum Learning

Phase	Epochs	Tasks Active	GNN	LR
A	1–20	MIC only	Training	1.5×10⁻⁴
B	21–40	MIC + Targets	Frozen	1.5×10⁻⁴
C	41+	All tasks	Unfrozen at LR/10	GNN: 1.5×10⁻⁵

Phase A builds strong MIC features. Phase B trains the target head without destabilising the GNN. Phase C brings MoA online with differential learning rates.

Loss Function

Task	Loss	Weight
MIC	Huber (δ=2.0)	1.0
MoA	Cross-entropy (label smoothing 0.1)	25.0
Targets	Focal loss (γ=2.0, α=0.25)	80.0
Pathways	Binary cross-entropy	5.0
Target entropy	Activation + diversity regulariser	0.15

Regularisation

Technique	Details
Dropout	0.35 on all heads and fusion layers
Weight decay	5×10⁻³ (AdamW)
MIC label noise	Gaussian N(0, 0.5) on training MIC values
SMILES augmentation	2× non-canonical enumerations per compound
EMA	Exponential moving average (decay=0.999) for smoother validation
SWA	Stochastic weight averaging in final 20% of training
LR schedule	Linear warmup (5 epochs) → cosine annealing
Early stopping	Patience 30, monitoring val_mic
Gradient clipping	max_norm=5.0

Test Set Results (v21)

Metric	Value
MIC RMSE	2.21 log₂ µM
MIC R²	0.50
MIC Spearman ρ	0.70
Within 2-fold accuracy	54.4%
MoA Macro F1	0.98
MoA Balanced Accuracy	97.0%
Target AUROC	0.997

Evaluated on a held-out scaffold-split test set of 2,743 compounds.

Dataset

27,488 Compounds from 7 Sources

Source	Compounds	Data Type	Quality Filter
ChEMBL 34	9,143	MIC, IC50, EC50	Functional whole-cell assays only (assay_type=F), enzyme IC50 excluded, IQR variance filter
SPARK	12,310	MIC	Curated antimicrobial data from Pew Charitable Trusts
CO-ADD	4,211	MIC (dose-response)	Community for Open Antimicrobial Drug Discovery
PubChem	~1,800	MIC	E. coli bioassay growth inhibition data
PATRIC/BV-BRC	32	MIC	Standardised clinical broth microdilution
ChEMBL Mechanisms	386	MoA labels	Curated drug mechanism annotations
BindingDB	729	Target labels	Compound-protein binding (Ki/IC50/Kd < 10 µM)

Data Quality

Enzyme IC50 exclusion: 2,585 enzyme-level IC50/EC50 records (beta-lactamase, endonuclease assays) removed — these are not whole-cell MIC values
Variance filter: Compounds with IQR > 2 log₂ units across measurements dropped (203 high-variance compounds)
Winsorisation: MIC values clipped at 1st/99th percentiles
Deduplication: Priority order — ChEMBL > SPARK > CO-ADD > PubChem > PATRIC

Labels

Label	Train	Val	Test	Coverage
MIC values	22,313	2,836	2,743	98% of compounds
MoA (6 classes)	4,734	591	593	21% (482 curated, 5,918 with propagation)
Targets (17 genes)	1,208	134	147	5.4% of compounds

MoA Classes

ID	Mechanism	Examples	Train Count
0	Cell wall synthesis	Ampicillin, meropenem, vancomycin	135 curated
1	DNA replication / damage	Ciprofloxacin, nalidixic acid, novobiocin	84 curated
2	RNA synthesis	Rifampicin, fidaxomicin	37 curated
3	Protein synthesis	Tetracycline, chloramphenicol, gentamicin	114 curated
4	Membrane disruption	Polymyxin B, colistin, daptomycin	19 curated
5	Folate / metabolic	Trimethoprim, sulfamethoxazole	93 curated

MoA labels expanded from 482 curated → 5,918 via Tanimoto similarity propagation (threshold 0.40).

Predicted Protein Targets

Gene	Name	Function	Labelled Compounds
folA	Dihydrofolate reductase	Folate biosynthesis	475
lpxC	LPS deacetylase	LPS biosynthesis	264
fabI	Enoyl-ACP reductase	Fatty acid biosynthesis	107
dxr	DXP reductoisomerase	Isoprenoid biosynthesis	71
deoA	Thymidine phosphorylase	Nucleotide salvage	67
thyA	Thymidylate synthase	Thymidylate biosynthesis	56
aroA	EPSP synthase	Aromatic amino acid synthesis	36
murA	MurA transferase	Peptidoglycan biosynthesis	22
...

17 targets total (filtered from 1,516 iML1515 genes to those with ≥20 labelled compounds).

Splitting

Scaffold-based — Murcko scaffold clustering ensures val/test contain novel chemotypes
MoA-stratified — MoA-labelled compounds distributed proportionally across all splits
80/10/10 train/validation/test

Biological Integration

ESM-2 Protein Embeddings

Pre-computed embeddings from facebook/esm2_t33_650M_UR50D for all 1,516 iML1515 gene products. Projected from 1280-dim → 512-dim for bilinear attention scoring against compound embeddings.

iML1515 Metabolic Model

The differentiable FBA surrogate uses matrices from the iML1515 genome-scale metabolic model:

Gene-reaction matrix: 1,516 genes × 2,712 reactions (binary)
Reaction-pathway matrix: 2,712 reactions × 40 pathways (binary)
Wildtype FBA fluxes: Baseline metabolic state from constraint-based modelling
Biomass coefficients: Growth contribution of each reaction
Learnable sensitivity weights: How much each reaction responds to drug-induced inhibition

File Structure

ecoli-k12-drug-model/
├── app.py                          # Gradio web interface
├── train.py                        # Training script
├── install_and_run.py              # One-click installer + launcher
├── START_MAC.command               # Mac double-click launcher
├── START_WINDOWS.bat               # Windows double-click launcher
├── requirements-app.txt            # Inference-only dependencies
├── requirements.txt                # Full training dependencies
├── configs/
│   └── model_config.yaml           # Architecture hyperparameters
├── models/
│   ├── best_model.pt               # Best Phase B checkpoint (val_mic=1.09)
│   ├── best_model_phase_c.pt       # Best Phase C checkpoint (has MoA)
│   └── final_model.pt              # EMA model at end of training
├── src/
│   ├── data/
│   │   ├── merge_datasets.py       # Multi-source dataset builder
│   │   ├── dataset.py              # PyTorch Dataset + collate
│   │   ├── precompute_features.py  # Graph/FP/physchem caching
│   │   ├── compile_moa_labels.py   # 180 curated antibiotic MoA labels
│   │   ├── fetch_chembl_moa.py     # ChEMBL mechanism API
│   │   ├── fetch_patric_amr.py     # PATRIC clinical MIC data
│   │   ├── fetch_bindingdb_targets.py  # BindingDB interactions
│   │   └── propagate_moa_labels.py # Tanimoto MoA propagation
│   ├── features/
│   │   └── molecular_graph.py      # SMILES → graph + fingerprint + physchem
│   ├── models/
│   │   ├── compound_encoder.py     # Multi-branch encoder (GNN+FP+Physchem)
│   │   ├── target_predictor.py     # Bilinear compound-protein scoring
│   │   ├── metabolic_layer.py      # Differentiable FBA surrogate
│   │   └── multitask_head.py       # Top-level model with task heads
│   ├── training/
│   │   ├── losses.py               # Multi-task loss function
│   │   └── evaluation.py           # Test set metrics
│   └── inference/
│       └── predict.py              # Single-compound prediction API
└── data/
    ├── raw/                        # Source data (.parquet files)
    └── processed/                  # Feature caches, splits, gene lists

Interpreting Results

MIC Values

MIC (µg/mL)	Interpretation
≤ 0.5	Highly potent — strong antibiotic candidate
0.5 – 4	Potent — comparable to clinical antibiotics
4 – 16	Moderate — may need optimisation
16 – 64	Weak — significant barriers to activity
> 64	Likely inactive against E. coli

The model predicts in log₂ µM internally. To convert:

µM → µg/mL: multiply by molecular weight / 1000
log₂ µM → µM: compute 2^(predicted value)

MoA Confidence

The model outputs softmax probabilities across 6 classes. Predictions with >80% confidence on the top class are generally reliable. Lower confidence suggests the compound may have a novel or mixed mechanism.

Target Scores

Target interaction probabilities range 0–1. Most compounds show low activation (<0.1) across all targets. Scores above 0.3 indicate meaningful predicted binding. The model is most confident for well-characterised targets like folA (dihydrofolate reductase) and lpxC (LPS biosynthesis).

Limitations

Prediction accuracy: MIC predictions have ~2.5-fold average error (RMSE 2.21 log₂ µM). This is comparable to inter-laboratory MIC variability but should be treated as order-of-magnitude estimates, not precise values.
E. coli K12 only: Trained on laboratory K12 strain data. Clinical isolates with acquired resistance (ESBL, carbapenemases, plasmid-mediated resistance) will have different MIC profiles.
Scaffold bias: Best performance on compounds structurally similar to the training set (fluoroquinolones, beta-lactams, aminoglycosides, tetracyclines, sulfonamides). Novel scaffolds may have higher prediction error.
Target coverage: Only 17 of ~1,500 E. coli proteins have sufficient training labels. The model cannot predict interactions with uncharacterised targets.
MoA propagation: 91% of MoA labels are inferred by Tanimoto similarity (threshold 0.40), not experimentally confirmed. Novel chemotypes may be misclassified.
Not for clinical use. This is a research tool for prioritising compounds in early-stage drug discovery. All predictions require experimental validation.

System Requirements

	Minimum	Recommended
Python	3.9	3.10+
RAM	4 GB	8 GB
Disk	2 GB	4 GB
GPU	Not required	MPS (Mac) or CUDA for faster inference

Citation

@misc{ecoli-k12-drug-model,
  title={E. coli K12 Drug Effect Prediction Model},
  author={Alex Sheridan},
  year={2026},
  url={https://huggingface.co/MrMufasi/ecoli-k12-drug-model}
}

Built with PyTorch, PyTorch Geometric, ESM-2, RDKit, and Gradio

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including MrMufasi/ecoli-k12-drug-model

Microbiology

Collection

1 item • Updated 22 days ago