E. coli K12 Drug Effect Prediction Model
Multi-task GNN predicting antibacterial activity, mechanism of action, and protein targets from compound structure
What It Does
Given any compound's SMILES string, this model predicts three things simultaneously:
| Output | Description | Accuracy |
|---|---|---|
| MIC | Minimum inhibitory concentration (Β΅g/mL) | RΒ²=0.50, Spearman Ο=0.70, 54% within 2-fold |
| Mechanism of Action | Which of 6 biological processes the drug disrupts | F1=0.98 |
| Protein Targets | Which of 17 E. coli proteins the drug binds | AUROC=0.997 |
Example Predictions
| Drug | Predicted MIC | True MIC | MoA Prediction | Top Target |
|---|---|---|---|---|
| Ciprofloxacin | 0.25 Β΅M | 0.03 Β΅M | DNA replication β | gyrA β |
| Ampicillin | 10.9 Β΅M | 8.0 Β΅M | Cell wall β | murA β |
| Tetracycline | 2.0 Β΅M | 1.5 Β΅M | Protein synthesis β | β |
| Trimethoprim | 1.4 Β΅M | 0.5 Β΅M | Folate pathway β | folA (91%) β |
| Rifampicin | 27.6 Β΅M | 16.0 Β΅M | RNA synthesis β | β |
Quick Start
Option 1: One-click launcher (recommended)
Requires Python 3.9+ installed.
git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull
Then:
- Mac: double-click
START_MAC.command - Windows: double-click
START_WINDOWS.bat
Opens at http://localhost:7860
Option 2: Manual setup
git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements-app.txt
ECOLI_PROJECT_ROOT=$(pwd) PYTHONPATH=$(pwd) python app.py
Option 3: Python API
from src.inference.predict import predict_drug_effects
result = predict_drug_effects(
"OC(=O)C1=CN(C2CC2)c2cc(N3CCNCC3)c(F)cc2C1=O", # Ciprofloxacin
device="cpu"
)
print(f"MIC: {result['predicted_mic_uM']:.2f} Β΅M")
print(f"MoA: {result['moa_class']} ({result['moa_confidence']:.0%})")
for t in result['top_targets'][:3]:
print(f" Target: {t['gene']} ({t['score']:.3f})")
Model Architecture
SMILES Input
β
βββ GINEConv GNN (3 layers, 256-dim) ββββββββββββ molecular graph topology
β βββ 17 atom features + 4 bond features
β βββ triple pooling: mean + max + attention β 768-dim
β
βββ Morgan + RDKit Fingerprints (4096-bit) ββββββ substructure patterns
β βββ 3-layer MLP β 256-dim
β
βββ Physicochemical Descriptors (12 features) βββ scaffold-independent properties
β βββ MW, logP, TPSA, HBD, HBA, RotBonds, ArRings, Rings, FrCSP3, MolMR, etc.
β βββ 2-layer MLP β 256-dim
β
βββ Data Source Embedding (8 β 16-dim) ββββββββββ corrects for inter-source bias
β
βΌ
Fusion: 768-dim MLP + LayerNorm + residual connection
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Target Predictor β
β Bilinear attention: compound emb Γ ESM-2 protein embs β
β 17 E. coli genes with pre-computed ESM-2 (650M) vectors β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metabolic Integration Layer (differentiable FBA surrogate) β
β iML1515: 1,516 genes Γ 2,712 reactions Γ 40 pathways β
β Computes: reaction inhibition β flux perturbation β β
β pathway impact β growth proxy β 256-dim features β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββ
βΌ βΌ βΌ
MIC Head MoA Head Pathway Head
3-layer MLP 6-class CE 40-dim BCE
β Β΅g/mL β class β pathway impacts
Total parameters: ~9M (7.6M trainable)
Training
Curriculum Learning
| Phase | Epochs | Tasks Active | GNN | LR |
|---|---|---|---|---|
| A | 1β20 | MIC only | Training | 1.5Γ10β»β΄ |
| B | 21β40 | MIC + Targets | Frozen | 1.5Γ10β»β΄ |
| C | 41+ | All tasks | Unfrozen at LR/10 | GNN: 1.5Γ10β»β΅ |
Phase A builds strong MIC features. Phase B trains the target head without destabilising the GNN. Phase C brings MoA online with differential learning rates.
Loss Function
| Task | Loss | Weight |
|---|---|---|
| MIC | Huber (Ξ΄=2.0) | 1.0 |
| MoA | Cross-entropy (label smoothing 0.1) | 25.0 |
| Targets | Focal loss (Ξ³=2.0, Ξ±=0.25) | 80.0 |
| Pathways | Binary cross-entropy | 5.0 |
| Target entropy | Activation + diversity regulariser | 0.15 |
Regularisation
| Technique | Details |
|---|---|
| Dropout | 0.35 on all heads and fusion layers |
| Weight decay | 5Γ10β»Β³ (AdamW) |
| MIC label noise | Gaussian N(0, 0.5) on training MIC values |
| SMILES augmentation | 2Γ non-canonical enumerations per compound |
| EMA | Exponential moving average (decay=0.999) for smoother validation |
| SWA | Stochastic weight averaging in final 20% of training |
| LR schedule | Linear warmup (5 epochs) β cosine annealing |
| Early stopping | Patience 30, monitoring val_mic |
| Gradient clipping | max_norm=5.0 |
Test Set Results (v21)
| Metric | Value |
|---|---|
| MIC RMSE | 2.21 logβ Β΅M |
| MIC RΒ² | 0.50 |
| MIC Spearman Ο | 0.70 |
| Within 2-fold accuracy | 54.4% |
| MoA Macro F1 | 0.98 |
| MoA Balanced Accuracy | 97.0% |
| Target AUROC | 0.997 |
Evaluated on a held-out scaffold-split test set of 2,743 compounds.
Dataset
27,488 Compounds from 7 Sources
| Source | Compounds | Data Type | Quality Filter |
|---|---|---|---|
| ChEMBL 34 | 9,143 | MIC, IC50, EC50 | Functional whole-cell assays only (assay_type=F), enzyme IC50 excluded, IQR variance filter |
| SPARK | 12,310 | MIC | Curated antimicrobial data from Pew Charitable Trusts |
| CO-ADD | 4,211 | MIC (dose-response) | Community for Open Antimicrobial Drug Discovery |
| PubChem | ~1,800 | MIC | E. coli bioassay growth inhibition data |
| PATRIC/BV-BRC | 32 | MIC | Standardised clinical broth microdilution |
| ChEMBL Mechanisms | 386 | MoA labels | Curated drug mechanism annotations |
| BindingDB | 729 | Target labels | Compound-protein binding (Ki/IC50/Kd < 10 Β΅M) |
Data Quality
- Enzyme IC50 exclusion: 2,585 enzyme-level IC50/EC50 records (beta-lactamase, endonuclease assays) removed β these are not whole-cell MIC values
- Variance filter: Compounds with IQR > 2 logβ units across measurements dropped (203 high-variance compounds)
- Winsorisation: MIC values clipped at 1st/99th percentiles
- Deduplication: Priority order β ChEMBL > SPARK > CO-ADD > PubChem > PATRIC
Labels
| Label | Train | Val | Test | Coverage |
|---|---|---|---|---|
| MIC values | 22,313 | 2,836 | 2,743 | 98% of compounds |
| MoA (6 classes) | 4,734 | 591 | 593 | 21% (482 curated, 5,918 with propagation) |
| Targets (17 genes) | 1,208 | 134 | 147 | 5.4% of compounds |
MoA Classes
| ID | Mechanism | Examples | Train Count |
|---|---|---|---|
| 0 | Cell wall synthesis | Ampicillin, meropenem, vancomycin | 135 curated |
| 1 | DNA replication / damage | Ciprofloxacin, nalidixic acid, novobiocin | 84 curated |
| 2 | RNA synthesis | Rifampicin, fidaxomicin | 37 curated |
| 3 | Protein synthesis | Tetracycline, chloramphenicol, gentamicin | 114 curated |
| 4 | Membrane disruption | Polymyxin B, colistin, daptomycin | 19 curated |
| 5 | Folate / metabolic | Trimethoprim, sulfamethoxazole | 93 curated |
MoA labels expanded from 482 curated β 5,918 via Tanimoto similarity propagation (threshold 0.40).
Predicted Protein Targets
| Gene | Name | Function | Labelled Compounds |
|---|---|---|---|
| folA | Dihydrofolate reductase | Folate biosynthesis | 475 |
| lpxC | LPS deacetylase | LPS biosynthesis | 264 |
| fabI | Enoyl-ACP reductase | Fatty acid biosynthesis | 107 |
| dxr | DXP reductoisomerase | Isoprenoid biosynthesis | 71 |
| deoA | Thymidine phosphorylase | Nucleotide salvage | 67 |
| thyA | Thymidylate synthase | Thymidylate biosynthesis | 56 |
| aroA | EPSP synthase | Aromatic amino acid synthesis | 36 |
| murA | MurA transferase | Peptidoglycan biosynthesis | 22 |
| ... |
17 targets total (filtered from 1,516 iML1515 genes to those with β₯20 labelled compounds).
Splitting
- Scaffold-based β Murcko scaffold clustering ensures val/test contain novel chemotypes
- MoA-stratified β MoA-labelled compounds distributed proportionally across all splits
- 80/10/10 train/validation/test
Biological Integration
ESM-2 Protein Embeddings
Pre-computed embeddings from facebook/esm2_t33_650M_UR50D for all 1,516 iML1515 gene products. Projected from 1280-dim β 512-dim for bilinear attention scoring against compound embeddings.
iML1515 Metabolic Model
The differentiable FBA surrogate uses matrices from the iML1515 genome-scale metabolic model:
- Gene-reaction matrix: 1,516 genes Γ 2,712 reactions (binary)
- Reaction-pathway matrix: 2,712 reactions Γ 40 pathways (binary)
- Wildtype FBA fluxes: Baseline metabolic state from constraint-based modelling
- Biomass coefficients: Growth contribution of each reaction
- Learnable sensitivity weights: How much each reaction responds to drug-induced inhibition
File Structure
ecoli-k12-drug-model/
βββ app.py # Gradio web interface
βββ train.py # Training script
βββ install_and_run.py # One-click installer + launcher
βββ START_MAC.command # Mac double-click launcher
βββ START_WINDOWS.bat # Windows double-click launcher
βββ requirements-app.txt # Inference-only dependencies
βββ requirements.txt # Full training dependencies
βββ configs/
β βββ model_config.yaml # Architecture hyperparameters
βββ models/
β βββ best_model.pt # Best Phase B checkpoint (val_mic=1.09)
β βββ best_model_phase_c.pt # Best Phase C checkpoint (has MoA)
β βββ final_model.pt # EMA model at end of training
βββ src/
β βββ data/
β β βββ merge_datasets.py # Multi-source dataset builder
β β βββ dataset.py # PyTorch Dataset + collate
β β βββ precompute_features.py # Graph/FP/physchem caching
β β βββ compile_moa_labels.py # 180 curated antibiotic MoA labels
β β βββ fetch_chembl_moa.py # ChEMBL mechanism API
β β βββ fetch_patric_amr.py # PATRIC clinical MIC data
β β βββ fetch_bindingdb_targets.py # BindingDB interactions
β β βββ propagate_moa_labels.py # Tanimoto MoA propagation
β βββ features/
β β βββ molecular_graph.py # SMILES β graph + fingerprint + physchem
β βββ models/
β β βββ compound_encoder.py # Multi-branch encoder (GNN+FP+Physchem)
β β βββ target_predictor.py # Bilinear compound-protein scoring
β β βββ metabolic_layer.py # Differentiable FBA surrogate
β β βββ multitask_head.py # Top-level model with task heads
β βββ training/
β β βββ losses.py # Multi-task loss function
β β βββ evaluation.py # Test set metrics
β βββ inference/
β βββ predict.py # Single-compound prediction API
βββ data/
βββ raw/ # Source data (.parquet files)
βββ processed/ # Feature caches, splits, gene lists
Interpreting Results
MIC Values
| MIC (Β΅g/mL) | Interpretation |
|---|---|
| β€ 0.5 | Highly potent β strong antibiotic candidate |
| 0.5 β 4 | Potent β comparable to clinical antibiotics |
| 4 β 16 | Moderate β may need optimisation |
| 16 β 64 | Weak β significant barriers to activity |
| > 64 | Likely inactive against E. coli |
The model predicts in logβ Β΅M internally. To convert:
- Β΅M β Β΅g/mL: multiply by molecular weight / 1000
- logβ Β΅M β Β΅M: compute 2^(predicted value)
MoA Confidence
The model outputs softmax probabilities across 6 classes. Predictions with >80% confidence on the top class are generally reliable. Lower confidence suggests the compound may have a novel or mixed mechanism.
Target Scores
Target interaction probabilities range 0β1. Most compounds show low activation (<0.1) across all targets. Scores above 0.3 indicate meaningful predicted binding. The model is most confident for well-characterised targets like folA (dihydrofolate reductase) and lpxC (LPS biosynthesis).
Limitations
Prediction accuracy: MIC predictions have ~2.5-fold average error (RMSE 2.21 logβ Β΅M). This is comparable to inter-laboratory MIC variability but should be treated as order-of-magnitude estimates, not precise values.
E. coli K12 only: Trained on laboratory K12 strain data. Clinical isolates with acquired resistance (ESBL, carbapenemases, plasmid-mediated resistance) will have different MIC profiles.
Scaffold bias: Best performance on compounds structurally similar to the training set (fluoroquinolones, beta-lactams, aminoglycosides, tetracyclines, sulfonamides). Novel scaffolds may have higher prediction error.
Target coverage: Only 17 of ~1,500 E. coli proteins have sufficient training labels. The model cannot predict interactions with uncharacterised targets.
MoA propagation: 91% of MoA labels are inferred by Tanimoto similarity (threshold 0.40), not experimentally confirmed. Novel chemotypes may be misclassified.
Not for clinical use. This is a research tool for prioritising compounds in early-stage drug discovery. All predictions require experimental validation.
System Requirements
| Minimum | Recommended | |
|---|---|---|
| Python | 3.9 | 3.10+ |
| RAM | 4 GB | 8 GB |
| Disk | 2 GB | 4 GB |
| GPU | Not required | MPS (Mac) or CUDA for faster inference |
Citation
@misc{ecoli-k12-drug-model,
title={E. coli K12 Drug Effect Prediction Model},
author={Alex Sheridan},
year={2026},
url={https://huggingface.co/MrMufasi/ecoli-k12-drug-model}
}
Built with PyTorch, PyTorch Geometric, ESM-2, RDKit, and Gradio