E. coli K12 Drug Effect Prediction Model

Multi-task GNN predicting antibacterial activity, mechanism of action, and protein targets from compound structure

HuggingFace Python PyTorch License


What It Does

Given any compound's SMILES string, this model predicts three things simultaneously:

Output Description Accuracy
MIC Minimum inhibitory concentration (¡g/mL) R²=0.50, Spearman ρ=0.70, 54% within 2-fold
Mechanism of Action Which of 6 biological processes the drug disrupts F1=0.98
Protein Targets Which of 17 E. coli proteins the drug binds AUROC=0.997

Example Predictions

Drug Predicted MIC True MIC MoA Prediction Top Target
Ciprofloxacin 0.25 Β΅M 0.03 Β΅M DNA replication βœ“ gyrA βœ“
Ampicillin 10.9 Β΅M 8.0 Β΅M Cell wall βœ“ murA βœ“
Tetracycline 2.0 Β΅M 1.5 Β΅M Protein synthesis βœ“ β€”
Trimethoprim 1.4 Β΅M 0.5 Β΅M Folate pathway βœ“ folA (91%) βœ“
Rifampicin 27.6 Β΅M 16.0 Β΅M RNA synthesis βœ“ β€”

Quick Start

Option 1: One-click launcher (recommended)

Requires Python 3.9+ installed.

git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull

Then:

  • Mac: double-click START_MAC.command
  • Windows: double-click START_WINDOWS.bat

Opens at http://localhost:7860

Option 2: Manual setup

git clone https://huggingface.co/MrMufasi/ecoli-k12-drug-model
cd ecoli-k12-drug-model
git lfs pull

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements-app.txt

ECOLI_PROJECT_ROOT=$(pwd) PYTHONPATH=$(pwd) python app.py

Option 3: Python API

from src.inference.predict import predict_drug_effects

result = predict_drug_effects(
    "OC(=O)C1=CN(C2CC2)c2cc(N3CCNCC3)c(F)cc2C1=O",  # Ciprofloxacin
    device="cpu"
)

print(f"MIC: {result['predicted_mic_uM']:.2f} Β΅M")
print(f"MoA: {result['moa_class']} ({result['moa_confidence']:.0%})")
for t in result['top_targets'][:3]:
    print(f"  Target: {t['gene']} ({t['score']:.3f})")

Model Architecture

SMILES Input
    β”‚
    β”œβ”€β”€ GINEConv GNN (3 layers, 256-dim) ──────────── molecular graph topology
    β”‚     └── 17 atom features + 4 bond features
    β”‚     └── triple pooling: mean + max + attention β†’ 768-dim
    β”‚
    β”œβ”€β”€ Morgan + RDKit Fingerprints (4096-bit) ────── substructure patterns
    β”‚     └── 3-layer MLP β†’ 256-dim
    β”‚
    β”œβ”€β”€ Physicochemical Descriptors (12 features) ─── scaffold-independent properties
    β”‚     └── MW, logP, TPSA, HBD, HBA, RotBonds, ArRings, Rings, FrCSP3, MolMR, etc.
    β”‚     └── 2-layer MLP β†’ 256-dim
    β”‚
    └── Data Source Embedding (8 β†’ 16-dim) ────────── corrects for inter-source bias
          β”‚
          β–Ό
    Fusion: 768-dim MLP + LayerNorm + residual connection
          β”‚
          β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Target Predictor                                            β”‚
    β”‚   Bilinear attention: compound emb Γ— ESM-2 protein embs    β”‚
    β”‚   17 E. coli genes with pre-computed ESM-2 (650M) vectors  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Metabolic Integration Layer (differentiable FBA surrogate)  β”‚
    β”‚   iML1515: 1,516 genes Γ— 2,712 reactions Γ— 40 pathways     β”‚
    β”‚   Computes: reaction inhibition β†’ flux perturbation β†’       β”‚
    β”‚   pathway impact β†’ growth proxy β†’ 256-dim features          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β–Ό             β–Ό             β–Ό
            MIC Head      MoA Head     Pathway Head
          3-layer MLP    6-class CE    40-dim BCE
          β†’ Β΅g/mL        β†’ class       β†’ pathway impacts

Total parameters: ~9M (7.6M trainable)


Training

Curriculum Learning

Phase Epochs Tasks Active GNN LR
A 1–20 MIC only Training 1.5Γ—10⁻⁴
B 21–40 MIC + Targets Frozen 1.5Γ—10⁻⁴
C 41+ All tasks Unfrozen at LR/10 GNN: 1.5Γ—10⁻⁡

Phase A builds strong MIC features. Phase B trains the target head without destabilising the GNN. Phase C brings MoA online with differential learning rates.

Loss Function

Task Loss Weight
MIC Huber (Ξ΄=2.0) 1.0
MoA Cross-entropy (label smoothing 0.1) 25.0
Targets Focal loss (Ξ³=2.0, Ξ±=0.25) 80.0
Pathways Binary cross-entropy 5.0
Target entropy Activation + diversity regulariser 0.15

Regularisation

Technique Details
Dropout 0.35 on all heads and fusion layers
Weight decay 5Γ—10⁻³ (AdamW)
MIC label noise Gaussian N(0, 0.5) on training MIC values
SMILES augmentation 2Γ— non-canonical enumerations per compound
EMA Exponential moving average (decay=0.999) for smoother validation
SWA Stochastic weight averaging in final 20% of training
LR schedule Linear warmup (5 epochs) β†’ cosine annealing
Early stopping Patience 30, monitoring val_mic
Gradient clipping max_norm=5.0

Test Set Results (v21)

Metric Value
MIC RMSE 2.21 logβ‚‚ Β΅M
MIC RΒ² 0.50
MIC Spearman ρ 0.70
Within 2-fold accuracy 54.4%
MoA Macro F1 0.98
MoA Balanced Accuracy 97.0%
Target AUROC 0.997

Evaluated on a held-out scaffold-split test set of 2,743 compounds.


Dataset

27,488 Compounds from 7 Sources

Source Compounds Data Type Quality Filter
ChEMBL 34 9,143 MIC, IC50, EC50 Functional whole-cell assays only (assay_type=F), enzyme IC50 excluded, IQR variance filter
SPARK 12,310 MIC Curated antimicrobial data from Pew Charitable Trusts
CO-ADD 4,211 MIC (dose-response) Community for Open Antimicrobial Drug Discovery
PubChem ~1,800 MIC E. coli bioassay growth inhibition data
PATRIC/BV-BRC 32 MIC Standardised clinical broth microdilution
ChEMBL Mechanisms 386 MoA labels Curated drug mechanism annotations
BindingDB 729 Target labels Compound-protein binding (Ki/IC50/Kd < 10 Β΅M)

Data Quality

  • Enzyme IC50 exclusion: 2,585 enzyme-level IC50/EC50 records (beta-lactamase, endonuclease assays) removed β€” these are not whole-cell MIC values
  • Variance filter: Compounds with IQR > 2 logβ‚‚ units across measurements dropped (203 high-variance compounds)
  • Winsorisation: MIC values clipped at 1st/99th percentiles
  • Deduplication: Priority order β€” ChEMBL > SPARK > CO-ADD > PubChem > PATRIC

Labels

Label Train Val Test Coverage
MIC values 22,313 2,836 2,743 98% of compounds
MoA (6 classes) 4,734 591 593 21% (482 curated, 5,918 with propagation)
Targets (17 genes) 1,208 134 147 5.4% of compounds

MoA Classes

ID Mechanism Examples Train Count
0 Cell wall synthesis Ampicillin, meropenem, vancomycin 135 curated
1 DNA replication / damage Ciprofloxacin, nalidixic acid, novobiocin 84 curated
2 RNA synthesis Rifampicin, fidaxomicin 37 curated
3 Protein synthesis Tetracycline, chloramphenicol, gentamicin 114 curated
4 Membrane disruption Polymyxin B, colistin, daptomycin 19 curated
5 Folate / metabolic Trimethoprim, sulfamethoxazole 93 curated

MoA labels expanded from 482 curated β†’ 5,918 via Tanimoto similarity propagation (threshold 0.40).

Predicted Protein Targets

Gene Name Function Labelled Compounds
folA Dihydrofolate reductase Folate biosynthesis 475
lpxC LPS deacetylase LPS biosynthesis 264
fabI Enoyl-ACP reductase Fatty acid biosynthesis 107
dxr DXP reductoisomerase Isoprenoid biosynthesis 71
deoA Thymidine phosphorylase Nucleotide salvage 67
thyA Thymidylate synthase Thymidylate biosynthesis 56
aroA EPSP synthase Aromatic amino acid synthesis 36
murA MurA transferase Peptidoglycan biosynthesis 22
...

17 targets total (filtered from 1,516 iML1515 genes to those with β‰₯20 labelled compounds).

Splitting

  • Scaffold-based β€” Murcko scaffold clustering ensures val/test contain novel chemotypes
  • MoA-stratified β€” MoA-labelled compounds distributed proportionally across all splits
  • 80/10/10 train/validation/test

Biological Integration

ESM-2 Protein Embeddings

Pre-computed embeddings from facebook/esm2_t33_650M_UR50D for all 1,516 iML1515 gene products. Projected from 1280-dim β†’ 512-dim for bilinear attention scoring against compound embeddings.

iML1515 Metabolic Model

The differentiable FBA surrogate uses matrices from the iML1515 genome-scale metabolic model:

  • Gene-reaction matrix: 1,516 genes Γ— 2,712 reactions (binary)
  • Reaction-pathway matrix: 2,712 reactions Γ— 40 pathways (binary)
  • Wildtype FBA fluxes: Baseline metabolic state from constraint-based modelling
  • Biomass coefficients: Growth contribution of each reaction
  • Learnable sensitivity weights: How much each reaction responds to drug-induced inhibition

File Structure

ecoli-k12-drug-model/
β”œβ”€β”€ app.py                          # Gradio web interface
β”œβ”€β”€ train.py                        # Training script
β”œβ”€β”€ install_and_run.py              # One-click installer + launcher
β”œβ”€β”€ START_MAC.command               # Mac double-click launcher
β”œβ”€β”€ START_WINDOWS.bat               # Windows double-click launcher
β”œβ”€β”€ requirements-app.txt            # Inference-only dependencies
β”œβ”€β”€ requirements.txt                # Full training dependencies
β”œβ”€β”€ configs/
β”‚   └── model_config.yaml           # Architecture hyperparameters
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ best_model.pt               # Best Phase B checkpoint (val_mic=1.09)
β”‚   β”œβ”€β”€ best_model_phase_c.pt       # Best Phase C checkpoint (has MoA)
β”‚   └── final_model.pt              # EMA model at end of training
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ merge_datasets.py       # Multi-source dataset builder
β”‚   β”‚   β”œβ”€β”€ dataset.py              # PyTorch Dataset + collate
β”‚   β”‚   β”œβ”€β”€ precompute_features.py  # Graph/FP/physchem caching
β”‚   β”‚   β”œβ”€β”€ compile_moa_labels.py   # 180 curated antibiotic MoA labels
β”‚   β”‚   β”œβ”€β”€ fetch_chembl_moa.py     # ChEMBL mechanism API
β”‚   β”‚   β”œβ”€β”€ fetch_patric_amr.py     # PATRIC clinical MIC data
β”‚   β”‚   β”œβ”€β”€ fetch_bindingdb_targets.py  # BindingDB interactions
β”‚   β”‚   └── propagate_moa_labels.py # Tanimoto MoA propagation
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   └── molecular_graph.py      # SMILES β†’ graph + fingerprint + physchem
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ compound_encoder.py     # Multi-branch encoder (GNN+FP+Physchem)
β”‚   β”‚   β”œβ”€β”€ target_predictor.py     # Bilinear compound-protein scoring
β”‚   β”‚   β”œβ”€β”€ metabolic_layer.py      # Differentiable FBA surrogate
β”‚   β”‚   └── multitask_head.py       # Top-level model with task heads
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ losses.py               # Multi-task loss function
β”‚   β”‚   └── evaluation.py           # Test set metrics
β”‚   └── inference/
β”‚       └── predict.py              # Single-compound prediction API
└── data/
    β”œβ”€β”€ raw/                        # Source data (.parquet files)
    └── processed/                  # Feature caches, splits, gene lists

Interpreting Results

MIC Values

MIC (Β΅g/mL) Interpretation
≀ 0.5 Highly potent β€” strong antibiotic candidate
0.5 – 4 Potent β€” comparable to clinical antibiotics
4 – 16 Moderate β€” may need optimisation
16 – 64 Weak β€” significant barriers to activity
> 64 Likely inactive against E. coli

The model predicts in logβ‚‚ Β΅M internally. To convert:

  • Β΅M β†’ Β΅g/mL: multiply by molecular weight / 1000
  • logβ‚‚ Β΅M β†’ Β΅M: compute 2^(predicted value)

MoA Confidence

The model outputs softmax probabilities across 6 classes. Predictions with >80% confidence on the top class are generally reliable. Lower confidence suggests the compound may have a novel or mixed mechanism.

Target Scores

Target interaction probabilities range 0–1. Most compounds show low activation (<0.1) across all targets. Scores above 0.3 indicate meaningful predicted binding. The model is most confident for well-characterised targets like folA (dihydrofolate reductase) and lpxC (LPS biosynthesis).


Limitations

  1. Prediction accuracy: MIC predictions have ~2.5-fold average error (RMSE 2.21 logβ‚‚ Β΅M). This is comparable to inter-laboratory MIC variability but should be treated as order-of-magnitude estimates, not precise values.

  2. E. coli K12 only: Trained on laboratory K12 strain data. Clinical isolates with acquired resistance (ESBL, carbapenemases, plasmid-mediated resistance) will have different MIC profiles.

  3. Scaffold bias: Best performance on compounds structurally similar to the training set (fluoroquinolones, beta-lactams, aminoglycosides, tetracyclines, sulfonamides). Novel scaffolds may have higher prediction error.

  4. Target coverage: Only 17 of ~1,500 E. coli proteins have sufficient training labels. The model cannot predict interactions with uncharacterised targets.

  5. MoA propagation: 91% of MoA labels are inferred by Tanimoto similarity (threshold 0.40), not experimentally confirmed. Novel chemotypes may be misclassified.

  6. Not for clinical use. This is a research tool for prioritising compounds in early-stage drug discovery. All predictions require experimental validation.


System Requirements

Minimum Recommended
Python 3.9 3.10+
RAM 4 GB 8 GB
Disk 2 GB 4 GB
GPU Not required MPS (Mac) or CUDA for faster inference

Citation

@misc{ecoli-k12-drug-model,
  title={E. coli K12 Drug Effect Prediction Model},
  author={Alex Sheridan},
  year={2026},
  url={https://huggingface.co/MrMufasi/ecoli-k12-drug-model}
}

Built with PyTorch, PyTorch Geometric, ESM-2, RDKit, and Gradio

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including MrMufasi/ecoli-k12-drug-model