CMMP: Contrastive Micrograph-Metadata Pre-Training
A contrastive encoder that aligns HAADF-STEM microscopy images with their acquisition metadata in a shared 128-d embedding space. Enables cross-modal retrieval (find images by metadata, or predict metadata from images) and serves as a conditioning backbone for style transfer models.
Model Details
- Architecture: ViT-B/16 image encoder (ImageNet-pretrained, fine-tuned) + MLP metadata encoder
- Embedding dimension: 128
- Image input: Single-channel grayscale, 256×256 pixels
- Metadata input: 7-d z-scored vector (pixel_size, dwell_time, convergence_angle, beam_current, gain, offset, inner_collection_angle)
- Loss: Symmetric cross-entropy (CLIP-style) with learnable temperature and bias
- Parameters: ~86M (ViT-B/16 backbone)
Performance
Retrieval accuracy on the validation set (663 Henrik images):
| Metric | Score |
|---|---|
| Top-1 | 0.7952 |
| Top-5 | 0.9398 |
| Top-10 | 1.0000 |
Training Configuration
| Parameter | Value |
|---|---|
| Dataset | Henrik's HAADF-STEM (6,627 images, archived/original) |
| Image encoder | ViT-B/16 (ImageNet-pretrained, fine-tuned) |
| Crop size | 256×256 (on-the-fly from full-resolution images) |
| Loss function | CLIP (symmetric cross-entropy) with learnable logit bias |
| Optimizer | AdamW (lr=1e-4, weight_decay=0.01) |
| Scheduler | Cosine annealing |
| Batch size | 64 per GPU × 8 GPUs = 512 effective |
| Epochs | 1000 (best checkpoint at epoch 476) |
| Hardware | 8× H100 GPUs |
Note on batch size
Contrastive learning benefits significantly from larger batch sizes (more negatives per step). An earlier single-GPU run with effective batch 128 achieved Top-1 = 0.447. Scaling to 8 GPUs with effective batch 512 improved Top-1 to 0.795 — a 78% relative improvement with no other changes. This is consistent with findings in CLIP and similar contrastive methods.
Training Data
Henrik's HAADF-STEM Dataset
- 6,627 high-angle annular dark-field STEM images from a Titan Themis microscope at Empa Dübendorf
- Variable resolution: 256×256 to 4096×4096 pixels, uint16
- 7 metadata parameters automatically logged by the microscope software
- Train/val split: 90/10 by image (seed 67), resulting in 5,964 train / 663 val
Usage
import torch
from models import CMMP
# Load model
model = CMMP(
meta_input_dim=7,
embed_dim=128,
image_encoder="vit_pretrained",
image_size=256,
)
model.load_state_dict(torch.load("model.pth", map_location="cpu"))
model.eval()
# Embed an image and its metadata
image = torch.randn(1, 1, 256, 256) # single-channel grayscale [0, 1]
metadata = torch.randn(1, 7) # z-scored metadata vector
with torch.no_grad():
img_emb, meta_emb, temp, bias = model(image, metadata)
# img_emb: (1, 128) — L2-normalized image embedding
# meta_emb: (1, 128) — L2-normalized metadata embedding
# Cross-modal similarity
similarity = (img_emb @ meta_emb.T).item()
Files
model.pth— Best checkpoint (epoch 476, highest Top-1)config.json— Full training configurationtraining_log.csv— Per-epoch training metrics (loss, retrieval accuracy, learning rate)
Related Models
- Stemson-AI/cmmp-vit-pretrained-256-with-sample-atomagined — Same architecture trained with atomagined simulated data mixed in (Top-1: 0.705). The improvement was later attributed to batch size, not atomagined data.
Citation
@misc{cmmp2026,
title={Contrastive Micrograph-Metadata Pre-Training for Transmission Electron Microscopy},
author={Channing, Georgia and Keller, Debora and Rossell, Marta D. and Erni, Rolf and Helveg, Stig and Eliasson, Henrik},
year={2026},
}
- Downloads last month
- 20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support