CMMP: Contrastive Micrograph-Metadata Pre-Training

A contrastive encoder that aligns HAADF-STEM microscopy images with their acquisition metadata in a shared 128-d embedding space. Enables cross-modal retrieval (find images by metadata, or predict metadata from images) and serves as a conditioning backbone for style transfer models.

Model Details

Architecture: ViT-B/16 image encoder (ImageNet-pretrained, fine-tuned) + MLP metadata encoder
Embedding dimension: 128
Image input: Single-channel grayscale, 256×256 pixels
Metadata input: 7-d z-scored vector (pixel_size, dwell_time, convergence_angle, beam_current, gain, offset, inner_collection_angle)
Loss: Symmetric cross-entropy (CLIP-style) with learnable temperature and bias
Parameters: ~86M (ViT-B/16 backbone)

Performance

Retrieval accuracy on the validation set (663 Henrik images):

Metric	Score
Top-1	0.7952
Top-5	0.9398
Top-10	1.0000

Training Configuration

Parameter	Value
Dataset	Henrik's HAADF-STEM (6,627 images, archived/original)
Image encoder	ViT-B/16 (ImageNet-pretrained, fine-tuned)
Crop size	256×256 (on-the-fly from full-resolution images)
Loss function	CLIP (symmetric cross-entropy) with learnable logit bias
Optimizer	AdamW (lr=1e-4, weight_decay=0.01)
Scheduler	Cosine annealing
Batch size	64 per GPU × 8 GPUs = 512 effective
Epochs	1000 (best checkpoint at epoch 476)
Hardware	8× H100 GPUs

Note on batch size

Contrastive learning benefits significantly from larger batch sizes (more negatives per step). An earlier single-GPU run with effective batch 128 achieved Top-1 = 0.447. Scaling to 8 GPUs with effective batch 512 improved Top-1 to 0.795 — a 78% relative improvement with no other changes. This is consistent with findings in CLIP and similar contrastive methods.

Training Data

Henrik's HAADF-STEM Dataset

6,627 high-angle annular dark-field STEM images from a Titan Themis microscope at Empa Dübendorf
Variable resolution: 256×256 to 4096×4096 pixels, uint16
7 metadata parameters automatically logged by the microscope software
Train/val split: 90/10 by image (seed 67), resulting in 5,964 train / 663 val

Usage

import torch
from models import CMMP

# Load model
model = CMMP(
    meta_input_dim=7,
    embed_dim=128,
    image_encoder="vit_pretrained",
    image_size=256,
)
model.load_state_dict(torch.load("model.pth", map_location="cpu"))
model.eval()

# Embed an image and its metadata
image = torch.randn(1, 1, 256, 256)  # single-channel grayscale [0, 1]
metadata = torch.randn(1, 7)          # z-scored metadata vector

with torch.no_grad():
    img_emb, meta_emb, temp, bias = model(image, metadata)
    # img_emb: (1, 128) — L2-normalized image embedding
    # meta_emb: (1, 128) — L2-normalized metadata embedding

# Cross-modal similarity
similarity = (img_emb @ meta_emb.T).item()

Files

model.pth — Best checkpoint (epoch 476, highest Top-1)
config.json — Full training configuration
training_log.csv — Per-epoch training metrics (loss, retrieval accuracy, learning rate)

Related Models

Stemson-AI/cmmp-vit-pretrained-256-with-sample-atomagined — Same architecture trained with atomagined simulated data mixed in (Top-1: 0.705). The improvement was later attributed to batch size, not atomagined data.

Citation

@misc{cmmp2026,
  title={Contrastive Micrograph-Metadata Pre-Training for Transmission Electron Microscopy},
  author={Channing, Georgia and Keller, Debora and Rossell, Marta D. and Erni, Rolf and Helveg, Stig and Eliasson, Henrik},
  year={2026},
}

Downloads last month: 20

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support