CMMP: Contrastive Micrograph-Metadata Pre-Training

A contrastive encoder that aligns HAADF-STEM microscopy images with their acquisition metadata in a shared 128-d embedding space. Enables cross-modal retrieval (find images by metadata, or predict metadata from images) and serves as a conditioning backbone for style transfer models.

Model Details

  • Architecture: ViT-B/16 image encoder (ImageNet-pretrained, fine-tuned) + MLP metadata encoder
  • Embedding dimension: 128
  • Image input: Single-channel grayscale, 256×256 pixels
  • Metadata input: 7-d z-scored vector (pixel_size, dwell_time, convergence_angle, beam_current, gain, offset, inner_collection_angle)
  • Loss: Symmetric cross-entropy (CLIP-style) with learnable temperature and bias
  • Parameters: ~86M (ViT-B/16 backbone)

Performance

Retrieval accuracy on the validation set (663 Henrik images):

Metric Score
Top-1 0.7952
Top-5 0.9398
Top-10 1.0000

Training Configuration

Parameter Value
Dataset Henrik's HAADF-STEM (6,627 images, archived/original)
Image encoder ViT-B/16 (ImageNet-pretrained, fine-tuned)
Crop size 256×256 (on-the-fly from full-resolution images)
Loss function CLIP (symmetric cross-entropy) with learnable logit bias
Optimizer AdamW (lr=1e-4, weight_decay=0.01)
Scheduler Cosine annealing
Batch size 64 per GPU × 8 GPUs = 512 effective
Epochs 1000 (best checkpoint at epoch 476)
Hardware 8× H100 GPUs

Note on batch size

Contrastive learning benefits significantly from larger batch sizes (more negatives per step). An earlier single-GPU run with effective batch 128 achieved Top-1 = 0.447. Scaling to 8 GPUs with effective batch 512 improved Top-1 to 0.795 — a 78% relative improvement with no other changes. This is consistent with findings in CLIP and similar contrastive methods.

Training Data

Henrik's HAADF-STEM Dataset

  • 6,627 high-angle annular dark-field STEM images from a Titan Themis microscope at Empa Dübendorf
  • Variable resolution: 256×256 to 4096×4096 pixels, uint16
  • 7 metadata parameters automatically logged by the microscope software
  • Train/val split: 90/10 by image (seed 67), resulting in 5,964 train / 663 val

Usage

import torch
from models import CMMP

# Load model
model = CMMP(
    meta_input_dim=7,
    embed_dim=128,
    image_encoder="vit_pretrained",
    image_size=256,
)
model.load_state_dict(torch.load("model.pth", map_location="cpu"))
model.eval()

# Embed an image and its metadata
image = torch.randn(1, 1, 256, 256)  # single-channel grayscale [0, 1]
metadata = torch.randn(1, 7)          # z-scored metadata vector

with torch.no_grad():
    img_emb, meta_emb, temp, bias = model(image, metadata)
    # img_emb: (1, 128) — L2-normalized image embedding
    # meta_emb: (1, 128) — L2-normalized metadata embedding

# Cross-modal similarity
similarity = (img_emb @ meta_emb.T).item()

Files

  • model.pth — Best checkpoint (epoch 476, highest Top-1)
  • config.json — Full training configuration
  • training_log.csv — Per-epoch training metrics (loss, retrieval accuracy, learning rate)

Related Models

Citation

@misc{cmmp2026,
  title={Contrastive Micrograph-Metadata Pre-Training for Transmission Electron Microscopy},
  author={Channing, Georgia and Keller, Debora and Rossell, Marta D. and Erni, Rolf and Helveg, Stig and Eliasson, Henrik},
  year={2026},
}
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support