AudioMosaic
Collection
ICML2026 AudioMosaic: Contrastive Masked Audio Representation Learning • 15 items • Updated • 2
AudioMosaic: Contrastive Masked Audio Representation Learning
Code: https://github.com/HanxunH/AudioMosaic
AudioMosaic is a self-supervised audio foundation model that combines masked learning with contrastive training (NT-Xent) over mel-spectrogram patches. The model is pretrained on AudioSet-2M and transfers to a wide range of downstream audio tasks.
import sys, torch
from huggingface_hub import snapshot_download
# Download the self-contained release (architecture + weights + loader)
local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-pretrained")
sys.path.insert(0, local_dir)
from load_model import load_pretrained_encoder
model = load_pretrained_encoder(device="cuda")
# Forward a log-mel spectrogram batch of shape [B, 1, 1024, 128]
fbank = torch.randn(2, 1, 1024, 128).cuda()
with torch.no_grad():
features = model.forward_encoder(fbank) # [B, num_patches+1, 768]
The release contains:
model.safetensors — encoder weightsconfig.json — architecture hyperparametersmodeling.py — vendored model architecture (no need to install AudioMosaic)load_model.py — convenience loaderRequired dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub.
@inproceedings{huang2026audiomosaic,
title={AudioMosaic: Contrastive Masked Audio Representation Learning},
author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
booktitle={ICML},
year={2026}
}