AudioMosaic

AudioMosaic is a self-supervised audio foundation model that combines masked learning with contrastive training (NT-Xent) over mel-spectrogram patches. The model is pretrained on AudioSet-2M and transfers to a wide range of downstream audio tasks.

Model Details

Architecture: ViT-B/16
Embedding dim: 768, Depth: 12, Heads: 12
Input: log-mel spectrogram of size 1024 × 128
Patch size: 16 × 16, Patch stride: 16 × 16
Pretraining: NT-Xent contrastive loss + temporal/frequency masking (60% time / 40% freq)
Pretraining data: AudioSet-2M
Pretraining epochs: 400

Model Usage

import sys, torch
from huggingface_hub import snapshot_download

# Download the self-contained release (architecture + weights + loader)
local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-pretrained")
sys.path.insert(0, local_dir)

from load_model import load_pretrained_encoder
model = load_pretrained_encoder(device="cuda")

# Forward a log-mel spectrogram batch of shape [B, 1, 1024, 128]
fbank = torch.randn(2, 1, 1024, 128).cuda()
with torch.no_grad():
    features = model.forward_encoder(fbank)  # [B, num_patches+1, 768]

The release contains:

model.safetensors — encoder weights
config.json — architecture hyperparameters
modeling.py — vendored model architecture (no need to install AudioMosaic)
load_model.py — convenience loader

Required dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub.

Citation

@inproceedings{huang2026audiomosaic,
  title={AudioMosaic: Contrastive Masked Audio Representation Learning},
  author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
  booktitle={ICML},
  year={2026}
}

Downloads last month: 50

Safetensors

Model size

86.1M params

Tensor type

F32

Model tree for hanxunh/AudioMosaic-vit-b16-pretrained

Finetunes

14 models

Collection including hanxunh/AudioMosaic-vit-b16-pretrained

AudioMosaic

Collection

ICML2026 AudioMosaic: Contrastive Masked Audio Representation Learning • 15 items • Updated 4 days ago • 2

Paper for hanxunh/AudioMosaic-vit-b16-pretrained

AudioMosaic: Contrastive Masked Audio Representation Learning

Paper • 2605.14231 • Published 1 day ago • 3