AudioMosaic
Collection
ICML2026 AudioMosaic: Contrastive Masked Audio Representation Learning β’ 15 items β’ Updated β’ 2
AudioMosaic: Contrastive Masked Audio Representation Learning
Code: https://github.com/HanxunH/AudioMosaic
Pretrained encoder: hanxunh/AudioMosaic-vit-b16-pretrained
This is the AudioMosaic ViT-B/16 encoder fine-tuned on ESC-50 (fold 1) for single-label audio classification over 50 classes.
| Metric | Value |
|---|---|
| Accuracy | 97.25 |
import sys, torch
from huggingface_hub import snapshot_download
local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-finetune-esc-split1")
sys.path.insert(0, local_dir)
from load_model import load_classifier
model = load_classifier(device="cuda")
# Forward a log-mel spectrogram batch of shape [B, 1, 1024, 128]
fbank = torch.randn(2, 1, 1024, 128).cuda()
with torch.no_grad():
logits = model(fbank) # [B, 50]
pred = logits.argmax(dim=-1) # predicted class id
The release contains:
model.safetensors β fine-tuned classifier weightsconfig.json β architecture hyperparametersmodeling.py β vendored model architecture (no need to install AudioMosaic)load_model.py β convenience loaderRequired dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub.
@inproceedings{huang2026audiomosaic,
title={AudioMosaic: Contrastive Masked Audio Representation Learning},
author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
booktitle={ICML},
year={2026}
}
Base model
hanxunh/AudioMosaic-vit-b16-pretrained