arXiv HuggingFace License Made with Python

AudioMosaic: Contrastive Masked Audio Representation Learning

Code: https://github.com/HanxunH/AudioMosaic

Pretrained encoder: hanxunh/AudioMosaic-vit-b16-pretrained


AudioMosaic β€” ESC-50 (fold 2) Classifier

This is the AudioMosaic ViT-B/16 encoder fine-tuned on ESC-50 (fold 2) for single-label audio classification over 50 classes.

Metric Value
Accuracy 98.75

Model Details

  • Architecture: ViT-B/16 with linear classifier head
  • Embedding dim: 768, Depth: 12, Heads: 12
  • Input: log-mel spectrogram of size 1024 Γ— 128
  • Patch size: 16 Γ— 16
  • Pooling: average over patch tokens
  • Classes: 50

Model Usage

import sys, torch
from huggingface_hub import snapshot_download

local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-finetune-esc-split2")
sys.path.insert(0, local_dir)

from load_model import load_classifier
model = load_classifier(device="cuda")

# Forward a log-mel spectrogram batch of shape [B, 1, 1024, 128]
fbank = torch.randn(2, 1, 1024, 128).cuda()
with torch.no_grad():
    logits = model(fbank)              # [B, 50]
    pred = logits.argmax(dim=-1)       # predicted class id

The release contains:

  • model.safetensors β€” fine-tuned classifier weights
  • config.json β€” architecture hyperparameters
  • modeling.py β€” vendored model architecture (no need to install AudioMosaic)
  • load_model.py β€” convenience loader

Required dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub.


Citation

@inproceedings{huang2026audiomosaic,
  title={AudioMosaic: Contrastive Masked Audio Representation Learning},
  author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
  booktitle={ICML},
  year={2026}
}
Downloads last month
75
Safetensors
Model size
85.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hanxunh/AudioMosaic-vit-b16-finetune-esc-split2

Finetuned
(14)
this model

Collection including hanxunh/AudioMosaic-vit-b16-finetune-esc-split2

Paper for hanxunh/AudioMosaic-vit-b16-finetune-esc-split2

Evaluation results