AudioMosaic — AudioSet-20K Linear Probe

This is the AudioMosaic ViT-B/16 encoder with a linear classifier head trained on AudioSet-20K (encoder frozen, only the probe parameters are trained).

Metric	Value
mAP (AS-20K eval)	29.40

Model Details

Architecture: ViT-B/16 with linear classifier head
Embedding dim: 768, Depth: 12, Heads: 12
Input: log-mel spectrogram of size 1024 × 128
Patch size: 16 × 16
Classes: 527 (AudioSet ontology)
Encoder weights are frozen during probe training.

Model Usage

import sys, torch
from huggingface_hub import snapshot_download

local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-linear-prob-as20k")
sys.path.insert(0, local_dir)

from load_model import load_classifier
model = load_classifier(device="cuda")

# Forward a log-mel spectrogram batch of shape [B, 1, 1024, 128]
fbank = torch.randn(2, 1, 1024, 128).cuda()
with torch.no_grad():
    logits = model(fbank)        # [B, 527]
    probs = logits.sigmoid()     # multi-label probabilities

The release contains:

model.safetensors — probe weights
config.json — architecture hyperparameters
modeling.py — vendored model architecture (no need to install AudioMosaic)
load_model.py — convenience loader

Required dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub.

Citation

@inproceedings{huang2026audiomosaic,
  title={AudioMosaic: Contrastive Masked Audio Representation Learning},
  author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
  booktitle={ICML},
  year={2026}
}