AudioMosaic
Collection
ICML2026 AudioMosaic: Contrastive Masked Audio Representation Learning β’ 15 items β’ Updated β’ 2
AudioMosaic: Contrastive Masked Audio Representation Learning
Code: https://github.com/HanxunH/AudioMosaic
Pretrained encoder: hanxunh/AudioMosaic-vit-b16-pretrained
End-to-end audio-language model that pairs the AudioMosaic ViT-B/16 audio encoder with a LoRA-fine-tuned Llama-7B for instruction-following on audio.
This repo bundles everything needed for inference in a single download:
import sys, torch
from huggingface_hub import snapshot_download
local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-ltu-stage4")
sys.path.insert(0, local_dir)
from load_model import load_ltu
model, tokenizer = load_ltu(device_map="auto")
# Audio fbank: log-mel spectrogram of shape [1, 1, 1024, 128]
# (built with torchaudio.compliance.kaldi.fbank; see main_ltu_inference.py in the GitHub repo)
fbank = torch.randn(1, 1, 1024, 128).half().cuda()
prompt = "Close-ended question: Write an audio caption describing the sound."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
with torch.no_grad():
out = model.generate(input_ids=input_ids, audio=fbank, max_new_tokens=64, temperature=0.1)
print(tokenizer.decode(out[0], skip_special_tokens=True))
The release contains:
pytorch_model-0000{1,2,3}-of-00002.bin β LTU-pretrained Llama-7B baseextra_weights.bin β audio encoder + audioβLLM projector + Llama LoRA (372 MB)modeling.py β vendored AudioMosaic audio-encoder architecturemodeling_audiomosaic_ltu.py β vendored LTU/Llama wrappertransformers_vendored/ β vendored slim copy of HF Transformers (Llama only)tokenizer.model, config.json, etc. β standard HF Llama filesload_model.py β one-line loaderRequired dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub, numpy. The LTU codebase and HF Transformers are vendored so no extra installs are required.
@inproceedings{huang2026audiomosaic,
title={AudioMosaic: Contrastive Masked Audio Representation Learning},
author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
booktitle={ICML},
year={2026}
}
LTU base model and training recipe:
@inproceedings{gong2024ltu,
title={Listen, think, and understand},
author={Gong, Yuan and Luo, Hongyin and Liu, Alexander and Karlinsky, Leonid and Glass, James R},
booktitle={ICLR},
year={2024}
}
Base model
hanxunh/AudioMosaic-vit-b16-pretrained