arXiv HuggingFace License Made with Python

AudioMosaic: Contrastive Masked Audio Representation Learning

Code: https://github.com/HanxunH/AudioMosaic

Pretrained encoder: hanxunh/AudioMosaic-vit-b16-pretrained


AudioMosaic + LTU (Listen, Think, Understand) β€” Stage 4

End-to-end audio-language model that pairs the AudioMosaic ViT-B/16 audio encoder with a LoRA-fine-tuned Llama-7B for instruction-following on audio.

This repo bundles everything needed for inference in a single download:

  • LTU-pretrained Llama-7B base (~9.3 GB)
  • AudioMosaic ViT-B/16 audio encoder
  • Audio β†’ LLM projector (768 β†’ 4096)
  • Llama q_proj / v_proj LoRA adapters (4.2M params)
  • Vendored modeling code (no need to install AudioMosaic)

Model Details

  • Audio encoder: ViT-B/16, 768-dim, 12-layer (input: log-mel spectrogram 1024 Γ— 128)
  • LLM: Llama-7B with LoRA (rank 8, q_proj + v_proj, alpha 16)
  • Training stages: stage1 (projector) β†’ stage2 (closed-ended cla.) β†’ stage3 (closed-ended all) β†’ stage4 (open + closed mix)
  • Training data: OpenAQA-5.6M

Model Usage

import sys, torch
from huggingface_hub import snapshot_download

local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-ltu-stage4")
sys.path.insert(0, local_dir)

from load_model import load_ltu
model, tokenizer = load_ltu(device_map="auto")

# Audio fbank: log-mel spectrogram of shape [1, 1, 1024, 128]
# (built with torchaudio.compliance.kaldi.fbank; see main_ltu_inference.py in the GitHub repo)
fbank = torch.randn(1, 1, 1024, 128).half().cuda()
prompt = "Close-ended question: Write an audio caption describing the sound."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

with torch.no_grad():
    out = model.generate(input_ids=input_ids, audio=fbank, max_new_tokens=64, temperature=0.1)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The release contains:

  • pytorch_model-0000{1,2,3}-of-00002.bin β€” LTU-pretrained Llama-7B base
  • extra_weights.bin β€” audio encoder + audioβ†’LLM projector + Llama LoRA (372 MB)
  • modeling.py β€” vendored AudioMosaic audio-encoder architecture
  • modeling_audiomosaic_ltu.py β€” vendored LTU/Llama wrapper
  • transformers_vendored/ β€” vendored slim copy of HF Transformers (Llama only)
  • tokenizer.model, config.json, etc. β€” standard HF Llama files
  • load_model.py β€” one-line loader

Required dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub, numpy. The LTU codebase and HF Transformers are vendored so no extra installs are required.


Citation

@inproceedings{huang2026audiomosaic,
  title={AudioMosaic: Contrastive Masked Audio Representation Learning},
  author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
  booktitle={ICML},
  year={2026}
}

LTU base model and training recipe:

@inproceedings{gong2024ltu,
  title={Listen, think, and understand},
  author={Gong, Yuan and Luo, Hongyin and Liu, Alexander and Karlinsky, Leonid and Glass, James R},
  booktitle={ICLR},
  year={2024}
}
Downloads last month
85
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hanxunh/AudioMosaic-vit-b16-ltu-stage4

Finetuned
(14)
this model

Collection including hanxunh/AudioMosaic-vit-b16-ltu-stage4

Paper for hanxunh/AudioMosaic-vit-b16-ltu-stage4