AudioMosaic + LTU (Listen, Think, Understand) — Stage 4

End-to-end audio-language model that pairs the AudioMosaic ViT-B/16 audio encoder with a LoRA-fine-tuned Llama-7B for instruction-following on audio.

This repo bundles everything needed for inference in a single download:

LTU-pretrained Llama-7B base (~9.3 GB)
AudioMosaic ViT-B/16 audio encoder
Audio → LLM projector (768 → 4096)
Llama q_proj / v_proj LoRA adapters (4.2M params)
Vendored modeling code (no need to install AudioMosaic)

Model Details

Audio encoder: ViT-B/16, 768-dim, 12-layer (input: log-mel spectrogram 1024 × 128)
LLM: Llama-7B with LoRA (rank 8, q_proj + v_proj, alpha 16)
Training stages: stage1 (projector) → stage2 (closed-ended cla.) → stage3 (closed-ended all) → stage4 (open + closed mix)
Training data: OpenAQA-5.6M

Model Usage

import sys, torch
from huggingface_hub import snapshot_download

local_dir = snapshot_download("hanxunh/AudioMosaic-vit-b16-ltu-stage4")
sys.path.insert(0, local_dir)

from load_model import load_ltu
model, tokenizer = load_ltu(device_map="auto")

# Audio fbank: log-mel spectrogram of shape [1, 1, 1024, 128]
# (built with torchaudio.compliance.kaldi.fbank; see main_ltu_inference.py in the GitHub repo)
fbank = torch.randn(1, 1, 1024, 128).half().cuda()
prompt = "Close-ended question: Write an audio caption describing the sound."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

with torch.no_grad():
    out = model.generate(input_ids=input_ids, audio=fbank, max_new_tokens=64, temperature=0.1)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The release contains:

pytorch_model-0000{1,2,3}-of-00002.bin — LTU-pretrained Llama-7B base
extra_weights.bin — audio encoder + audio→LLM projector + Llama LoRA (372 MB)
modeling.py — vendored AudioMosaic audio-encoder architecture
modeling_audiomosaic_ltu.py — vendored LTU/Llama wrapper
transformers_vendored/ — vendored slim copy of HF Transformers (Llama only)
tokenizer.model, config.json, etc. — standard HF Llama files
load_model.py — one-line loader

Required dependencies: torch, timm, torchlibrosa, safetensors, huggingface_hub, numpy. The LTU codebase and HF Transformers are vendored so no extra installs are required.

Citation

@inproceedings{huang2026audiomosaic,
  title={AudioMosaic: Contrastive Masked Audio Representation Learning},
  author={Hanxun Huang and Qizhou Wang and Xingjun Ma and Cihang Xie and Christopher Leckie and Sarah Erfani},
  booktitle={ICML},
  year={2026}
}

LTU base model and training recipe:

@inproceedings{gong2024ltu,
  title={Listen, think, and understand},
  author={Gong, Yuan and Luo, Hongyin and Liu, Alexander and Karlinsky, Leonid and Glass, James R},
  booktitle={ICLR},
  year={2024}
}