You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SpaAudioLM

Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

Model Summary

SpaAudioLM is a multimodal audio language model fine-tuned from Qwen2.5-Omni-7B for geospatially aware environmental sound classification. It jointly reasons over audio signals and geospatial Point-of-Interest (POI) metadata across 28 environmental sound categories.

Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. SpaAudioLM bridges this gap by enabling spatially grounded sound understanding.

Training Hyperparameters

Phase	Base Model	Epochs	Learning Rate	Key Details
SFT	Qwen2.5-Omni-7B	6	1e-5	DeepSpeed Zero-2, batch size 4/GPU, full parameter fine-tuning
GRPO	SFT checkpoint	3	1e-6	Group size 8, KL coeff 0.05, rewards: F1 (1.0) + format (0.1) + POI (0.3)

Hardware: 4× GPUs, 32GB+ VRAM each

Results

Comparison on multi-label audio event classification (mean ± std over 5 runs, %):

Model	F1-Micro	F1-Macro	F1-Weighted	Jaccard	Exact Match
Qwen2-Audio-7B	4.73	2.86	5.27	1.96	0.00
Qwen2.5-Omni-7B	34.36	25.90	37.35	18.31	9.97
Qwen3-Omni-30B	29.66	20.26	28.80	14.81	14.02
GPT-4o Audio	30.09	26.47	34.07	17.18	9.43
Gemini 2.5 Pro	44.24	40.35	47.65	28.04	15.58
SpaAudioLM (Ours)	73.36	63.48	72.98	53.57	54.47

Quick Start

Download & Inference

# Download model weights
huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM

# Clone the repo for inference scripts
git clone https://github.com/<your-username>/SpaAudioLM.git
cd SpaAudioLM

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate

# Run inference
bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh

Dataset

git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data

The dataset contains 3,854 WAV files with POI metadata, split into train (2,697), validation (578), and test (579) samples.

Training

# Phase 1: SFT
bash app/src/sft/GeoOmniR1Strength-sft.sh

# Phase 2: GRPO (requires SFT checkpoint)
bash app/src/grpo/GeoOmniR1-grpo-strength.sh

Evaluation

# Single run
uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results

# 5-run aggregation (mean ± std)
uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>

Intended Use

This model is designed for multi-label environmental sound classification in geospatial contexts. It takes audio input along with POI metadata and produces chain-of-thought reasoning followed by sound event labels.

Limitations

Requires POI metadata for optimal performance; audio-only inference may degrade results.
Trained on 28 environmental sound categories; may not generalize to other sound taxonomies.
Requires significant GPU resources (4× 32GB+ VRAM) for training.

Citation

@article{hou2025spaaudioLM,
  title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
  author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
  year={2025}
}

Downloads last month: 14

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for shiran-yu/SpaAudioLM

Base model

Qwen/Qwen2.5-Omni-7B

Finetuned

(49)

this model

Collection including shiran-yu/SpaAudioLM

SpaAudioLM

Collection

SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding • 2 items • Updated 7 days ago