MOSS-Audio

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

2026.4.13: 🎉🎉🎉 We have released MOSS-Audio. Blog and paper coming soon!

Introduction
Model Architecture
- DeepStack Cross-Layer Feature Injection
- Time-Aware Representation
Released Models
Evaluation
Quickstart
More Information
Citation

Introduction

Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

Model	Audio Encoder	LLM Backbone	Total Size
MOSS-Audio-4B-Instruct	MOSS-Audio-Encoder	Qwen3-4B	~4.6B
MOSS-Audio-4B-Thinking	MOSS-Audio-Encoder	Qwen3-4B	~4.6B
MOSS-Audio-8B-Instruct	MOSS-Audio-Encoder	Qwen3-8B	~8.6B
MOSS-Audio-8B-Thinking	MOSS-Audio-Encoder	Qwen3-8B	~8.6B

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 70.80, outperforming all of the open-source models.
Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

Model	Model Size	MMAU	MMAU-Pro	MMAR	MMSU	Avg
Open Source (small)
Kimi-Audio	7B	72.41	56.58	60.82	54.74	61.14
Qwen2.5-Omni	7B	65.60	52.20	56.70	61.32	58.96
Audio Flamingo 3	7B	61.23	51.70	57.96	60.04	57.73
MiMo-Audio-7B	7B	74.90	53.35	61.70	61.94	62.97
MiniCPM-o-4.5	9B	70.97	39.65	55.75	60.96	56.83
MOSS-Audio-4B-Instruct	4B	75.79	58.16	59.68	59.68	64.04
MOSS-Audio-4B-Thinking	4B	77.64	60.75	63.91	71.20	68.37
MOSS-Audio-8B-Instruct	8B	77.03	57.48	64.42	66.36	66.32
MOSS-Audio-8B-Thinking	8B	77.13	64.29	65.73	76.06	70.80
Open Source (large)
Qwen3-Omni-30B-A3B-Instruct	30B	75.00	61.22	66.40	69.00	67.91
Step-Audio-R1.1	33B	72.18	60.80	68.75	64.18	66.48
Step-Audio-R1	33B	78.67	59.68	69.15	75.18	70.67
Closed Source
GPT4o-Audio	-	65.66	52.30	59.78	58.76	59.13
Gemini-3-Pro	-	80.15	68.28	81.73	81.28	77.86
Gemini-3.1-Pro	-	81.10	73.47	83.70	81.30	79.89

Speech Captioning (LLM-as-a-Judge Score↑)

Speech Captioning (click to expand)

Model	gender	age	accent	pitch	volume	speed	texture	clarity	fluency	emotion	tone	personality	summary	Avg
Qwen3-Omni-30B-A3B-Instruct	4.436	3.936	4.356	3.590	3.682	3.614	3.093	3.521	3.531	3.328	3.224	3.292	3.179	3.5986
Qwen3-Omni-30B-A3B-Thinking	4.419	4.026	4.327	3.610	3.577	3.610	3.179	3.403	3.526	3.232	3.154	3.197	3.107	3.5667
Gemini-3-Pro	4.191	3.835	4.181	3.392	3.254	3.320	2.998	3.347	3.524	3.055	2.997	3.023	2.775	3.3763
Gemini-3.1-Pro	4.436	3.936	4.356	3.590	3.682	3.614	3.093	3.521	3.531	3.328	3.224	3.292	3.179	3.5986
MOSS-Audio-4B-Instruct	4.697	3.980	4.497	3.628	3.722	3.564	3.407	3.841	3.744	3.311	3.282	3.305	3.259	3.7105
MOSS-Audio-8B-Instruct	4.683	3.979	4.572	3.682	3.709	3.638	3.403	3.869	3.747	3.314	3.253	3.272	3.307	3.7252

ASR

Model	Overall	Health Condition	Dialect	Singing	Non-Speech Vocalizations	Code-Switching	Acoustic Environment (Clean)	Acoustic Environment (Noisy)	Acoustic Characteristics: Whisper	Acoustic Characteristics: Far-Field / Near-Field	Multi-Speaker	Age	Semantic Content
Paraformer-Large	15.77	22.18	43.45	32.34	4.95	12.65	3.11	4.67	5.02	17.46	20.33	14.96	7.14
GLM-ASR-Nano	17.29	24.49	22.39	51.95	4.65	11.88	3.68	5.02	4.94	27.51	28.02	17.19	7.32
Fun-ASR-Nano	12.04	21.99	7.80	19.35	4.76	11.23	2.98	3.46	3.78	18.38	19.82	14.95	6.08
SenseVoice-Small	14.50	24.04	8.89	23.79	4.92	13.90	4.13	4.93	5.57	26.66	24.06	17.63	7.55
Kimi-Audio-7B-Instruct	14.12	21.11	29.34	21.76	4.68	16.38	2.20	2.15	2.66	21.02	20.61	16.74	6.12
Qwen2.5-Omni-3B	15.26	24.65	33.87	24.24	5.54	11.66	2.76	3.56	4.32	22.15	22.91	15.17	7.24
Qwen2.5-Omni-7B	15.05	23.85	31.91	22.69	4.56	12.97	2.52	3.16	3.64	25.38	21.01	16.13	6.78
Qwen3-Omni-30B-A3B-Instruct	11.39	20.73	15.63	16.01	4.73	11.30	2.23	2.47	1.90	17.08	18.15	11.46	5.74
MOSS-Audio-4B-Instruct	11.58	21.11	11.84	10.79	4.01	10.11	3.11	3.72	3.29	18.48	20.33	15.09	8.15
MOSS-Audio-8B-Instruct	11.30	19.18	8.76	9.81	4.31	10.18	2.70	3.20	2.75	24.04	24.36	15.26	7.69

Detailed ASR Results (click to expand)

Model	Acoustic Environment (Clean)			Acoustic Environment (Noisy)	Acoustic Characteristics: Whisper	Acoustic Characteristics: Far-Field / Near-Field	Multi-Speaker	Age		Health Condition		Semantic Content		Code-Switching			Dialect		Singing		Non-Speech Vocalizations
Model	AISHELL-1 test	AISHELL-2 Android \| IOS \| Mic	THCHS-30 test	MAGICDATA-READ test	AISHELL6-Whisper normal \| whisper	AliMeeting Test_Ali_far \| Test_Ali_near	AISHELL-4 test	SeniorTalk sentence	ChildMandarin test	AISHELL-6A mild \| moderate \| severe \| StutteringSpeech	AISHELL_6B LRDWWS \| Uncontrol	WenetSpeech test-meeting	Fleurs cmn_hans_cn	CS-Dialogue test	TALCS test	ASCEND test	KeSpeech test	WSYue-ASR-eval short	MIR-1K test	openc-pop test	MNV_17
Paraformer-Large	1.98	3.28 \| 3.21 \| 3.00	4.07	4.67	1.11 \| 8.92	25.64 \| 9.27	20.33	17.31	12.60	6.98 \| 9.30 \| 13.34 \| 10.74	47.59 \| 45.08	7.88	6.40	10.64	10.77	16.55	11.48	75.42	57.70	6.98	4.95
GLM-ASR-Nano	2.89	3.75 \| 3.73 \| 3.78	4.23	5.02	0.83 \| 9.06	40.27 \| 14.76	28.02	20.33	14.06	8.74 \| 12.11 \| 14.38 \| 12.29	50.34 \| 49.09	9.70	4.94	11.06	11.07	13.50	9.72	35.07	95.87	8.03	4.65
Fun-ASR-Nano	2.16	3.04 \| 2.99 \| 3.07	3.65	3.46	0.81 \| 6.76	27.21 \| 9.55	19.82	16.96	12.94	6.60 \| 8.81 \| 12.98 \| 10.30	47.42 \| 45.84	7.39	4.76	10.47	8.09	15.13	7.43	8.17	35.85	2.84	4.76
SenseVoice-Small	3.23	4.16 \| 4.02 \| 3.96	5.26	4.93	1.25 \| 9.88	37.01 \| 16.31	24.06	21.07	14.18	7.62 \| 9.85 \| 14.39 \| 11.47	52.92 \| 47.97	8.35	6.75	12.81	10.52	18.38	10.45	7.34	39.51	8.07	4.92
Kimi-Audio-7B-Instruct	0.79	2.91 \| 3.03 \| 2.88	1.39	2.15	0.69 \| 4.63	28.22 \| 13.82	20.61	19.70	13.79	7.00 \| 9.34 \| 12.56 \| 10.75	44.44 \| 42.57	7.15	5.10	14.56	12.74	21.83	5.51	53.17	38.35	5.17	4.68
Qwen2.5-Omni-3B	1.51	3.10 \| 2.94 \| 2.93	3.32	3.56	0.82 \| 7.82	32.14 \| 12.16	22.91	17.38	12.96	6.87 \| 10.55 \| 14.57 \| 11.33	54.54 \| 50.03	9.04	5.45	10.78	10.94	13.25	7.67	60.06	45.00	3.47	5.54
Qwen2.5-Omni-7B	1.16	2.88 \| 2.77 \| 2.73	3.06	3.16	0.71 \| 6.57	32.03 \| 18.73	21.01	19.96	12.29	7.27 \| 10.94 \| 12.92 \| 10.53	51.99 \| 49.45	8.43	5.13	14.02	10.46	14.42	6.40	57.43	42.62	2.75	4.56
Qwen3-Omni-30B-A3B-Instruct	0.95	2.70 \| 2.72 \| 2.57	2.21	2.47	0.59 \| 3.22	25.72 \| 8.44	18.15	14.13	8.79	6.20 \| 8.88 \| 11.59 \| 10.25	45.80 \| 41.65	6.64	4.84	12.94	8.33	12.64	5.87	25.39	30.81	1.21	4.73
MOSS-Audio-4B-Instruct	2.26	3.22 \| 3.20 \| 3.33	3.53	3.72	0.73 \| 5.86	27.27 \| 9.68	20.33	16.93	13.25	6.36 \| 9.77 \| 12.68 \| 10.28	43.35 \| 44.25	8.17	8.13	9.14	8.37	12.83	14.65	9.04	18.47	3.10	4.01
MOSS-Audio-8B-Instruct	1.82	2.97 \| 2.95 \| 2.91	2.82	3.20	0.69 \| 4.80	36.82 \| 11.25	24.36	17.42	13.10	5.84 \| 8.94 \| 11.52 \| 9.72	39.76 \| 39.27	7.86	7.52	9.07	8.22	13.26	9.18	8.33	17.24	2.39	4.31

Timestamp ASR (AAS↓)

Model	AISHELL-1(zh)	LibriSpeech(en)
Qwen3-Omni-30B-A3B-Instruct	833.66	646.95
Gemini-3.1-Pro	708.24	871.19
MOSS-Audio-4B-Instruct	76.96	358.13
MOSS-Audio-8B-Instruct	35.77	131.61

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

Recommended setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

huggingface-cli download OpenMOSS-Team/MOSS-Audio --local-dir ./weights/MOSS-Audio
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Instruct --local-dir ./weights/MOSS-Audio-Instruct

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

More Information

MOSI.AI: https://mosi.cn
OpenMOSS: https://www.open-moss.com

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

BF16

Collection including OpenMOSS-Team/MOSS-Audio-8B-Instruct

MOSS-Audio

Collection

4 items • Updated about 5 hours ago • 4

OpenMOSS-Team
/

MOSS-Audio-8B-Instruct

MOSS-Audio

News

Contents

Introduction

Model Architecture

DeepStack Cross-Layer Feature Injection

Time-Aware Representation

Released Models

Evaluation

General Audio Understanding (Accuracy↑)

Speech Captioning (LLM-as-a-Judge Score↑)

ASR

Timestamp ASR (AAS↓)

Quickstart

Environment Setup

Recommended setup

Optional: FlashAttention 2

Basic Usage

Gradio App

SGLang Serving

More Information

LICENSE

Citation

Collection including OpenMOSS-Team/MOSS-Audio-8B-Instruct

MOSS-Audio