Higgs Audio v3 8B STT

A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.

Usage

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "bosonai/higgs-audio-v3-8b-stt",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager",
    device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt")

# Transcribe audio (16kHz mono numpy array)
from transformers.utils import cached_file
import importlib.util
spec = importlib.util.spec_from_file_location("transcribe", cached_file("bosonai/higgs-audio-v3-8b-stt", "transcribe.py", _raise_exceptions_for_connection_errors=False))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

audio_np = np.random.randn(16000).astype(np.float32)  # replace with your audio
text = mod.transcribe(model, tokenizer, audio_np)
print(text)

Requirements

torch
transformers>=4.51.0
whisper  # for audio preprocessing (WhisperProcessor)

Architecture

Encoder: Whisper-Large-v3 (frozen)
Decoder: Qwen3-8B (LoRA fine-tuned, merged)
Total parameters: 8.91B
Audio input: 16kHz mono WAV
Supports: Thinking mode for improved accuracy

Performance (ESB Benchmark — Full Scale, All Samples)

Dataset	WER
AMI	6.23%
Earnings22	11.33%
GigaSpeech	9.34%
LibriSpeech Clean	1.24%
LibriSpeech Other	2.34%
SPGISpeech	3.14%
TED-LIUM	3.14%
VoxPopuli	5.63%
Average	5.30%

Training Data

10K AMI samples
6K SPGISpeech
5K Earnings22
4K VoxPopuli
3K LibriSpeech
3K TED-LIUM

Downloads last month: 801

Safetensors

Model size

9B params

Tensor type

BF16

Collection including bosonai/higgs-audio-v3-8b-stt

Higgs-Audio-STT

Collection

Understanding Model • 2 items • Updated 21 days ago • 1

Evaluation results

Test WER on AMI (Meetings test)
test set self-reported

6.230
Test WER on Earnings-22
test set self-reported

11.330
Test WER on GigaSpeech
test set self-reported

9.340
Test WER on LibriSpeech (clean)
test set self-reported

1.240
Test WER on LibriSpeech (other)
test set self-reported

2.340
Test WER on SPGI Speech
test set self-reported

3.140
Test WER on tedlium-v3
test set self-reported

3.140
Test WER on Vox Populi
test set self-reported

5.630