Higgs Audio v3 8B STT

A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.

Usage

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "bosonai/higgs-audio-v3-8b-stt",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager",
    device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt")

# Transcribe audio (16kHz mono numpy array)
from transformers.utils import cached_file
import importlib.util
spec = importlib.util.spec_from_file_location("transcribe", cached_file("bosonai/higgs-audio-v3-8b-stt", "transcribe.py", _raise_exceptions_for_connection_errors=False))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

audio_np = np.random.randn(16000).astype(np.float32)  # replace with your audio
text = mod.transcribe(model, tokenizer, audio_np)
print(text)

Requirements

torch
transformers>=4.51.0
whisper  # for audio preprocessing (WhisperProcessor)

Architecture

  • Encoder: Whisper-Large-v3 (frozen)
  • Decoder: Qwen3-8B (LoRA fine-tuned, merged)
  • Total parameters: 8.91B
  • Audio input: 16kHz mono WAV
  • Supports: Thinking mode for improved accuracy

Performance (ESB Benchmark — Full Scale, All Samples)

Dataset WER
AMI 6.23%
Earnings22 11.33%
GigaSpeech 9.34%
LibriSpeech Clean 1.24%
LibriSpeech Other 2.34%
SPGISpeech 3.14%
TED-LIUM 3.14%
VoxPopuli 5.63%
Average 5.30%

Training Data

  • 10K AMI samples
  • 6K SPGISpeech
  • 5K Earnings22
  • 4K VoxPopuli
  • 3K LibriSpeech
  • 3K TED-LIUM
Downloads last month
801
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including bosonai/higgs-audio-v3-8b-stt

Evaluation results