Ultra Diar Streaming Sortformer (5-Speaker)

This model extends NVIDIA Streaming Sortformer speaker diarization from 4 speakers to 5 speakers. The original diar_streaming_sortformer_4spk-v2.1 supports up to 4 speakers; this model expands the capability to handle 5 speakers through fine-tuning and architectural modifications.

Model Details

Code & Training

Extension scripts, NeMo patches for split-head / split-LR training, synthetic data tooling, and documentation: Ultra-Sortformer (GitHub).

Training

This model was trained on 2× NVIDIA H100 GPUs. We use synthetic data with 2–5 speakers.

Usage

This model requires the NVIDIA NeMo toolkit to train, fine-tune, or perform diarization. Install NeMo after installing Cython and the latest PyTorch.

Install NeMo

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

Quick Start: Run Diarization

from nemo.collections.asr.models import SortformerEncLabelModel

# Load model from Hugging Face
diar_model = SortformerEncLabelModel.from_pretrained("devsy0117/ultra_diar_streaming_sortformer_5spk_v1")
diar_model.eval()

# Streaming parameters (recommended for best performance)
diar_model.sortformer_modules.chunk_len = 340
diar_model.sortformer_modules.chunk_right_context = 40
diar_model.sortformer_modules.fifo_len = 40
diar_model.sortformer_modules.spkcache_update_period = 300

# Run diarization
predicted_segments = diar_model.diarize(audio=["/path/to/your/audio.wav"], batch_size=1)

for segment in predicted_segments[0]:
    print(segment)

Loading the Model

from nemo.collections.asr.models import SortformerEncLabelModel

# Option 1: Load directly from Hugging Face
diar_model = SortformerEncLabelModel.from_pretrained("devsy0117/ultra_diar_streaming_sortformer_5spk_v1")

# Option 2: Load from a downloaded .nemo file
diar_model = SortformerEncLabelModel.restore_from(
    restore_path="/path/to/ultra_diar_streaming_sortformer_5spk_v1.nemo",
    map_location="cuda",
    strict=False,
)

diar_model.eval()

Input Format

  • Single audio file: audio_input="/path/to/multispeaker_audio.wav"
  • Multiple files: audio_input=["/path/to/audio1.wav", "/path/to/audio2.wav"]

Note: The base model (v2.1) is trained for up to 4 speakers. The lower Spk_Count_Acc of this model on AliMeeting and AMI IHM reflects sessions with ≤4 speakers being predicted as 5, a trade-off from extending to 5-speaker support. DER improves significantly due to reduced MISS rate.

License

This model is a derivative of NVIDIA Sortformer, licensed under the NVIDIA Open Model License.

Attribution: Licensed by NVIDIA Corporation under the NVIDIA Open Model License.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for devsy0117/ultra_diar_streaming_sortformer_5spk_v1

Finetuned
(7)
this model