mlx-community/diar_sortformer_4spk-v1-fp32

This model was converted to MLX format from nvidia/diar_sortformer_4spk-v1 using mlx-audio version 0.3.2.

Refer to the original model card for more details on the model.

Use with mlx-audio

pip install -U mlx-audio

Python Example β€” Offline Inference:

from mlx_audio.vad import load

model = load("mlx-community/diar_sortformer_4spk-v1-fp32")
result = model.generate("meeting.wav", threshold=0.5, verbose=True)
print(result.text)

for seg in result.segments:
    print(f"Speaker {seg.speaker}: {seg.start:.2f}s - {seg.end:.2f}s")

Python Example β€” Streaming Inference:

from mlx_audio.vad import load

model = load("mlx-community/diar_sortformer_4spk-v1-fp32")

for result in model.generate_stream("meeting.wav", chunk_duration=5.0):
    for seg in result.segments:
        print(f"Speaker {seg.speaker}: {seg.start:.2f}s - {seg.end:.2f}s")

Python Example β€” Real-time Microphone Streaming:

from mlx_audio.vad import load

model = load("mlx-community/diar_sortformer_4spk-v1-fp32")
state = model.init_streaming_state()

for chunk in mic_stream():  # your audio source
    result, state = model.feed(chunk, state, sample_rate=16000)
    for seg in result.segments:
        print(f"Speaker {seg.speaker}: {seg.start:.2f}s - {seg.end:.2f}s")

Model Details

  • Architecture: FastConformer (18 layers) + Transformer Encoder (18 layers) + Sortformer Modules
  • Mel bins: 80
  • Max speakers: 4
  • Input: 16kHz mono audio
  • Output: Per-frame speaker activity probabilities

Ported from NVIDIA NeMo SortformerEncLabelModel.

Downloads last month
81
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/diar_sortformer_4spk-v1-fp32

Finetuned
(2)
this model