WhisperKit CoreML โ€” Distil Large v3 Italian

CoreML model for Italian speech-to-text on Apple Silicon, compatible with WhisperKit.

Why this model exists

The official distil-whisper/distil-large-v3 available on argmaxinc/whisperkit-coreml is English-only. Despite being flagged as multilingual (it inherits the tokenizer from large-v3), its decoder was distilled exclusively on English data. This means it ignores language tokens like <|it|> and always outputs English text, regardless of DecodingOptions.language settings.

This repo provides a CoreML conversion of bofenghuang/whisper-large-v3-distil-it-v0.2, a model distilled on 6,500+ hours of Italian audio, making it a true Italian-capable distilled Whisper model.

Model composition

Component Source Notes
AudioEncoder openai/whisper-large-v3 Identical to large-v3 (frozen during distillation)
MelSpectrogram openai/whisper-large-v3 Standard mel-spectrogram preprocessing
TextDecoder bofenghuang/whisper-large-v3-distil-it-v0.2 2 decoder layers, trained on Italian data
config.json distil-whisper/distil-large-v3 Architecture config (2 decoder layers, 32 encoder layers)
generation_config.json distil-whisper/distil-large-v3 Modified: language set to null (was `<

How it was built

  1. TextDecoder was converted from PyTorch to CoreML using whisperkittools:

    whisperkit-generate-model \
      --model-version bofenghuang/whisper-large-v3-distil-it-v0.2 \
      --output-dir ./output
    

    The AudioEncoder conversion failed due to a coremltools compatibility issue, but since the encoder is identical to large-v3 (frozen during distillation), we reused the encoder from argmaxinc/whisperkit-coreml.

  2. AudioEncoder + MelSpectrogram were copied from the official openai_whisper-large-v3 CoreML model on argmaxinc/whisperkit-coreml.

  3. generation_config.json was patched to set "language": null instead of "language": "<|en|>" to avoid English bias.

Usage with WhisperKit

import WhisperKit

// Download the model
let modelURL = try await WhisperKit.download(
    variant: "bofenghuang_whisper-large-v3-distil-it",
    from: "jmadseeker/whisperkit-coreml-distil-large-v3-it"
)

// Initialize WhisperKit
let config = WhisperKitConfig(modelFolder: modelURL.path, verbose: false, logLevel: .error, load: true)
let whisperKit = try await WhisperKit(config)

// Transcribe in Italian
let options = DecodingOptions(
    task: .transcribe,
    language: "it",
    temperature: 0.0,
    temperatureIncrementOnFallback: 0.2,
    temperatureFallbackCount: 2
)
let results = try await whisperKit.transcribe(audioPath: audioURL.path, decodeOptions: options)
print(results.first?.text ?? "")

Performance

  • ~6x faster than large-v3 (2 decoder layers vs 32)
  • ~1.5 GB model size
  • 98% ANE dispatch on Apple Silicon (Neural Engine accelerated)
  • Quality comparable to large-v3 for Italian transcription

Used by

  • Pulsecribe โ€” macOS voice dictation app with multi-provider transcription

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jmadseeker/whisperkit-coreml-distil-large-v3-it

Quantized
(1)
this model