WhisperKit CoreML — Distil Large v3 Italian

CoreML model for Italian speech-to-text on Apple Silicon, compatible with WhisperKit.

Why this model exists

The official distil-whisper/distil-large-v3 available on argmaxinc/whisperkit-coreml is English-only. Despite being flagged as multilingual (it inherits the tokenizer from large-v3), its decoder was distilled exclusively on English data. This means it ignores language tokens like <|it|> and always outputs English text, regardless of DecodingOptions.language settings.

This repo provides a CoreML conversion of bofenghuang/whisper-large-v3-distil-it-v0.2, a model distilled on 6,500+ hours of Italian audio, making it a true Italian-capable distilled Whisper model.

Model composition

Component	Source	Notes
AudioEncoder	`openai/whisper-large-v3`	Identical to large-v3 (frozen during distillation)
MelSpectrogram	`openai/whisper-large-v3`	Standard mel-spectrogram preprocessing
TextDecoder	`bofenghuang/whisper-large-v3-distil-it-v0.2`	2 decoder layers, trained on Italian data
config.json	`distil-whisper/distil-large-v3`	Architecture config (2 decoder layers, 32 encoder layers)
generation_config.json	`distil-whisper/distil-large-v3`	Modified: `language` set to `null` (was `<

How it was built

TextDecoder was converted from PyTorch to CoreML using whisperkittools:
```
whisperkit-generate-model \
  --model-version bofenghuang/whisper-large-v3-distil-it-v0.2 \
  --output-dir ./output
```
The AudioEncoder conversion failed due to a coremltools compatibility issue, but since the encoder is identical to large-v3 (frozen during distillation), we reused the encoder from argmaxinc/whisperkit-coreml.
AudioEncoder + MelSpectrogram were copied from the official openai_whisper-large-v3 CoreML model on argmaxinc/whisperkit-coreml.
generation_config.json was patched to set "language": null instead of "language": "<|en|>" to avoid English bias.

Usage with WhisperKit

import WhisperKit

// Download the model
let modelURL = try await WhisperKit.download(
    variant: "bofenghuang_whisper-large-v3-distil-it",
    from: "jmadseeker/whisperkit-coreml-distil-large-v3-it"
)

// Initialize WhisperKit
let config = WhisperKitConfig(modelFolder: modelURL.path, verbose: false, logLevel: .error, load: true)
let whisperKit = try await WhisperKit(config)

// Transcribe in Italian
let options = DecodingOptions(
    task: .transcribe,
    language: "it",
    temperature: 0.0,
    temperatureIncrementOnFallback: 0.2,
    temperatureFallbackCount: 2
)
let results = try await whisperKit.transcribe(audioPath: audioURL.path, decodeOptions: options)
print(results.first?.text ?? "")

Performance

~6x faster than large-v3 (2 decoder layers vs 32)
~1.5 GB model size
98% ANE dispatch on Apple Silicon (Neural Engine accelerated)
Quality comparable to large-v3 for Italian transcription

Used by

Pulsecribe — macOS voice dictation app with multi-provider transcription

Credits

Original Italian distilled model: bofenghuang/whisper-large-v3-distil-it-v0.2
WhisperKit framework: argmaxinc/WhisperKit
CoreML conversion tools: argmaxinc/whisperkittools
Official CoreML models: argmaxinc/whisperkit-coreml

Downloads last month: -

Model tree for jmadseeker/whisperkit-coreml-distil-large-v3-it

Base model

bofenghuang/whisper-large-v3-distil-it-v0.2

Quantized

(1)

this model