Update README: explain INT4 sweep findings, add base_model_relation tag

2a9df0d verified 2 days ago

6.43 kB

	---
	license: openrail
	language:
	- en
	- ja
	- zh
	- ko
	- es
	- fr
	- de
	- multilingual
	library_name: coremltools
	tags:
	- coreml
	- ane
	- apple-neural-engine
	- text-to-speech
	- tts
	- audio
	- diffusion
	- flow-matching
	- on-device
	- ios
	- macos
	- fp16
	pipeline_tag: text-to-speech
	base_model: Supertone/supertonic-3
	base_model_relation: quantized
	---

	# Supertonic-3 — CoreML (fp16, ANE-ready)

	CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
	a 99M-parameter multilingual TTS model. All 4 components run on the
	Apple Neural Engine (1.8–3.7× faster than CPU on M-series chips).

	\| Component \| Size \| Role \|
	\| --- \| ---: \| --- \|
	\| `fp16/duration_predictor.mlpackage` \| 15 MB \| text -> frame count \|
	\| `fp16/text_encoder.mlpackage` \| 71 MB \| text -> conditioning latent \|
	\| `fp16/vector_estimator.mlpackage` \| 135 MB \| flow-matching denoiser (8 steps) \|
	\| `fp16/vocoder.mlpackage` \| 51 MB \| latent -> 44.1 kHz waveform \|
	\| Total \| 272 MB \| (originals: ~400 MB ONNX) \|

	## Quickstart

	```bash
	pip install coremltools soundfile numpy supertonic
	git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
	cd supertonic-3-coreml

	# Short prompt
	python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav

	# Long prompt — use --auto-pad for full content rendering
	python inference.py \
	--text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
	--voice F5 --lang en --auto-pad --out long.wav
	```

	10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
	31 languages supported via `unicode_indexer.json`.

	## The auto-pad trick (why `--auto-pad` matters)

	The supertonic-3 model has a soft cap on how much speech it renders per
	utterance. For long inputs (more than ~13 s of natural speech) the model
	truncates the prompt and emits a low-amplitude filler tone for the rest
	of the budget. The CoreML conversion's static bucket (T=L=320) extends
	this cap by ~3 s due to the way the bucket's padded positions leak into
	the real positions through ConvNeXt's dilated convolutions — that's
	**why CoreML inference sounds more natural than the original ONNX
	library** (proper word separation, intonation), but it still cuts off
	mid-sentence on long prompts.

	`--auto-pad` is a two-pass workaround:

	1. Pass 1 synthesizes the prompt alone at full bucket length to find
	where the model's content naturally stops (`t_orig`).
	2. Pass 2 appends a long filler sentence
	(`" And with that, the gentle silence wrapped itself around the room."`)
	that gives the model extra frames to fully render the original
	prompt, then renders the filler sentence, then drops into the filler
	tone.
	3. The longest clean-silence gap after `t_orig` is the boundary between
	the original prompt and the appended filler. The pipeline trims
	there and tail-pads with 0.5 s of true silence.

	Cost: ~2× synthesis time. Worth it for any prompt over ~5 s.

	## ANE engagement

	All 4 components compile to ANE-resident programs when loaded with
	`compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:

	\| Component \| ANE speedup \|
	\| --- \| --- \|
	\| duration_predictor \| 1.9× \|
	\| text_encoder \| 2.8× \|
	\| vector_estimator \| 2.4× (per step; 8 steps total) \|
	\| vocoder \| 3.7× \|

	Verify ANE engagement with:

	```bash
	xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
	```

	## Conversion notes

	- Static bucket: T=320 (text length), L=320 (latent length). Inputs are
	zero-padded on the right and masked. Bucket = 22.3 s of audio.
	- `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
	in PyTorch from the ONNX initializers, then traced to CoreML.
	Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
	(text_encoder), cos 0.9998 (vocoder).
	- `vector_estimator` (the heavy diffusion model) goes through
	`onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
	-> coremltools. Cos 0.998 vs ONNX per diffusion step.
	- The diffusion sampler stays host-side (8 Euler steps over the single
	step graph). All 4 components are individually quantizable.

	## License

	This conversion follows the original Supertone/supertonic-3 license
	(OpenRAIL). See `LICENSE` (or the upstream model card).

	## Why fp16 and not INT4?

	We attempted to ship an INT4 variant. After exhaustive testing
	([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
	architecture caps at INT8 minimum for the vocoder and vector_estimator:

	\| Component \| INT4 result \| Why \|
	\| --- \| --- \| --- \|
	\| vocoder \| cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. \| HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. \|
	\| vector_estimator \| per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. \| Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). \|
	\| duration_predictor \| smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. \| dp output sets the diffusion frame budget; any drift propagates. \|
	\| text_encoder \| cos 0.97 at pgc_g32 (works alone). \| Conditioning quality compounds with VE drift. \|

	The best achievable mixed config (only voc INT8, others fp16) saves
	~25 MB out of 272 MB — not worth a separate variant. The fp16 build
	shipped here is the final deliverable.

	## Companion build

	The cross-platform LiteRT version is at
	[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
	LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
	(65 MB instead of 256 MB) — but LiteRT can't reproduce the
	CoreML-only "bucket-leak" extension, so long prompts sound rushed on
	LiteRT. Use CoreML for full quality on Apple platforms.

	## Credits

	- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
	- CoreML conversion + auto-pad workflow: this repo
	- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)