Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b - 3.2 Billion voices?

#19

by mushroomfleet - opened 24 days ago

Discussion

mushroomfleet

24 days ago

•

edited 24 days ago

Prototype is fully working, source: https://github.com/MushroomFleet/ZeroVoice-Voxtral-mini-4b

ZeroBytes = " The Universe springs forth - from Zero Bytes ! "

How It Works

ZeroVoice applies the ZeroBytes System position-is-seed procedural generation paradigm to text-to-speech voice synthesis:

Z-axis selects the voice family: z<100 = English, z=100-199 = European, z>=200 = Asian/Arabic
Position hash (xxHash64 of packed coordinates) selects a cross-family voice pair -- the primary voice (A) and a tint voice (B) are always from different language families
Coherent noise on the X/Y plane derives a smooth SLERP blend weight (capped at 0.20) so nearby coordinates sound similar
Row-wise SLERP blends the two voice embeddings on the 3072-dimensional hypersphere with magnitude preservation and norm calibration

The result is a voice that sounds like the primary preset with subtle characteristics from the secondary, producing natural-sounding variation without artifacts.

Key Numbers

Metric	Value
Base preset voices	20 (9 languages, 10 male / 10 female)
Cross-family voice pairs	156 (A and B always from different families)
Perceptually distinct voices per seed	~3,276
Total addressable voices (all seeds)	~3.28 billion
Coordinate addresses (practical range)	301 million+
Additional voice data stored	0 bytes
Audio output	24 kHz WAV

Full statistics and voice inventory: ZeroVoice-stats.md

ZeroBytes Law Compliance

Law	Status
O(1) Access	Any voice computed directly from (x,y,z) -- no iteration
Parallelism	Each coordinate depends only on its own values, never neighbors
Coherence	Adjacent coordinates produce similar voices via multi-octave noise
Hierarchy	Z selects family, X/Y explore variation within
Determinism	Same inputs produce identical output on any machine

I already built this system for KokoroTTS. This allows a 150mb model grant NPC in videogames unique and consistent voices using the world spawn position X,Y,Z as the Voice ID. When divided into Male and Female this offered 1 Billion voices for each gender. This was acheived by applying the determinitism scope to SLERP calculations of the numpy voice data. I sought to see if we can apply this to Voxtral that stored the voices differently as a protoype proof of concept.

This will never beat "fine tuning" or "cloning", but offers consistent voices at large scale, easily integrated into procedurally deterministic applications.

Thanks to the Team for the weights release !

3 Billion voices is the technical total, but many will be strange blends, still it offers millions of practically unique voices which are just as easy to reference so it's perfect for gamedev and other development applications.

mushroomfleet changed discussion title from Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b to Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b - 3.2 Billion voices? 24 days ago

mushroomfleet changed discussion status to closed 24 days ago

mushroomfleet changed discussion status to open 24 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment