QORA-TTS 0.6B - Pure Rust Text-to-Speech

Pure Rust TTS engine with 9 built-in speakers. No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable text-to-speech that runs on any machine.

Smart system awareness — automatically detects your hardware (RAM, CPU threads) and adjusts generation limits so TTS runs well even on constrained systems. 9 built-in voices — works out of the box with no reference audio needed. 10 languages supported.

Based on Qwen3-TTS-12Hz-0.6B-CustomVoice (Apache 2.0).

License

This project is licensed under Apache 2.0. The base model Qwen3-TTS-12Hz-0.6B-CustomVoice is released by the Qwen team under Apache 2.0.

What It Does

QORA-TTS 0.6B converts text to natural-sounding speech. It can:

9 built-in voices — male and female speakers embedded in the model, no reference audio needed
10 languages — English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian
24 kHz output — high-quality mono WAV audio
Fast generation — lighter model for quicker speech synthesis
Controllable generation — adjust length, temperature, and sampling parameters

Quick Start

Download all files into the same folder
Run:

# Use built-in speaker
qora-tts.exe --speaker ryan --language english --text "Hello, how are you?"

# Different speaker
qora-tts.exe --speaker serena --language chinese --text "你好世界"

# Japanese speaker
qora-tts.exe --speaker ono_anna --language japanese --text "こんにちは"

# Control length and output
qora-tts.exe --speaker aiden --language english --text "Good morning!" --max-codes 200 --output greeting.wav

# Reproducible output with seed
qora-tts.exe --speaker ryan --text "Same every time" --seed 42

Files

qora-tts.exe          4.1 MB   Inference engine
model.qora-tts       971 MB    Q4 weights (talker + predictor + decoder)
config.json           4.8 KB   Model configuration
tokenizer.json         11 MB   Tokenizer (151,936 vocab)
vocab.json            2.7 MB   Vocabulary
merges.txt            1.6 MB   BPE merges
tokenizer_config.json 7.2 KB   Tokenizer config

No safetensors needed. Everything loads from model.qora-tts. The exe auto-finds all files in its own directory.

Architecture

Component	Details
Parameters	0.6B total
Talker	28 layers, hidden=1024, 16/8 GQA heads, SwiGLU FFN 3072
Code Predictor	5 layers, hidden=1024, 16 code groups
Speech Decoder	8-layer transformer + Vocos vocoder, 16 VQ codebooks
Quantization	Q4 (4-bit symmetric, group_size=32) with LUT-optimized dequantization
Sample Rate	24 kHz mono WAV
Code Rate	12.5 Hz (1 code = 80ms of audio)

How It Works

Text encoding — tokenize input text with 151K BPE vocabulary
Speaker selection — load built-in speaker embedding by name
Code generation — 28-layer transformer (Talker) generates speech codes autoregressively
Code expansion — 5-layer Code Predictor expands code0 into 16 codebooks (codes 0-15)
Audio synthesis — VQ decoder + Vocos vocoder converts codes to 24kHz waveform

Smart System Awareness

QORA-TTS detects your system at startup and automatically adjusts generation limits:

QORA-TTS — Pure Rust Text-to-Speech Engine
System: 16384 MB RAM (9856 MB free), 12 threads

Available RAM	Max Codes	Default	Audio Length
< 4 GB	200	100	~8s
4-8 GB	500	300	~20s
8-12 GB	1000	500	~40s
>= 12 GB	2000	500	~80s

Hard caps apply even to explicit user values — if you pass --max-codes 2000 on a system with 6 GB free RAM, it gets clamped to 500 automatically. This prevents the model from running for too long on weak systems.

CLI Arguments

Flag	Default	Description
`--text <text>`	"Hello, how are you today?"	Text to synthesize
`--speaker <name>`	ryan	Built-in speaker name
`--language <name>`	english	Target language
`--output <path>`	output.wav	Output WAV path
`--max-codes <n>`	500	Max code timesteps (~n/12.5 seconds)
`--temperature <f>`	0.8	Sampling temperature
`--top-k <n>`	50	Top-K sampling
`--seed <n>`	random	Random seed for reproducibility

Built-in Speakers

Speaker	Language	Description
ryan	English	Dynamic male voice
aiden	English	Sunny American male
serena	Chinese	Warm, gentle female
vivian	Chinese	Bright young female
uncle_fu	Chinese	Seasoned male
dylan	Beijing dialect	Youthful male
eric	Sichuan dialect	Lively male
ono_anna	Japanese	Playful female
sohee	Korean	Warm female

Supported Languages

Language	Flag Value
English	`english`
Chinese	`chinese`
German	`german`
Italian	`italian`
Portuguese	`portuguese`
Spanish	`spanish`
Japanese	`japanese`
Korean	`korean`
French	`french`
Russian	`russian`

Performance

Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:

Phase	Time	Notes
Model Load	~0.6s	From binary, 971 MB
Prefill	~2-5s	Text + speaker embedding processing
Code Generation	~1.5s/code	Autoregressive, 12.5 codes/sec of audio
Code Expansion	~0.1s	5-layer predictor, 16 codebooks
Audio Decode	~0.5s/frame	VQ + Vocos vocoder
RAM Usage	~970 MB	Q4 model in memory

Example: "Hello, how are you?" (~3 seconds of audio) takes ~10-15 seconds total.

Comparison with 1.7B

	QORA-TTS 0.6B	QORA-TTS 1.7B
Parameters	0.6B	1.7B
Model size	971 MB	1559 MB
Voice cloning	No	Yes (ECAPA-TDNN)
Built-in speakers	9 (embedded)	25 (via voice files)
Code generation	~1.5s/code	~2.5s/code
Quality	Good	Higher
Best for	Speed + simplicity	Quality + cloning

Built With

Language: Pure Rust (2024 edition)
Dependencies: half (f16), rayon (parallelism), tokenizers (HuggingFace tokenizer), memmap2 (mmap for converter), serde_json (config parsing)
No ML framework for inference — all matrix ops are hand-written Rust
Burn framework used only as a build dependency (for binary format types)

Model Binary Format (.qora-tts)

Custom binary format for fast loading:

Header:  "QTTS" magic + version + format byte
Talker:  28 transformer layers (Q4 quantized)
Predictor: 5 transformer layers + code embeddings
Decoder: VQ codebooks + 8 transformer layers + Vocos vocoder

Downloads last month: 19

Model tree for qoranet/QORA-TTS-12Hz-0.6B

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Finetuned

(4)

this model