YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
QORA-TTS 1.7B - Pure Rust Text-to-Speech with Voice Cloning
Pure Rust TTS engine with voice cloning. No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable text-to-speech that runs on any machine.
Smart system awareness β automatically detects your hardware (RAM, CPU threads) and adjusts generation limits so TTS runs well even on constrained systems. Voice cloning β clone any voice from a 3-10 second WAV recording. 25 included voices β ready to use out of the box.
Based on Qwen3-TTS-12Hz-1.7B-Base (Apache 2.0).
License
This project is licensed under Apache 2.0. The base model Qwen3-TTS-12Hz-1.7B-Base is released by the Qwen team under Apache 2.0.
What It Does
QORA-TTS 1.7B converts text to natural-sounding speech. It can:
- Voice cloning β clone any voice from a short WAV recording (3-10 seconds)
- 25 built-in voices β 13 female + 12 male voices included, ready to use
- 10 languages β English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian
- 24 kHz output β high-quality mono WAV audio
- Controllable generation β adjust length, temperature, and sampling parameters
Platform Support
| Platform | Binary | Status |
|---|---|---|
| Windows x86_64 | qora-tts.exe |
Tested |
| Linux x86_64 | qora-tts |
Supported |
| macOS aarch64 | qora-tts |
Supported |
CPU-only β no GPU needed. Pre-built binaries on the Releases page.
Quick Start
- Download from the Releases page
- Run:
# Voice cloning with included voice
qora-tts.exe --ref-audio voices/luna.wav --text "Hello, how are you?" --language english
# Different voice
qora-tts.exe --ref-audio voices/adam.wav --text "Good morning!" --language english
# Clone your own voice (any 24kHz WAV)
qora-tts.exe --ref-audio my_recording.wav --text "Custom voice" --language english
# Control length (codes = seconds x 12.5)
qora-tts.exe --ref-audio voices/luna.wav --text "Short" --max-codes 100
# Custom output path
qora-tts.exe --ref-audio voices/sagar.wav --text "Hi there" --language english --output greeting.wav
# Reproducible output with seed
qora-tts.exe --ref-audio voices/luna.wav --text "Same every time" --seed 42
Files
qora-tts.exe 4.1 MB Inference engine
model.qora-tts 1559 MB Q4 weights (talker + predictor + decoder + speaker encoder)
config.json 4.4 KB Model configuration
tokenizer.json 11 MB Tokenizer (151,936 vocab)
vocab.json 2.7 MB Vocabulary
merges.txt 1.6 MB BPE merges
tokenizer_config.json 7.2 KB Tokenizer config
voices/ 25 reference WAV files for voice cloning
No safetensors needed. Everything loads from model.qora-tts. The exe auto-finds all files in its own directory.
Architecture
| Component | Details |
|---|---|
| Parameters | 1.7B total |
| Talker | 28 layers, hidden=2048, 16/8 GQA heads, SwiGLU FFN 6144 |
| Code Predictor | 5 layers, hidden=1024, input_proj [1024, 2048], 16 code groups |
| Speech Decoder | 8-layer transformer + Vocos vocoder, 16 VQ codebooks |
| Speaker Encoder | ECAPA-TDNN (3 Res2Net blocks, 2048-dim output) |
| Quantization | Q4 (4-bit symmetric, group_size=32) with LUT-optimized dequantization |
| Sample Rate | 24 kHz mono WAV |
| Code Rate | 12.5 Hz (1 code = 80ms of audio) |
How It Works
- Text encoding β tokenize input text with 151K BPE vocabulary
- Voice extraction β ECAPA-TDNN speaker encoder extracts voice embedding from reference audio
- Code generation β 28-layer transformer (Talker) generates speech codes autoregressively
- Code expansion β 5-layer Code Predictor expands code0 into 16 codebooks (codes 0-15)
- Audio synthesis β VQ decoder + Vocos vocoder converts codes to 24kHz waveform
AVX-512 SIMD Acceleration
On CPUs with AVX-512 support (Intel 11th gen+, AMD Zen 4+), QORA-TTS automatically uses hand-written AVX-512 SIMD kernels for faster inference:
| Kernel | Technique | Speedup |
|---|---|---|
| Q4 GEMV | permutexvar_ps 16-entry LUT lookup, nibble extract via cvtepu8_epi32 |
~2.5x |
| F16 GEMV | cvtph_ps f16βf32 + fmadd_ps FMA accumulation |
~2.5x |
Detection is automatic at runtime β falls back to scalar code on non-AVX-512 CPUs with zero overhead.
Smart System Awareness
QORA-TTS detects your system at startup and automatically adjusts generation limits:
QORA-TTS β Pure Rust Text-to-Speech Engine
System: 16384 MB RAM (9856 MB free), 12 threads
| Available RAM | Max Codes | Default | Audio Length |
|---|---|---|---|
| < 4 GB | 200 | 100 | ~8s |
| 4-8 GB | 500 | 300 | ~20s |
| 8-12 GB | 1000 | 500 | ~40s |
| >= 12 GB | 2000 | 500 | ~80s |
Hard caps apply even to explicit user values β if you pass --max-codes 2000 on a system with 6 GB free RAM, it gets clamped to 500 automatically. This prevents the model from running for too long on weak systems.
CLI Arguments
| Flag | Default | Description |
|---|---|---|
--text <text> |
"Hello, how are you today?" | Text to synthesize |
--ref-audio <wav> |
- | Required - reference WAV for voice cloning |
--language <name> |
english | Target language |
--output <path> |
output.wav | Output WAV path |
--max-codes <n> |
500 | Max code timesteps (~n/12.5 seconds) |
--temperature <f> |
0.8 | Sampling temperature |
--top-k <n> |
50 | Top-K sampling |
--seed <n> |
random | Random seed for reproducibility |
Included Voices
| Female | Male |
|---|---|
| luna, anushri, beth, caty, cherie, ember, faith, hope, jessica, kea, riya, vidhi, velvety | adam, charles, david, hale, heisenberg, joe, peter, quentin, sagar, steven, titan, true |
All 24kHz WAV files in voices/. Use any 3-10 second clean speech recording for custom voice cloning.
Supported Languages
| Language | Flag Value |
|---|---|
| English | english |
| Chinese | chinese |
| German | german |
| Italian | italian |
| Portuguese | portuguese |
| Spanish | spanish |
| Japanese | japanese |
| Korean | korean |
| French | french |
| Russian | russian |
Performance
Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:
| Phase | Time | Notes |
|---|---|---|
| Model Load | ~0.8s | From binary, 1559 MB |
| Voice Extraction | ~5-10s | ECAPA-TDNN speaker encoder |
| Prefill | ~3-8s | Text + voice embedding processing |
| Code Generation | ~2.5s/code | Autoregressive, 12.5 codes/sec of audio |
| Code Expansion | ~0.1s | 5-layer predictor, 16 codebooks |
| Audio Decode | ~0.5s/frame | VQ + Vocos vocoder |
| RAM Usage | ~1560 MB | Q4 model in memory |
Example: "Hello, how are you?" (~3 seconds of audio) takes ~15-20 seconds total.
Comparison with 0.6B
| QORA-TTS 1.7B | QORA-TTS 0.6B | |
|---|---|---|
| Parameters | 1.7B | 0.6B |
| Model size | 1559 MB | 971 MB |
| Voice cloning | Yes (ECAPA-TDNN) | No |
| Built-in speakers | 25 (via voice files) | 9 (embedded) |
| Code generation | ~2.5s/code | ~1.5s/code |
| Quality | Higher | Good |
| Best for | Quality + cloning | Speed + simplicity |
Building from Source
cargo build --release
Dependencies
- Language: Pure Rust (2021 edition)
halfβ F16 supporttokenizersβ HuggingFace tokenizersafetensorsβ Weight loadingserde_jsonβ Config parsing- No ML framework for inference β all matrix ops are hand-written Rust
Cross-Platform Releases
Pre-built binaries via GitHub Actions for Windows x86_64, Linux x86_64, macOS aarch64.
Model Binary Format (.qora-tts)
Custom binary format for fast loading:
Header: "QTTS" magic + version + format byte
Talker: 28 transformer layers (Q4 quantized)
Predictor: 5 transformer layers + code embeddings
Decoder: VQ codebooks + 8 transformer layers + Vocos vocoder
Speaker Encoder: ECAPA-TDNN (3 Res2Net blocks)
- Downloads last month
- 30