MOSS-TTS-Local-Transformer GGUF
First known GGUF quantization of MOSS-TTS-Local-Transformer (1.7B parameters).
The OpenMOSS team has only published GGUFs for their 8B Delay model. This repo provides quantized GGUFs for the lightweight 1.7B Local variant, suitable for edge/on-device TTS deployment.
Available Quantizations
| File | Quant | Size | BPW | Notes |
|---|---|---|---|---|
moss-tts-local-Q2_K.gguf |
Q2_K | 1.1 GB | 3.31 | Smallest, aggressive |
moss-tts-local-Q3_K_M.gguf |
Q3_K_M | 1.4 GB | 4.10 | Recommended balance |
moss-tts-local-Q4_K_M.gguf |
Q4_K_M | 1.9 GB | 5.06 | Same tier as OpenMOSS 8B GGUF |
Architecture
MOSS-TTS-Local is a two-component TTS system:
Text β [Qwen3 backbone (1.7B, 28 layers)] β audio tokens β [ONNX decoder] β 24kHz PCM
β
[Local transformer (4 layers)]
β
[Bridge MLPs (33 codebooks)]
- Backbone: Qwen3 architecture (hidden_size=2048, 28 layers, 16 attention heads, GQA 8 KV heads)
- Audio components: 32 codebook embeddings, 33 LM heads, 33 layer norms, 4-layer local transformer, 33+1 bridge MLPs
- Total tensors: 555 (after skipping duplicate text embedding)
- Audio decoder: MOSS-Audio-Tokenizer-ONNX (encoder 1.5MB + decoder 14MB)
How This Was Converted
The standard convert_hf_to_gguf.py (both upstream llama.cpp and the OpenMOSS fork) fails on the Local variant because 213 audio-specific tensors lack GGUF name mappings. We added:
- 19 new
MODEL_TENSORentries ingguf-py/gguf/constants.pyfor local transformer layers, bridge MLPs, and audio layer norms - ~70 lines of tensor mapping in
convert_hf_to_gguf.pyhandling:- Embedding offset:
model.embedding_list.0= text (skip),1-32= audio codebooks 0-31 - Local transformer with layer indexing
- Indexed bridge MLPs (33 per projection type)
- Embedding offset:
- Quantizer must be built from the OpenMOSS fork (upstream doesn't know
moss-tts-delayarchitecture)
Full conversion guide: see the accompanying PR on OpenMOSS/llama.cpp.
Inference
Current Status
The GGUF files contain all 555 tensors correctly. However, the llama-moss-tts C++ binary currently only supports the 8B Delay model's tensor layout (375 tensors). The Local variant's extra tensors need C++ loading + inference code in moss-tts-delay.cpp.
PyTorch inference works β audio quality verified with the original safetensors producing clean English speech at 24kHz.
Audio Quality Verification
Tested with PyTorch bf16 on RTX 3090:
- "Hello, I am LUI, your personal AI assistant. How can I help you today?" β 5.8s clean audio
- "Your stress has been elevated for the past hour..." β 5.6s clean audio
- "Playing rain sounds. Notifications muted..." β 7.0s clean audio
Future: On-Device Mobile TTS
Target: Android phone with 8GB RAM using hot-swap loading:
- Unload chat LLM (Qwen 2.5 1.5B, 940MB)
- Load MOSS-TTS-Local Q3_K_M (1.4GB) via llama.cpp
- Generate audio tokens
- Decode with ONNX audio decoder (15.5MB)
- Play audio
- Unload TTS, reload chat LLM
Reproduction
# Prerequisites
pip install torch transformers==5.0.0 numpy sentencepiece gguf
git clone https://github.com/OpenMOSS/llama.cpp.git moss-fork
# Apply tensor mapping patches (see PR)
# Then:
# Download model
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('OpenMOSS-Team/MOSS-TTS-Local-Transformer', local_dir='moss-tts-local')"
# Convert
python3 moss-fork/convert_hf_to_gguf.py moss-tts-local/ --outfile moss-tts-local-bf16.gguf --outtype bf16
# Quantize (must use fork's quantizer)
cd moss-fork && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --target llama-quantize -j$(nproc)
./bin/llama-quantize ../../moss-tts-local-bf16.gguf ../../moss-tts-local-Q3_K_M.gguf Q3_K_M
Credits
- Model: OpenMOSS-Team/MOSS-TTS-Local-Transformer (Apache 2.0)
- Conversion infrastructure: OpenMOSS/llama.cpp fork
- GGUF conversion + quantization: obirije
- Context: Built for LUI, an open-source Android AI assistant with on-device TTS
- Downloads last month
- 211
2-bit
3-bit
4-bit
Model tree for John9007/MOSS-TTS-Local-GGUF
Base model
OpenMOSS-Team/MOSS-TTS-Local-Transformer