PersonaPlex 7B Hybrid β Distilled + LLM Reasoning
Distilled NF4 weights with a hybrid architecture: PersonaPlex handles voice I/O, Qwen/Ollama handles reasoning.
Architecture
User Voice β PersonaPlex (ASR+TTS) β Text β Qwen 27B/122B (reasoning) β Text β User
PersonaPlex processes audio in real-time (full-duplex). When it generates a complete sentence, the hybrid agent intercepts it and routes through a local LLM for intelligent response.
Distillation Results
Trained for 5 epochs on 3,000 samples from the bf16 teacher (73 min on A100).
| Model | Token Match vs bf16 | Output Quality |
|---|---|---|
| bf16 (teacher) | 100% | Reference |
| NF4 raw (before) | 75% | Coherent but divergent |
| NF4 distilled | 90% | Close match to teacher |
Training loss: 0.5823 β 0.0697 (88% reduction over 5 epochs).
Quick Start
# Clone the repo
git clone https://github.com/robit-man/personaplex.git
cd personaplex
# Start with hybrid mode (PersonaPlex voice + Qwen reasoning)
source personaplex-setup/venv/bin/activate
export PYTHONPATH="personaplex-setup/moshi:"
export HYBRID_LLM_MODEL="open-agents-qwen35:27b"
python -m moshi.server --moshi-weight student_best.pt \ # or download from this repo
--device cuda --hybrid --host 0.0.0.0
# For Qwen 122B (deeper reasoning, higher latency):
export HYBRID_LLM_MODEL="open-agents-qwen35:122b"
LLM Model Selection
| Model | Latency | Best For |
|---|---|---|
| Qwen 3.5:9B | ~1s | Quick exchanges |
| Qwen 3.5:27B | ~2s | General conversation (recommended) |
| Qwen 3.5:122B | ~5-10s | Complex analysis |
| Nemotron 3 Super 120B | ~5-10s | Tool calling, codebase analysis |
Files
| File | Description |
|---|---|
| student_best.pt | Distilled bf16 weights (15.6 GB) |
| training_log.json | Training metrics |
| distill_v2.py | Distillation training script |
Anti-Call-Center Training
The prompts used for distillation explicitly enforce:
- No self-naming (model never introduces itself by name)
- No "how can I help" patterns
- Direct, natural responses instead of customer service scripts
Note: The base PersonaPlex model was trained on call center data, so these tendencies are baked into the architecture. The hybrid approach solves this by routing through an LLM that follows the prompt correctly.
Training Config
{
"epochs": 5,
"lr": 5e-6,
"temperature": 2.0,
"alpha_kl": 0.7,
"alpha_hard": 0.3,
"total_samples": 3000,
"optimizer": "AdamW",
"scheduler": "CosineAnnealingLR"
}
License
Same as base: NVIDIA Open Model License.
Built by open-agents-ai.