PersonaPlex 7B Hybrid — Distilled + LLM Reasoning

Distilled NF4 weights with a hybrid architecture: PersonaPlex handles voice I/O, Qwen/Ollama handles reasoning.

Architecture

User Voice → PersonaPlex (ASR+TTS) → Text → Qwen 27B/122B (reasoning) → Text → User

PersonaPlex processes audio in real-time (full-duplex). When it generates a complete sentence, the hybrid agent intercepts it and routes through a local LLM for intelligent response.

Distillation Results

Trained for 5 epochs on 3,000 samples from the bf16 teacher (73 min on A100).

Model	Token Match vs bf16	Output Quality
bf16 (teacher)	100%	Reference
NF4 raw (before)	75%	Coherent but divergent
NF4 distilled	90%	Close match to teacher

Training loss: 0.5823 → 0.0697 (88% reduction over 5 epochs).

Quick Start

# Clone the repo
git clone https://github.com/robit-man/personaplex.git
cd personaplex

# Start with hybrid mode (PersonaPlex voice + Qwen reasoning)
source personaplex-setup/venv/bin/activate
export PYTHONPATH="personaplex-setup/moshi:"
export HYBRID_LLM_MODEL="open-agents-qwen35:27b"
python -m moshi.server   --moshi-weight student_best.pt \  # or download from this repo
  --device cuda --hybrid --host 0.0.0.0

# For Qwen 122B (deeper reasoning, higher latency):
export HYBRID_LLM_MODEL="open-agents-qwen35:122b"

LLM Model Selection

Model	Latency	Best For
Qwen 3.5:9B	~1s	Quick exchanges
Qwen 3.5:27B	~2s	General conversation (recommended)
Qwen 3.5:122B	~5-10s	Complex analysis
Nemotron 3 Super 120B	~5-10s	Tool calling, codebase analysis

Files

File	Description
student_best.pt	Distilled bf16 weights (15.6 GB)
training_log.json	Training metrics
distill_v2.py	Distillation training script

Anti-Call-Center Training

The prompts used for distillation explicitly enforce:

No self-naming (model never introduces itself by name)
No "how can I help" patterns
Direct, natural responses instead of customer service scripts

Note: The base PersonaPlex model was trained on call center data, so these tendencies are baked into the architecture. The hybrid approach solves this by routing through an LLM that follows the prompt correctly.

Training Config

{
  "epochs": 5,
  "lr": 5e-6,
  "temperature": 2.0,
  "alpha_kl": 0.7,
  "alpha_hard": 0.3,
  "total_samples": 3000,
  "optimizer": "AdamW",
  "scheduler": "CosineAnnealingLR"
}

License

Same as base: NVIDIA Open Model License.

Built by open-agents-ai.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cudabenchmarktest/personaplex-7b-hybrid

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

nvidia/personaplex-7b-v1

Finetuned

(36)

this model