Qwen3.5-2B Voice Assistant

Fine-tuned Qwen3.5-2B for voice assistant / conversational use.

This is designed to be short responses without thinking.

Trained on curated, concise datasets — all assistant responses are short and natural-sounding, optimized for spoken output rather than written text.

Training Details

Parameter Value
Base model unsloth/Qwen3.5-2B
Method LoRA (rank=16, alpha=32)
LoRA dropout 0.05
Learning rate 0.0001
Epochs 3 (early stopping, patience=4)
Effective batch size 64
Max sequence length 1024
Scheduler Cosine with 50 warmup steps
Precision bf16
Thinking mode Disabled
GPU NVIDIA L4 (22 GB)
Framework Unsloth + TRL SFTTrainer

Datasets

All datasets are filtered for concise, voice-friendly assistant responses (20–400 chars for general data, 20–500 chars for reasoning). Responses containing markdown formatting (bold, inline code, numbered lists, bullet points, headings) are excluded. Exact-match deduplication is applied across all sources before training.

Dataset Rows Purpose
OpenAssistant/oasst_top1_2023-08-25 2,388 Real human multi-turn conversations
HuggingFaceTB/everyday-conversations-llama3.1-2k 1,910 Greetings, small talk, basic Q&A
argilla/synthetic-concise-reasoning-sft 535 Short factual reasoning answers
WizardLM/WizardLM_evol_instruct_70k 7,000 Casual single-turn Q&A
Duplicates removed 1,992
Total (after dedup) 9,841

Filtering Pipeline (v7)

Each assistant response is checked against the following before inclusion:

  • Length: 20–400 chars (general), 20–500 chars (reasoning)
  • No markdown: **bold**, `inline code`, [link](url), # headings all excluded
  • No lists: numbered (1.) and bullet (-, *) patterns excluded at line-start and after colons
  • No list lead-ins: phrases like "the process involves:", "as follows:", "the following" excluded
  • No AI-isms: "certainly!", "as an AI", "in conclusion", "delve" excluded
  • Post-dedup sanity check: % of markdown patterns logged to W&B before training

System Prompt

All training samples include this system prompt:

You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue.

Available Formats

Repo Format Use case
cowWhySo/qwen3_5_2B_voice_assistant Merged 16-bit Transformers / vLLM / SGLang
cowWhySo/qwen3_5_2B_voice_assistant-lora LoRA adapters Merge with base yourself
cowWhySo/qwen3_5_2B_voice_assistant-GGUF GGUF (q4_k_m, q5_k_m, q8_0, f16) llama.cpp / Ollama / LM Studio

Usage with llama.cpp

huggingface-cli download cowWhySo/qwen3_5_2B_voice_assistant-GGUF --include "*q4_k_m*" --local-dir .
./llama-cli -m *q4_k_m*.gguf --ctx-size 2048 --temp 0.7 --top-p 0.9

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")

messages = [
    {"role": "system", "content": "You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue."},
    {"role": "user", "content": "What's the weather like today?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fine-tuned with Unsloth on an NVIDIA L4 GPU.

Downloads last month
114
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cowWhySo/qwen3_5_2B_voice_assistant

Finetuned
Qwen/Qwen3.5-2B
Adapter
(14)
this model