Qwen3.5-2B Voice Assistant

Fine-tuned Qwen3.5-2B for voice assistant / conversational use.

This is designed to be short responses without thinking.

Trained on curated, concise datasets — all assistant responses are short and natural-sounding, optimized for spoken output rather than written text.

Training Details

Parameter	Value
Base model	`unsloth/Qwen3.5-2B`
Method	LoRA (rank=16, alpha=32)
LoRA dropout	0.05
Learning rate	0.0001
Epochs	3 (early stopping, patience=4)
Effective batch size	64
Max sequence length	1024
Scheduler	Cosine with 50 warmup steps
Precision	bf16
Thinking mode	Disabled
GPU	NVIDIA L4 (22 GB)
Framework	Unsloth + TRL SFTTrainer

Datasets

All datasets are filtered for concise, voice-friendly assistant responses (20–400 chars for general data, 20–500 chars for reasoning). Responses containing markdown formatting (bold, inline code, numbered lists, bullet points, headings) are excluded. Exact-match deduplication is applied across all sources before training.

Dataset	Rows	Purpose
OpenAssistant/oasst_top1_2023-08-25	2,388	Real human multi-turn conversations
HuggingFaceTB/everyday-conversations-llama3.1-2k	1,910	Greetings, small talk, basic Q&A
argilla/synthetic-concise-reasoning-sft	535	Short factual reasoning answers
WizardLM/WizardLM_evol_instruct_70k	7,000	Casual single-turn Q&A
Duplicates removed	1,992
Total (after dedup)	9,841

Filtering Pipeline (v7)

Each assistant response is checked against the following before inclusion:

Length: 20–400 chars (general), 20–500 chars (reasoning)
No markdown: **bold**, `inline code`, [link](url), # headings all excluded
No lists: numbered (1.) and bullet (-, *) patterns excluded at line-start and after colons
No list lead-ins: phrases like "the process involves:", "as follows:", "the following" excluded
No AI-isms: "certainly!", "as an AI", "in conclusion", "delve" excluded
Post-dedup sanity check: % of markdown patterns logged to W&B before training

System Prompt

All training samples include this system prompt:

You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue.

Available Formats

Repo	Format	Use case
cowWhySo/qwen3_5_2B_voice_assistant	Merged 16-bit	Transformers / vLLM / SGLang
cowWhySo/qwen3_5_2B_voice_assistant-lora	LoRA adapters	Merge with base yourself
cowWhySo/qwen3_5_2B_voice_assistant-GGUF	GGUF (q4_k_m, q5_k_m, q8_0, f16)	llama.cpp / Ollama / LM Studio

Usage with llama.cpp

huggingface-cli download cowWhySo/qwen3_5_2B_voice_assistant-GGUF --include "*q4_k_m*" --local-dir .
./llama-cli -m *q4_k_m*.gguf --ctx-size 2048 --temp 0.7 --top-p 0.9

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")

messages = [
    {"role": "system", "content": "You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue."},
    {"role": "user", "content": "What's the weather like today?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fine-tuned with Unsloth on an NVIDIA L4 GPU.

Downloads last month: 114

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for cowWhySo/qwen3_5_2B_voice_assistant

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

unsloth/Qwen3.5-2B

Adapter

(14)

this model