Distil-Qwen3-0.6B-Voice-Assistant-Banking
A fine-tuned Qwen3-0.6B model for multi-turn intent classification and slot extraction in a banking voice assistant. Trained using knowledge distillation from a 120B teacher model, this 0.6B model delivers 90.9% tool call accuracy — exceeding the teacher — while running at ~40ms inference, enabling real-time voice pipelines under 400ms total latency.
For the GGUF version (for llama.cpp deployment), see distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf.
Results
| Model | Parameters | Tool Call Accuracy | ROUGE |
|---|---|---|---|
| GPT-oss-120B (teacher) | 120B | 87.5% | 94.4% |
| This model (tuned) | 0.6B | 90.9% | 97.8% |
| Qwen3-0.6B (base) | 0.6B | 48.7% | 66.3% |
The fine-tuned model exceeds the 120B teacher on tool call accuracy while being 200x smaller. The base Qwen3-0.6B achieves only 48.7% — fine-tuning is essential for reliable multi-turn conversations.
Quick Start
Using Transformers
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("distil-labs/distil-qwen3-0.6b-voice-assistant-banking")
tokenizer = AutoTokenizer.from_pretrained("distil-labs/distil-qwen3-0.6b-voice-assistant-banking")
TOOLS = [
{"type": "function", "function": {"name": "check_balance", "description": "Check the balance of a bank account", "parameters": {"type": "object", "properties": {"account_type": {"type": "string", "enum": ["checking", "savings", "credit"]}}, "required": [], "additionalProperties": False}}},
{"type": "function", "function": {"name": "transfer_money", "description": "Transfer money between accounts", "parameters": {"type": "object", "properties": {"amount": {"type": "number"}, "from_account": {"type": "string", "enum": ["checking", "savings"]}, "to_account": {"type": "string", "enum": ["checking", "savings"]}}, "required": [], "additionalProperties": False}}},
{"type": "function", "function": {"name": "cancel_card", "description": "Cancel a bank card", "parameters": {"type": "object", "properties": {"card_type": {"type": "string", "enum": ["credit", "debit"]}, "card_last_four": {"type": "string"}, "reason": {"type": "string", "enum": ["lost", "stolen", "damaged", "other"]}}, "required": [], "additionalProperties": False}}},
{"type": "function", "function": {"name": "intent_unclear", "description": "Use when the user's intent cannot be determined", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
{"type": "function", "function": {"name": "greeting", "description": "User is greeting", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
{"type": "function", "function": {"name": "goodbye", "description": "User is ending the conversation", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
]
messages = [
{"role": "system", "content": "You are a tool-calling model working on:\n<task_description>You are a voice assistant for BankCo, a retail bank. The user input is automatically transcribed speech from an ASR system, so it may contain transcription errors, homophones, filler words, or unusual phrasings. Parse the user's request and return the appropriate function call despite any transcription artifacts. If you can identify the intent, call the matching function. Extract any mentioned argument values; omit arguments not mentioned. If you cannot understand what the user wants, call intent_unclear(). Use conversation history to understand context from previous turns.</task_description>\n\nRespond to the conversation history by generating an appropriate tool call that satisfies the user request. Generate only the tool call according to the provided tool schema, do not generate anything else. Always respond with a tool call.\n\n"},
{"role": "user", "content": "I need to cancel my credit card ending in 1234"},
]
text = tokenizer.apply_chat_template(
messages, tools=TOOLS, tokenize=False, add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# <tool_call>
# {"name": "cancel_card", "arguments": {"card_type": "credit", "card_last_four": "1234"}}
# </tool_call>
Using with the Demo App
This model powers the BankCo Voice Assistant demo — a full ASR -> SLM -> TTS voice pipeline that runs locally.
Using llama.cpp
Download the GGUF version and serve it with llama.cpp:
huggingface-cli download distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf \
--local-dir distil-model
llama-server \
--model distil-model/Qwen3-voice-assistant-slm-0.6B.gguf \
--port 8000 \
--jinja
Then query via the OpenAI-compatible API at http://127.0.0.1:8000/v1.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-0.6B |
| Parameters | 0.6 billion |
| Architecture | Qwen3ForCausalLM |
| Context Length | 40,960 tokens |
| Precision | bfloat16 |
| Training Data | 77 seed conversations, synthetically expanded |
| Teacher Model | GPT-oss-120B |
| Task | Multi-turn tool calling (closed book) |
Training
This model was trained using the Distil Labs platform:
- Seed Data: 77 hand-written multi-turn conversations covering 14 banking functions, including ASR transcription artifacts (filler words, homophones, word splits)
- Synthetic Expansion: Expanded to thousands of examples using a 120B teacher model
- Fine-tuning: Multi-turn tool calling distillation on Qwen3-0.6B
What the Model Does
The model acts as a function caller for a banking voice assistant. Given a user utterance (potentially with ASR errors) and conversation history, it outputs a structured tool call:
User: "Trans fur 500 from my savin to checkin"
Model: {"name": "transfer_money", "arguments": {"amount": 500, "from_account": "savings", "to_account": "checking"}}
User: "I wanna cancel my card"
Model: {"name": "cancel_card", "arguments": {}}
User: "What about that thing from last week"
Model: {"name": "intent_unclear", "arguments": {}}
Supported Functions
The model handles 14 banking operations:
| Function | Description |
|---|---|
check_balance |
Check account balance |
get_statement |
Request account statement |
transfer_money |
Transfer between accounts |
pay_bill |
Pay a bill |
cancel_card |
Cancel a card |
replace_card |
Request replacement card |
activate_card |
Activate a new card |
report_fraud |
Report fraudulent transaction |
reset_pin |
Reset card PIN |
speak_to_human |
Connect to a human agent |
greeting |
Conversation start |
goodbye |
Conversation end |
thank_you |
Express gratitude |
intent_unclear |
Cannot determine intent |
Use Cases
- Real-time voice banking assistants (ASR -> SLM -> TTS pipeline)
- Text-based banking chatbots with structured intent routing
- Edge deployment for on-device voice processing
- Any multi-turn tool calling task with bounded intent taxonomy
Limitations
- Trained on English banking intents only
- Covers 14 specific banking functions — not a general-purpose tool caller
- ASR artifact handling is tuned for common speech-to-text errors, not all possible transcription formats
- 90.9% accuracy means ~1 in 10 function calls will be incorrect
License
This model is released under the Apache 2.0 license.
Links
Citation
@misc{distil-qwen3-0.6b-voice-assistant-banking,
author = {Distil Labs},
title = {Distil-Qwen3-0.6B-Voice-Assistant-Banking: A Fine-tuned SLM for Banking Voice Assistants},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/distil-labs/distil-qwen3-0.6b-voice-assistant-banking}
}
- Downloads last month
- 7