Distil-Qwen3-0.6B-Voice-Assistant-Banking (GGUF)

GGUF version of distil-labs/distil-qwen3-0.6b-voice-assistant-banking for deployment with llama.cpp, Ollama, and other GGUF-compatible runtimes.

A fine-tuned Qwen3-0.6B model for multi-turn intent classification and slot extraction in a banking voice assistant. Trained using knowledge distillation from a 120B teacher model, this 0.6B model delivers 90.9% tool call accuracy — exceeding the teacher — while running at ~40ms inference, enabling real-time voice pipelines under 400ms total latency.

For the safetensors version (for Transformers / vLLM), see distil-labs/distil-qwen3-0.6b-voice-assistant-banking.

Results

Model	Parameters	Tool Call Accuracy	ROUGE
GPT-oss-120B (teacher)	120B	87.5%	94.4%
This model (tuned)	0.6B	90.9%	97.8%
Qwen3-0.6B (base)	0.6B	48.7%	66.3%

The fine-tuned model exceeds the 120B teacher on tool call accuracy while being 200x smaller. The base Qwen3-0.6B achieves only 48.7% — fine-tuning is essential for reliable multi-turn conversations.

Quick Start

Using llama.cpp

huggingface-cli download distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf \
    --local-dir distil-model

llama-server \
    --model distil-model/Qwen3-voice-assistant-slm-0.6B.gguf \
    --port 8000

Then query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")

TOOLS = [
    {"type": "function", "function": {"name": "check_balance", "description": "Check the balance of a bank account", "parameters": {"type": "object", "properties": {"account_type": {"type": "string", "enum": ["checking", "savings", "credit"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "transfer_money", "description": "Transfer money between accounts", "parameters": {"type": "object", "properties": {"amount": {"type": "number"}, "from_account": {"type": "string", "enum": ["checking", "savings"]}, "to_account": {"type": "string", "enum": ["checking", "savings"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "cancel_card", "description": "Cancel a bank card", "parameters": {"type": "object", "properties": {"card_type": {"type": "string", "enum": ["credit", "debit"]}, "card_last_four": {"type": "string"}, "reason": {"type": "string", "enum": ["lost", "stolen", "damaged", "other"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "intent_unclear", "description": "Use when the user's intent cannot be determined", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "greeting", "description": "User is greeting", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "goodbye", "description": "User is ending the conversation", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
]

response = client.chat.completions.create(
    model="model",
    messages=[
        {"role": "system", "content": "You are a tool-calling model working on:\n<task_description>You are a voice assistant for BankCo, a retail bank. The user input is automatically transcribed speech from an ASR system, so it may contain transcription errors, homophones, filler words, or unusual phrasings. Parse the user's request and return the appropriate function call despite any transcription artifacts. If you can identify the intent, call the matching function. Extract any mentioned argument values; omit arguments not mentioned. If you cannot understand what the user wants, call intent_unclear(). Use conversation history to understand context from previous turns.</task_description>\n\nRespond to the conversation history by generating an appropriate tool call that satisfies the user request. Generate only the tool call according to the provided tool schema, do not generate anything else. Always respond with a tool call.\n\n"},
        {"role": "user", "content": "I need to cancel my credit card ending in 1234"},
    ],
    tools=TOOLS,
    tool_choice="required",
    temperature=0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

fn = response.choices[0].message.tool_calls[0].function
print(f"{fn.name}({fn.arguments})")
# cancel_card({"card_type": "credit", "card_last_four": "1234"})

Using Llamacpp

huggingface-cli download distil-labs/distil-qwen3-0.6b-voice-assistant-banking \
    --local-dir distil-model
cd distil-model Qwen3-voice-assistant-slm-0.6B.gguf --jinja
llama-server

Using with the Demo App

This model powers the BankCo Voice Assistant demo — a full ASR -> SLM -> TTS voice pipeline that runs locally.

Model Details

Property	Value
Base Model	Qwen/Qwen3-0.6B
Parameters	0.6 billion
Architecture	Qwen3ForCausalLM
Format	GGUF (F16)
File Size	~1.1 GB
Context Length	40,960 tokens
Training Data	77 seed conversations, synthetically expanded
Teacher Model	GPT-oss-120B
Task	Multi-turn tool calling (closed book)

Training

This model was trained using the Distil Labs platform:

Seed Data: 77 hand-written multi-turn conversations covering 14 banking functions, including ASR transcription artifacts (filler words, homophones, word splits)
Synthetic Expansion: Expanded to thousands of examples using a 120B teacher model
Fine-tuning: Multi-turn tool calling distillation on Qwen3-0.6B

What the Model Does

The model acts as a function caller for a banking voice assistant. Given a user utterance (potentially with ASR errors) and conversation history, it outputs a structured tool call:

User: "Trans fur 500 from my savin to checkin"
Model: {"name": "transfer_money", "arguments": {"amount": 500, "from_account": "savings", "to_account": "checking"}}

User: "I wanna cancel my card"
Model: {"name": "cancel_card", "arguments": {}}

User: "What about that thing from last week"
Model: {"name": "intent_unclear", "arguments": {}}

Supported Functions

The model handles 14 banking operations:

Function	Description
`check_balance`	Check account balance
`get_statement`	Request account statement
`transfer_money`	Transfer between accounts
`pay_bill`	Pay a bill
`cancel_card`	Cancel a card
`replace_card`	Request replacement card
`activate_card`	Activate a new card
`report_fraud`	Report fraudulent transaction
`reset_pin`	Reset card PIN
`speak_to_human`	Connect to a human agent
`greeting`	Conversation start
`goodbye`	Conversation end
`thank_you`	Express gratitude
`intent_unclear`	Cannot determine intent

Use Cases

Real-time voice banking assistants (ASR -> SLM -> TTS pipeline)
Text-based banking chatbots with structured intent routing
Edge deployment for on-device voice processing
Any multi-turn tool calling task with bounded intent taxonomy

Limitations

Trained on English banking intents only
Covers 14 specific banking functions — not a general-purpose tool caller
ASR artifact handling is tuned for common speech-to-text errors, not all possible transcription formats
90.9% accuracy means ~1 in 10 function calls will be incorrect

License

This model is released under the Apache 2.0 license.

Citation

@misc{distil-qwen3-0.6b-voice-assistant-banking-gguf,
  author = {Distil Labs},
  title = {Distil-Qwen3-0.6B-Voice-Assistant-Banking-GGUF: A Fine-tuned SLM for Banking Voice Assistants},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf}
}

Downloads last month: 14

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Quantized

(289)

this model

distil-labs
/

distil-qwen3-0.6b-voice-assistant-banking-gguf