Distil-Qwen3-0.6B-Voice-Assistant-Banking (GGUF)

GGUF version of distil-labs/distil-qwen3-0.6b-voice-assistant-banking for deployment with llama.cpp, Ollama, and other GGUF-compatible runtimes.

A fine-tuned Qwen3-0.6B model for multi-turn intent classification and slot extraction in a banking voice assistant. Trained using knowledge distillation from a 120B teacher model, this 0.6B model delivers 90.9% tool call accuracy โ€” exceeding the teacher โ€” while running at ~40ms inference, enabling real-time voice pipelines under 400ms total latency.

For the safetensors version (for Transformers / vLLM), see distil-labs/distil-qwen3-0.6b-voice-assistant-banking.

Results

Model Parameters Tool Call Accuracy ROUGE
GPT-oss-120B (teacher) 120B 87.5% 94.4%
This model (tuned) 0.6B 90.9% 97.8%
Qwen3-0.6B (base) 0.6B 48.7% 66.3%

The fine-tuned model exceeds the 120B teacher on tool call accuracy while being 200x smaller. The base Qwen3-0.6B achieves only 48.7% โ€” fine-tuning is essential for reliable multi-turn conversations.

Quick Start

Using llama.cpp

huggingface-cli download distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf \
    --local-dir distil-model

llama-server \
    --model distil-model/Qwen3-voice-assistant-slm-0.6B.gguf \
    --port 8000

Then query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")

TOOLS = [
    {"type": "function", "function": {"name": "check_balance", "description": "Check the balance of a bank account", "parameters": {"type": "object", "properties": {"account_type": {"type": "string", "enum": ["checking", "savings", "credit"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "transfer_money", "description": "Transfer money between accounts", "parameters": {"type": "object", "properties": {"amount": {"type": "number"}, "from_account": {"type": "string", "enum": ["checking", "savings"]}, "to_account": {"type": "string", "enum": ["checking", "savings"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "cancel_card", "description": "Cancel a bank card", "parameters": {"type": "object", "properties": {"card_type": {"type": "string", "enum": ["credit", "debit"]}, "card_last_four": {"type": "string"}, "reason": {"type": "string", "enum": ["lost", "stolen", "damaged", "other"]}}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "intent_unclear", "description": "Use when the user's intent cannot be determined", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "greeting", "description": "User is greeting", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
    {"type": "function", "function": {"name": "goodbye", "description": "User is ending the conversation", "parameters": {"type": "object", "properties": {}, "required": [], "additionalProperties": False}}},
]

response = client.chat.completions.create(
    model="model",
    messages=[
        {"role": "system", "content": "You are a tool-calling model working on:\n<task_description>You are a voice assistant for BankCo, a retail bank. The user input is automatically transcribed speech from an ASR system, so it may contain transcription errors, homophones, filler words, or unusual phrasings. Parse the user's request and return the appropriate function call despite any transcription artifacts. If you can identify the intent, call the matching function. Extract any mentioned argument values; omit arguments not mentioned. If you cannot understand what the user wants, call intent_unclear(). Use conversation history to understand context from previous turns.</task_description>\n\nRespond to the conversation history by generating an appropriate tool call that satisfies the user request. Generate only the tool call according to the provided tool schema, do not generate anything else. Always respond with a tool call.\n\n"},
        {"role": "user", "content": "I need to cancel my credit card ending in 1234"},
    ],
    tools=TOOLS,
    tool_choice="required",
    temperature=0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

fn = response.choices[0].message.tool_calls[0].function
print(f"{fn.name}({fn.arguments})")
# cancel_card({"card_type": "credit", "card_last_four": "1234"})

Using Llamacpp

huggingface-cli download distil-labs/distil-qwen3-0.6b-voice-assistant-banking \
    --local-dir distil-model
cd distil-model Qwen3-voice-assistant-slm-0.6B.gguf --jinja
llama-server 

Using with the Demo App

This model powers the BankCo Voice Assistant demo โ€” a full ASR -> SLM -> TTS voice pipeline that runs locally.

Model Details

Property Value
Base Model Qwen/Qwen3-0.6B
Parameters 0.6 billion
Architecture Qwen3ForCausalLM
Format GGUF (F16)
File Size ~1.1 GB
Context Length 40,960 tokens
Training Data 77 seed conversations, synthetically expanded
Teacher Model GPT-oss-120B
Task Multi-turn tool calling (closed book)

Training

This model was trained using the Distil Labs platform:

  1. Seed Data: 77 hand-written multi-turn conversations covering 14 banking functions, including ASR transcription artifacts (filler words, homophones, word splits)
  2. Synthetic Expansion: Expanded to thousands of examples using a 120B teacher model
  3. Fine-tuning: Multi-turn tool calling distillation on Qwen3-0.6B

What the Model Does

The model acts as a function caller for a banking voice assistant. Given a user utterance (potentially with ASR errors) and conversation history, it outputs a structured tool call:

User: "Trans fur 500 from my savin to checkin"
Model: {"name": "transfer_money", "arguments": {"amount": 500, "from_account": "savings", "to_account": "checking"}}

User: "I wanna cancel my card"
Model: {"name": "cancel_card", "arguments": {}}

User: "What about that thing from last week"
Model: {"name": "intent_unclear", "arguments": {}}

Supported Functions

The model handles 14 banking operations:

Function Description
check_balance Check account balance
get_statement Request account statement
transfer_money Transfer between accounts
pay_bill Pay a bill
cancel_card Cancel a card
replace_card Request replacement card
activate_card Activate a new card
report_fraud Report fraudulent transaction
reset_pin Reset card PIN
speak_to_human Connect to a human agent
greeting Conversation start
goodbye Conversation end
thank_you Express gratitude
intent_unclear Cannot determine intent

Use Cases

  • Real-time voice banking assistants (ASR -> SLM -> TTS pipeline)
  • Text-based banking chatbots with structured intent routing
  • Edge deployment for on-device voice processing
  • Any multi-turn tool calling task with bounded intent taxonomy

Limitations

  • Trained on English banking intents only
  • Covers 14 specific banking functions โ€” not a general-purpose tool caller
  • ASR artifact handling is tuned for common speech-to-text errors, not all possible transcription formats
  • 90.9% accuracy means ~1 in 10 function calls will be incorrect

License

This model is released under the Apache 2.0 license.

Links

Citation

@misc{distil-qwen3-0.6b-voice-assistant-banking-gguf,
  author = {Distil Labs},
  title = {Distil-Qwen3-0.6B-Voice-Assistant-Banking-GGUF: A Fine-tuned SLM for Banking Voice Assistants},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf}
}
Downloads last month
14
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for distil-labs/distil-qwen3-0.6b-voice-assistant-banking-gguf

Finetuned
Qwen/Qwen3-0.6B
Quantized
(289)
this model