ha-voice-7b β Fine-Tuned Voice Assistant for Home Assistant
A purpose-built 7B language model for Home Assistant voice pipelines, fine-tuned from Qwen2.5-7B-Instruct on 8,000 HA tool-calling conversations using QLoRA.
Why This Model Exists
Home Assistant's voice pipeline sends 50+ tool schemas to the LLM on every voice command. Generic models struggle with this β they use wrong parameters (name instead of area), call wrong tools, or return empty responses. This model was specifically trained to:
- Select the correct tool from 50+ concurrent tool schemas
- Use the correct parameters β
areafor rooms,floorfor stories,namefor specific devices,domainas arrays - Produce clean tool call JSON in OpenAI-compatible format
- Confirm actions with brief text after tool execution
- Generalize across any entity naming convention β not hardcoded to one user's setup
Benchmark Results
Tested through Ollama with 33 diverse voice commands and 25+ tool schemas:
| Model | Size | Tool Accuracy | Param Accuracy | Empty Responses | Latency |
|---|---|---|---|---|---|
| cogito:14b Q4_K_M (baseline) | 9.0 GB | 88% | 58% | 0 | 2,856ms |
| ha-voice:7b Q4_K_M | 4.5 GB | 94% | 85% | 0 | 341ms |
- 94% tool selection accuracy β correct tool called nearly every time
- 85% full parameter accuracy β correct params with proper formatting (uses
areanotname) - Zero empty responses β always produces a tool call or text
- 8.4x faster than cogito:14b baseline
- Half the size (4.5 GB vs 9.0 GB)
Available Files
| File | Quant | Size | Use Case |
|---|---|---|---|
ha-voice-7b-Q4_K_M.gguf |
Q4_K_M | 4.5 GB | Recommended β best balance of size and quality |
How to Use
With Ollama
# Download the GGUF
# Create a Modelfile (use your preferred chat template)
cat > Modelfile << 'EOF'
FROM ./ha-voice-7b-Q4_K_M.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create ha-voice:7b -f Modelfile
With Home Assistant
Configure the local_openai custom integration to point to your Ollama instance with this model. The model expects tools in standard OpenAI function calling format, which local_openai sends automatically.
Recommended LiteLLM configuration:
- model_name: local/voice
litellm_params:
model: ollama_chat/ha-voice:7b
api_base: http://localhost:11434
stream: false
model_info:
supports_function_calling: true
Direct Ollama (without LiteLLM):
Point local_openai to http://your-ollama-host:11434/v1 and use model name ha-voice:7b.
Note: Use
stream: falseor non-streaming mode for reliable tool call responses. Streaming tool calls may drop tool call data depending on your proxy configuration.
With llama.cpp
llama-server \
-m ha-voice-7b-Q4_K_M.gguf \
-ngl 99 \
-c 4096 \
--port 8080
What It Handles
The model was trained on these HA tool types with randomized entity names:
- HassTurnOn / HassTurnOff β lights, switches, scripts, scenes (with
area,floor,name,domainparameters) - HassLightSet β brightness (0-100), color, color temperature
- HassVacuumStart / HassVacuumReturnToBase β vacuum control by name or area
- HassListAddItem / HassListRemoveItem / HassListCompleteItem β shopping/todo lists
- HassBroadcast β whole-home announcements
- HassCancelAllTimers β timer management
- GetDateTime β current date/time queries
- Camera scripts β parameterless tools for showing camera feeds on TVs
- HassGetState / HassSetPosition / HassClimateSetTemperature β state queries and climate control
- HassMediaPause / HassSetVolume β media player control
Training data included 15-80 tools per conversation, so the model handles large tool sets without degradation.
Training Details
- Base model: Qwen2.5-7B-Instruct (7.7B parameters)
- Method: QLoRA (LoRA rank 64, alpha 128, 2.08% trainable parameters)
- Training data: 8,000 conversations with:
- Randomized entity names (English, European, technical naming styles)
- Varied tool counts (15-80 tools per conversation)
- Multi-turn sequences (user β tool_call β tool_result β confirmation)
- Ollama Qwen2 chat template format
- Epochs: 3
- Hardware: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)
- Training time: ~30 hours total
Why Qwen2.5-7B-Instruct
- Same architecture family as cogito:14b (which was proven for HA voice)
- No thinking/reasoning tokens (critical β
<think>tokens cause empty responses via OpenAI API) - Strong baseline tool-calling ability
- 7B is the sweet spot for consumer GPU deployment (fits in 8 GB VRAM)
Key Design Decisions
Generalization Over Memorization
The training data uses randomized entity names from multiple pools:
- English: "Living Room", "Kitchen", "Master Bedroom"
- European: "Salon", "Cuisine", "Wohnzimmer"
- Technical: "lr", "kit", "br1"
- Product names: "Hue Lamp", "LIFX Strip", "Govee Strip"
This means the model works with any HA setup, not just the one it was trained on.
Parameter Accuracy
The model learned the correct HA parameter semantics:
areafor rooms:{"area": "Living Room"}floorfor stories:{"floor": "Upstairs"}namefor specific devices:{"name": "Desk LED Lights"}domainas arrays:{"domain": ["light"]}
Generic models (including cogito:14b) consistently confuse these parameters.
No Thinking Tokens
The base model (Qwen2.5-7B-Instruct) does not produce <think> tokens. This is critical for HA voice pipelines β models with thinking tokens (qwen3, qwen3.5) return empty content via the OpenAI-compatible API, causing Voice PE devices to show error states.
Known Issues
- Post-tool confirmation text: The model generates correct confirmations ("Done.", "Lights are on.") when tested directly, but HA's
local_openaistreaming pipeline may drop them. A fallback patch inentity.pyinjects "Done." when this occurs. This is an HA integration issue, not a model issue. - Trained primarily on English voice commands
- Does not include cover/lock/valve control in training data (coming in future versions)
- "Bedroom" may be used instead of "Master Bedroom" for ambiguous area names (HA resolves this correctly)
- Best results with the full HA system prompt and tool schemas (minimal prompts may produce empty responses)
Hardware Requirements
| Hardware | VRAM | Performance |
|---|---|---|
| RTX 3060 (12 GB) | ~5 GB | Good β primary target |
| RTX 4060/4070 (8-12 GB) | ~5 GB | Excellent |
| RTX 4080/4090 (16+ GB) | ~5 GB | Excellent |
| CPU only (16+ GB RAM) | N/A | Usable but slower (~2-5s per call) |
| Raspberry Pi 5 (8 GB) | N/A | Marginal β may work with CPU inference |
Acknowledgments
- Qwen Team for the excellent Qwen2.5-7B-Instruct base model
- Home Assistant for the intent definitions used in training data generation
- llama.cpp for GGUF conversion tools
- Hugging Face TRL for the SFT training framework
- Built on an NVIDIA DGX Spark
License
Apache 2.0 (same as the base model)
- Downloads last month
- 119
4-bit