ha-voice-7b — Fine-Tuned Voice Assistant for Home Assistant

A purpose-built 7B language model for Home Assistant voice pipelines, fine-tuned from Qwen2.5-7B-Instruct on 8,000 HA tool-calling conversations using QLoRA.

Why This Model Exists

Home Assistant's voice pipeline sends 50+ tool schemas to the LLM on every voice command. Generic models struggle with this — they use wrong parameters (name instead of area), call wrong tools, or return empty responses. This model was specifically trained to:

Select the correct tool from 50+ concurrent tool schemas
Use the correct parameters — area for rooms, floor for stories, name for specific devices, domain as arrays
Produce clean tool call JSON in OpenAI-compatible format
Confirm actions with brief text after tool execution
Generalize across any entity naming convention — not hardcoded to one user's setup

Benchmark Results

Tested through Ollama with 33 diverse voice commands and 25+ tool schemas:

Model	Size	Tool Accuracy	Param Accuracy	Empty Responses	Latency
cogito:14b Q4_K_M (baseline)	9.0 GB	88%	58%	0	2,856ms
ha-voice:7b Q4_K_M	4.5 GB	94%	85%	0	341ms

94% tool selection accuracy — correct tool called nearly every time
85% full parameter accuracy — correct params with proper formatting (uses area not name)
Zero empty responses — always produces a tool call or text
8.4x faster than cogito:14b baseline
Half the size (4.5 GB vs 9.0 GB)

Available Files

File	Quant	Size	Use Case
`ha-voice-7b-Q4_K_M.gguf`	Q4_K_M	4.5 GB	Recommended — best balance of size and quality

How to Use

With Ollama

# Download the GGUF
# Create a Modelfile (use your preferred chat template)
cat > Modelfile << 'EOF'
FROM ./ha-voice-7b-Q4_K_M.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

ollama create ha-voice:7b -f Modelfile

With Home Assistant

Configure the local_openai custom integration to point to your Ollama instance with this model. The model expects tools in standard OpenAI function calling format, which local_openai sends automatically.

Recommended LiteLLM configuration:

- model_name: local/voice
  litellm_params:
    model: ollama_chat/ha-voice:7b
    api_base: http://localhost:11434
    stream: false
  model_info:
    supports_function_calling: true

Direct Ollama (without LiteLLM): Point local_openai to http://your-ollama-host:11434/v1 and use model name ha-voice:7b.

Note: Use stream: false or non-streaming mode for reliable tool call responses. Streaming tool calls may drop tool call data depending on your proxy configuration.

With llama.cpp

llama-server \
  -m ha-voice-7b-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --port 8080

What It Handles

The model was trained on these HA tool types with randomized entity names:

HassTurnOn / HassTurnOff — lights, switches, scripts, scenes (with area, floor, name, domain parameters)
HassLightSet — brightness (0-100), color, color temperature
HassVacuumStart / HassVacuumReturnToBase — vacuum control by name or area
HassListAddItem / HassListRemoveItem / HassListCompleteItem — shopping/todo lists
HassBroadcast — whole-home announcements
HassCancelAllTimers — timer management
GetDateTime — current date/time queries
Camera scripts — parameterless tools for showing camera feeds on TVs
HassGetState / HassSetPosition / HassClimateSetTemperature — state queries and climate control
HassMediaPause / HassSetVolume — media player control

Training data included 15-80 tools per conversation, so the model handles large tool sets without degradation.

Training Details

Base model: Qwen2.5-7B-Instruct (7.7B parameters)
Method: QLoRA (LoRA rank 64, alpha 128, 2.08% trainable parameters)
Training data: 8,000 conversations with:
- Randomized entity names (English, European, technical naming styles)
- Varied tool counts (15-80 tools per conversation)
- Multi-turn sequences (user → tool_call → tool_result → confirmation)
- Ollama Qwen2 chat template format
Epochs: 3
Hardware: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)
Training time: ~30 hours total

Why Qwen2.5-7B-Instruct

Same architecture family as cogito:14b (which was proven for HA voice)
No thinking/reasoning tokens (critical — <think> tokens cause empty responses via OpenAI API)
Strong baseline tool-calling ability
7B is the sweet spot for consumer GPU deployment (fits in 8 GB VRAM)

Key Design Decisions

Generalization Over Memorization

The training data uses randomized entity names from multiple pools:

English: "Living Room", "Kitchen", "Master Bedroom"
European: "Salon", "Cuisine", "Wohnzimmer"
Technical: "lr", "kit", "br1"
Product names: "Hue Lamp", "LIFX Strip", "Govee Strip"

This means the model works with any HA setup, not just the one it was trained on.

Parameter Accuracy

The model learned the correct HA parameter semantics:

area for rooms: {"area": "Living Room"}
floor for stories: {"floor": "Upstairs"}
name for specific devices: {"name": "Desk LED Lights"}
domain as arrays: {"domain": ["light"]}

Generic models (including cogito:14b) consistently confuse these parameters.

No Thinking Tokens

The base model (Qwen2.5-7B-Instruct) does not produce <think> tokens. This is critical for HA voice pipelines — models with thinking tokens (qwen3, qwen3.5) return empty content via the OpenAI-compatible API, causing Voice PE devices to show error states.

Known Issues

Post-tool confirmation text: The model generates correct confirmations ("Done.", "Lights are on.") when tested directly, but HA's local_openai streaming pipeline may drop them. A fallback patch in entity.py injects "Done." when this occurs. This is an HA integration issue, not a model issue.
Trained primarily on English voice commands
Does not include cover/lock/valve control in training data (coming in future versions)
"Bedroom" may be used instead of "Master Bedroom" for ambiguous area names (HA resolves this correctly)
Best results with the full HA system prompt and tool schemas (minimal prompts may produce empty responses)

Hardware Requirements

Hardware	VRAM	Performance
RTX 3060 (12 GB)	~5 GB	Good — primary target
RTX 4060/4070 (8-12 GB)	~5 GB	Excellent
RTX 4080/4090 (16+ GB)	~5 GB	Excellent
CPU only (16+ GB RAM)	N/A	Usable but slower (~2-5s per call)
Raspberry Pi 5 (8 GB)	N/A	Marginal — may work with CPU inference

Acknowledgments

Qwen Team for the excellent Qwen2.5-7B-Instruct base model
Home Assistant for the intent definitions used in training data generation
llama.cpp for GGUF conversion tools
Hugging Face TRL for the SFT training framework
Built on an NVIDIA DGX Spark

License

Apache 2.0 (same as the base model)

Downloads last month: 119

GGUF

Model size

8B params

Architecture

qwen2

Hardware compatibility

4-bit

Model tree for infohound/ha-voice-7b-GGUF

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

(291)

this model