ha-voice-7b β€” Fine-Tuned Voice Assistant for Home Assistant

A purpose-built 7B language model for Home Assistant voice pipelines, fine-tuned from Qwen2.5-7B-Instruct on 8,000 HA tool-calling conversations using QLoRA.

Why This Model Exists

Home Assistant's voice pipeline sends 50+ tool schemas to the LLM on every voice command. Generic models struggle with this β€” they use wrong parameters (name instead of area), call wrong tools, or return empty responses. This model was specifically trained to:

  1. Select the correct tool from 50+ concurrent tool schemas
  2. Use the correct parameters β€” area for rooms, floor for stories, name for specific devices, domain as arrays
  3. Produce clean tool call JSON in OpenAI-compatible format
  4. Confirm actions with brief text after tool execution
  5. Generalize across any entity naming convention β€” not hardcoded to one user's setup

Benchmark Results

Tested through Ollama with 33 diverse voice commands and 25+ tool schemas:

Model Size Tool Accuracy Param Accuracy Empty Responses Latency
cogito:14b Q4_K_M (baseline) 9.0 GB 88% 58% 0 2,856ms
ha-voice:7b Q4_K_M 4.5 GB 94% 85% 0 341ms
  • 94% tool selection accuracy β€” correct tool called nearly every time
  • 85% full parameter accuracy β€” correct params with proper formatting (uses area not name)
  • Zero empty responses β€” always produces a tool call or text
  • 8.4x faster than cogito:14b baseline
  • Half the size (4.5 GB vs 9.0 GB)

Available Files

File Quant Size Use Case
ha-voice-7b-Q4_K_M.gguf Q4_K_M 4.5 GB Recommended β€” best balance of size and quality

How to Use

With Ollama

# Download the GGUF
# Create a Modelfile (use your preferred chat template)
cat > Modelfile << 'EOF'
FROM ./ha-voice-7b-Q4_K_M.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

ollama create ha-voice:7b -f Modelfile

With Home Assistant

Configure the local_openai custom integration to point to your Ollama instance with this model. The model expects tools in standard OpenAI function calling format, which local_openai sends automatically.

Recommended LiteLLM configuration:

- model_name: local/voice
  litellm_params:
    model: ollama_chat/ha-voice:7b
    api_base: http://localhost:11434
    stream: false
  model_info:
    supports_function_calling: true

Direct Ollama (without LiteLLM): Point local_openai to http://your-ollama-host:11434/v1 and use model name ha-voice:7b.

Note: Use stream: false or non-streaming mode for reliable tool call responses. Streaming tool calls may drop tool call data depending on your proxy configuration.

With llama.cpp

llama-server \
  -m ha-voice-7b-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --port 8080

What It Handles

The model was trained on these HA tool types with randomized entity names:

  • HassTurnOn / HassTurnOff β€” lights, switches, scripts, scenes (with area, floor, name, domain parameters)
  • HassLightSet β€” brightness (0-100), color, color temperature
  • HassVacuumStart / HassVacuumReturnToBase β€” vacuum control by name or area
  • HassListAddItem / HassListRemoveItem / HassListCompleteItem β€” shopping/todo lists
  • HassBroadcast β€” whole-home announcements
  • HassCancelAllTimers β€” timer management
  • GetDateTime β€” current date/time queries
  • Camera scripts β€” parameterless tools for showing camera feeds on TVs
  • HassGetState / HassSetPosition / HassClimateSetTemperature β€” state queries and climate control
  • HassMediaPause / HassSetVolume β€” media player control

Training data included 15-80 tools per conversation, so the model handles large tool sets without degradation.

Training Details

  • Base model: Qwen2.5-7B-Instruct (7.7B parameters)
  • Method: QLoRA (LoRA rank 64, alpha 128, 2.08% trainable parameters)
  • Training data: 8,000 conversations with:
    • Randomized entity names (English, European, technical naming styles)
    • Varied tool counts (15-80 tools per conversation)
    • Multi-turn sequences (user β†’ tool_call β†’ tool_result β†’ confirmation)
    • Ollama Qwen2 chat template format
  • Epochs: 3
  • Hardware: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)
  • Training time: ~30 hours total

Why Qwen2.5-7B-Instruct

  • Same architecture family as cogito:14b (which was proven for HA voice)
  • No thinking/reasoning tokens (critical β€” <think> tokens cause empty responses via OpenAI API)
  • Strong baseline tool-calling ability
  • 7B is the sweet spot for consumer GPU deployment (fits in 8 GB VRAM)

Key Design Decisions

Generalization Over Memorization

The training data uses randomized entity names from multiple pools:

  • English: "Living Room", "Kitchen", "Master Bedroom"
  • European: "Salon", "Cuisine", "Wohnzimmer"
  • Technical: "lr", "kit", "br1"
  • Product names: "Hue Lamp", "LIFX Strip", "Govee Strip"

This means the model works with any HA setup, not just the one it was trained on.

Parameter Accuracy

The model learned the correct HA parameter semantics:

  • area for rooms: {"area": "Living Room"}
  • floor for stories: {"floor": "Upstairs"}
  • name for specific devices: {"name": "Desk LED Lights"}
  • domain as arrays: {"domain": ["light"]}

Generic models (including cogito:14b) consistently confuse these parameters.

No Thinking Tokens

The base model (Qwen2.5-7B-Instruct) does not produce <think> tokens. This is critical for HA voice pipelines β€” models with thinking tokens (qwen3, qwen3.5) return empty content via the OpenAI-compatible API, causing Voice PE devices to show error states.

Known Issues

  • Post-tool confirmation text: The model generates correct confirmations ("Done.", "Lights are on.") when tested directly, but HA's local_openai streaming pipeline may drop them. A fallback patch in entity.py injects "Done." when this occurs. This is an HA integration issue, not a model issue.
  • Trained primarily on English voice commands
  • Does not include cover/lock/valve control in training data (coming in future versions)
  • "Bedroom" may be used instead of "Master Bedroom" for ambiguous area names (HA resolves this correctly)
  • Best results with the full HA system prompt and tool schemas (minimal prompts may produce empty responses)

Hardware Requirements

Hardware VRAM Performance
RTX 3060 (12 GB) ~5 GB Good β€” primary target
RTX 4060/4070 (8-12 GB) ~5 GB Excellent
RTX 4080/4090 (16+ GB) ~5 GB Excellent
CPU only (16+ GB RAM) N/A Usable but slower (~2-5s per call)
Raspberry Pi 5 (8 GB) N/A Marginal β€” may work with CPU inference

Acknowledgments

License

Apache 2.0 (same as the base model)

Downloads last month
119
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for infohound/ha-voice-7b-GGUF

Base model

Qwen/Qwen2.5-7B
Quantized
(291)
this model