Apertus Tool Calling: Practical Notes
It has been two months since I last posted about my Apertus-based project. I have made a lot of advancements since then, most importantly with regards to tools. I wanted to give the community some insights into my learnings; hopefully it can help someone else.
Background
I came across the swiss-ai/apertus-format GitHub repo and this helped with improving model output quality and consistency. To get more information about the ApertusFormatter, head over to the GitHub format specification page. Though this post is not a tutorial, I will attempt to explain practical behavior I observed that isn't obvious from the documentation.
The add_generation_prompt Parameter
This parameter on format_conversation() had the most impact on whether tool calls were generated correctly, and its behavior wasn't immediately obvious to me.
In tool-calling mode (intent == tool call), you need:
formatter = ApertusFormatter(enable_thinking=True, tools=tool_functions)
formatted = formatter.format_conversation(conversation, add_generation_prompt=False)
Setting add_generation_prompt=True with a tool-capable prompt caused the model to consistently ignore the tools and generate a plain text response instead. My interpretation is that the generation prompt suffix signals "start producing a human-readable assistant turn," which is in tension with the structured <|tools_prefix|>...<|tools_suffix|> output the tool path requires. Omitting it leaves the model in a state where it correctly emits the tool call JSON.
In streaming/text mode, add_generation_prompt=True is required β without it, the streamed response produces no tokens.
So the rule I landed on:
- Tool call path:
add_generation_prompt=False - Text response path:
add_generation_prompt=True
Streaming and Tool Calls Are Mutually Exclusive
The format supports tool calls via <|tools_prefix|>[...]<|tools_suffix|> in the assistant turn. This works correctly in a single HTTP completion call. It does not work in a streaming context.
When the model is streaming, it emits tokens one at a time and the client is consuming them as they arrive. There's no point at which you can intercept a partial JSON array, parse a tool call, execute it, and inject the result back into the conversation before the generation is already half-finished. The model doesn't wait.
Besides, the add_generation_prompt=True ignores the tools anyway.
What I ended up implementing is a two-call pattern:
- A short classification call (
max_tokens=5,add_generation_prompt=False) that determines whether the model intends to call a tool or produce a text response. I use a dedicatedtag_extractiontool for this with a constrained output schema. - Based on the result: either a full tool-calling HTTP call (synchronous), or a full streaming text generation.
The classification call adds approximately 200ms. The benefit is that the two paths β tool execution and streaming text β stay cleanly separated, and neither interferes with the other.
Few-Shot Examples with AssistantBlock
The format supports structured assistant messages via Message.assistant_with_blocks() with AssistantBlock objects. I used this to provide a few-shot example in the prompt demonstrating the expected tool call output format.
A couple of observations:
One or two examples is sufficient. I experimented with adding more, expecting improved reliability. The opposite happened β beyond two examples, output quality dropped. The model appeared to over-fit to the example pattern and produced less flexible responses. Two examples, regardless of how many tools are in the schema, was the consistent sweet spot.
The tool_calls block structure matters. Tool call arguments must be valid JSON strings. Malformed arguments in a few-shot example reliably produced malformed arguments in the model's actual output. The few-shot example is essentially a demonstration the model mirrors.
An example few-shot assistant turn using AssistantBlock:
from apertus_format import Message, AssistantBlock, BlockType, ToolCall
assistant_msg = Message.assistant_with_blocks([
AssistantBlock(
type=BlockType.TOOL_CALLS,
calls=[
ToolCall(
name="control_device",
arguments='{"action": "turn_off", "zone": "living_room"}'
)
]
)
])
System Instruction Length and Output Quality
Long system prompts degraded all model outputs, not just tool calls. This was the more general finding, and it's worth stating clearly: even in scenarios with no tools involved at all, a verbose system instruction correlated with wandering responses, inconsistent formatting, and generally lower quality output. The model seemed to lose coherence proportional to how much instruction text preceded the conversation.
Tool call reliability was simply one visible symptom of this broader problem. When tool calling failed in ways that were hard to explain β wrong tool selected, arguments malformed, tool call ignored entirely β trimming the system instruction often resolved it. But the same trim also improved plain text response quality in those sessions, which revealed the real cause: the system instruction itself was the problem, regardless of whether tools were in play.
My working theory is that Apertus-8B-Instruct's effective attention degrades with long preambles. The instruction becomes "heavy" in a way that competes with the actual conversation content, and the model produces output that reflects both imperfectly rather than following either cleanly.
The fix was the same regardless of call type: short, purpose-specific system instructions.
- Tool-calling prompts: a single directive stating the model should call the appropriate tool and nothing else
- Text response prompts: a brief statement permitting conversational latitude
- Classification (peek) calls: a description of the valid output states only
Each instruction is as short as it can possibly be while still being unambiguous. Token counts dropped meaningfully across all three, and output quality improved across all three β not just in tool call paths.
The practical implication: don't treat system instruction length as free. Every token you add to the system section is competing with the content you actually care about.
The text_response Tool
When I provided tools in the formatter and sent a request that didn't require a tool call, the model would still attempt to format its response as a tool call β often inventing a tool name. The fix was straightforward: add a text_response tool to the schema alongside the domain tools.
TEXT_RESPONSE_TOOL = {
"name": "text_response",
"description": "Return a natural language response to the user.",
"parameters": {
"type": "object",
"properties": {
"content": {
"type": "string",
"description": "The response text."
}
},
"required": ["content"]
}
}
Once this was included in every tool-capable prompt, the model had a valid structured output path for conversational replies and used it consistently.
Summary
| Behavior | Finding |
|---|---|
add_generation_prompt in tool mode |
Must be False; True suppresses tool calls |
add_generation_prompt in streaming mode |
Must be True; False produces no output |
| Streaming + tool calls | Incompatible; requires a two-call pattern |
Few-shot AssistantBlock examples |
2 examples optimal; more degrades quality |
| System instruction length | Shorter is better for tool calling accuracy |
| Mixed tool/text prompts | Add a text_response tool to prevent the model inventing names |
The format library itself (swiss-ai/apertus-format) worked correctly in all cases β these are observations about prompt construction strategy, not library bugs. Happy to share more detail on any of the above.
This is very cool, thanks for sharing your experience here. Do you have a blog? Or considered posting this to your HF profile? It would be nice to learn a little more about the context of your explorations.
See also #18
Thanks, @loleg ! I have been meaning to dust off my blog, but I have yet to get around to that. In the meantime, here is a little more of a teaser for what I am doing.
I want the intent and context awareness that AI can bring to everyday interactions, but I didn't want my personal assistant gossiping about me to Zuckerberg, Altman, and Bezos. So I built my own: a fully on-premises home intelligence system running swiss-ai/Apertus-8B-Instruct locally on an RTX 3090 via vLLM, with a fine-tuned LoRA adapter trained on my home layout and device vocabulary. Voice is captured and transcribed locally via Whisper STT, responses are synthesized via XTTS and streamed sentence-by-sentence to speakers throughout the house. Nothing leaves the building. A confidence-based symbolic reasoning layer sits in front of the LLM, handling routine requests from a local skill cache in under 150ms and only escalating to the model when genuinely needed. All services communicate over an MQTT broker with mutual TLS certificate authentication. Happy to share more detail on any part of the stack.