Apertus Tool Calling: Practical Notes

#26

by factoryseven - opened 20 days ago

It has been two months since I last posted about my Apertus-based project. I have made a lot of advancements since then, most importantly with regards to tools. I wanted to give the community some insights into my learnings; hopefully it can help someone else.

Background

I came across the swiss-ai/apertus-format GitHub repo and this helped with improving model output quality and consistency. To get more information about the ApertusFormatter, head over to the GitHub format specification page. Though this post is not a tutorial, I will attempt to explain practical behavior I observed that isn't obvious from the documentation.

The `add_generation_prompt` Parameter

This parameter on format_conversation() had the most impact on whether tool calls were generated correctly, and its behavior wasn't immediately obvious to me.

In tool-calling mode (intent == tool call), you need:

formatter = ApertusFormatter(enable_thinking=True, tools=tool_functions)
formatted = formatter.format_conversation(conversation, add_generation_prompt=False)

Setting add_generation_prompt=True with a tool-capable prompt caused the model to consistently ignore the tools and generate a plain text response instead. My interpretation is that the generation prompt suffix signals "start producing a human-readable assistant turn," which is in tension with the structured <|tools_prefix|>...<|tools_suffix|> output the tool path requires. Omitting it leaves the model in a state where it correctly emits the tool call JSON.

In streaming/text mode, add_generation_prompt=True is required — without it, the streamed response produces no tokens.

So the rule I landed on:

Tool call path: add_generation_prompt=False
Text response path: add_generation_prompt=True

Streaming and Tool Calls Are Mutually Exclusive

The format supports tool calls via <|tools_prefix|>[...]<|tools_suffix|> in the assistant turn. This works correctly in a single HTTP completion call. It does not work in a streaming context.

When the model is streaming, it emits tokens one at a time and the client is consuming them as they arrive. There's no point at which you can intercept a partial JSON array, parse a tool call, execute it, and inject the result back into the conversation before the generation is already half-finished. The model doesn't wait.

Besides, the add_generation_prompt=True ignores the tools anyway.

What I ended up implementing is a two-call pattern:

A short classification call (max_tokens=5, add_generation_prompt=False) that determines whether the model intends to call a tool or produce a text response. I use a dedicated tag_extraction tool for this with a constrained output schema.
Based on the result: either a full tool-calling HTTP call (synchronous), or a full streaming text generation.

The classification call adds approximately 200ms. The benefit is that the two paths — tool execution and streaming text — stay cleanly separated, and neither interferes with the other.

Few-Shot Examples with `AssistantBlock`

The format supports structured assistant messages via Message.assistant_with_blocks() with AssistantBlock objects. I used this to provide a few-shot example in the prompt demonstrating the expected tool call output format.

A couple of observations:

One or two examples is sufficient. I experimented with adding more, expecting improved reliability. The opposite happened — beyond two examples, output quality dropped. The model appeared to over-fit to the example pattern and produced less flexible responses. Two examples, regardless of how many tools are in the schema, was the consistent sweet spot.

The tool_calls block structure matters. Tool call arguments must be valid JSON strings. Malformed arguments in a few-shot example reliably produced malformed arguments in the model's actual output. The few-shot example is essentially a demonstration the model mirrors.

An example few-shot assistant turn using AssistantBlock:

from apertus_format import Message, AssistantBlock, BlockType, ToolCall

assistant_msg = Message.assistant_with_blocks([
    AssistantBlock(
        type=BlockType.TOOL_CALLS,
        calls=[
            ToolCall(
                name="control_device",
                arguments='{"action": "turn_off", "zone": "living_room"}'
            )
        ]
    )
])

System Instruction Length and Output Quality

Long system prompts degraded all model outputs, not just tool calls. This was the more general finding, and it's worth stating clearly: even in scenarios with no tools involved at all, a verbose system instruction correlated with wandering responses, inconsistent formatting, and generally lower quality output. The model seemed to lose coherence proportional to how much instruction text preceded the conversation.

Tool call reliability was simply one visible symptom of this broader problem. When tool calling failed in ways that were hard to explain — wrong tool selected, arguments malformed, tool call ignored entirely — trimming the system instruction often resolved it. But the same trim also improved plain text response quality in those sessions, which revealed the real cause: the system instruction itself was the problem, regardless of whether tools were in play.

My working theory is that Apertus-8B-Instruct's effective attention degrades with long preambles. The instruction becomes "heavy" in a way that competes with the actual conversation content, and the model produces output that reflects both imperfectly rather than following either cleanly.

The fix was the same regardless of call type: short, purpose-specific system instructions.

Tool-calling prompts: a single directive stating the model should call the appropriate tool and nothing else
Text response prompts: a brief statement permitting conversational latitude
Classification (peek) calls: a description of the valid output states only

Each instruction is as short as it can possibly be while still being unambiguous. Token counts dropped meaningfully across all three, and output quality improved across all three — not just in tool call paths.

The practical implication: don't treat system instruction length as free. Every token you add to the system section is competing with the content you actually care about.

The `text_response` Tool

When I provided tools in the formatter and sent a request that didn't require a tool call, the model would still attempt to format its response as a tool call — often inventing a tool name. The fix was straightforward: add a text_response tool to the schema alongside the domain tools.

TEXT_RESPONSE_TOOL = {
    "name": "text_response",
    "description": "Return a natural language response to the user.",
    "parameters": {
        "type": "object",
        "properties": {
            "content": {
                "type": "string",
                "description": "The response text."
            }
        },
        "required": ["content"]
    }
}

Once this was included in every tool-capable prompt, the model had a valid structured output path for conversational replies and used it consistently.

Summary

Behavior	Finding
`add_generation_prompt` in tool mode	Must be `False`; `True` suppresses tool calls
`add_generation_prompt` in streaming mode	Must be `True`; `False` produces no output
Streaming + tool calls	Incompatible; requires a two-call pattern
Few-shot `AssistantBlock` examples	2 examples optimal; more degrades quality
System instruction length	Shorter is better for tool calling accuracy
Mixed tool/text prompts	Add a `text_response` tool to prevent the model inventing names

The format library itself (swiss-ai/apertus-format) worked correctly in all cases — these are observations about prompt construction strategy, not library bugs. Happy to share more detail on any of the above.

loleg

Swiss AI Initiative org about 11 hours ago

•

edited about 7 hours ago

This is very cool, thanks for sharing your experience here. Do you have a blog? Or considered posting this to your HF profile? It would be nice to learn a little more about the context of your explorations.