Tool-call format incompatible with vLLM on Ampere (RTX 3090) β inconsistent XML output
Tool-call format incompatible with vLLM tool parsers on Ampere (RTX 3090)
Model: kai-os/Carnice-V2-27b
Base: Qwen3.6-27B (Hermes-style agentic SFT)
HF Tags: hermes-agent, tool-use
Date filed: 2026-05-05
Summary
Carnice-V2-27b ships with a Hermes-style chat template that instructs XML-formatted tool calls using <parameter=name> (equals delimiter between tag name and param name). However, the model's actual generation output is inconsistent across runs β sometimes emitting a space delimiter (<parameter name>), sometimes a malformed double-bracket (<parameter<name>>), and sometimes a hybrid. No existing vLLM tool-call parser (qwen3_xml, hermes, qwen3coder) can reliably parse Carnice's output.
The only known vLLM deployment where tool calls work is the NVFP4 quant (sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP) which uses --tool-call-parser qwen3_xml, but NVFP4 is a Blackwell (SM9x+) feature β it does not run on Ampere hardware (RTX 3090, SM86). Our AutoRound INT4 build on Ampere exhibits the same format issue.
Evidence
1. Chat template instructs XML with equals delimiter
The model's native chat template includes the following instruction:
<tool_call>
<function=example_function_name>
<parameter=example_parameter_1>
value_1
</parameter>
<parameter=example_parameter_2>
This is the value for the second parameter
that can span
multiple lines
</parameter>
</function>
</tool_call>
Note the <parameter=name> syntax: an equals sign (=) separates the tag name from the parameter name. This is neither the Qwen3 XML format (which uses <parameter_name>value</parameter_name> or <parameters>{"name": "value"}</parameters>) nor the Hermes v1 JSON format (which wraps JSON inside <tool_call>).
2. Model output is inconsistent across runs
When prompted with the same tool-use request, Carnice produces different format variants on different inference runs:
| Run | Format | Example snippet |
|---|---|---|
| 1 | <parameter location> (space delimiter) |
<parameter location>\nParis\n</parameter> |
| 2 | <parameter<location>> (double-bracket malformed) |
<parameter<location>Paris</parameter> |
| 3 | Broken escaped JSON (with JSON-patched template) | {"location": "Paris"} |
This non-determinism makes it impossible to write a stable regex or parser.
3. No vLLM parser handles the format
We tested all three relevant vLLM built-in tool-call parsers:
| Parser | Tested with | Result |
|---|---|---|
qwen3_xml |
Original template + patched template | β Inconsistent output; <parameter=name> not a recognized Qwen3 format |
hermes |
Original template | β Expects JSON inside <tool_call>...</tool_call>, not XML |
hermes |
Patched template (instructing JSON) | β Model produces broken escaped JSON in arguments |
qwen3coder |
Original template | β Format mismatch |
4. Patched template β broken JSON
When we patched the chat template to instruct JSON output inside <tool_call> tags (to match vLLM's hermes parser expectation), the model produces output like:
<tool_call>
function: get_weather
arguments: {\"location\": \"Paris\"}
</tool_call>
The JSON in arguments is double-escaped (literal \" in the text) because the model's fine-tuned token distribution prefers XML parameter-value pairs over raw JSON string interpolation.
What we tried
Parser attempts
--tool-call-parser qwen3_xml: Format inconsistency β parser expects<parameters>JSON</parameters>but gets<parameter=name>or<parameter name>--tool-call-parser hermeswith original template: Format mismatch β parser expects JSON, gets XML--tool-call-parser hermeswith JSON-patched template: Broken escaped JSON β model can't reliably produce clean JSON strings--tool-call-parser qwen3coder: Format mismatch β parser expects<tool_call>JSON</tool_call>with specific JSON schema
Template patching
- Original template β XML
+=format (as shipped) - Patched template instructing JSON output β model produces malformed/broken JSON
- Removing/adding empty
<think>\n\n</think>block in generation prompt β no effect on format issue
Other flags
- Removing the
--reasoning-parser qwen3flag β no effect on format issue - Adding empty think block to generation prompt β no effect on format issue
- Varying
temperature(0.0, 0.3, 0.6) β format still varies, especially at temperature > 0
Working vLLM configuration (Blackwell-only)
The NVFP4 quant at sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP reportedly works with:
--tool-call-parser qwen3_xml
This suggests the NVFP4 model may use a prefix-corrected copy of the chat template that aligns the model's output format with what qwen3_xml expects. However:
- NVFP4 requires Blackwell (RTX 5090, B200, etc.) β non-functional on Ampere SM86
- The NVFP4 uploader may have patched the tokenizer_config.json or chat template differently
- We cannot verify the exact template difference because the NVFP4 quant is not loadable on our hardware
Our environment (reproducible on Ampere)
- Hardware: 1Γ or 2Γ RTX 3090 (Ampere SM86, 24 GB), PCIe, no NVLink
- vLLM version: v0.20.x + Genesis patches (v7.48βv7.69 tested)
- Quantization: AutoRound INT4 (W4A16) via Marlin kernel
- Flags:
--enable-auto-tool-choice, various--tool-call-parservalues tested - Full reproduce: See noonghunna/club-3090 β
docker-compose.carnice-bf16mtp.ymlshows the shipping config (which works via a heavily patched chat template that forces JSON output, not via native parser compatibility)
Workaround (for Ampere users)
We ship a heavily patched chat template (carnice-chat-template.jinja) that:
- Instructs the model to output JSON inside
<tool_call>tags (instead of native XML) - Uses
--tool-call-parser hermes(which expects JSON within<tool_call>) - Accepts that the model may still produce imperfect JSON in some cases
This workaround is brittle β it relies on overriding the model's native format instruction rather than matching what the model was actually fine-tuned to produce. A proper fix would require either:
- Retraining/re-tuning Carnice to output a format that matches an existing vLLM parser (Hermes JSON or Qwen3 XML)
- Adding a new vLLM parser that tolerates Carnice's
<parameter=name>format (if consistent output could be achieved) - Publishing a corrected
tokenizer_config.jsonwith a compatible chat template
Request
Please clarify the intended tool-call format. The HuggingFace model card tags this as
hermes-agentandtool-use, but the actual output format doesn't match any documented vLLM parser. What format was Carnice fine-tuned to produce?Please publish a corrected chat template in the model repo that produces output compatible with a standard vLLM parser (
hermesJSON orqwen3_xml). The current template instructs<parameter=name>but the model doesn't follow it reliably, and no parser understands that format.If the NVFP4 quant uses a different template, please publish that template separately so the INT4 community on Ampere can benefit from the same fix.
Consider adding
qwen3coderorqwen3_xmlto the model tags if those are the expected parsers, so users know which parser to configure.
Related links
- Model on HF: https://huggingface.co/kai-os/Carnice-V2-27b
- NVFP4 quant (working, Blackwell-only): https://huggingface.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP
- Our deployment + patched template: https://github.com/noonghunna/club-3090 (see
models/qwen3.6-27b/vllm/patches/carnice-chat-template.jinja)
This is the image that we produced. wasifb/Carnice_V2_27B_INT4_BF16MTP