Tool-call format incompatible with vLLM on Ampere (RTX 3090) β€” inconsistent XML output

#4
by wasifb - opened

Tool-call format incompatible with vLLM tool parsers on Ampere (RTX 3090)

Model: kai-os/Carnice-V2-27b
Base: Qwen3.6-27B (Hermes-style agentic SFT)
HF Tags: hermes-agent, tool-use
Date filed: 2026-05-05

Summary

Carnice-V2-27b ships with a Hermes-style chat template that instructs XML-formatted tool calls using <parameter=name> (equals delimiter between tag name and param name). However, the model's actual generation output is inconsistent across runs β€” sometimes emitting a space delimiter (<parameter name>), sometimes a malformed double-bracket (<parameter<name>>), and sometimes a hybrid. No existing vLLM tool-call parser (qwen3_xml, hermes, qwen3coder) can reliably parse Carnice's output.

The only known vLLM deployment where tool calls work is the NVFP4 quant (sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP) which uses --tool-call-parser qwen3_xml, but NVFP4 is a Blackwell (SM9x+) feature β€” it does not run on Ampere hardware (RTX 3090, SM86). Our AutoRound INT4 build on Ampere exhibits the same format issue.

Evidence

1. Chat template instructs XML with equals delimiter

The model's native chat template includes the following instruction:

<tool_call>
<function=example_function_name>
<parameter=example_parameter_1>
value_1
</parameter>
<parameter=example_parameter_2>
This is the value for the second parameter
that can span
multiple lines
</parameter>
</function>
</tool_call>

Note the <parameter=name> syntax: an equals sign (=) separates the tag name from the parameter name. This is neither the Qwen3 XML format (which uses <parameter_name>value</parameter_name> or <parameters>{"name": "value"}</parameters>) nor the Hermes v1 JSON format (which wraps JSON inside <tool_call>).

2. Model output is inconsistent across runs

When prompted with the same tool-use request, Carnice produces different format variants on different inference runs:

Run Format Example snippet
1 <parameter location> (space delimiter) <parameter location>\nParis\n</parameter>
2 <parameter<location>> (double-bracket malformed) <parameter<location>Paris</parameter>
3 Broken escaped JSON (with JSON-patched template) {"location": "Paris"}

This non-determinism makes it impossible to write a stable regex or parser.

3. No vLLM parser handles the format

We tested all three relevant vLLM built-in tool-call parsers:

Parser Tested with Result
qwen3_xml Original template + patched template ❌ Inconsistent output; <parameter=name> not a recognized Qwen3 format
hermes Original template ❌ Expects JSON inside <tool_call>...</tool_call>, not XML
hermes Patched template (instructing JSON) ❌ Model produces broken escaped JSON in arguments
qwen3coder Original template ❌ Format mismatch

4. Patched template β†’ broken JSON

When we patched the chat template to instruct JSON output inside <tool_call> tags (to match vLLM's hermes parser expectation), the model produces output like:

<tool_call>
function: get_weather
arguments: {\"location\": \"Paris\"}
</tool_call>

The JSON in arguments is double-escaped (literal \" in the text) because the model's fine-tuned token distribution prefers XML parameter-value pairs over raw JSON string interpolation.

What we tried

Parser attempts

  • --tool-call-parser qwen3_xml: Format inconsistency β€” parser expects <parameters>JSON</parameters> but gets <parameter=name> or <parameter name>
  • --tool-call-parser hermes with original template: Format mismatch β€” parser expects JSON, gets XML
  • --tool-call-parser hermes with JSON-patched template: Broken escaped JSON β€” model can't reliably produce clean JSON strings
  • --tool-call-parser qwen3coder: Format mismatch β€” parser expects <tool_call>JSON</tool_call> with specific JSON schema

Template patching

  • Original template β†’ XML += format (as shipped)
  • Patched template instructing JSON output β†’ model produces malformed/broken JSON
  • Removing/adding empty <think>\n\n</think> block in generation prompt β†’ no effect on format issue

Other flags

  • Removing the --reasoning-parser qwen3 flag β†’ no effect on format issue
  • Adding empty think block to generation prompt β†’ no effect on format issue
  • Varying temperature (0.0, 0.3, 0.6) β†’ format still varies, especially at temperature > 0

Working vLLM configuration (Blackwell-only)

The NVFP4 quant at sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP reportedly works with:

--tool-call-parser qwen3_xml

This suggests the NVFP4 model may use a prefix-corrected copy of the chat template that aligns the model's output format with what qwen3_xml expects. However:

  • NVFP4 requires Blackwell (RTX 5090, B200, etc.) β€” non-functional on Ampere SM86
  • The NVFP4 uploader may have patched the tokenizer_config.json or chat template differently
  • We cannot verify the exact template difference because the NVFP4 quant is not loadable on our hardware

Our environment (reproducible on Ampere)

  • Hardware: 1Γ— or 2Γ— RTX 3090 (Ampere SM86, 24 GB), PCIe, no NVLink
  • vLLM version: v0.20.x + Genesis patches (v7.48–v7.69 tested)
  • Quantization: AutoRound INT4 (W4A16) via Marlin kernel
  • Flags: --enable-auto-tool-choice, various --tool-call-parser values tested
  • Full reproduce: See noonghunna/club-3090 β€” docker-compose.carnice-bf16mtp.yml shows the shipping config (which works via a heavily patched chat template that forces JSON output, not via native parser compatibility)

Workaround (for Ampere users)

We ship a heavily patched chat template (carnice-chat-template.jinja) that:

  1. Instructs the model to output JSON inside <tool_call> tags (instead of native XML)
  2. Uses --tool-call-parser hermes (which expects JSON within <tool_call>)
  3. Accepts that the model may still produce imperfect JSON in some cases

This workaround is brittle β€” it relies on overriding the model's native format instruction rather than matching what the model was actually fine-tuned to produce. A proper fix would require either:

  • Retraining/re-tuning Carnice to output a format that matches an existing vLLM parser (Hermes JSON or Qwen3 XML)
  • Adding a new vLLM parser that tolerates Carnice's <parameter=name> format (if consistent output could be achieved)
  • Publishing a corrected tokenizer_config.json with a compatible chat template

Request

  1. Please clarify the intended tool-call format. The HuggingFace model card tags this as hermes-agent and tool-use, but the actual output format doesn't match any documented vLLM parser. What format was Carnice fine-tuned to produce?

  2. Please publish a corrected chat template in the model repo that produces output compatible with a standard vLLM parser (hermes JSON or qwen3_xml). The current template instructs <parameter=name> but the model doesn't follow it reliably, and no parser understands that format.

  3. If the NVFP4 quant uses a different template, please publish that template separately so the INT4 community on Ampere can benefit from the same fix.

  4. Consider adding qwen3coder or qwen3_xml to the model tags if those are the expected parsers, so users know which parser to configure.

Related links

This is the image that we produced. wasifb/Carnice_V2_27B_INT4_BF16MTP

Sign up or log in to comment