Instructions to use kai-os/Carnice-V2-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kai-os/Carnice-V2-27b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="kai-os/Carnice-V2-27b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("kai-os/Carnice-V2-27b")
model = AutoModelForImageTextToText.from_pretrained("kai-os/Carnice-V2-27b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use kai-os/Carnice-V2-27b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kai-os/Carnice-V2-27b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/kai-os/Carnice-V2-27b

SGLang

How to use kai-os/Carnice-V2-27b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kai-os/Carnice-V2-27b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kai-os/Carnice-V2-27b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use kai-os/Carnice-V2-27b with Docker Model Runner:
```
docker model run hf.co/kai-os/Carnice-V2-27b
```

Tool-call format incompatible with vLLM on Ampere (RTX 3090) — inconsistent XML output

by wasifb - opened 3 days ago

Discussion

wasifb

3 days ago

Tool-call format incompatible with vLLM tool parsers on Ampere (RTX 3090)

Model: kai-os/Carnice-V2-27b
Base: Qwen3.6-27B (Hermes-style agentic SFT)
HF Tags: hermes-agent, tool-use
Date filed: 2026-05-05

Summary

Carnice-V2-27b ships with a Hermes-style chat template that instructs XML-formatted tool calls using <parameter=name> (equals delimiter between tag name and param name). However, the model's actual generation output is inconsistent across runs — sometimes emitting a space delimiter (<parameter name>), sometimes a malformed double-bracket (<parameter<name>>), and sometimes a hybrid. No existing vLLM tool-call parser (qwen3_xml, hermes, qwen3coder) can reliably parse Carnice's output.

The only known vLLM deployment where tool calls work is the NVFP4 quant (sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP) which uses --tool-call-parser qwen3_xml, but NVFP4 is a Blackwell (SM9x+) feature — it does not run on Ampere hardware (RTX 3090, SM86). Our AutoRound INT4 build on Ampere exhibits the same format issue.

Evidence

1. Chat template instructs XML with equals delimiter

The model's native chat template includes the following instruction:

<tool_call>
<function=example_function_name>
<parameter=example_parameter_1>
value_1
</parameter>
<parameter=example_parameter_2>
This is the value for the second parameter
that can span
multiple lines
</parameter>
</function>
</tool_call>

Note the <parameter=name> syntax: an equals sign (=) separates the tag name from the parameter name. This is neither the Qwen3 XML format (which uses <parameter_name>value</parameter_name> or <parameters>{"name": "value"}</parameters>) nor the Hermes v1 JSON format (which wraps JSON inside <tool_call>).

2. Model output is inconsistent across runs

When prompted with the same tool-use request, Carnice produces different format variants on different inference runs:

Run	Format	Example snippet
1	`<parameter location>` (space delimiter)	`<parameter location>\nParis\n</parameter>`
2	`<parameter<location>>` (double-bracket malformed)	`<parameter<location>Paris</parameter>`
3	Broken escaped JSON (with JSON-patched template)	`{"location": "Paris"}`

This non-determinism makes it impossible to write a stable regex or parser.

3. No vLLM parser handles the format

We tested all three relevant vLLM built-in tool-call parsers:

Parser	Tested with	Result
`qwen3_xml`	Original template + patched template	❌ Inconsistent output; `<parameter=name>` not a recognized Qwen3 format
`hermes`	Original template	❌ Expects JSON inside `<tool_call>...</tool_call>`, not XML
`hermes`	Patched template (instructing JSON)	❌ Model produces broken escaped JSON in arguments
`qwen3coder`	Original template	❌ Format mismatch

4. Patched template → broken JSON

When we patched the chat template to instruct JSON output inside <tool_call> tags (to match vLLM's hermes parser expectation), the model produces output like:

<tool_call>
function: get_weather
arguments: {\"location\": \"Paris\"}
</tool_call>

The JSON in arguments is double-escaped (literal \" in the text) because the model's fine-tuned token distribution prefers XML parameter-value pairs over raw JSON string interpolation.

What we tried

Parser attempts

--tool-call-parser qwen3_xml: Format inconsistency — parser expects <parameters>JSON</parameters> but gets <parameter=name> or <parameter name>
--tool-call-parser hermes with original template: Format mismatch — parser expects JSON, gets XML
--tool-call-parser hermes with JSON-patched template: Broken escaped JSON — model can't reliably produce clean JSON strings
--tool-call-parser qwen3coder: Format mismatch — parser expects <tool_call>JSON</tool_call> with specific JSON schema

Template patching

Original template → XML += format (as shipped)
Patched template instructing JSON output → model produces malformed/broken JSON
Removing/adding empty <think>\n\n</think> block in generation prompt → no effect on format issue

Other flags

Removing the --reasoning-parser qwen3 flag → no effect on format issue
Adding empty think block to generation prompt → no effect on format issue
Varying temperature (0.0, 0.3, 0.6) → format still varies, especially at temperature > 0

Working vLLM configuration (Blackwell-only)

The NVFP4 quant at sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP reportedly works with:

--tool-call-parser qwen3_xml

This suggests the NVFP4 model may use a prefix-corrected copy of the chat template that aligns the model's output format with what qwen3_xml expects. However:

NVFP4 requires Blackwell (RTX 5090, B200, etc.) — non-functional on Ampere SM86
The NVFP4 uploader may have patched the tokenizer_config.json or chat template differently
We cannot verify the exact template difference because the NVFP4 quant is not loadable on our hardware

Our environment (reproducible on Ampere)

Hardware: 1× or 2× RTX 3090 (Ampere SM86, 24 GB), PCIe, no NVLink
vLLM version: v0.20.x + Genesis patches (v7.48–v7.69 tested)
Quantization: AutoRound INT4 (W4A16) via Marlin kernel
Flags: --enable-auto-tool-choice, various --tool-call-parser values tested
Full reproduce: See noonghunna/club-3090 — docker-compose.carnice-bf16mtp.yml shows the shipping config (which works via a heavily patched chat template that forces JSON output, not via native parser compatibility)

Workaround (for Ampere users)

We ship a heavily patched chat template (carnice-chat-template.jinja) that:

Instructs the model to output JSON inside <tool_call> tags (instead of native XML)
Uses --tool-call-parser hermes (which expects JSON within <tool_call>)
Accepts that the model may still produce imperfect JSON in some cases

This workaround is brittle — it relies on overriding the model's native format instruction rather than matching what the model was actually fine-tuned to produce. A proper fix would require either:

Retraining/re-tuning Carnice to output a format that matches an existing vLLM parser (Hermes JSON or Qwen3 XML)
Adding a new vLLM parser that tolerates Carnice's <parameter=name> format (if consistent output could be achieved)
Publishing a corrected tokenizer_config.json with a compatible chat template

Request

Please clarify the intended tool-call format. The HuggingFace model card tags this as hermes-agent and tool-use, but the actual output format doesn't match any documented vLLM parser. What format was Carnice fine-tuned to produce?
Please publish a corrected chat template in the model repo that produces output compatible with a standard vLLM parser (hermes JSON or qwen3_xml). The current template instructs <parameter=name> but the model doesn't follow it reliably, and no parser understands that format.
If the NVFP4 quant uses a different template, please publish that template separately so the INT4 community on Ampere can benefit from the same fix.
Consider adding qwen3coder or qwen3_xml to the model tags if those are the expected parsers, so users know which parser to configure.

kai-os
/

Carnice-V2-27b