Carnice-V2-27B-INT4-BF16-MTP

Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.

This model takes kai-os/Carnice-V2-27b (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

  1. INT4 quantization via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's Qwen3.6-27B-int4-AutoRound, delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
  2. BF16 MTP overlay — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
  3. Patched chat template — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside <tool_call> tags), compatible with vLLM's --tool-call-parser hermes.

Performance

Benchmarked on 2× RTX 3090 (PCIe, no NVLink) with vLLM dev205, TP=2:

Metric Value
Narrative TPS (n=5) 71.75 (CV 11.6%)
Code TPS (n=5) 80.35 (CV 10.6%)
MTP acceptance length 3.02-3.14
Per-position accept ~83% / 69% / 56%
TTFT ~141ms
Max context 262K tokens (fp8 KV)
Concurrent streams 2
VRAM per card 22.25 GiB
Model load size 9.19 GiB

For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is +4% on narrative, -9% on code — practically equivalent for everyday agentic use.

Quick start

Docker (vLLM)

services:
  vllm-carnice:
    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
    ports:
      - "8070:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
      - --quantization auto_round --dtype float16
      - --tensor-parallel-size 2
      - --disable-custom-all-reduce
      - --max-model-len 262144
      - --gpu-memory-utilization 0.92
      - --max-num-seqs 2
      - --kv-cache-dtype fp8_e5m2
      - --trust-remote-code
      - --reasoning-parser qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser hermes
      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Note for single RTX 3090: reduce --max-model-len to ~65K, set --tensor-parallel-size 1, --max-num-seqs 1.

API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=800,
    temperature=0.6,
)
print(response.choices[0].message.content)

For tool calling:

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(response.choices[0].message.tool_calls)

Hardware requirements

Setup Min VRAM Context Throughput
2× RTX 3090 (recommended) 24 GB each 262K 72/80 TPS
1× RTX 3090 24 GB ~65K ~50 TPS (estimated)
1× RTX 4090 24 GB ~65K ~60 TPS (estimated)
2× RTX 4090 24 GB each 262K ~85/100 TPS (estimated)

No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (--disable-custom-all-reduce on PCIe).

Known caveats

  • Marlin pad-sub-tile-n patch (vLLM PR #40361) required for TP=2. Vendored at github.com/noonghunna/club-3090.
  • Long-context recall degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
  • Thinking mode: Carnice is concise; its reasoning field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.

Build process

This model was built using the delta-merge approach documented at github.com/noonghunna/club-3090. The scripts are in the carnice-autoround/ directory:

  1. recipe_d_delta_merge.py — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
  2. recipe_d_bf16mtp_overlay.py — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
  3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON

References

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
BF16
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wasifb/Carnice_V2_27B_INT4_BF16MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(279)
this model