wasifb's picture
Upload README.md with huggingface_hub
4b5746c verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
tags:
  - qwen
  - qwen3-next
  - hermes
  - agentic
  - tool-use
  - MTP
  - spec-decode
  - AutoRound
  - INT4
base_model:
  - kai-os/Carnice-V2-27b
  - Qwen/Qwen3.6-27B
  - noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
  parameters:
    temperature: 0.6
    top_p: 0.95
    top_k: 20

Carnice-V2-27B-INT4-BF16-MTP

Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.

This model takes kai-os/Carnice-V2-27b (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

  1. INT4 quantization via AutoRound (W4A16, group_size=128, symmetric) β€” the quant grid comes from Lorbus's Qwen3.6-27B-int4-AutoRound, delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
  2. BF16 MTP overlay β€” all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β†’ ~3.0.
  3. Patched chat template β€” the tool-call format is changed from Qwen3 XML to Hermes JSON (inside <tool_call> tags), compatible with vLLM's --tool-call-parser hermes.

Performance

Benchmarked on 2Γ— RTX 3090 (PCIe, no NVLink) with vLLM dev205, TP=2:

Metric Value
Narrative TPS (n=5) 71.75 (CV 11.6%)
Code TPS (n=5) 80.35 (CV 10.6%)
MTP acceptance length 3.02-3.14
Per-position accept ~83% / 69% / 56%
TTFT ~141ms
Max context 262K tokens (fp8 KV)
Concurrent streams 2
VRAM per card 22.25 GiB
Model load size 9.19 GiB

For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is +4% on narrative, -9% on code β€” practically equivalent for everyday agentic use.

Quick start

Docker (vLLM)

services:
  vllm-carnice:
    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
    ports:
      - "8070:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
      - --quantization auto_round --dtype float16
      - --tensor-parallel-size 2
      - --disable-custom-all-reduce
      - --max-model-len 262144
      - --gpu-memory-utilization 0.92
      - --max-num-seqs 2
      - --kv-cache-dtype fp8_e5m2
      - --trust-remote-code
      - --reasoning-parser qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser hermes
      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Note for single RTX 3090: reduce --max-model-len to ~65K, set --tensor-parallel-size 1, --max-num-seqs 1.

API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=800,
    temperature=0.6,
)
print(response.choices[0].message.content)

For tool calling:

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(response.choices[0].message.tool_calls)

Hardware requirements

Setup Min VRAM Context Throughput
2Γ— RTX 3090 (recommended) 24 GB each 262K 72/80 TPS
1Γ— RTX 3090 24 GB ~65K ~50 TPS (estimated)
1Γ— RTX 4090 24 GB ~65K ~60 TPS (estimated)
2Γ— RTX 4090 24 GB each 262K ~85/100 TPS (estimated)

No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (--disable-custom-all-reduce on PCIe).

Known caveats

  • Marlin pad-sub-tile-n patch (vLLM PR #40361) required for TP=2. Vendored at github.com/noonghunna/club-3090.
  • Long-context recall degrades at β‰₯60K tokens β€” this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
  • Thinking mode: Carnice is concise; its reasoning field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β‰₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β€” tool calls and generation quality are unaffected.

Build process

This model was built using the delta-merge approach documented at github.com/noonghunna/club-3090. The scripts are in the carnice-autoround/ directory:

  1. recipe_d_delta_merge.py β€” Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
  2. recipe_d_bf16mtp_overlay.py β€” Replaces INT4-packed MTP projections with BF16 weights from base Qwen
  3. Chat template patch β€” Switches tool-call format from Qwen3 XML to Hermes JSON

References