Upload README.md with huggingface_hub

4b5746c verified 4 days ago

6.13 kB

license: apache-2.0
language:
  - en
pipeline_tag: text-generation
tags:
  - qwen
  - qwen3-next
  - hermes
  - agentic
  - tool-use
  - MTP
  - spec-decode
  - AutoRound
  - INT4
base_model:
  - kai-os/Carnice-V2-27b
  - Qwen/Qwen3.6-27B
  - noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
  parameters:
    temperature: 0.6
    top_p: 0.95
    top_k: 20

Carnice-V2-27B-INT4-BF16-MTP

Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.

This model takes kai-os/Carnice-V2-27b (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

INT4 quantization via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's Qwen3.6-27B-int4-AutoRound, delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
BF16 MTP overlay — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
Patched chat template — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside <tool_call> tags), compatible with vLLM's --tool-call-parser hermes.

Performance

Benchmarked on 2× RTX 3090 (PCIe, no NVLink) with vLLM dev205, TP=2:

Metric	Value
Narrative TPS (n=5)	71.75 (CV 11.6%)
Code TPS (n=5)	80.35 (CV 10.6%)
MTP acceptance length	3.02-3.14
Per-position accept	~83% / 69% / 56%
TTFT	~141ms
Max context	262K tokens (fp8 KV)
Concurrent streams	2
VRAM per card	22.25 GiB
Model load size	9.19 GiB

For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is +4% on narrative, -9% on code — practically equivalent for everyday agentic use.

Quick start

Docker (vLLM)

services:
  vllm-carnice:
    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
    ports:
      - "8070:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
      - --quantization auto_round --dtype float16
      - --tensor-parallel-size 2
      - --disable-custom-all-reduce
      - --max-model-len 262144
      - --gpu-memory-utilization 0.92
      - --max-num-seqs 2
      - --kv-cache-dtype fp8_e5m2
      - --trust-remote-code
      - --reasoning-parser qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser hermes
      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Note for single RTX 3090: reduce --max-model-len to ~65K, set --tensor-parallel-size 1, --max-num-seqs 1.

API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=800,
    temperature=0.6,
)
print(response.choices[0].message.content)

For tool calling:

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(response.choices[0].message.tool_calls)

Hardware requirements

Setup	Min VRAM	Context	Throughput
2× RTX 3090 (recommended)	24 GB each	262K	72/80 TPS
1× RTX 3090	24 GB	~65K	~50 TPS (estimated)
1× RTX 4090	24 GB	~65K	~60 TPS (estimated)
2× RTX 4090	24 GB each	262K	~85/100 TPS (estimated)

No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (--disable-custom-all-reduce on PCIe).

Known caveats

Marlin pad-sub-tile-n patch (vLLM PR #40361) required for TP=2. Vendored at github.com/noonghunna/club-3090.
Long-context recall degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
Thinking mode: Carnice is concise; its reasoning field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.

Build process

This model was built using the delta-merge approach documented at github.com/noonghunna/club-3090. The scripts are in the carnice-autoround/ directory:

recipe_d_delta_merge.py — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
recipe_d_bf16mtp_overlay.py — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON

References

Base model: Qwen/Qwen3.6-27B
Fine-tune: kai-os/Carnice-V2-27b
Quant recipe: noonghunna/Qwen3.6-27B-int4-AutoRound (Lorbus)
Project & compose: github.com/noonghunna/club-3090
vLLM: vllm-project/vllm