--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - qwen - qwen3-next - hermes - agentic - tool-use - MTP - spec-decode - AutoRound - INT4 base_model: - kai-os/Carnice-V2-27b - Qwen/Qwen3.6-27B - noonghunna/Qwen3.6-27B-int4-AutoRound inference: parameters: temperature: 0.6 top_p: 0.95 top_k: 20 --- # Carnice-V2-27B-INT4-BF16-MTP **Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.** This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies: 1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop. 2. **BF16 MTP overlay** — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0. 3. **Patched chat template** — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `` tags), compatible with vLLM's `--tool-call-parser hermes`. ## Performance Benchmarked on **2× RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2: | Metric | Value | |---|---| | Narrative TPS (n=5) | **71.75** (CV 11.6%) | | Code TPS (n=5) | **80.35** (CV 10.6%) | | MTP acceptance length | **3.02-3.14** | | Per-position accept | ~83% / 69% / 56% | | TTFT | **~141ms** | | Max context | **262K tokens** (fp8 KV) | | Concurrent streams | **2** | | VRAM per card | **22.25 GiB** | | Model load size | **9.19 GiB** | For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** — practically equivalent for everyday agentic use. ## Quick start ### Docker (vLLM) ```yaml services: vllm-carnice: image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8 ports: - "8070:8000" volumes: - ./models:/root/.cache/huggingface shm_size: "16gb" ipc: host deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] command: - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp - --quantization auto_round --dtype float16 - --tensor-parallel-size 2 - --disable-custom-all-reduce - --max-model-len 262144 - --gpu-memory-utilization 0.92 - --max-num-seqs 2 - --kv-cache-dtype fp8_e5m2 - --trust-remote-code - --reasoning-parser qwen3 - --enable-auto-tool-choice - --tool-call-parser hermes - --speculative-config '{"method":"mtp","num_speculative_tokens":3}' ``` **Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`. ### API ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed") response = client.chat.completions.create( model="carnice-bf16mtp", messages=[{"role": "user", "content": "Write a quicksort in Python."}], max_tokens=800, temperature=0.6, ) print(response.choices[0].message.content) ``` For tool calling: ```python response = client.chat.completions.create( model="carnice-bf16mtp", messages=[{"role": "user", "content": "What's the weather in Paris?"}], tools=[{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a city", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"], }, }, }], tool_choice="auto", ) print(response.choices[0].message.tool_calls) ``` ## Hardware requirements | Setup | Min VRAM | Context | Throughput | |---|---|---|---| | **2× RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS | | **1× RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) | | **1× RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) | | **2× RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) | No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe). ## Known caveats - **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad). - **Long-context recall** degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant. - **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected. ## Build process This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory: 1. `recipe_d_delta_merge.py` — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights 2. `recipe_d_bf16mtp_overlay.py` — Replaces INT4-packed MTP projections with BF16 weights from base Qwen 3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON ## References - **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) - **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) - **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus) - **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090) - **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)