wasifb
/

Carnice_V2_27B_INT4_BF16MTP

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- qwen
+- qwen3-next
+- hermes
+- agentic
+- tool-use
+- MTP
+- spec-decode
+- AutoRound
+- INT4
+base_model:
+- kai-os/Carnice-V2-27b
+- Qwen/Qwen3.6-27B
+- noonghunna/Qwen3.6-27B-int4-AutoRound
+inference:
+  parameters:
+    temperature: 0.6
+    top_p: 0.95
+    top_k: 20
+---
+# Carnice-V2-27B-INT4-BF16-MTP
+**Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**
+This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:
+1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
+2. **BF16 MTP overlay** — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
+3. **Patched chat template** — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.
+## Performance
+Benchmarked on **2× RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:
+| Metric | Value |
+|---|---|
+| Narrative TPS (n=5) | **71.75** (CV 11.6%) |
+| Code TPS (n=5) | **80.35** (CV 10.6%) |
+| MTP acceptance length | **3.02-3.14** |
+| Per-position accept | ~83% / 69% / 56% |
+| TTFT | **~141ms** |
+| Max context | **262K tokens** (fp8 KV) |
+| Concurrent streams | **2** |
+| VRAM per card | **22.25 GiB** |
+| Model load size | **9.19 GiB** |
+For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** — practically equivalent for everyday agentic use.
+## Quick start
+### Docker (vLLM)
+```yaml
+services:
+  vllm-carnice:
+    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
+    ports:
+      - "8070:8000"
+    volumes:
+      - ./models:/root/.cache/huggingface
+    shm_size: "16gb"
+    ipc: host
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    command:
+      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
+      - --quantization auto_round --dtype float16
+      - --tensor-parallel-size 2
+      - --disable-custom-all-reduce
+      - --max-model-len 262144
+      - --gpu-memory-utilization 0.92
+      - --max-num-seqs 2
+      - --kv-cache-dtype fp8_e5m2
+      - --trust-remote-code
+      - --reasoning-parser qwen3
+      - --enable-auto-tool-choice
+      - --tool-call-parser hermes
+      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
+```
+**Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.
+### API
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")
+response = client.chat.completions.create(
+    model="carnice-bf16mtp",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=800,
+    temperature=0.6,
+)
+print(response.choices[0].message.content)
+```
+For tool calling:
+```python
+response = client.chat.completions.create(
+    model="carnice-bf16mtp",
+    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "get_weather",
+            "description": "Get weather for a city",
+            "parameters": {
+                "type": "object",
+                "properties": {"city": {"type": "string"}},
+                "required": ["city"],
+            },
+        },
+    }],
+    tool_choice="auto",
+)
+print(response.choices[0].message.tool_calls)
+```
+## Hardware requirements
+| Setup | Min VRAM | Context | Throughput |
+|---|---|---|---|
+| **2× RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
+| **1× RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
+| **1× RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
+| **2× RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |
+No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).
+## Known caveats
+- **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
+- **Long-context recall** degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
+- **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.
+## Build process
+This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:
+1. `recipe_d_delta_merge.py` — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
+2. `recipe_d_bf16mtp_overlay.py` — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
+3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON
+## References
+- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
+- **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
+- **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
+- **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
+- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)