| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - qwen |
| - qwen3-next |
| - hermes |
| - agentic |
| - tool-use |
| - MTP |
| - spec-decode |
| - AutoRound |
| - INT4 |
| base_model: |
| - kai-os/Carnice-V2-27b |
| - Qwen/Qwen3.6-27B |
| - noonghunna/Qwen3.6-27B-int4-AutoRound |
| inference: |
| parameters: |
| temperature: 0.6 |
| top_p: 0.95 |
| top_k: 20 |
| --- |
| |
| # Carnice-V2-27B-INT4-BF16-MTP |
|
|
| **Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.** |
|
|
| This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies: |
|
|
| 1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) β the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop. |
| 2. **BF16 MTP overlay** β all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β ~3.0. |
| 3. **Patched chat template** β the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`. |
| |
| ## Performance |
| |
| Benchmarked on **2Γ RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2: |
| |
| | Metric | Value | |
| |---|---| |
| | Narrative TPS (n=5) | **71.75** (CV 11.6%) | |
| | Code TPS (n=5) | **80.35** (CV 10.6%) | |
| | MTP acceptance length | **3.02-3.14** | |
| | Per-position accept | ~83% / 69% / 56% | |
| | TTFT | **~141ms** | |
| | Max context | **262K tokens** (fp8 KV) | |
| | Concurrent streams | **2** | |
| | VRAM per card | **22.25 GiB** | |
| | Model load size | **9.19 GiB** | |
| |
| For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** β practically equivalent for everyday agentic use. |
| |
| ## Quick start |
| |
| ### Docker (vLLM) |
| |
| ```yaml |
| services: |
| vllm-carnice: |
| image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8 |
| ports: |
| - "8070:8000" |
| volumes: |
| - ./models:/root/.cache/huggingface |
| shm_size: "16gb" |
| ipc: host |
| deploy: |
| resources: |
| reservations: |
| devices: |
| - driver: nvidia |
| count: all |
| capabilities: [gpu] |
| command: |
| - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp |
| - --quantization auto_round --dtype float16 |
| - --tensor-parallel-size 2 |
| - --disable-custom-all-reduce |
| - --max-model-len 262144 |
| - --gpu-memory-utilization 0.92 |
| - --max-num-seqs 2 |
| - --kv-cache-dtype fp8_e5m2 |
| - --trust-remote-code |
| - --reasoning-parser qwen3 |
| - --enable-auto-tool-choice |
| - --tool-call-parser hermes |
| - --speculative-config '{"method":"mtp","num_speculative_tokens":3}' |
| ``` |
| |
| **Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`. |
|
|
| ### API |
|
|
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed") |
| |
| response = client.chat.completions.create( |
| model="carnice-bf16mtp", |
| messages=[{"role": "user", "content": "Write a quicksort in Python."}], |
| max_tokens=800, |
| temperature=0.6, |
| ) |
| print(response.choices[0].message.content) |
| ``` |
|
|
| For tool calling: |
|
|
| ```python |
| response = client.chat.completions.create( |
| model="carnice-bf16mtp", |
| messages=[{"role": "user", "content": "What's the weather in Paris?"}], |
| tools=[{ |
| "type": "function", |
| "function": { |
| "name": "get_weather", |
| "description": "Get weather for a city", |
| "parameters": { |
| "type": "object", |
| "properties": {"city": {"type": "string"}}, |
| "required": ["city"], |
| }, |
| }, |
| }], |
| tool_choice="auto", |
| ) |
| print(response.choices[0].message.tool_calls) |
| ``` |
|
|
| ## Hardware requirements |
|
|
| | Setup | Min VRAM | Context | Throughput | |
| |---|---|---|---| |
| | **2Γ RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS | |
| | **1Γ RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) | |
| | **1Γ RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) | |
| | **2Γ RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) | |
|
|
| No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe). |
|
|
| ## Known caveats |
|
|
| - **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad). |
| - **Long-context recall** degrades at β₯60K tokens β this is a model-level GatedDeltaNet attention ceiling, not specific to this quant. |
| - **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β tool calls and generation quality are unaffected. |
|
|
| ## Build process |
|
|
| This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory: |
|
|
| 1. `recipe_d_delta_merge.py` β Applies Lorbus's INT4 quant grid to Carnice's BF16 weights |
| 2. `recipe_d_bf16mtp_overlay.py` β Replaces INT4-packed MTP projections with BF16 weights from base Qwen |
| 3. Chat template patch β Switches tool-call format from Qwen3 XML to Hermes JSON |
|
|
| ## References |
|
|
| - **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) |
| - **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) |
| - **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus) |
| - **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090) |
| - **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm) |
|
|