File size: 6,130 Bytes

4b5746c

---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- qwen
- qwen3-next
- hermes
- agentic
- tool-use
- MTP
- spec-decode
- AutoRound
- INT4
base_model:
- kai-os/Carnice-V2-27b
- Qwen/Qwen3.6-27B
- noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
  parameters:
    temperature: 0.6
    top_p: 0.95
    top_k: 20
---

# Carnice-V2-27B-INT4-BF16-MTP

**Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**

This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
2. **BF16 MTP overlay** — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
3. **Patched chat template** — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.

## Performance

Benchmarked on **2× RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:

| Metric | Value |
|---|---|
| Narrative TPS (n=5) | **71.75** (CV 11.6%) |
| Code TPS (n=5) | **80.35** (CV 10.6%) |
| MTP acceptance length | **3.02-3.14** |
| Per-position accept | ~83% / 69% / 56% |
| TTFT | **~141ms** |
| Max context | **262K tokens** (fp8 KV) |
| Concurrent streams | **2** |
| VRAM per card | **22.25 GiB** |
| Model load size | **9.19 GiB** |

For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** — practically equivalent for everyday agentic use.

## Quick start

### Docker (vLLM)

```yaml
services:
  vllm-carnice:
    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
    ports:
      - "8070:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
      - --quantization auto_round --dtype float16
      - --tensor-parallel-size 2
      - --disable-custom-all-reduce
      - --max-model-len 262144
      - --gpu-memory-utilization 0.92
      - --max-num-seqs 2
      - --kv-cache-dtype fp8_e5m2
      - --trust-remote-code
      - --reasoning-parser qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser hermes
      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

**Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.

### API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=800,
    temperature=0.6,
)
print(response.choices[0].message.content)
```

For tool calling:

```python
response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(response.choices[0].message.tool_calls)
```

## Hardware requirements

| Setup | Min VRAM | Context | Throughput |
|---|---|---|---|
| **2× RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
| **1× RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
| **1× RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
| **2× RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |

No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).

## Known caveats

- **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
- **Long-context recall** degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
- **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.

## Build process

This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:

1. `recipe_d_delta_merge.py` — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
2. `recipe_d_bf16mtp_overlay.py` — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON

## References

- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
- **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
- **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)