File size: 6,130 Bytes
4b5746c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- qwen
- qwen3-next
- hermes
- agentic
- tool-use
- MTP
- spec-decode
- AutoRound
- INT4
base_model:
- kai-os/Carnice-V2-27b
- Qwen/Qwen3.6-27B
- noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
  parameters:
    temperature: 0.6
    top_p: 0.95
    top_k: 20
---

# Carnice-V2-27B-INT4-BF16-MTP

**Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**

This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) β€” the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
2. **BF16 MTP overlay** β€” all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β†’ ~3.0.
3. **Patched chat template** β€” the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.

## Performance

Benchmarked on **2Γ— RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:

| Metric | Value |
|---|---|
| Narrative TPS (n=5) | **71.75** (CV 11.6%) |
| Code TPS (n=5) | **80.35** (CV 10.6%) |
| MTP acceptance length | **3.02-3.14** |
| Per-position accept | ~83% / 69% / 56% |
| TTFT | **~141ms** |
| Max context | **262K tokens** (fp8 KV) |
| Concurrent streams | **2** |
| VRAM per card | **22.25 GiB** |
| Model load size | **9.19 GiB** |

For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** β€” practically equivalent for everyday agentic use.

## Quick start

### Docker (vLLM)

```yaml
services:
  vllm-carnice:
    image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
    ports:
      - "8070:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
      - --quantization auto_round --dtype float16
      - --tensor-parallel-size 2
      - --disable-custom-all-reduce
      - --max-model-len 262144
      - --gpu-memory-utilization 0.92
      - --max-num-seqs 2
      - --kv-cache-dtype fp8_e5m2
      - --trust-remote-code
      - --reasoning-parser qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser hermes
      - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

**Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.

### API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=800,
    temperature=0.6,
)
print(response.choices[0].message.content)
```

For tool calling:

```python
response = client.chat.completions.create(
    model="carnice-bf16mtp",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(response.choices[0].message.tool_calls)
```

## Hardware requirements

| Setup | Min VRAM | Context | Throughput |
|---|---|---|---|
| **2Γ— RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
| **1Γ— RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
| **1Γ— RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
| **2Γ— RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |

No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).

## Known caveats

- **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
- **Long-context recall** degrades at β‰₯60K tokens β€” this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
- **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β‰₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β€” tool calls and generation quality are unaffected.

## Build process

This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:

1. `recipe_d_delta_merge.py` β€” Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
2. `recipe_d_bf16mtp_overlay.py` β€” Replaces INT4-packed MTP projections with BF16 weights from base Qwen
3. Chat template patch β€” Switches tool-call format from Qwen3 XML to Hermes JSON

## References

- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
- **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
- **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)