wasifb commited on
Commit
4b5746c
·
verified ·
1 Parent(s): b6a40d5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - qwen
8
+ - qwen3-next
9
+ - hermes
10
+ - agentic
11
+ - tool-use
12
+ - MTP
13
+ - spec-decode
14
+ - AutoRound
15
+ - INT4
16
+ base_model:
17
+ - kai-os/Carnice-V2-27b
18
+ - Qwen/Qwen3.6-27B
19
+ - noonghunna/Qwen3.6-27B-int4-AutoRound
20
+ inference:
21
+ parameters:
22
+ temperature: 0.6
23
+ top_p: 0.95
24
+ top_k: 20
25
+ ---
26
+
27
+ # Carnice-V2-27B-INT4-BF16-MTP
28
+
29
+ **Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**
30
+
31
+ This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:
32
+
33
+ 1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
34
+ 2. **BF16 MTP overlay** — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
35
+ 3. **Patched chat template** — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.
36
+
37
+ ## Performance
38
+
39
+ Benchmarked on **2× RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:
40
+
41
+ | Metric | Value |
42
+ |---|---|
43
+ | Narrative TPS (n=5) | **71.75** (CV 11.6%) |
44
+ | Code TPS (n=5) | **80.35** (CV 10.6%) |
45
+ | MTP acceptance length | **3.02-3.14** |
46
+ | Per-position accept | ~83% / 69% / 56% |
47
+ | TTFT | **~141ms** |
48
+ | Max context | **262K tokens** (fp8 KV) |
49
+ | Concurrent streams | **2** |
50
+ | VRAM per card | **22.25 GiB** |
51
+ | Model load size | **9.19 GiB** |
52
+
53
+ For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** — practically equivalent for everyday agentic use.
54
+
55
+ ## Quick start
56
+
57
+ ### Docker (vLLM)
58
+
59
+ ```yaml
60
+ services:
61
+ vllm-carnice:
62
+ image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
63
+ ports:
64
+ - "8070:8000"
65
+ volumes:
66
+ - ./models:/root/.cache/huggingface
67
+ shm_size: "16gb"
68
+ ipc: host
69
+ deploy:
70
+ resources:
71
+ reservations:
72
+ devices:
73
+ - driver: nvidia
74
+ count: all
75
+ capabilities: [gpu]
76
+ command:
77
+ - --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
78
+ - --quantization auto_round --dtype float16
79
+ - --tensor-parallel-size 2
80
+ - --disable-custom-all-reduce
81
+ - --max-model-len 262144
82
+ - --gpu-memory-utilization 0.92
83
+ - --max-num-seqs 2
84
+ - --kv-cache-dtype fp8_e5m2
85
+ - --trust-remote-code
86
+ - --reasoning-parser qwen3
87
+ - --enable-auto-tool-choice
88
+ - --tool-call-parser hermes
89
+ - --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
90
+ ```
91
+
92
+ **Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.
93
+
94
+ ### API
95
+
96
+ ```python
97
+ from openai import OpenAI
98
+
99
+ client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")
100
+
101
+ response = client.chat.completions.create(
102
+ model="carnice-bf16mtp",
103
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
104
+ max_tokens=800,
105
+ temperature=0.6,
106
+ )
107
+ print(response.choices[0].message.content)
108
+ ```
109
+
110
+ For tool calling:
111
+
112
+ ```python
113
+ response = client.chat.completions.create(
114
+ model="carnice-bf16mtp",
115
+ messages=[{"role": "user", "content": "What's the weather in Paris?"}],
116
+ tools=[{
117
+ "type": "function",
118
+ "function": {
119
+ "name": "get_weather",
120
+ "description": "Get weather for a city",
121
+ "parameters": {
122
+ "type": "object",
123
+ "properties": {"city": {"type": "string"}},
124
+ "required": ["city"],
125
+ },
126
+ },
127
+ }],
128
+ tool_choice="auto",
129
+ )
130
+ print(response.choices[0].message.tool_calls)
131
+ ```
132
+
133
+ ## Hardware requirements
134
+
135
+ | Setup | Min VRAM | Context | Throughput |
136
+ |---|---|---|---|
137
+ | **2× RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
138
+ | **1× RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
139
+ | **1× RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
140
+ | **2× RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |
141
+
142
+ No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).
143
+
144
+ ## Known caveats
145
+
146
+ - **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
147
+ - **Long-context recall** degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
148
+ - **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.
149
+
150
+ ## Build process
151
+
152
+ This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:
153
+
154
+ 1. `recipe_d_delta_merge.py` — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
155
+ 2. `recipe_d_bf16mtp_overlay.py` — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
156
+ 3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON
157
+
158
+ ## References
159
+
160
+ - **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
161
+ - **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
162
+ - **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
163
+ - **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
164
+ - **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)