Text Generation
HERMES
Safetensors
English
qwen3_5
qwen
qwen3-next
agentic
tool-use
MTP
spec-decode
AutoRound
INT4
conversational
4-bit precision
auto-round
Instructions to use wasifb/Carnice_V2_27B_INT4_BF16MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use wasifb/Carnice_V2_27B_INT4_BF16MTP with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 6,130 Bytes
4b5746c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | ---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- qwen
- qwen3-next
- hermes
- agentic
- tool-use
- MTP
- spec-decode
- AutoRound
- INT4
base_model:
- kai-os/Carnice-V2-27b
- Qwen/Qwen3.6-27B
- noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
parameters:
temperature: 0.6
top_p: 0.95
top_k: 20
---
# Carnice-V2-27B-INT4-BF16-MTP
**Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**
This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:
1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) β the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
2. **BF16 MTP overlay** β all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β ~3.0.
3. **Patched chat template** β the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.
## Performance
Benchmarked on **2Γ RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:
| Metric | Value |
|---|---|
| Narrative TPS (n=5) | **71.75** (CV 11.6%) |
| Code TPS (n=5) | **80.35** (CV 10.6%) |
| MTP acceptance length | **3.02-3.14** |
| Per-position accept | ~83% / 69% / 56% |
| TTFT | **~141ms** |
| Max context | **262K tokens** (fp8 KV) |
| Concurrent streams | **2** |
| VRAM per card | **22.25 GiB** |
| Model load size | **9.19 GiB** |
For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** β practically equivalent for everyday agentic use.
## Quick start
### Docker (vLLM)
```yaml
services:
vllm-carnice:
image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
ports:
- "8070:8000"
volumes:
- ./models:/root/.cache/huggingface
shm_size: "16gb"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
- --quantization auto_round --dtype float16
- --tensor-parallel-size 2
- --disable-custom-all-reduce
- --max-model-len 262144
- --gpu-memory-utilization 0.92
- --max-num-seqs 2
- --kv-cache-dtype fp8_e5m2
- --trust-remote-code
- --reasoning-parser qwen3
- --enable-auto-tool-choice
- --tool-call-parser hermes
- --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```
**Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.
### API
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=800,
temperature=0.6,
)
print(response.choices[0].message.content)
```
For tool calling:
```python
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
tool_choice="auto",
)
print(response.choices[0].message.tool_calls)
```
## Hardware requirements
| Setup | Min VRAM | Context | Throughput |
|---|---|---|---|
| **2Γ RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
| **1Γ RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
| **1Γ RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
| **2Γ RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |
No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).
## Known caveats
- **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
- **Long-context recall** degrades at β₯60K tokens β this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
- **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β tool calls and generation quality are unaffected.
## Build process
This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:
1. `recipe_d_delta_merge.py` β Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
2. `recipe_d_bf16mtp_overlay.py` β Replaces INT4-packed MTP projections with BF16 weights from base Qwen
3. Chat template patch β Switches tool-call format from Qwen3 XML to Hermes JSON
## References
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
- **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
- **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)
|