license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- qwen
- qwen3-next
- hermes
- agentic
- tool-use
- MTP
- spec-decode
- AutoRound
- INT4
base_model:
- kai-os/Carnice-V2-27b
- Qwen/Qwen3.6-27B
- noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
parameters:
temperature: 0.6
top_p: 0.95
top_k: 20
Carnice-V2-27B-INT4-BF16-MTP
Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.
This model takes kai-os/Carnice-V2-27b (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:
- INT4 quantization via AutoRound (W4A16, group_size=128, symmetric) β the quant grid comes from Lorbus's Qwen3.6-27B-int4-AutoRound, delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
- BF16 MTP overlay β all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β ~3.0.
- Patched chat template β the tool-call format is changed from Qwen3 XML to Hermes JSON (inside
<tool_call>tags), compatible with vLLM's--tool-call-parser hermes.
Performance
Benchmarked on 2Γ RTX 3090 (PCIe, no NVLink) with vLLM dev205, TP=2:
| Metric | Value |
|---|---|
| Narrative TPS (n=5) | 71.75 (CV 11.6%) |
| Code TPS (n=5) | 80.35 (CV 10.6%) |
| MTP acceptance length | 3.02-3.14 |
| Per-position accept | ~83% / 69% / 56% |
| TTFT | ~141ms |
| Max context | 262K tokens (fp8 KV) |
| Concurrent streams | 2 |
| VRAM per card | 22.25 GiB |
| Model load size | 9.19 GiB |
For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is +4% on narrative, -9% on code β practically equivalent for everyday agentic use.
Quick start
Docker (vLLM)
services:
vllm-carnice:
image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
ports:
- "8070:8000"
volumes:
- ./models:/root/.cache/huggingface
shm_size: "16gb"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
- --quantization auto_round --dtype float16
- --tensor-parallel-size 2
- --disable-custom-all-reduce
- --max-model-len 262144
- --gpu-memory-utilization 0.92
- --max-num-seqs 2
- --kv-cache-dtype fp8_e5m2
- --trust-remote-code
- --reasoning-parser qwen3
- --enable-auto-tool-choice
- --tool-call-parser hermes
- --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Note for single RTX 3090: reduce --max-model-len to ~65K, set --tensor-parallel-size 1, --max-num-seqs 1.
API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=800,
temperature=0.6,
)
print(response.choices[0].message.content)
For tool calling:
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
tool_choice="auto",
)
print(response.choices[0].message.tool_calls)
Hardware requirements
| Setup | Min VRAM | Context | Throughput |
|---|---|---|---|
| 2Γ RTX 3090 (recommended) | 24 GB each | 262K | 72/80 TPS |
| 1Γ RTX 3090 | 24 GB | ~65K | ~50 TPS (estimated) |
| 1Γ RTX 4090 | 24 GB | ~65K | ~60 TPS (estimated) |
| 2Γ RTX 4090 | 24 GB each | 262K | ~85/100 TPS (estimated) |
No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (--disable-custom-all-reduce on PCIe).
Known caveats
- Marlin pad-sub-tile-n patch (vLLM PR #40361) required for TP=2. Vendored at github.com/noonghunna/club-3090.
- Long-context recall degrades at β₯60K tokens β this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
- Thinking mode: Carnice is concise; its
reasoningfield is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β tool calls and generation quality are unaffected.
Build process
This model was built using the delta-merge approach documented at github.com/noonghunna/club-3090. The scripts are in the carnice-autoround/ directory:
recipe_d_delta_merge.pyβ Applies Lorbus's INT4 quant grid to Carnice's BF16 weightsrecipe_d_bf16mtp_overlay.pyβ Replaces INT4-packed MTP projections with BF16 weights from base Qwen- Chat template patch β Switches tool-call format from Qwen3 XML to Hermes JSON
References
- Base model: Qwen/Qwen3.6-27B
- Fine-tune: kai-os/Carnice-V2-27b
- Quant recipe: noonghunna/Qwen3.6-27B-int4-AutoRound (Lorbus)
- Project & compose: github.com/noonghunna/club-3090
- vLLM: vllm-project/vllm