wasifb's picture
Upload README.md with huggingface_hub
4b5746c verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- qwen
- qwen3-next
- hermes
- agentic
- tool-use
- MTP
- spec-decode
- AutoRound
- INT4
base_model:
- kai-os/Carnice-V2-27b
- Qwen/Qwen3.6-27B
- noonghunna/Qwen3.6-27B-int4-AutoRound
inference:
parameters:
temperature: 0.6
top_p: 0.95
top_k: 20
---
# Carnice-V2-27B-INT4-BF16-MTP
**Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.**
This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:
1. **INT4 quantization** via AutoRound (W4A16, group_size=128, symmetric) β€” the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
2. **BF16 MTP overlay** β€” all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 β†’ ~3.0.
3. **Patched chat template** β€” the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.
## Performance
Benchmarked on **2Γ— RTX 3090 (PCIe, no NVLink)** with vLLM dev205, TP=2:
| Metric | Value |
|---|---|
| Narrative TPS (n=5) | **71.75** (CV 11.6%) |
| Code TPS (n=5) | **80.35** (CV 10.6%) |
| MTP acceptance length | **3.02-3.14** |
| Per-position accept | ~83% / 69% / 56% |
| TTFT | **~141ms** |
| Max context | **262K tokens** (fp8 KV) |
| Concurrent streams | **2** |
| VRAM per card | **22.25 GiB** |
| Model load size | **9.19 GiB** |
For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is **+4% on narrative, -9% on code** β€” practically equivalent for everyday agentic use.
## Quick start
### Docker (vLLM)
```yaml
services:
vllm-carnice:
image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
ports:
- "8070:8000"
volumes:
- ./models:/root/.cache/huggingface
shm_size: "16gb"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
- --quantization auto_round --dtype float16
- --tensor-parallel-size 2
- --disable-custom-all-reduce
- --max-model-len 262144
- --gpu-memory-utilization 0.92
- --max-num-seqs 2
- --kv-cache-dtype fp8_e5m2
- --trust-remote-code
- --reasoning-parser qwen3
- --enable-auto-tool-choice
- --tool-call-parser hermes
- --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```
**Note for single RTX 3090:** reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.
### API
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=800,
temperature=0.6,
)
print(response.choices[0].message.content)
```
For tool calling:
```python
response = client.chat.completions.create(
model="carnice-bf16mtp",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
tool_choice="auto",
)
print(response.choices[0].message.tool_calls)
```
## Hardware requirements
| Setup | Min VRAM | Context | Throughput |
|---|---|---|---|
| **2Γ— RTX 3090** (recommended) | 24 GB each | 262K | 72/80 TPS |
| **1Γ— RTX 3090** | 24 GB | ~65K | ~50 TPS (estimated) |
| **1Γ— RTX 4090** | 24 GB | ~65K | ~60 TPS (estimated) |
| **2Γ— RTX 4090** | 24 GB each | 262K | ~85/100 TPS (estimated) |
No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).
## Known caveats
- **Marlin pad-sub-tile-n patch** (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
- **Long-context recall** degrades at β‰₯60K tokens β€” this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
- **Thinking mode**: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects β‰₯50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic β€” tool calls and generation quality are unaffected.
## Build process
This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:
1. `recipe_d_delta_merge.py` β€” Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
2. `recipe_d_bf16mtp_overlay.py` β€” Replaces INT4-packed MTP projections with BF16 weights from base Qwen
3. Chat template patch β€” Switches tool-call format from Qwen3 XML to Hermes JSON
## References
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Fine-tune**: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
- **Quant recipe**: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
- **Project & compose**: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)