Upload README.md with huggingface_hub

4b5746c verified 4 days ago

6.13 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- qwen
	- qwen3-next
	- hermes
	- agentic
	- tool-use
	- MTP
	- spec-decode
	- AutoRound
	- INT4
	base_model:
	- kai-os/Carnice-V2-27b
	- Qwen/Qwen3.6-27B
	- noonghunna/Qwen3.6-27B-int4-AutoRound
	inference:
	parameters:
	temperature: 0.6
	top_p: 0.95
	top_k: 20
	---

	# Carnice-V2-27B-INT4-BF16-MTP

	Hermes-style agentic fine-tune of Qwen3.6-27B, quantized to INT4 with a BF16 MTP overlay for speculative decoding.

	This model takes [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b) (a Hermes-style agentic fine-tune of Qwen3.6-27B) and applies:

	1. INT4 quantization via AutoRound (W4A16, group_size=128, symmetric) — the quant grid comes from Lorbus's [Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound), delta-merged onto Carnice's BF16 weights. This avoids re-running the full AutoRound calibration loop.
	2. BF16 MTP overlay — all 29 MTP head tensors are kept in BF16 (unquantized) for clean speculative decoding acceptance. This recovers MTP AL from ~2.0 → ~3.0.
	3. Patched chat template — the tool-call format is changed from Qwen3 XML to Hermes JSON (inside `<tool_call>` tags), compatible with vLLM's `--tool-call-parser hermes`.

	## Performance

	Benchmarked on 2× RTX 3090 (PCIe, no NVLink) with vLLM dev205, TP=2:

	\| Metric \| Value \|
	\|---\|---\|
	\| Narrative TPS (n=5) \| 71.75 (CV 11.6%) \|
	\| Code TPS (n=5) \| 80.35 (CV 10.6%) \|
	\| MTP acceptance length \| 3.02-3.14 \|
	\| Per-position accept \| ~83% / 69% / 56% \|
	\| TTFT \| ~141ms \|
	\| Max context \| 262K tokens (fp8 KV) \|
	\| Concurrent streams \| 2 \|
	\| VRAM per card \| 22.25 GiB \|
	\| Model load size \| 9.19 GiB \|

	For comparison, the base Qwen3.6-27B INT4 (same hardware, same config) runs at ~69 narr / ~89 code TPS. Carnice is +4% on narrative, -9% on code — practically equivalent for everyday agentic use.

	## Quick start

	### Docker (vLLM)

	```yaml
	services:
	vllm-carnice:
	image: vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
	ports:
	- "8070:8000"
	volumes:
	- ./models:/root/.cache/huggingface
	shm_size: "16gb"
	ipc: host
	deploy:
	resources:
	reservations:
	devices:
	- driver: nvidia
	count: all
	capabilities: [gpu]
	command:
	- --model /root/.cache/huggingface/carnice-v2-27b-int4-bf16mtp
	- --quantization auto_round --dtype float16
	- --tensor-parallel-size 2
	- --disable-custom-all-reduce
	- --max-model-len 262144
	- --gpu-memory-utilization 0.92
	- --max-num-seqs 2
	- --kv-cache-dtype fp8_e5m2
	- --trust-remote-code
	- --reasoning-parser qwen3
	- --enable-auto-tool-choice
	- --tool-call-parser hermes
	- --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
	```

	Note for single RTX 3090: reduce `--max-model-len` to ~65K, set `--tensor-parallel-size 1`, `--max-num-seqs 1`.

	### API

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://localhost:8070/v1", api_key="not-needed")

	response = client.chat.completions.create(
	model="carnice-bf16mtp",
	messages=[{"role": "user", "content": "Write a quicksort in Python."}],
	max_tokens=800,
	temperature=0.6,
	)
	print(response.choices[0].message.content)
	```

	For tool calling:

	```python
	response = client.chat.completions.create(
	model="carnice-bf16mtp",
	messages=[{"role": "user", "content": "What's the weather in Paris?"}],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_weather",
	"description": "Get weather for a city",
	"parameters": {
	"type": "object",
	"properties": {"city": {"type": "string"}},
	"required": ["city"],
	},
	},
	}],
	tool_choice="auto",
	)
	print(response.choices[0].message.tool_calls)
	```

	## Hardware requirements

	\| Setup \| Min VRAM \| Context \| Throughput \|
	\|---\|---\|---\|---\|
	\| 2× RTX 3090 (recommended) \| 24 GB each \| 262K \| 72/80 TPS \|
	\| 1× RTX 3090 \| 24 GB \| ~65K \| ~50 TPS (estimated) \|
	\| 1× RTX 4090 \| 24 GB \| ~65K \| ~60 TPS (estimated) \|
	\| 2× RTX 4090 \| 24 GB each \| 262K \| ~85/100 TPS (estimated) \|

	No NVLink required. PCIe-only works fine. Custom all-reduce must be disabled (`--disable-custom-all-reduce` on PCIe).

	## Known caveats

	- Marlin pad-sub-tile-n patch (vLLM PR [#40361](https://github.com/vllm-project/vllm/pull/40361)) required for TP=2. Vendored at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad).
	- Long-context recall degrades at ≥60K tokens — this is a model-level GatedDeltaNet attention ceiling, not specific to this quant.
	- Thinking mode: Carnice is concise; its `reasoning` field is shorter than base Qwen's verbose style. verify-full.sh's thinking test expects ≥50 chars; Carnice typically outputs ~5-10 chars. This is cosmetic — tool calls and generation quality are unaffected.

	## Build process

	This model was built using the delta-merge approach documented at [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090). The scripts are in the `carnice-autoround/` directory:

	1. `recipe_d_delta_merge.py` — Applies Lorbus's INT4 quant grid to Carnice's BF16 weights
	2. `recipe_d_bf16mtp_overlay.py` — Replaces INT4-packed MTP projections with BF16 weights from base Qwen
	3. Chat template patch — Switches tool-call format from Qwen3 XML to Hermes JSON

	## References

	- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
	- Fine-tune: [kai-os/Carnice-V2-27b](https://huggingface.co/kai-os/Carnice-V2-27b)
	- Quant recipe: [noonghunna/Qwen3.6-27B-int4-AutoRound](https://huggingface.co/noonghunna/Qwen3.6-27B-int4-AutoRound) (Lorbus)
	- Project & compose: [github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090)
	- vLLM: [vllm-project/vllm](https://github.com/vllm-project/vllm)