---
license: apache-2.0
language:
- en
- zh
library_name: mlx
base_model: kai-os/Carnice-V2-27b
base_model_relation: quantized
pipeline_tag: text-generation
inference: false
tags:
- qwen
- qwen3
- qwen3.6
- carnice
- hermes-agent
- agentic
- sft
- mlx
- apple-silicon
- 4-bit
---
# Carnice-V2-27b — MLX 4-bit (naive affine)
MLX-format quantization of [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference.
This is the **conservative choice** of three published variants: standard 4-bit affine quantization, the most widely-tested mlx-lm setting. For better speed/size on the same quality tier, see [`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6`](https://huggingface.co/Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6).
## Quantization
| | |
|---|---|
| Recipe | Naive 4-bit affine |
| Effective bits/weight | 4.50 |
| Group size | 64 |
| Disk size | ~14 GB (3 shards) |
| Source | [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) (BF16 safetensors) |
Conversion command (mlx-lm 0.31.3):
```bash
mlx_lm.convert \
--hf-path kai-os/Carnice-V2-27b \
--mlx-path Carnice-V2-27b-MLX-4bit \
-q --q-bits 4
```
## Performance — Apple M4 Pro 48 GB, 16 GPU cores
7-prompt agent benchmark suite, `--no-thinking` mode:
| Format | Wall-clock total | Avg tok/s | Output tokens |
|---|---|---|---|
| Carnice Q5_K_M (llama.cpp) | 157.4s | 9.1 | 1297 |
| **Carnice MLX 4-bit naive (this)** | **91.1s (-42%)** | **17.3** | **1192** |
| Carnice MLX mixed_3_6 | 77.7s | 17.0 | 1056 |
| Carnice MLX 6-bit | 108.7s | 11.0 | 1007 |
**~42% faster wall-clock than the GGUF Q5_K_M** on the same hardware. Per-token throughput ~1.9× the llama.cpp baseline.
### Quality (wikitext-2 perplexity)
| Variant | seq 256 | seq 1024 |
|---|---|---|
| **naive 4-bit (this)** | **4.949 ± 0.092** | **3.985 ± 0.036** |
| mixed_3_6 | 5.147 ± 0.097 | 4.073 ± 0.038 |
| 6-bit | 4.881 ± 0.091 | (not measured) |
Evaluated with `mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}`. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory. Numbers are comparable across rows within each column. Do not compare to externally-reported wikitext-2 perplexities without matching settings.
The relative ordering between 4-bit and mixed_3_6 is preserved at both context lengths, which indicates the quantization is stable across context length and does not introduce hidden long-range degradation. Lower perplexity at seq 1024 vs seq 256 is expected — more conditioning context yields better next-token prediction.
## Usage
### `mlx_lm` (Python)
```python
from mlx_lm import load, generate
model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-4bit")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Hello"}],
add_generation_prompt=True,
enable_thinking=False, # important for agent-style use
tokenize=False,
)
print(generate(model, tokenizer, prompt, max_tokens=200))
```
### `mlx_lm.server` (OpenAI-compatible)
```bash
mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-4bit \
--host 127.0.0.1 --port 8080 \
--temp 0.6 --top-p 0.95 --top-k 20
```
Pass `chat_template_kwargs: {"enable_thinking": false}` in request body to disable Carnice's thinking block — without it the model produces ~2× more tokens.
### Hermes Agent / other agent harnesses
If you're driving this model from an agent harness, make sure the harness propagates `chat_template_kwargs.enable_thinking: false` to `mlx_lm.server`. Without it the model emits a hidden `...` block on every turn — roughly ~200 tokens of latency that's invisible to the caller.
Known mismatch with [Hermes Agent](https://github.com/NousResearch/hermes-agent)'s `custom` provider: it sends a top-level `think: false` field instead of the `chat_template_kwargs` form, and `mlx_lm.server` does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation.
## Example output
System: `You find bugs. Reply with: BUG: , then FIX: . No code fences, no extra prose.`
User:
```ts
async function processItems(items: string[]) {
const results = []
for (const item of items) {
results.push(fetch(`/api/process/${item}`).then(r => r.json()))
}
return await results
}
```
What's wrong?
Output (4.3s, 45 tokens):
```
BUG: `await` is applied to an array of promises instead of using `Promise.all` to wait for all to resolve.
FIX: Replace `await results` with `await Promise.all(results)`.
```
## Other variants
| Repo | bpw | Size | Tradeoff |
|---|---|---|---|
| `Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6` | 3.97 | 12 GB | **Recommended** — fastest + smallest, quality ≥ this |
| **`Tranquil-Flow/Carnice-V2-27b-MLX-4bit`** (this) | 4.50 | 14 GB | Conservative — standard naive affine |
| `Tranquil-Flow/Carnice-V2-27b-MLX-6bit` | 6.50 | 20 GB | Quality tier — closer to BF16 fidelity, ~40% slower |
## Limitations & out-of-scope use
This is a third-party MLX-format quantization of `kai-os/Carnice-V2-27b`. It is not maintained by `kai-os` or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion.
- **Apple Silicon only.** MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (`kai-os/Carnice-V2-27b`) or a GGUF conversion.
- **Text-only.** The upstream Carnice model is multimodal (`image-text-to-text`); the `mlx_lm.convert` pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with `transformers`.
- **Quantization artifacts.** Naive 4-bit affine quantization (4.50 bpw) introduces representation error vs the BF16 source — see the perplexity table above. The 7-prompt agent benchmark did not surface degradation, but workloads with long context, complex chains-of-thought, or precision-sensitive numerical reasoning may benefit from a higher-bit variant.
- **Issue scope.** Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on `kai-os/Carnice-V2-27b`.
## Attribution & license
Original model: [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — Hermes-style SFT of Qwen3.6-27B by `kai-os`. Apache 2.0.
This conversion: Apache 2.0. Please credit kai-os as the upstream source.
## Citation
If you use this model, please cite the upstream Carnice release:
```bibtex
@misc{carnice_v2_27b_2026,
author = {kai-os},
title = {Carnice-V2-27b},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}}
}
```
Carnice is itself an SFT of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B); please also acknowledge the Qwen team's base model where appropriate.
This MLX conversion may be referenced as `Tranquil-Flow/Carnice-V2-27b-MLX-4bit` (Hugging Face), Apache 2.0, no additional restrictions.