Carnice-V2-27b — MLX 6-bit (quality tier)
MLX-format quantization of kai-os/Carnice-V2-27b — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference.
This is the quality tier of three published variants: closer-to-BF16 fidelity at the cost of ~40% slower throughput than 4-bit. Pick this if you have abundant unified memory and prefer the cleanest weight representation.
Quantization
| Recipe | 6-bit affine |
| Effective bits/weight | 6.50 |
| Group size | 64 |
| Disk size | ~20 GB (5 shards) |
| Source | kai-os/Carnice-V2-27b (BF16 safetensors) |
Conversion command (mlx-lm 0.31.3):
mlx_lm.convert \
--hf-path kai-os/Carnice-V2-27b \
--mlx-path Carnice-V2-27b-MLX-6bit \
-q --q-bits 6
Performance — Apple M4 Pro 48 GB, 16 GPU cores
7-prompt agent benchmark suite, --no-thinking mode:
| Format | Wall-clock total | Avg tok/s | Output tokens |
|---|---|---|---|
| Carnice Q5_K_M (llama.cpp) | 157.4s | 9.1 | 1297 |
| Carnice MLX 4-bit naive | 91.1s | 17.3 | 1192 |
| Carnice MLX mixed_3_6 | 77.7s | 17.0 | 1056 |
| Carnice MLX 6-bit (this) | 108.7s (-31%) | 11.0 | 1007 |
Still ~31% faster wall-clock than the GGUF Q5_K_M. Per-token throughput is bandwidth-limited at this size — the higher bit count means more memory traffic per token. On systems with more memory bandwidth (M3 Ultra, M4 Max), the gap to the smaller variants narrows.
Output quality is parity with mixed_3_6 in our agent benchmark — slightly different stylistic choices but no systematic improvement detectable on the prompts tested. Recommended only when subtle weight precision matters (long-context coherence, complex reasoning chains, RAG fidelity).
Quality (wikitext-2 perplexity)
| Variant | seq 256 | seq 1024 |
|---|---|---|
| naive 4-bit | 4.949 ± 0.092 | 3.985 ± 0.036 |
| mixed_3_6 | 5.147 ± 0.097 | 4.073 ± 0.038 |
| 6-bit (this) | 4.881 ± 0.091 | (not measured) |
Evaluated with mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory on this hardware. Higher-memory Apple Silicon (M3 Ultra, M4 Max with ≥64 GB) should be able to produce a long-context number; we welcome PRs adding it. Do not compare to externally-reported wikitext-2 perplexities without matching settings.
6-bit's lower perplexity at seq 256 reflects its higher fidelity to the BF16 source — on next-token prediction it is meaningfully better than the smaller variants. This advantage did not surface on the 7-prompt agent benchmark; whether it matters in practice depends on workload characteristics (long context, complex reasoning, RAG). The 4-bit and mixed_3_6 long-context numbers track their short-context numbers (relative ordering preserved), suggesting quantization is stable across context length for those variants — by extrapolation, 6-bit is expected to behave similarly.
Usage
mlx_lm (Python)
from mlx_lm import load, generate
model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-6bit")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Hello"}],
add_generation_prompt=True,
enable_thinking=False, # important for agent-style use
tokenize=False,
)
print(generate(model, tokenizer, prompt, max_tokens=200))
mlx_lm.server (OpenAI-compatible)
mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-6bit \
--host 127.0.0.1 --port 8080 \
--temp 0.6 --top-p 0.95 --top-k 20
Pass chat_template_kwargs: {"enable_thinking": false} in request body to disable Carnice's thinking block.
Hermes Agent / other agent harnesses
If you're driving this model from an agent harness, make sure the harness propagates chat_template_kwargs.enable_thinking: false to mlx_lm.server. Without it the model emits a hidden <think>...</think> block on every turn — roughly ~200 tokens of latency that's invisible to the caller.
Known mismatch with Hermes Agent's custom provider: it sends a top-level think: false field instead of the chat_template_kwargs form, and mlx_lm.server does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation.
Example output
System: You find bugs. Reply with: BUG: <one-line description>, then FIX: <one-line patch description>. No code fences, no extra prose.
User:
async function processItems(items: string[]) {
const results = []
for (const item of items) {
results.push(fetch(`/api/process/${item}`).then(r => r.json()))
}
return await results
}
What's wrong?
Output (4.6s, 42 tokens):
BUG: The function returns an array of unresolved Promises instead of awaiting them.
FIX: Replace `return await results` with `return await Promise.all(results)`.
Other variants
| Repo | bpw | Size | Tradeoff |
|---|---|---|---|
Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6 |
3.97 | 12 GB | Recommended — fastest + smallest, quality ≥ naive 4-bit |
Tranquil-Flow/Carnice-V2-27b-MLX-4bit |
4.50 | 14 GB | Conservative naive 4-bit |
Tranquil-Flow/Carnice-V2-27b-MLX-6bit (this) |
6.50 | 20 GB | Quality tier — closer to BF16, ~40% slower |
Limitations & out-of-scope use
This is a third-party MLX-format quantization of kai-os/Carnice-V2-27b. It is not maintained by kai-os or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion.
- Apple Silicon only. MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (
kai-os/Carnice-V2-27b) or a GGUF conversion. - Text-only. The upstream Carnice model is multimodal (
image-text-to-text); themlx_lm.convertpipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights withtransformers. - Memory and bandwidth. 6-bit affine quantization (6.50 bpw) is the highest-fidelity variant of this release at the cost of larger memory footprint and lower per-token throughput on bandwidth-limited hardware. Fits comfortably on 32 GB+ unified memory; on 16 GB systems prefer
mixed_3_6or4bitto leave room for KV cache. - Issue scope. Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on
kai-os/Carnice-V2-27b.
Attribution & license
Original model: kai-os/Carnice-V2-27b — Hermes-style SFT of Qwen3.6-27B by kai-os. Apache 2.0.
This conversion: Apache 2.0. Please credit kai-os as the upstream source.
Citation
If you use this model, please cite the upstream Carnice release:
@misc{carnice_v2_27b_2026,
author = {kai-os},
title = {Carnice-V2-27b},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}}
}
Carnice is itself an SFT of Qwen/Qwen3.6-27B; please also acknowledge the Qwen team's base model where appropriate.
This MLX conversion may be referenced as Tranquil-Flow/Carnice-V2-27b-MLX-6bit (Hugging Face), Apache 2.0, no additional restrictions.
- Downloads last month
- 181
6-bit