Carnice-V2-27b — MLX 6-bit (quality tier)

MLX-format quantization of kai-os/Carnice-V2-27b — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference.

This is the quality tier of three published variants: closer-to-BF16 fidelity at the cost of ~40% slower throughput than 4-bit. Pick this if you have abundant unified memory and prefer the cleanest weight representation.

Quantization


Recipe	6-bit affine
Effective bits/weight	6.50
Group size	64
Disk size	~20 GB (5 shards)
Source	`kai-os/Carnice-V2-27b` (BF16 safetensors)

Conversion command (mlx-lm 0.31.3):

mlx_lm.convert \
  --hf-path kai-os/Carnice-V2-27b \
  --mlx-path Carnice-V2-27b-MLX-6bit \
  -q --q-bits 6

Performance — Apple M4 Pro 48 GB, 16 GPU cores

7-prompt agent benchmark suite, --no-thinking mode:

Format	Wall-clock total	Avg tok/s	Output tokens
Carnice Q5_K_M (llama.cpp)	157.4s	9.1	1297
Carnice MLX 4-bit naive	91.1s	17.3	1192
Carnice MLX mixed_3_6	77.7s	17.0	1056
Carnice MLX 6-bit (this)	108.7s (-31%)	11.0	1007

Still ~31% faster wall-clock than the GGUF Q5_K_M. Per-token throughput is bandwidth-limited at this size — the higher bit count means more memory traffic per token. On systems with more memory bandwidth (M3 Ultra, M4 Max), the gap to the smaller variants narrows.

Output quality is parity with mixed_3_6 in our agent benchmark — slightly different stylistic choices but no systematic improvement detectable on the prompts tested. Recommended only when subtle weight precision matters (long-context coherence, complex reasoning chains, RAG fidelity).

Quality (wikitext-2 perplexity)

Variant	seq 256	seq 1024
naive 4-bit	4.949 ± 0.092	3.985 ± 0.036
mixed_3_6	5.147 ± 0.097	4.073 ± 0.038
6-bit (this)	4.881 ± 0.091	(not measured)

Evaluated with mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory on this hardware. Higher-memory Apple Silicon (M3 Ultra, M4 Max with ≥64 GB) should be able to produce a long-context number; we welcome PRs adding it. Do not compare to externally-reported wikitext-2 perplexities without matching settings.

6-bit's lower perplexity at seq 256 reflects its higher fidelity to the BF16 source — on next-token prediction it is meaningfully better than the smaller variants. This advantage did not surface on the 7-prompt agent benchmark; whether it matters in practice depends on workload characteristics (long context, complex reasoning, RAG). The 4-bit and mixed_3_6 long-context numbers track their short-context numbers (relative ordering preserved), suggesting quantization is stable across context length for those variants — by extrapolation, 6-bit is expected to behave similarly.

Usage

`mlx_lm` (Python)

from mlx_lm import load, generate

model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-6bit")
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Hello"}],
    add_generation_prompt=True,
    enable_thinking=False,        # important for agent-style use
    tokenize=False,
)
print(generate(model, tokenizer, prompt, max_tokens=200))

`mlx_lm.server` (OpenAI-compatible)

mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-6bit \
  --host 127.0.0.1 --port 8080 \
  --temp 0.6 --top-p 0.95 --top-k 20

Pass chat_template_kwargs: {"enable_thinking": false} in request body to disable Carnice's thinking block.

Hermes Agent / other agent harnesses

If you're driving this model from an agent harness, make sure the harness propagates chat_template_kwargs.enable_thinking: false to mlx_lm.server. Without it the model emits a hidden <think>...</think> block on every turn — roughly ~200 tokens of latency that's invisible to the caller.

Known mismatch with Hermes Agent's custom provider: it sends a top-level think: false field instead of the chat_template_kwargs form, and mlx_lm.server does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation.

Example output

System: You find bugs. Reply with: BUG: <one-line description>, then FIX: <one-line patch description>. No code fences, no extra prose.

User:

async function processItems(items: string[]) {
  const results = []
  for (const item of items) {
    results.push(fetch(`/api/process/${item}`).then(r => r.json()))
  }
  return await results
}

What's wrong?

Output (4.6s, 42 tokens):

BUG: The function returns an array of unresolved Promises instead of awaiting them.
FIX: Replace `return await results` with `return await Promise.all(results)`.

Other variants

Repo	bpw	Size	Tradeoff
`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6`	3.97	12 GB	Recommended — fastest + smallest, quality ≥ naive 4-bit
`Tranquil-Flow/Carnice-V2-27b-MLX-4bit`	4.50	14 GB	Conservative naive 4-bit
`Tranquil-Flow/Carnice-V2-27b-MLX-6bit` (this)	6.50	20 GB	Quality tier — closer to BF16, ~40% slower

Limitations & out-of-scope use

This is a third-party MLX-format quantization of kai-os/Carnice-V2-27b. It is not maintained by kai-os or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion.

Apple Silicon only. MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (kai-os/Carnice-V2-27b) or a GGUF conversion.
Text-only. The upstream Carnice model is multimodal (image-text-to-text); the mlx_lm.convert pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with transformers.
Memory and bandwidth. 6-bit affine quantization (6.50 bpw) is the highest-fidelity variant of this release at the cost of larger memory footprint and lower per-token throughput on bandwidth-limited hardware. Fits comfortably on 32 GB+ unified memory; on 16 GB systems prefer mixed_3_6 or 4bit to leave room for KV cache.
Issue scope. Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on kai-os/Carnice-V2-27b.

Attribution & license

Original model: kai-os/Carnice-V2-27b — Hermes-style SFT of Qwen3.6-27B by kai-os. Apache 2.0.

This conversion: Apache 2.0. Please credit kai-os as the upstream source.

Citation

If you use this model, please cite the upstream Carnice release:

@misc{carnice_v2_27b_2026,
  author       = {kai-os},
  title        = {Carnice-V2-27b},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}}
}

Carnice is itself an SFT of Qwen/Qwen3.6-27B; please also acknowledge the Qwen team's base model where appropriate.

This MLX conversion may be referenced as Tranquil-Flow/Carnice-V2-27b-MLX-6bit (Hugging Face), Apache 2.0, no additional restrictions.

Downloads last month: 181

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit

Model tree for Tranquil-Flow/Carnice-V2-27b-MLX-6bit

Base model

Qwen/Qwen3.6-27B

Finetuned

kai-os/Carnice-V2-27b

Quantized

(9)

this model