Carnice-V2-27b — MLX mixed_3_6 (recommended)

MLX-format quantization of kai-os/Carnice-V2-27b — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for fast Apple Silicon inference.

This is the recommended default of three published variants: smallest, fastest, and quality on par with naive 4-bit on agent tasks.

Quantization


Recipe	`mixed_3_6` mixed-bit (critical layers at 6-bit, others at 3-bit)
Effective bits/weight	3.97
Group size	64
Disk size	~12 GB (3 shards)
Source	`kai-os/Carnice-V2-27b` (BF16 safetensors)

Conversion command (mlx-lm 0.31.3):

mlx_lm.convert \
  --hf-path kai-os/Carnice-V2-27b \
  --mlx-path Carnice-V2-27b-MLX-mixed_3_6 \
  -q --quant-predicate mixed_3_6

Performance — Apple M4 Pro 48 GB, 16 GPU cores

7-prompt agent benchmark suite, --no-thinking mode (Carnice's default for agent loops):

Format	Wall-clock total	Avg tok/s	Output tokens
Carnice Q5_K_M (llama.cpp)	157.4s	9.1	1297
Carnice MLX 4-bit naive	91.1s	17.3	1192
Carnice MLX mixed_3_6 (this)	77.7s (-51%)	17.0	1056
Carnice MLX 6-bit	108.7s	11.0	1007

~51% faster wall-clock than the GGUF Q5_K_M on the same hardware. Per-token throughput ~1.9× the llama.cpp baseline. Quality matches or exceeds naive 4-bit on agent tasks (more complete tool-selection responses, correct severity classification on triage, well-formed JSON).

Benchmarks were also re-run with conciseness preserved — the chat template's enable_thinking: false flag must be propagated through the request (see Usage). Without it, output token counts approximately double and the wall-clock advantage is lost.

Quality (wikitext-2 perplexity)

Variant	seq 256	seq 1024
naive 4-bit	4.949 ± 0.092	3.985 ± 0.036
mixed_3_6 (this)	5.147 ± 0.097	4.073 ± 0.038
6-bit	4.881 ± 0.091	(not measured)

Evaluated with mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory. Numbers are comparable across rows within each column. Do not compare to externally-reported wikitext-2 perplexities without matching settings.

mixed_3_6's slightly higher perplexity than naive 4-bit at both context lengths is the expected tradeoff for its lower bits/weight (3.97 vs 4.50). The gap is preserved at longer context, indicating the lower-bit recipe does not introduce hidden long-range degradation. On the 7-prompt agent benchmark, mixed_3_6 produced more complete responses on tool selection and equivalent-or-better severity classification, so the perplexity gap did not translate into observable agent-task degradation on the prompts tested.

Usage

`mlx_lm` (Python)

from mlx_lm import load, generate

model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6")
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Hello"}],
    add_generation_prompt=True,
    enable_thinking=False,        # important for agent-style use
    tokenize=False,
)
print(generate(model, tokenizer, prompt, max_tokens=200))

`mlx_lm.server` (OpenAI-compatible)

mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6 \
  --host 127.0.0.1 --port 8080 \
  --temp 0.6 --top-p 0.95 --top-k 20

When sending requests, include chat_template_kwargs to disable thinking:

{
  "model": "...",
  "messages": [...],
  "chat_template_kwargs": {"enable_thinking": false}
}

LM Studio

LM Studio's MLX runtime should load this directly via the search-and-download flow.

Hermes Agent / other agent harnesses

If you're driving this model from an agent harness, make sure the harness propagates chat_template_kwargs.enable_thinking: false to mlx_lm.server. Without it the model emits a hidden <think>...</think> block on every turn — roughly ~200 tokens of latency that's invisible to the caller.

Known mismatch with Hermes Agent's custom provider: it sends a top-level think: false field instead of the chat_template_kwargs form, and mlx_lm.server does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation.

Example output

System: You find bugs. Reply with: BUG: <one-line description>, then FIX: <one-line patch description>. No code fences, no extra prose.

User:

async function processItems(items: string[]) {
  const results = []
  for (const item of items) {
    results.push(fetch(`/api/process/${item}`).then(r => r.json()))
  }
  return await results
}

What's wrong?

Output (3.4s, 41 tokens):

BUG: The function awaits an array of promises instead of awaiting all promises concurrently.
FIX: Replace `await results` with `await Promise.all(results)`.

Other variants

Repo	bpw	Size	Tradeoff
`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6` (this)	3.97	12 GB	Recommended — fastest + smallest, quality ≥ naive 4-bit
`Tranquil-Flow/Carnice-V2-27b-MLX-4bit`	4.50	14 GB	Conservative naive affine quant
`Tranquil-Flow/Carnice-V2-27b-MLX-6bit`	6.50	20 GB	Quality tier — closer to BF16 fidelity, ~40% slower

Limitations & out-of-scope use

This is a third-party MLX-format quantization of kai-os/Carnice-V2-27b. It is not maintained by kai-os or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion.

Apple Silicon only. MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (kai-os/Carnice-V2-27b) or a GGUF conversion.
Text-only. The upstream Carnice model is multimodal (image-text-to-text); the mlx_lm.convert pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with transformers.
Quantization artifacts. The mixed_3_6 recipe (3.97 bpw — predominantly 3-bit groups with critical layers preserved at 6-bit) is the lowest-bit variant of this release. It introduces more representation error than the 4-bit and 6-bit variants, but the 7-prompt agent benchmark did not surface degradation. Workloads with long context, complex chains-of-thought, or precision-sensitive numerical reasoning may prefer the higher-bit variants.
Issue scope. Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on kai-os/Carnice-V2-27b.

Attribution & license

Original model: kai-os/Carnice-V2-27b — Hermes-style SFT of Qwen3.6-27B by kai-os. Apache 2.0.

This conversion: Apache 2.0, no additional restrictions. Please credit kai-os as the upstream source when discussing or comparing this model.

Citation

If you use this model, please cite the upstream Carnice release:

@misc{carnice_v2_27b_2026,
  author       = {kai-os},
  title        = {Carnice-V2-27b},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}}
}

Carnice is itself an SFT of Qwen/Qwen3.6-27B; please also acknowledge the Qwen team's base model where appropriate.

This MLX conversion may be referenced as Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6 (Hugging Face), Apache 2.0, no additional restrictions.

Downloads last month: 295

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6

Base model

Qwen/Qwen3.6-27B

Finetuned

kai-os/Carnice-V2-27b

Quantized

(9)

this model