--- license: apache-2.0 language: - en - zh library_name: mlx base_model: kai-os/Carnice-V2-27b base_model_relation: quantized pipeline_tag: text-generation inference: false tags: - qwen - qwen3 - qwen3.6 - carnice - hermes-agent - agentic - sft - mlx - apple-silicon - 4-bit --- # Carnice-V2-27b — MLX 4-bit (naive affine) MLX-format quantization of [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference. This is the **conservative choice** of three published variants: standard 4-bit affine quantization, the most widely-tested mlx-lm setting. For better speed/size on the same quality tier, see [`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6`](https://huggingface.co/Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6). ## Quantization | | | |---|---| | Recipe | Naive 4-bit affine | | Effective bits/weight | 4.50 | | Group size | 64 | | Disk size | ~14 GB (3 shards) | | Source | [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) (BF16 safetensors) | Conversion command (mlx-lm 0.31.3): ```bash mlx_lm.convert \ --hf-path kai-os/Carnice-V2-27b \ --mlx-path Carnice-V2-27b-MLX-4bit \ -q --q-bits 4 ``` ## Performance — Apple M4 Pro 48 GB, 16 GPU cores 7-prompt agent benchmark suite, `--no-thinking` mode: | Format | Wall-clock total | Avg tok/s | Output tokens | |---|---|---|---| | Carnice Q5_K_M (llama.cpp) | 157.4s | 9.1 | 1297 | | **Carnice MLX 4-bit naive (this)** | **91.1s (-42%)** | **17.3** | **1192** | | Carnice MLX mixed_3_6 | 77.7s | 17.0 | 1056 | | Carnice MLX 6-bit | 108.7s | 11.0 | 1007 | **~42% faster wall-clock than the GGUF Q5_K_M** on the same hardware. Per-token throughput ~1.9× the llama.cpp baseline. ### Quality (wikitext-2 perplexity) | Variant | seq 256 | seq 1024 | |---|---|---| | **naive 4-bit (this)** | **4.949 ± 0.092** | **3.985 ± 0.036** | | mixed_3_6 | 5.147 ± 0.097 | 4.073 ± 0.038 | | 6-bit | 4.881 ± 0.091 | (not measured) | Evaluated with `mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}`. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory. Numbers are comparable across rows within each column. Do not compare to externally-reported wikitext-2 perplexities without matching settings. The relative ordering between 4-bit and mixed_3_6 is preserved at both context lengths, which indicates the quantization is stable across context length and does not introduce hidden long-range degradation. Lower perplexity at seq 1024 vs seq 256 is expected — more conditioning context yields better next-token prediction. ## Usage ### `mlx_lm` (Python) ```python from mlx_lm import load, generate model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-4bit") prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Hello"}], add_generation_prompt=True, enable_thinking=False, # important for agent-style use tokenize=False, ) print(generate(model, tokenizer, prompt, max_tokens=200)) ``` ### `mlx_lm.server` (OpenAI-compatible) ```bash mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-4bit \ --host 127.0.0.1 --port 8080 \ --temp 0.6 --top-p 0.95 --top-k 20 ``` Pass `chat_template_kwargs: {"enable_thinking": false}` in request body to disable Carnice's thinking block — without it the model produces ~2× more tokens. ### Hermes Agent / other agent harnesses If you're driving this model from an agent harness, make sure the harness propagates `chat_template_kwargs.enable_thinking: false` to `mlx_lm.server`. Without it the model emits a hidden `...` block on every turn — roughly ~200 tokens of latency that's invisible to the caller. Known mismatch with [Hermes Agent](https://github.com/NousResearch/hermes-agent)'s `custom` provider: it sends a top-level `think: false` field instead of the `chat_template_kwargs` form, and `mlx_lm.server` does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation. ## Example output System: `You find bugs. Reply with: BUG: , then FIX: . No code fences, no extra prose.` User: ```ts async function processItems(items: string[]) { const results = [] for (const item of items) { results.push(fetch(`/api/process/${item}`).then(r => r.json())) } return await results } ``` What's wrong? Output (4.3s, 45 tokens): ``` BUG: `await` is applied to an array of promises instead of using `Promise.all` to wait for all to resolve. FIX: Replace `await results` with `await Promise.all(results)`. ``` ## Other variants | Repo | bpw | Size | Tradeoff | |---|---|---|---| | `Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6` | 3.97 | 12 GB | **Recommended** — fastest + smallest, quality ≥ this | | **`Tranquil-Flow/Carnice-V2-27b-MLX-4bit`** (this) | 4.50 | 14 GB | Conservative — standard naive affine | | `Tranquil-Flow/Carnice-V2-27b-MLX-6bit` | 6.50 | 20 GB | Quality tier — closer to BF16 fidelity, ~40% slower | ## Limitations & out-of-scope use This is a third-party MLX-format quantization of `kai-os/Carnice-V2-27b`. It is not maintained by `kai-os` or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion. - **Apple Silicon only.** MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (`kai-os/Carnice-V2-27b`) or a GGUF conversion. - **Text-only.** The upstream Carnice model is multimodal (`image-text-to-text`); the `mlx_lm.convert` pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with `transformers`. - **Quantization artifacts.** Naive 4-bit affine quantization (4.50 bpw) introduces representation error vs the BF16 source — see the perplexity table above. The 7-prompt agent benchmark did not surface degradation, but workloads with long context, complex chains-of-thought, or precision-sensitive numerical reasoning may benefit from a higher-bit variant. - **Issue scope.** Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on `kai-os/Carnice-V2-27b`. ## Attribution & license Original model: [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — Hermes-style SFT of Qwen3.6-27B by `kai-os`. Apache 2.0. This conversion: Apache 2.0. Please credit kai-os as the upstream source. ## Citation If you use this model, please cite the upstream Carnice release: ```bibtex @misc{carnice_v2_27b_2026, author = {kai-os}, title = {Carnice-V2-27b}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}} } ``` Carnice is itself an SFT of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B); please also acknowledge the Qwen team's base model where appropriate. This MLX conversion may be referenced as `Tranquil-Flow/Carnice-V2-27b-MLX-4bit` (Hugging Face), Apache 2.0, no additional restrictions.