| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| library_name: mlx |
| base_model: kai-os/Carnice-V2-27b |
| base_model_relation: quantized |
| pipeline_tag: text-generation |
| inference: false |
| tags: |
| - qwen |
| - qwen3 |
| - qwen3.6 |
| - carnice |
| - hermes-agent |
| - agentic |
| - sft |
| - mlx |
| - apple-silicon |
| - 4-bit |
| --- |
| |
| # Carnice-V2-27b — MLX 4-bit (naive affine) |
|
|
| MLX-format quantization of [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference. |
|
|
| This is the **conservative choice** of three published variants: standard 4-bit affine quantization, the most widely-tested mlx-lm setting. For better speed/size on the same quality tier, see [`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6`](https://huggingface.co/Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6). |
|
|
| ## Quantization |
|
|
| | | | |
| |---|---| |
| | Recipe | Naive 4-bit affine | |
| | Effective bits/weight | 4.50 | |
| | Group size | 64 | |
| | Disk size | ~14 GB (3 shards) | |
| | Source | [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) (BF16 safetensors) | |
|
|
| Conversion command (mlx-lm 0.31.3): |
| ```bash |
| mlx_lm.convert \ |
| --hf-path kai-os/Carnice-V2-27b \ |
| --mlx-path Carnice-V2-27b-MLX-4bit \ |
| -q --q-bits 4 |
| ``` |
|
|
| ## Performance — Apple M4 Pro 48 GB, 16 GPU cores |
|
|
| 7-prompt agent benchmark suite, `--no-thinking` mode: |
|
|
| | Format | Wall-clock total | Avg tok/s | Output tokens | |
| |---|---|---|---| |
| | Carnice Q5_K_M (llama.cpp) | 157.4s | 9.1 | 1297 | |
| | **Carnice MLX 4-bit naive (this)** | **91.1s (-42%)** | **17.3** | **1192** | |
| | Carnice MLX mixed_3_6 | 77.7s | 17.0 | 1056 | |
| | Carnice MLX 6-bit | 108.7s | 11.0 | 1007 | |
|
|
| **~42% faster wall-clock than the GGUF Q5_K_M** on the same hardware. Per-token throughput ~1.9× the llama.cpp baseline. |
|
|
| ### Quality (wikitext-2 perplexity) |
|
|
| | Variant | seq 256 | seq 1024 | |
| |---|---|---| |
| | **naive 4-bit (this)** | **4.949 ± 0.092** | **3.985 ± 0.036** | |
| | mixed_3_6 | 5.147 ± 0.097 | 4.073 ± 0.038 | |
| | 6-bit | 4.881 ± 0.091 | (not measured) | |
|
|
| Evaluated with `mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}`. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory. Numbers are comparable across rows within each column. Do not compare to externally-reported wikitext-2 perplexities without matching settings. |
|
|
| The relative ordering between 4-bit and mixed_3_6 is preserved at both context lengths, which indicates the quantization is stable across context length and does not introduce hidden long-range degradation. Lower perplexity at seq 1024 vs seq 256 is expected — more conditioning context yields better next-token prediction. |
|
|
| ## Usage |
|
|
| ### `mlx_lm` (Python) |
| ```python |
| from mlx_lm import load, generate |
|
|
| model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-4bit") |
| prompt = tokenizer.apply_chat_template( |
| [{"role": "user", "content": "Hello"}], |
| add_generation_prompt=True, |
| enable_thinking=False, # important for agent-style use |
| tokenize=False, |
| ) |
| print(generate(model, tokenizer, prompt, max_tokens=200)) |
| ``` |
| |
| ### `mlx_lm.server` (OpenAI-compatible) |
| ```bash |
| mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-4bit \ |
| --host 127.0.0.1 --port 8080 \ |
| --temp 0.6 --top-p 0.95 --top-k 20 |
| ``` |
| |
| Pass `chat_template_kwargs: {"enable_thinking": false}` in request body to disable Carnice's thinking block — without it the model produces ~2× more tokens. |
| |
| ### Hermes Agent / other agent harnesses |
| |
| If you're driving this model from an agent harness, make sure the harness propagates `chat_template_kwargs.enable_thinking: false` to `mlx_lm.server`. Without it the model emits a hidden `<think>...</think>` block on every turn — roughly ~200 tokens of latency that's invisible to the caller. |
| |
| Known mismatch with [Hermes Agent](https://github.com/NousResearch/hermes-agent)'s `custom` provider: it sends a top-level `think: false` field instead of the `chat_template_kwargs` form, and `mlx_lm.server` does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation. |
| |
| ## Example output |
| |
| System: `You find bugs. Reply with: BUG: <one-line description>, then FIX: <one-line patch description>. No code fences, no extra prose.` |
| |
| User: |
| ```ts |
| async function processItems(items: string[]) { |
| const results = [] |
| for (const item of items) { |
| results.push(fetch(`/api/process/${item}`).then(r => r.json())) |
| } |
| return await results |
| } |
| ``` |
| What's wrong? |
| |
| Output (4.3s, 45 tokens): |
| ``` |
| BUG: `await` is applied to an array of promises instead of using `Promise.all` to wait for all to resolve. |
| FIX: Replace `await results` with `await Promise.all(results)`. |
| ``` |
|
|
| ## Other variants |
|
|
| | Repo | bpw | Size | Tradeoff | |
| |---|---|---|---| |
| | `Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6` | 3.97 | 12 GB | **Recommended** — fastest + smallest, quality ≥ this | |
| | **`Tranquil-Flow/Carnice-V2-27b-MLX-4bit`** (this) | 4.50 | 14 GB | Conservative — standard naive affine | |
| | `Tranquil-Flow/Carnice-V2-27b-MLX-6bit` | 6.50 | 20 GB | Quality tier — closer to BF16 fidelity, ~40% slower | |
|
|
| ## Limitations & out-of-scope use |
|
|
| This is a third-party MLX-format quantization of `kai-os/Carnice-V2-27b`. It is not maintained by `kai-os` or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion. |
|
|
| - **Apple Silicon only.** MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (`kai-os/Carnice-V2-27b`) or a GGUF conversion. |
| - **Text-only.** The upstream Carnice model is multimodal (`image-text-to-text`); the `mlx_lm.convert` pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with `transformers`. |
| - **Quantization artifacts.** Naive 4-bit affine quantization (4.50 bpw) introduces representation error vs the BF16 source — see the perplexity table above. The 7-prompt agent benchmark did not surface degradation, but workloads with long context, complex chains-of-thought, or precision-sensitive numerical reasoning may benefit from a higher-bit variant. |
| - **Issue scope.** Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on `kai-os/Carnice-V2-27b`. |
|
|
| ## Attribution & license |
|
|
| Original model: [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — Hermes-style SFT of Qwen3.6-27B by `kai-os`. Apache 2.0. |
|
|
| This conversion: Apache 2.0. Please credit kai-os as the upstream source. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the upstream Carnice release: |
|
|
| ```bibtex |
| @misc{carnice_v2_27b_2026, |
| author = {kai-os}, |
| title = {Carnice-V2-27b}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}} |
| } |
| ``` |
|
|
| Carnice is itself an SFT of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B); please also acknowledge the Qwen team's base model where appropriate. |
|
|
| This MLX conversion may be referenced as `Tranquil-Flow/Carnice-V2-27b-MLX-4bit` (Hugging Face), Apache 2.0, no additional restrictions. |
|
|