Instructions to use Tranquil-Flow/Carnice-V2-27b-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tranquil-Flow/Carnice-V2-27b-MLX-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use Tranquil-Flow/Carnice-V2-27b-MLX-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Tranquil-Flow/Carnice-V2-27b-MLX-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Tranquil-Flow/Carnice-V2-27b-MLX-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Tranquil-Flow/Carnice-V2-27b-MLX-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Tranquil-Flow/Carnice-V2-27b-MLX-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Tranquil-Flow/Carnice-V2-27b-MLX-4bit

Run Hermes

hermes

MLX LM

How to use Tranquil-Flow/Carnice-V2-27b-MLX-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "Tranquil-Flow/Carnice-V2-27b-MLX-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "Tranquil-Flow/Carnice-V2-27b-MLX-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "Tranquil-Flow/Carnice-V2-27b-MLX-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Carnice-V2-27b-MLX-4bit / README.md

Tranquil-Flow

Upload folder using huggingface_hub

ab2eeb9 verified 13 days ago

preview code

raw

history blame contribute delete

7.63 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	library_name: mlx
	base_model: kai-os/Carnice-V2-27b
	base_model_relation: quantized
	pipeline_tag: text-generation
	inference: false
	tags:
	- qwen
	- qwen3
	- qwen3.6
	- carnice
	- hermes-agent
	- agentic
	- sft
	- mlx
	- apple-silicon
	- 4-bit
	---

	# Carnice-V2-27b — MLX 4-bit (naive affine)

	MLX-format quantization of [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — a Hermes-style SFT of Qwen3.6-27B for agentic workloads — converted for Apple Silicon inference.

	This is the conservative choice of three published variants: standard 4-bit affine quantization, the most widely-tested mlx-lm setting. For better speed/size on the same quality tier, see [`Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6`](https://huggingface.co/Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6).

	## Quantization

	\| \| \|
	\|---\|---\|
	\| Recipe \| Naive 4-bit affine \|
	\| Effective bits/weight \| 4.50 \|
	\| Group size \| 64 \|
	\| Disk size \| ~14 GB (3 shards) \|
	\| Source \| [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) (BF16 safetensors) \|

	Conversion command (mlx-lm 0.31.3):
	```bash
	mlx_lm.convert \
	--hf-path kai-os/Carnice-V2-27b \
	--mlx-path Carnice-V2-27b-MLX-4bit \
	-q --q-bits 4
	```

	## Performance — Apple M4 Pro 48 GB, 16 GPU cores

	7-prompt agent benchmark suite, `--no-thinking` mode:

	\| Format \| Wall-clock total \| Avg tok/s \| Output tokens \|
	\|---\|---\|---\|---\|
	\| Carnice Q5_K_M (llama.cpp) \| 157.4s \| 9.1 \| 1297 \|
	\| Carnice MLX 4-bit naive (this) \| 91.1s (-42%) \| 17.3 \| 1192 \|
	\| Carnice MLX mixed_3_6 \| 77.7s \| 17.0 \| 1056 \|
	\| Carnice MLX 6-bit \| 108.7s \| 11.0 \| 1007 \|

	~42% faster wall-clock than the GGUF Q5_K_M on the same hardware. Per-token throughput ~1.9× the llama.cpp baseline.

	### Quality (wikitext-2 perplexity)

	\| Variant \| seq 256 \| seq 1024 \|
	\|---\|---\|---\|
	\| naive 4-bit (this) \| 4.949 ± 0.092 \| 3.985 ± 0.036 \|
	\| mixed_3_6 \| 5.147 ± 0.097 \| 4.073 ± 0.038 \|
	\| 6-bit \| 4.881 ± 0.091 \| (not measured) \|

	Evaluated with `mlx_lm.perplexity --num-samples 64 --batch-size 1 --sequence-length {256,1024}`. The 6-bit variant could not be measured at sequence length 1024 on M4 Pro 48 GB — its larger memory footprint plus the 1024-token KV cache exceeds the available unified memory. Numbers are comparable across rows within each column. Do not compare to externally-reported wikitext-2 perplexities without matching settings.

	The relative ordering between 4-bit and mixed_3_6 is preserved at both context lengths, which indicates the quantization is stable across context length and does not introduce hidden long-range degradation. Lower perplexity at seq 1024 vs seq 256 is expected — more conditioning context yields better next-token prediction.

	## Usage

	### `mlx_lm` (Python)
	```python
	from mlx_lm import load, generate

	model, tokenizer = load("Tranquil-Flow/Carnice-V2-27b-MLX-4bit")
	prompt = tokenizer.apply_chat_template(
	[{"role": "user", "content": "Hello"}],
	add_generation_prompt=True,
	enable_thinking=False, # important for agent-style use
	tokenize=False,
	)
	print(generate(model, tokenizer, prompt, max_tokens=200))
	```

	### `mlx_lm.server` (OpenAI-compatible)
	```bash
	mlx_lm.server --model Tranquil-Flow/Carnice-V2-27b-MLX-4bit \
	--host 127.0.0.1 --port 8080 \
	--temp 0.6 --top-p 0.95 --top-k 20
	```

	Pass `chat_template_kwargs: {"enable_thinking": false}` in request body to disable Carnice's thinking block — without it the model produces ~2× more tokens.

	### Hermes Agent / other agent harnesses

	If you're driving this model from an agent harness, make sure the harness propagates `chat_template_kwargs.enable_thinking: false` to `mlx_lm.server`. Without it the model emits a hidden `<think>...</think>` block on every turn — roughly ~200 tokens of latency that's invisible to the caller.

	Known mismatch with [Hermes Agent](https://github.com/NousResearch/hermes-agent)'s `custom` provider: it sends a top-level `think: false` field instead of the `chat_template_kwargs` form, and `mlx_lm.server` does not interpret it. The simplest workaround is a tiny HTTP proxy that rewrites the field between the agent and the server. Open a discussion if you'd like a reference implementation.

	## Example output

	System: `You find bugs. Reply with: BUG: <one-line description>, then FIX: <one-line patch description>. No code fences, no extra prose.`

	User:
	```ts
	async function processItems(items: string[]) {
	const results = []
	for (const item of items) {
	results.push(fetch(`/api/process/${item}`).then(r => r.json()))
	}
	return await results
	}
	```
	What's wrong?

	Output (4.3s, 45 tokens):
	```
	BUG: `await` is applied to an array of promises instead of using `Promise.all` to wait for all to resolve.
	FIX: Replace `await results` with `await Promise.all(results)`.
	```

	## Other variants

	\| Repo \| bpw \| Size \| Tradeoff \|
	\|---\|---\|---\|---\|
	\| `Tranquil-Flow/Carnice-V2-27b-MLX-mixed_3_6` \| 3.97 \| 12 GB \| Recommended — fastest + smallest, quality ≥ this \|
	\| `Tranquil-Flow/Carnice-V2-27b-MLX-4bit` (this) \| 4.50 \| 14 GB \| Conservative — standard naive affine \|
	\| `Tranquil-Flow/Carnice-V2-27b-MLX-6bit` \| 6.50 \| 20 GB \| Quality tier — closer to BF16 fidelity, ~40% slower \|

	## Limitations & out-of-scope use

	This is a third-party MLX-format quantization of `kai-os/Carnice-V2-27b`. It is not maintained by `kai-os` or the upstream Carnice/Qwen teams. It inherits whatever biases, factual limitations, and safety properties the upstream model has — no additional alignment, safety tuning, or behavioral evaluation was performed during conversion.

	- Apple Silicon only. MLX is Apple's framework; these weights run on M-series Macs. For other hardware use the upstream BF16 weights (`kai-os/Carnice-V2-27b`) or a GGUF conversion.
	- Text-only. The upstream Carnice model is multimodal (`image-text-to-text`); the `mlx_lm.convert` pipeline used here drops the vision encoder. This release supports text input only. For image input, use the upstream BF16 weights with `transformers`.
	- Quantization artifacts. Naive 4-bit affine quantization (4.50 bpw) introduces representation error vs the BF16 source — see the perplexity table above. The 7-prompt agent benchmark did not surface degradation, but workloads with long context, complex chains-of-thought, or precision-sensitive numerical reasoning may benefit from a higher-bit variant.
	- Issue scope. Issues specific to this MLX conversion (loading errors, quantization fidelity, file integrity) belong on this repo. Issues with model behavior (instruction following, factuality, refusal calibration, training-data concerns) are upstream concerns and should be raised on `kai-os/Carnice-V2-27b`.

	## Attribution & license

	Original model: [`kai-os/Carnice-V2-27b`](https://huggingface.co/kai-os/Carnice-V2-27b) — Hermes-style SFT of Qwen3.6-27B by `kai-os`. Apache 2.0.

	This conversion: Apache 2.0. Please credit kai-os as the upstream source.

	## Citation

	If you use this model, please cite the upstream Carnice release:

	```bibtex
	@misc{carnice_v2_27b_2026,
	author = {kai-os},
	title = {Carnice-V2-27b},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/kai-os/Carnice-V2-27b}}
	}
	```

	Carnice is itself an SFT of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B); please also acknowledge the Qwen team's base model where appropriate.

	This MLX conversion may be referenced as `Tranquil-Flow/Carnice-V2-27b-MLX-4bit` (Hugging Face), Apache 2.0, no additional restrictions.