Huihui4-48B-A4B vMLX MXFP4

MLX MXFP4 (microscaling 4-bit) build of huihui-ai/Huihui4-48B-A4B-abliterated for Apple Silicon — Gemma 4 architecture, abliterated, 48B-parameter MoE with ~4B active per token, full multimodal (image + text).

Smallest variant — ideal for serving on 32 GB / 64 GB Macs, with the fastest cold-load time in the family.

Built end-to-end with our mlx-vlm editable fork (LibraxisAI delta on top of upstream Blaizzy/mlx-vlm) — Fix 1 (progress visibility during lazy weight materialization), Fix 3 (per-shard eval + cache release during save), parity-aligned converter for Gemma 4 multi-bank audio + dual-image processor.

TL;DR

Property	Value
Base model	`huihui-ai/Huihui4-48B-A4B-abliterated` (Gemma 4, abliterated)
Architecture	`Gemma4ForConditionalGeneration` MoE, 256 experts/layer, 8 active per token
Total parameters	~48 B
Activated parameters	~4 B per token
Quantization	MXFP4 (microscaling FP4 + block scale)
Bits / weight	~4.4
Size on disk	25 GB
Cold load (M3 Ultra)	~29 s
TTFT (text)	~0.3 s
Modalities	text in / text out, image in (JPEG/PNG), audio-aware tokenizer

Why this build

MXFP4 is the mainstream serving target for this model family on Apple Silicon. At ~4.4 bits per weight you get:

The smallest disk and RAM footprint of the family (25 GB on disk, fits comfortably in 32 GB unified memory with overhead room for KV cache and image features).
The fastest cold load (~29 s) — practical for environments where models cycle in and out of cache.
Strong throughput — for short prompts, this build emitted the highest output character count in our matrix (60516 chars on the canonical Polish probe in 0.3 s TTFT, vs. 8342 chars for the fp16 baseline at 0.5 s TTFT).

The tradeoff vs. mxfp8 is mild quality compression — well-suited for chat and multimodal Q&A, marginally less crisp on long structured generations. For evaluation, see the fp16 parity baseline.

Model details

Property	Value
Format	MLX, sharded safetensors
Quantization config	MXFP4 (FP4 mantissa + block scale)
Tokenizer	Inherited from base, `chat_template.jinja` included
Special tokens	`<
Image processor	Dual-resolution Gemma 4 (low + hi-res patches)
Audio extractor	Multi-bank mel filter (128 mel × 257 freq)
License	Apache 2.0 (inherited from `huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated`)

Runtime compatibility

This quantized MLX build includes the Gemma 4 vision projection compatibility tensor embed_vision.embedding_projection.biases, so current MLX loaders that require the quantized projection bias can load the checkpoint cleanly. The MXFP8 variant was smoke-tested in LM Studio, and MXFP4/MXFP8/NVFP4 were patched with the same compatibility pattern.

The default generation_config.json is tuned conservatively for chat stability (temperature=0.7, top_p=0.9, top_k=40, min_p=0.05, repetition_penalty=1.18) to reduce phrase-looping in GUI runtimes such as LM Studio.

Other variants

Variant	Bits/weight	Size on disk	Cold load	When to use
`Huihui4-48B-A4B-vmlx-fp16`	16	91 GB	~99 s	parity baseline, golden eval
`Huihui4-48B-A4B-vmlx-mxfp8`	~8.5	47 GB	~55 s	balanced production target
`Huihui4-48B-A4B-vmlx-mxfp4` (this)	~4.4	25 GB	~29 s	mainstream, 32 GB Macs
`Huihui4-48B-A4B-vmlx-nvfp4`	~4.4	27 GB	~31 s	NVIDIA-style FP4 sleeper, higher quality at same footprint

Usage

`mlx-vlm` CLI

pip install mlx-vlm

python -m mlx_vlm.generate \
  --model LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4 \
  --image path/to/image.jpg \
  --prompt "Describe what you see in detail." \
  --max-tokens 1024

`mlx-vlm` Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4")
config = model.config

messages = [{"role": "user", "content": "Hello! Tell me about yourself."}]
prompt = apply_chat_template(processor, config, messages)
output = generate(model, processor, prompt, max_tokens=512)
print(output)

`mlx-batch-runner` (Responses API, streaming)

curl -X POST http://127.0.0.1:10240/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4", "task": "llm"}'

curl -N -X POST http://127.0.0.1:10240/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4",
    "stream": true,
    "input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}]
  }'

Our local validation/runtime path for this upload workflow was ../mlx-batch-runner, which ships full /v1/responses SSE with model cache TTL (MODEL_CACHE_TTL=600 default) and pin support (PINNED_MODELS=...).

For the public LibraxisAI server project, see mlx-batch-server: an Apple Silicon MLX inference server with batch processing, OpenAI-compatible /v1/responses, dynamic model load/unload, streaming, and VLM support.

Validation

End-to-end pipeline test 2026-04-22 (load → text simple → text canonical → vision JPEG → unload):

Probe	TTFT	Output chars	Notes
Text — simple greeting (PL)	0.6 s	13254	most prolific in the family
Text — canonical (PL, literary)	0.3 s	60516	exhaustive multi-paragraph response
Vision — JPEG (Monument Valley)	4.9 s	980	accurate scene description

Channel parsing: has_reasoning=False on every probe — Huihui4 family emits content exclusively on output channel, matching OpenAI Responses API expectations cleanly.

Limitations and safety

Abliteration disclosure. This model derives from huihui-ai/Huihui4-48B-A4B-abliterated, which has had its safety alignment layers (refusal mechanisms and attention routing) removed. The underlying knowledge from pretraining is intact, but the model will not refuse queries it would normally decline. Do not deploy without an external safety layer if your context requires content moderation. The base model card's disclosures apply here.

Multimodal: tested on still images (JPEG/PNG). Video is supported by the upstream Gemma 4 processor (Gemma4VideoProcessor, 32-frame uniform sampling) but not yet covered in our published validation matrix.
Audio: tokenizer-side audio markers are present, but no audio-input validation has been published yet.
Like all 4-bit quantized MoE models on Apple Silicon, expect occasional cosmetic artifacts (trailing special tokens) on very long generations.

License

Apache 2.0 — inherited via huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated. See LICENSE link and the underlying Google Gemma terms.

Acknowledgements

huihui-ai — abliteration of the Gemma-4-26B-A4B-it base, MoE expansion to 48B-A4B, original distillation.
TeichAI — gemma-4-26B-A4B-it-Claude-Opus-Distill co-base.
Google DeepMind — Gemma 4 architecture and pretraining.
Apple MLX team — MLX framework, quantization primitives.
Blaizzy/mlx-vlm — upstream multimodal MLX runtime; this build uses our editable LibraxisAI delta which we are upstreaming as separate PRs.

Downloads last month: 198

Safetensors

Model size

10B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

unsloth/gemma-4-26B-A4B-it

Finetuned

TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill

Finetuned

huihui-ai/Huihui4-48B-A4B-abliterated

Quantized

(7)

this model

Huihui4-48B-A4B vMLX MXFP4

TL;DR

Why this build

Model details

Runtime compatibility

Other variants

Usage

mlx-vlm CLI

mlx-vlm Python

mlx-batch-runner (Responses API, streaming)

Validation

Limitations and safety

License

Acknowledgements

Model tree for LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4

`mlx-vlm` CLI

`mlx-vlm` Python

`mlx-batch-runner` (Responses API, streaming)