Huihui4-48B-A4B vMLX MXFP4

MLX MXFP4 (microscaling 4-bit) build of huihui-ai/Huihui4-48B-A4B-abliterated for Apple Silicon — Gemma 4 architecture, abliterated, 48B-parameter MoE with ~4B active per token, full multimodal (image + text).

Smallest variant — ideal for serving on 32 GB / 64 GB Macs, with the fastest cold-load time in the family.

Built end-to-end with our mlx-vlm editable fork (LibraxisAI delta on top of upstream Blaizzy/mlx-vlm) — Fix 1 (progress visibility during lazy weight materialization), Fix 3 (per-shard eval + cache release during save), parity-aligned converter for Gemma 4 multi-bank audio + dual-image processor.

TL;DR

Property Value
Base model huihui-ai/Huihui4-48B-A4B-abliterated (Gemma 4, abliterated)
Architecture Gemma4ForConditionalGeneration MoE, 256 experts/layer, 8 active per token
Total parameters ~48 B
Activated parameters ~4 B per token
Quantization MXFP4 (microscaling FP4 + block scale)
Bits / weight ~4.4
Size on disk 25 GB
Cold load (M3 Ultra) ~29 s
TTFT (text) ~0.3 s
Modalities text in / text out, image in (JPEG/PNG), audio-aware tokenizer

Why this build

MXFP4 is the mainstream serving target for this model family on Apple Silicon. At ~4.4 bits per weight you get:

  • The smallest disk and RAM footprint of the family (25 GB on disk, fits comfortably in 32 GB unified memory with overhead room for KV cache and image features).
  • The fastest cold load (~29 s) — practical for environments where models cycle in and out of cache.
  • Strong throughput — for short prompts, this build emitted the highest output character count in our matrix (60516 chars on the canonical Polish probe in 0.3 s TTFT, vs. 8342 chars for the fp16 baseline at 0.5 s TTFT).

The tradeoff vs. mxfp8 is mild quality compression — well-suited for chat and multimodal Q&A, marginally less crisp on long structured generations. For evaluation, see the fp16 parity baseline.

Model details

Property Value
Format MLX, sharded safetensors
Quantization config MXFP4 (FP4 mantissa + block scale)
Tokenizer Inherited from base, chat_template.jinja included
Special tokens `<
Image processor Dual-resolution Gemma 4 (low + hi-res patches)
Audio extractor Multi-bank mel filter (128 mel × 257 freq)
License Apache 2.0 (inherited from huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated)

Runtime compatibility

This quantized MLX build includes the Gemma 4 vision projection compatibility tensor embed_vision.embedding_projection.biases, so current MLX loaders that require the quantized projection bias can load the checkpoint cleanly. The MXFP8 variant was smoke-tested in LM Studio, and MXFP4/MXFP8/NVFP4 were patched with the same compatibility pattern.

The default generation_config.json is tuned conservatively for chat stability (temperature=0.7, top_p=0.9, top_k=40, min_p=0.05, repetition_penalty=1.18) to reduce phrase-looping in GUI runtimes such as LM Studio.

Other variants

Variant Bits/weight Size on disk Cold load When to use
Huihui4-48B-A4B-vmlx-fp16 16 91 GB ~99 s parity baseline, golden eval
Huihui4-48B-A4B-vmlx-mxfp8 ~8.5 47 GB ~55 s balanced production target
Huihui4-48B-A4B-vmlx-mxfp4 (this) ~4.4 25 GB ~29 s mainstream, 32 GB Macs
Huihui4-48B-A4B-vmlx-nvfp4 ~4.4 27 GB ~31 s NVIDIA-style FP4 sleeper, higher quality at same footprint

Usage

mlx-vlm CLI

pip install mlx-vlm

python -m mlx_vlm.generate \
  --model LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4 \
  --image path/to/image.jpg \
  --prompt "Describe what you see in detail." \
  --max-tokens 1024

mlx-vlm Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4")
config = model.config

messages = [{"role": "user", "content": "Hello! Tell me about yourself."}]
prompt = apply_chat_template(processor, config, messages)
output = generate(model, processor, prompt, max_tokens=512)
print(output)

mlx-batch-runner (Responses API, streaming)

curl -X POST http://127.0.0.1:10240/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4", "task": "llm"}'

curl -N -X POST http://127.0.0.1:10240/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4",
    "stream": true,
    "input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}]
  }'

Our local validation/runtime path for this upload workflow was ../mlx-batch-runner, which ships full /v1/responses SSE with model cache TTL (MODEL_CACHE_TTL=600 default) and pin support (PINNED_MODELS=...).

For the public LibraxisAI server project, see mlx-batch-server: an Apple Silicon MLX inference server with batch processing, OpenAI-compatible /v1/responses, dynamic model load/unload, streaming, and VLM support.

Validation

End-to-end pipeline test 2026-04-22 (load → text simple → text canonical → vision JPEG → unload):

Probe TTFT Output chars Notes
Text — simple greeting (PL) 0.6 s 13254 most prolific in the family
Text — canonical (PL, literary) 0.3 s 60516 exhaustive multi-paragraph response
Vision — JPEG (Monument Valley) 4.9 s 980 accurate scene description

Channel parsing: has_reasoning=False on every probe — Huihui4 family emits content exclusively on output channel, matching OpenAI Responses API expectations cleanly.

Limitations and safety

Abliteration disclosure. This model derives from huihui-ai/Huihui4-48B-A4B-abliterated, which has had its safety alignment layers (refusal mechanisms and attention routing) removed. The underlying knowledge from pretraining is intact, but the model will not refuse queries it would normally decline. Do not deploy without an external safety layer if your context requires content moderation. The base model card's disclosures apply here.

  • Multimodal: tested on still images (JPEG/PNG). Video is supported by the upstream Gemma 4 processor (Gemma4VideoProcessor, 32-frame uniform sampling) but not yet covered in our published validation matrix.
  • Audio: tokenizer-side audio markers are present, but no audio-input validation has been published yet.
  • Like all 4-bit quantized MoE models on Apple Silicon, expect occasional cosmetic artifacts (trailing special tokens) on very long generations.

License

Apache 2.0 — inherited via huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated. See LICENSE link and the underlying Google Gemma terms.

Acknowledgements

  • huihui-ai — abliteration of the Gemma-4-26B-A4B-it base, MoE expansion to 48B-A4B, original distillation.
  • TeichAIgemma-4-26B-A4B-it-Claude-Opus-Distill co-base.
  • Google DeepMind — Gemma 4 architecture and pretraining.
  • Apple MLX team — MLX framework, quantization primitives.
  • Blaizzy/mlx-vlm — upstream multimodal MLX runtime; this build uses our editable LibraxisAI delta which we are upstreaming as separate PRs.

𝚅𝚒𝚋𝚎𝚌𝚛𝚊𝚏𝚝𝚎𝚍. with AI Agents by VetCoders (c)2024-2026 The LibraxisAI Team

Downloads last month
198
Safetensors
Model size
10B params
Tensor type
BF16
·
U8
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4