Huihui4-48B-A4B vMLX MXFP4
MLX MXFP4 (microscaling 4-bit) build of huihui-ai/Huihui4-48B-A4B-abliterated for Apple Silicon — Gemma 4 architecture, abliterated, 48B-parameter MoE with ~4B active per token, full multimodal (image + text).
Smallest variant — ideal for serving on 32 GB / 64 GB Macs, with the fastest cold-load time in the family.
Built end-to-end with our mlx-vlm editable fork (LibraxisAI delta on top of upstream Blaizzy/mlx-vlm) — Fix 1 (progress visibility during lazy weight materialization), Fix 3 (per-shard eval + cache release during save), parity-aligned converter for Gemma 4 multi-bank audio + dual-image processor.
TL;DR
| Property | Value |
|---|---|
| Base model | huihui-ai/Huihui4-48B-A4B-abliterated (Gemma 4, abliterated) |
| Architecture | Gemma4ForConditionalGeneration MoE, 256 experts/layer, 8 active per token |
| Total parameters | ~48 B |
| Activated parameters | ~4 B per token |
| Quantization | MXFP4 (microscaling FP4 + block scale) |
| Bits / weight | ~4.4 |
| Size on disk | 25 GB |
| Cold load (M3 Ultra) | ~29 s |
| TTFT (text) | ~0.3 s |
| Modalities | text in / text out, image in (JPEG/PNG), audio-aware tokenizer |
Why this build
MXFP4 is the mainstream serving target for this model family on Apple Silicon. At ~4.4 bits per weight you get:
- The smallest disk and RAM footprint of the family (25 GB on disk, fits comfortably in 32 GB unified memory with overhead room for KV cache and image features).
- The fastest cold load (~29 s) — practical for environments where models cycle in and out of cache.
- Strong throughput — for short prompts, this build emitted the highest output character count in our matrix (60516 chars on the canonical Polish probe in 0.3 s TTFT, vs. 8342 chars for the fp16 baseline at 0.5 s TTFT).
The tradeoff vs. mxfp8 is mild quality compression — well-suited for chat and multimodal Q&A, marginally less crisp on long structured generations. For evaluation, see the fp16 parity baseline.
Model details
| Property | Value |
|---|---|
| Format | MLX, sharded safetensors |
| Quantization config | MXFP4 (FP4 mantissa + block scale) |
| Tokenizer | Inherited from base, chat_template.jinja included |
| Special tokens | `< |
| Image processor | Dual-resolution Gemma 4 (low + hi-res patches) |
| Audio extractor | Multi-bank mel filter (128 mel × 257 freq) |
| License | Apache 2.0 (inherited from huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated) |
Runtime compatibility
This quantized MLX build includes the Gemma 4 vision projection compatibility tensor embed_vision.embedding_projection.biases, so current MLX loaders that require the quantized projection bias can load the checkpoint cleanly. The MXFP8 variant was smoke-tested in LM Studio, and MXFP4/MXFP8/NVFP4 were patched with the same compatibility pattern.
The default generation_config.json is tuned conservatively for chat stability (temperature=0.7, top_p=0.9, top_k=40, min_p=0.05, repetition_penalty=1.18) to reduce phrase-looping in GUI runtimes such as LM Studio.
Other variants
| Variant | Bits/weight | Size on disk | Cold load | When to use |
|---|---|---|---|---|
Huihui4-48B-A4B-vmlx-fp16 |
16 | 91 GB | ~99 s | parity baseline, golden eval |
Huihui4-48B-A4B-vmlx-mxfp8 |
~8.5 | 47 GB | ~55 s | balanced production target |
Huihui4-48B-A4B-vmlx-mxfp4 (this) |
~4.4 | 25 GB | ~29 s | mainstream, 32 GB Macs |
Huihui4-48B-A4B-vmlx-nvfp4 |
~4.4 | 27 GB | ~31 s | NVIDIA-style FP4 sleeper, higher quality at same footprint |
Usage
mlx-vlm CLI
pip install mlx-vlm
python -m mlx_vlm.generate \
--model LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4 \
--image path/to/image.jpg \
--prompt "Describe what you see in detail." \
--max-tokens 1024
mlx-vlm Python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4")
config = model.config
messages = [{"role": "user", "content": "Hello! Tell me about yourself."}]
prompt = apply_chat_template(processor, config, messages)
output = generate(model, processor, prompt, max_tokens=512)
print(output)
mlx-batch-runner (Responses API, streaming)
curl -X POST http://127.0.0.1:10240/v1/models/load \
-H "Content-Type: application/json" \
-d '{"model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4", "task": "llm"}'
curl -N -X POST http://127.0.0.1:10240/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4",
"stream": true,
"input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}]
}'
Our local validation/runtime path for this upload workflow was ../mlx-batch-runner, which ships full /v1/responses SSE with model cache TTL (MODEL_CACHE_TTL=600 default) and pin support (PINNED_MODELS=...).
For the public LibraxisAI server project, see mlx-batch-server: an Apple Silicon MLX inference server with batch processing, OpenAI-compatible /v1/responses, dynamic model load/unload, streaming, and VLM support.
Validation
End-to-end pipeline test 2026-04-22 (load → text simple → text canonical → vision JPEG → unload):
| Probe | TTFT | Output chars | Notes |
|---|---|---|---|
| Text — simple greeting (PL) | 0.6 s | 13254 | most prolific in the family |
| Text — canonical (PL, literary) | 0.3 s | 60516 | exhaustive multi-paragraph response |
| Vision — JPEG (Monument Valley) | 4.9 s | 980 | accurate scene description |
Channel parsing: has_reasoning=False on every probe — Huihui4 family emits content exclusively on output channel, matching OpenAI Responses API expectations cleanly.
Limitations and safety
Abliteration disclosure. This model derives from
huihui-ai/Huihui4-48B-A4B-abliterated, which has had its safety alignment layers (refusal mechanisms and attention routing) removed. The underlying knowledge from pretraining is intact, but the model will not refuse queries it would normally decline. Do not deploy without an external safety layer if your context requires content moderation. The base model card's disclosures apply here.
- Multimodal: tested on still images (JPEG/PNG). Video is supported by the upstream Gemma 4 processor (
Gemma4VideoProcessor, 32-frame uniform sampling) but not yet covered in our published validation matrix. - Audio: tokenizer-side audio markers are present, but no audio-input validation has been published yet.
- Like all 4-bit quantized MoE models on Apple Silicon, expect occasional cosmetic artifacts (trailing special tokens) on very long generations.
License
Apache 2.0 — inherited via huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated. See LICENSE link and the underlying Google Gemma terms.
Acknowledgements
- huihui-ai — abliteration of the Gemma-4-26B-A4B-it base, MoE expansion to 48B-A4B, original distillation.
- TeichAI —
gemma-4-26B-A4B-it-Claude-Opus-Distillco-base. - Google DeepMind — Gemma 4 architecture and pretraining.
- Apple MLX team — MLX framework, quantization primitives.
Blaizzy/mlx-vlm— upstream multimodal MLX runtime; this build uses our editable LibraxisAI delta which we are upstreaming as separate PRs.
𝚅𝚒𝚋𝚎𝚌𝚛𝚊𝚏𝚝𝚎𝚍. with AI Agents by VetCoders (c)2024-2026 The LibraxisAI Team
- Downloads last month
- 198
4-bit
Model tree for LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4
Base model
google/gemma-4-26B-A4B