LibraxisAI
/

Huihui4-48B-A4B-vmlx-nvfp4

Model card Files Files and versions

xet

Community

div0-space commited on 29 days ago

Commit

9c3d56e

verified ·

1 Parent(s): 9852827

card: full rewrite from canonical template

Browse files

Files changed (1) hide show

README.md +73 -105

README.md CHANGED Viewed

@@ -1,11 +1,11 @@
 ---
 license: apache-2.0
-license_link: https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated/blob/main/LICENSE
-base_model: huihui-ai/Huihui4-48B-A4B-abliterated
 language:
 - en
 - pl
 - multilingual
 library_name: mlx
 pipeline_tag: image-text-to-text
 tags:
@@ -24,150 +24,118 @@ tags:
 - nvfp4
 - 4bit
 - quantized
 ---
-# Huihui4-48B-A4B vMLX NVFP4
-MLX **NVFP4** (NVIDIA-style FP4) build of [`huihui-ai/Huihui4-48B-A4B-abliterated`](https://huggingface.co/huihui-ai/Huihui4-48B-A4B-abliterated) for Apple Silicon — Gemma 4 architecture, abliterated, 48B-parameter MoE with ~4B active per token, full multimodal (image + text).
-**Sleeper variant.** Same bits-per-weight footprint as `mxfp4`, but uses an NVIDIA-style FP4 layout (per-tensor scale + per-block exponent) that retains noticeably more numerical headroom on dense attention matmuls. In our matrix, `nvfp4` produces longer and slightly higher-quality outputs than `mxfp4` at almost identical disk footprint and load time. Few publishers ship this format — it is a deliberate part of the LibraxisAI release.
-Built end-to-end with our `mlx-vlm` editable fork (LibraxisAI delta on top of upstream `Blaizzy/mlx-vlm`) — Fix 1 (progress visibility during lazy weight materialization), Fix 3 (per-shard eval + cache release during save), parity-aligned converter for Gemma 4 multi-bank audio + dual-image processor.
-## TL;DR
-| Property            | Value                                                        |
-|---------------------|--------------------------------------------------------------|
-| Base model          | `huihui-ai/Huihui4-48B-A4B-abliterated` (Gemma 4, abliterated) |
-| Architecture        | `Gemma4ForConditionalGeneration` MoE, 256 experts/layer, 8 active per token |
-| Total parameters    | ~48 B                                                        |
-| Activated parameters| ~4 B per token                                               |
-| Quantization        | NVFP4 (NVIDIA FP4 layout, per-tensor scale + per-block exponent) |
-| Bits / weight       | ~4.4                                                         |
-| Size on disk        | **27 GB**                                                    |
-| Cold load (M3 Ultra)| **~31 s**                                                    |
-| TTFT (text)         | ~0.3 s                                                       |
-| Modalities          | text in / text out, image in (JPEG/PNG), audio-aware tokenizer |
-## Why this build
-`NVFP4` and `MXFP4` are competing 4-bit formats with different rounding strategies:
-- **`MXFP4`** (Microscaling FP4) — block scale per 32 weights, conservative on attention paths, slightly smaller files.
-- **`NVFP4`** (NVIDIA FP4) — per-tensor scale combined with per-block exponent, retains more dynamic range on dense matmuls.
-In identical end-to-end probes against the [`fp16` parity baseline](https://huggingface.co/LibraxisAI/Huihui4-48B-A4B-vmlx-fp16), `nvfp4` matched output quality more closely on vision (1043 vs. 980 chars on JPEG probe) at near-identical load time. For Apple Silicon serving where you want the smallest practical 4-bit checkpoint without giving up image-grounded fidelity, this is the variant to evaluate first.
-## Model details
-| Property              | Value                                                  |
-|-----------------------|--------------------------------------------------------|
-| Format                | MLX, sharded safetensors                               |
-| Quantization config   | NVFP4 (FP4 + per-tensor scale + per-block exponent)    |
-| Tokenizer             | Inherited from base, `chat_template.jinja` included    |
-| Special tokens        | `<|video|>` (32 frames default), `<image>`, audio markers |
-| Image processor       | Dual-resolution Gemma 4 (low + hi-res patches)         |
-| Audio extractor       | Multi-bank mel filter (128 mel × 257 freq)             |
-| License               | Apache 2.0 (inherited from `huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated`) |
-## Runtime compatibility
-This quantized MLX build includes the Gemma 4 vision projection compatibility tensor `embed_vision.embedding_projection.biases`, so current MLX loaders that require the quantized projection bias can load the checkpoint cleanly. The MXFP8 variant was smoke-tested in LM Studio, and MXFP4/MXFP8/NVFP4 were patched with the same compatibility pattern.
-## Other variants
-| Variant                                                         | Bits/weight | Size on disk | Cold load | When to use |
-|-----------------------------------------------------------------|-------------|--------------|-----------|-------------|
-| [`Huihui4-48B-A4B-vmlx-fp16`](https://huggingface.co/LibraxisAI/Huihui4-48B-A4B-vmlx-fp16)   | 16          | 91 GB | ~99 s | parity baseline, golden eval |
-| [`Huihui4-48B-A4B-vmlx-mxfp8`](https://huggingface.co/LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp8) | ~8.5        | 47 GB | ~55 s | balanced production target |
-| [`Huihui4-48B-A4B-vmlx-mxfp4`](https://huggingface.co/LibraxisAI/Huihui4-48B-A4B-vmlx-mxfp4) | ~4.4        | 25 GB | ~29 s | mainstream, 32 GB Macs |
-| **`Huihui4-48B-A4B-vmlx-nvfp4`** (this)                         | ~4.4        | 27 GB | ~31 s | **NVIDIA-style FP4 sleeper, higher quality at same footprint as mxfp4** |
 ## Usage
-### `mlx-vlm` CLI
 ```bash
 pip install mlx-vlm
 python -m mlx_vlm.generate \
   --model LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4 \
-  --image path/to/image.jpg \
-  --prompt "Describe what you see in detail." \
-  --max-tokens 1024
 ```
-### `mlx-vlm` Python
 ```python
-from mlx_vlm import load, generate
-from mlx_vlm.prompt_utils import apply_chat_template
 model, processor = load("LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4")
-config = model.config
-messages = [{"role": "user", "content": "Hello! Tell me about yourself."}]
-prompt = apply_chat_template(processor, config, messages)
-output = generate(model, processor, prompt, max_tokens=512)
-print(output)
 ```
-### `mlx-batch-runner` (Responses API, streaming)
-```bash
-curl -X POST http://127.0.0.1:10240/v1/models/load \
-  -H "Content-Type: application/json" \
-  -d '{"model": "LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4", "task": "llm"}'
-curl -N -X POST http://127.0.0.1:10240/v1/responses \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4",
-    "stream": true,
-    "input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}]
-  }'
-```
-Our local validation/runtime path for this upload workflow was `../mlx-batch-runner`, which ships full `/v1/responses` SSE with model cache TTL (`MODEL_CACHE_TTL=600` default) and pin support (`PINNED_MODELS=...`).
-For the public LibraxisAI server project, see [mlx-batch-server](https://github.com/LibraxisAI/mlx-batch-server): an Apple Silicon MLX inference server with batch processing, OpenAI-compatible `/v1/responses`, dynamic model load/unload, streaming, and VLM support.
-## Validation
-End-to-end pipeline test 2026-04-22 (load → text simple → text canonical → vision JPEG → unload):
-| Probe                           | TTFT  | Output chars | Notes                                       |
-|---------------------------------|-------|--------------|---------------------------------------------|
-| Text — simple greeting (PL)     | 0.7 s | 1728         | concise, focused                            |
-| Text — canonical (PL, literary) | 0.3 s | 1665         | tighter than mxfp4, more on-topic           |
-| Vision — JPEG (Monument Valley) | 4.6 s | **1043**     | richest vision output among 4-bit variants  |
-Channel parsing: `has_reasoning=False` on every probe — Huihui4 family emits content exclusively on `output` channel, matching OpenAI Responses API expectations cleanly.
-## Limitations and safety
-> **Abliteration disclosure.** This model derives from `huihui-ai/Huihui4-48B-A4B-abliterated`, which has had its safety alignment layers (refusal mechanisms and attention routing) removed. The underlying knowledge from pretraining is intact, but the model **will not refuse** queries it would normally decline. Do not deploy without an external safety layer if your context requires content moderation. The base model card's [disclosures](https://huggingface.co/huihui-ai/Huihui4-48B-A4B-abliterated) apply here.
-- Multimodal: tested on still images (JPEG/PNG). Video is supported by the upstream Gemma 4 processor (`Gemma4VideoProcessor`, 32-frame uniform sampling) but not yet covered in our published validation matrix.
-- Audio: tokenizer-side audio markers are present, but no audio-input validation has been published yet.
-- Like all 4-bit quantized MoE models on Apple Silicon, expect occasional cosmetic artifacts (trailing special tokens) on very long generations.
 ## License
-Apache 2.0 — inherited via `huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated`. See [LICENSE link](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated/blob/main/LICENSE) and the underlying [Google Gemma](https://ai.google.dev/gemma/terms) terms.
-## Acknowledgements
-- **huihui-ai** — abliteration of the Gemma-4-26B-A4B-it base, MoE expansion to 48B-A4B, original distillation.
-- **TeichAI** — `gemma-4-26B-A4B-it-Claude-Opus-Distill` co-base.
-- **Google DeepMind** — Gemma 4 architecture and pretraining.
-- **Apple MLX team** — MLX framework, quantization primitives.
-- **`Blaizzy/mlx-vlm`** — upstream multimodal MLX runtime; this build uses our editable LibraxisAI delta which we are upstreaming as separate PRs.
 ## Inference tested on
 [`LibraxisAI/mlx-batch-server`](https://github.com/LibraxisAI/mlx-batch-server)
 ---
-`𝚅𝚒𝚋𝚎𝚌𝚛𝚊𝚏𝚝𝚎𝚍. with AI Agents by VetCoders (c)2024-2026 The LibraxisAI Team`

 ---
 license: apache-2.0
 language:
 - en
 - pl
 - multilingual
+base_model:
+- huihui-ai/Huihui4-48B-A4B-abliterated
 library_name: mlx
 pipeline_tag: image-text-to-text
 tags:
 - nvfp4
 - 4bit
 - quantized
+- huihui
+inference: false
 ---
+# Huihui4-48B-A4B-vmlx-nvfp4
+`Huihui4-48B-A4B-vmlx-nvfp4` is an MLX vision-language checkpoint derived from `huihui-ai/Huihui4-48B-A4B-abliterated`, packaged for local multimodal prompting on Apple Silicon.
+## Intended use
+- Local image-and-text reasoning on Apple Silicon
+- Document, screenshot, chart, and visual question answering experiments
+- Operator-controlled multimodal prototyping where hosted inference is not desired
+## Out of scope
+- Safety-critical decisions without domain expert review
+- Claims of benchmark superiority not backed by published evaluation data
+- Non-MLX runtime guarantees; this card documents the shipped HF checkpoint, not every possible serving stack
+- High-stakes visual interpretation without human review
+## Training and conversion metadata
+| Parameter | Value |
+|---|---|
+| Repository | `LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4` |
+| Base model | `huihui-ai/Huihui4-48B-A4B-abliterated` |
+| Task | `image-text-to-text` |
+| Library | `mlx` |
+| Format | MLX / Apple Silicon checkpoint |
+| Quantization | NVFP4 |
+| Architecture | Gemma4ForConditionalGeneration |
+| Model files | 6 |
+| Config model_type | `gemma4` |
+This card only reports metadata present in the Hugging Face repository, existing card frontmatter, or public config files. Missing benchmark, dataset, or training-run details are left explicit rather than reconstructed.
 ## Usage
+### CLI
 ```bash
 pip install mlx-vlm
 python -m mlx_vlm.generate \
   --model LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4 \
+  --image image.jpg \
+  --prompt "Summarize the key signals in this document and list the next action items." \
+  --max-tokens 256
 ```
+### Python
 ```python
+from mlx_vlm import generate, load
 model, processor = load("LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4")
+response = generate(
+    model,
+    processor,
+    prompt="Summarize the key signals in this document and list the next action items.",
+    image="image.jpg",
+    max_tokens=256,
+)
+print(response)
 ```
+## Example output
+No public sample output is currently declared for this checkpoint. Run the usage example above against your own prompt or audio/image input to inspect behavior.
+## Quantization notes
+| Aspect | Original/base checkpoint | This checkpoint |
+|---|---|---|
+| Lineage | `huihui-ai/Huihui4-48B-A4B-abliterated` | `LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4` |
+| Runtime target | Upstream runtime format | MLX on Apple Silicon |
+| Quantization | Base precision or upstream-declared format | NVFP4 |
+| Published quality delta | Not declared in public metadata | Not declared in public metadata |
+## Limitations
+- No public benchmarks for this checkpoint are declared in the model metadata.
+- No public benchmark claims are made by this card unless listed in the frontmatter.
+- Validate outputs on your own domain data before relying on this checkpoint.
+- Memory use and speed depend heavily on the exact Apple Silicon generation, unified-memory size, and prompt length.
 ## License
+`apache-2.0`. Check the upstream/base model license as well when a base model is declared.
+## Citation
+```bibtex
+@misc{libraxisai-huihui4-48b-a4b-vmlx-nvfp4,
+  title = {Huihui4-48B-A4B-vmlx-nvfp4},
+  author = {LibraxisAI},
+  year = {2026},
+  howpublished = {\url{https://huggingface.co/LibraxisAI/Huihui4-48B-A4B-vmlx-nvfp4}},
+  note = {MLX checkpoint published by LibraxisAI}
+}
+```
 ## Inference tested on
 [`LibraxisAI/mlx-batch-server`](https://github.com/LibraxisAI/mlx-batch-server)
+## Related
+- Base model: [`huihui-ai/Huihui4-48B-A4B-abliterated`](https://huggingface.co/huihui-ai/Huihui4-48B-A4B-abliterated)
 ---
+𝚅𝚒𝚋𝚎𝚌𝚛𝚊𝚏𝚝𝚎𝚍. with AI Agents by VetCoders (c)2024-2026 LibraxisAI