Void-Citrus-L3.3-70B-mxfp4

Format: MXFP4 (OCP Microscaling FP4) — weights quantized to FP4 with per-block FP8 scaling factors (block size 32). Activations in BF16/FP16 (W4A16 style).
Base model: Darkknight535/Void-Citrus-L3.3-70B
How it was made: One-shot quantization with LLM Compressor (MXFP4 recipe) on a DGX Spark (GB10 Grace Blackwell). Calibrated on xensive/roleplaydataset100k (512 samples, max 4096 tokens).

Notes: lm_head and multimodal projection layers kept in high precision. GB10 Blackwell has native MX format hardware support per the OCP Microscaling spec. Older architectures will dequantize to BF16 at inference time while still benefiting from reduced model size and bandwidth.

Check the original model card for information about this model.

Running the model with vLLM in Docker

sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:latest \
  --model Firworks/Void-Citrus-L3.3-70B-mxfp4 \
  --dtype auto \
  --max-model-len 32768

Specifically for the DGX Spark

sudo docker run --gpus all --network host --ipc=host \
  nvcr.io/nvidia/vllm:26.02-py3 \
  vllm serve Firworks/Void-Citrus-L3.3-70B-mxfp4 \
  --dtype auto \
  --max-model-len 32768

Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).

If there are other models you'd like quantized to MXFP4, let me know.

Downloads last month
5
Safetensors
Model size
38B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Firworks/Void-Citrus-L3.3-70B-mxfp4

Quantized
(7)
this model