Void-Citrus-L3.3-70B-mxfp4
Format: MXFP4 (OCP Microscaling FP4) — weights quantized to FP4 with per-block FP8 scaling factors (block size 32). Activations in BF16/FP16 (W4A16 style).
Base model: Darkknight535/Void-Citrus-L3.3-70B
How it was made: One-shot quantization with LLM Compressor (MXFP4 recipe) on a DGX Spark (GB10 Grace Blackwell). Calibrated on xensive/roleplaydataset100k (512 samples, max 4096 tokens).
Notes:
lm_headand multimodal projection layers kept in high precision. GB10 Blackwell has native MX format hardware support per the OCP Microscaling spec. Older architectures will dequantize to BF16 at inference time while still benefiting from reduced model size and bandwidth.
Check the original model card for information about this model.
Running the model with vLLM in Docker
sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model Firworks/Void-Citrus-L3.3-70B-mxfp4 \
--dtype auto \
--max-model-len 32768
Specifically for the DGX Spark
sudo docker run --gpus all --network host --ipc=host \
nvcr.io/nvidia/vllm:26.02-py3 \
vllm serve Firworks/Void-Citrus-L3.3-70B-mxfp4 \
--dtype auto \
--max-model-len 32768
Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).
If there are other models you'd like quantized to MXFP4, let me know.
- Downloads last month
- 5
Model tree for Firworks/Void-Citrus-L3.3-70B-mxfp4
Base model
Darkknight535/Void-Citrus-L3.3-70B