Chunky — Bitext Chunk Alignment Model (NVFP4A16)

NVFP4A16-quantized version of p4b/qwen3-4b-chunky.

  • Weights: FP4 E2M1 (4-bit float), group_size=16, scale in fp8_e4m3fn
  • Activations: bfloat16 (unquantized)
  • lm_head: unquantized
  • Size: ~2.7 GB (vs 7.6 GB bf16)
  • Quantization: llm-compressor 0.10.0.1, NVFP4A16 preset (data-free)
  • Format: nvfp4-pack-quantized (compressed-tensors 0.14.0.1)

Hardware Requirement

Requires NVIDIA Blackwell GPU (sm_12x, RTX 50 series) for native FP4 execution via vLLM.

Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="p4b/qwen3-4b-chunky-nvfp4",
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(max_tokens=256, temperature=0.0)
outputs = llm.generate(["your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Quality vs bf16

Evaluated on 300 held-out samples (same seed, same split as bf16 baseline). No meaningful quality degradation.

Metric bf16 (7.6 GB) NVFP4A16 (2.7 GB) Δ
Reward mean -1.367 -1.340 +0.027
Reward median 0.000 0.000
FN mean 1.155 1.133 -0.022
FP mean 0.600 0.583 -0.017
Perfect (reward=0) 55.3% 56.3% +1.0%
Parse error 4.3% 3.7% -0.6%

64% size reduction with no accuracy loss.

Note: vLLM 0.15.1 falls back to the Marlin kernel on non-Blackwell GPUs. Native FP4 CUDA kernels require sm_12x (RTX 50 series).

Task

See p4b/qwen3-4b-chunky for full task description, prompt format, and evaluation results.

Given <src> and <tgt> bitext blocks with [|n|] split markers, predicts optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.

Downloads last month
95
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p4b/qwen3-4b-chunky-nvfp4

Quantized
(1)
this model