Chunky — Bitext Chunk Alignment Model (NVFP4A16)

NVFP4A16-quantized version of p4b/qwen3-4b-chunky.

Weights: FP4 E2M1 (4-bit float), group_size=16, scale in fp8_e4m3fn
Activations: bfloat16 (unquantized)
lm_head: unquantized
Size: ~2.7 GB (vs 7.6 GB bf16)
Quantization: llm-compressor 0.10.0.1, NVFP4A16 preset (data-free)
Format: nvfp4-pack-quantized (compressed-tensors 0.14.0.1)

Hardware Requirement

Requires NVIDIA Blackwell GPU (sm_12x, RTX 50 series) for native FP4 execution via vLLM.

Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="p4b/qwen3-4b-chunky-nvfp4",
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(max_tokens=256, temperature=0.0)
outputs = llm.generate(["your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Quality vs bf16

Evaluated on 300 held-out samples (same seed, same split as bf16 baseline). No meaningful quality degradation.

Metric	bf16 (7.6 GB)	NVFP4A16 (2.7 GB)	Δ
Reward mean	-1.367	-1.340	+0.027
Reward median	0.000	0.000	—
FN mean	1.155	1.133	-0.022
FP mean	0.600	0.583	-0.017
Perfect (reward=0)	55.3%	56.3%	+1.0%
Parse error	4.3%	3.7%	-0.6%

64% size reduction with no accuracy loss.

Note: vLLM 0.15.1 falls back to the Marlin kernel on non-Blackwell GPUs. Native FP4 CUDA kernels require sm_12x (RTX 50 series).

Task

See p4b/qwen3-4b-chunky for full task description, prompt format, and evaluation results.

Given <src> and <tgt> bitext blocks with [|n|] split markers, predicts optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.

Downloads last month: 95

Safetensors

Model size

2B params

Tensor type

F32

BF16

F8_E4M3

Model tree for p4b/qwen3-4b-chunky-nvfp4

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

unsloth/Qwen3-4B-Instruct-2507

Adapter

p4b/qwen3-4b-chunky

Quantized

(1)

this model