Chunky — Bitext Chunk Alignment Model (NVFP4A16)
NVFP4A16-quantized version of p4b/qwen3-4b-chunky.
- Weights: FP4 E2M1 (4-bit float), group_size=16, scale in fp8_e4m3fn
- Activations: bfloat16 (unquantized)
- lm_head: unquantized
- Size: ~2.7 GB (vs 7.6 GB bf16)
- Quantization: llm-compressor 0.10.0.1,
NVFP4A16preset (data-free) - Format:
nvfp4-pack-quantized(compressed-tensors 0.14.0.1)
Hardware Requirement
Requires NVIDIA Blackwell GPU (sm_12x, RTX 50 series) for native FP4 execution via vLLM.
Usage with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="p4b/qwen3-4b-chunky-nvfp4",
quantization="compressed-tensors",
dtype="bfloat16",
)
sampling_params = SamplingParams(max_tokens=256, temperature=0.0)
outputs = llm.generate(["your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)
Quality vs bf16
Evaluated on 300 held-out samples (same seed, same split as bf16 baseline). No meaningful quality degradation.
| Metric | bf16 (7.6 GB) | NVFP4A16 (2.7 GB) | Δ |
|---|---|---|---|
| Reward mean | -1.367 | -1.340 | +0.027 |
| Reward median | 0.000 | 0.000 | — |
| FN mean | 1.155 | 1.133 | -0.022 |
| FP mean | 0.600 | 0.583 | -0.017 |
| Perfect (reward=0) | 55.3% | 56.3% | +1.0% |
| Parse error | 4.3% | 3.7% | -0.6% |
64% size reduction with no accuracy loss.
Note: vLLM 0.15.1 falls back to the Marlin kernel on non-Blackwell GPUs. Native FP4 CUDA kernels require sm_12x (RTX 50 series).
Task
See p4b/qwen3-4b-chunky for full task description, prompt format, and evaluation results.
Given <src> and <tgt> bitext blocks with [|n|] split markers, predicts optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.
- Downloads last month
- 95
Model tree for p4b/qwen3-4b-chunky-nvfp4
Base model
Qwen/Qwen3-4B-Instruct-2507 Finetuned
unsloth/Qwen3-4B-Instruct-2507 Adapter
p4b/qwen3-4b-chunky