NLLB-200-600M β NF4 Bare-Metal Format
This repository contains a custom-serialized NF4 (Normal Float 4) quantization of facebook/nllb-200-distilled-600M, produced for researchers building bare-metal or embedded inference engines in C/C++/Rust where bitsandbytes is unavailable.
β οΈ This is not a standard HuggingFace checkpoint. It cannot be loaded with
AutoModel.from_pretrained(). It is a raw weight archive for custom engine developers. See the Loading Guide below.
Ablation: Why This Format Exists
Standard bitsandbytes save_pretrained() silently dequantizes NF4 weights back to
float16 before writing to disk, producing a file identical in size to the fp16 checkpoint
(~1.2 GB). This defeats the purpose of 4-bit quantization for storage and embedded
deployment. Three serialization strategies were evaluated:
| Strategy | File Size | Loadable by transformers | Notes |
|---|---|---|---|
save_pretrained() default |
1,176 MB | β Yes | Silently dequantized to fp16 |
save_pretrained(safe_serialization=True) |
1,176 MB | β Yes | Same issue |
Manual state_dict() extraction |
1,176 MB | β Yes | .data triggers dequantization |
| weights direct extraction (this repo) | ~660 MB | β Custom only | True packed uint8 nibbles |
The embedding matrix (model.shared.weight, 256206 Γ 1024) accounts for ~512 MB of the
final file in fp16. The remaining ~150 MB contains all 4-bit packed attention and FFN
weights across 24 encoder/decoder layers.
Weight Size Breakdown
| Component | Count | Per-layer size | Format | Total |
|---|---|---|---|---|
| Encoder self-attn (q/k/v/o) | 4 Γ 12 | ~0.5 MB | uint8 packed | ~24 MB |
| Encoder FFN (fc1/fc2) | 2 Γ 12 | ~2.1 MB | uint8 packed | ~50 MB |
| Decoder self-attn (q/k/v/o) | 4 Γ 12 | ~0.5 MB | uint8 packed | ~24 MB |
| Decoder cross-attn (q/k/v/o) | 4 Γ 12 | ~0.5 MB | uint8 packed | ~24 MB |
| Decoder FFN (fc1/fc2) | 2 Γ 12 | ~2.1 MB | uint8 packed | ~50 MB |
| Shared embedding (1 copy) | 1 | 512 MB | float16 | ~512 MB |
| LayerNorms, biases | β | negligible | float16 | ~5 MB |
| Total | ~660 MB |
Tensor Naming Convention
Every quantized linear layer is stored under the following keys in model.safetensors:
<layer_path>.weight.packed # uint8, shape (N*K/2, 1) β two 4-bit nibbles per byte
<layer_path>.weight.absmax # float32, per-block absmax scales (blocksize = 64)
<layer_path>.weight.absmax2 # float32, second-level scales (double quantization)
<layer_path>.weight.code # float32[16], NF4 lookup table
<layer_path>.weight.code2 # float32[256], second-level lookup table
<layer_path>.weight.shape # int64[2], original [out_features, in_features]
model.shared.weight # float16, (256206, 1024) β one copy for all tied embeddings
Non-quantized layers (LayerNorm, biases) are stored as standard float16 tensors under their original parameter path.
Dequantization Algorithm
To reconstruct a float16 weight matrix from the packed format:
// Pseudocode β C implementation
void dequantize_nf4(
uint8_t* packed, // .weight.packed
float* absmax, // .weight.absmax (after double-quant dequantization)
float* code, // .weight.code (NF4 table, 16 entries)
float* out, // output float32 buffer
int n_elements // out_features * in_features
) {
int blocksize = 64;
for (int i = 0; i < n_elements / 2; i++) {
uint8_t byte = packed[i];
uint8_t lo = byte & 0x0F; // first element
uint8_t hi = (byte >> 4) & 0x0F; // second element
int block = (i * 2) / blocksize;
float scale = absmax[block];
out[i * 2] = code[lo] * scale;
out[i * 2 + 1] = code[hi] * scale;
}
}
Double Quantization (absmax recovery)
Because bnb_4bit_use_double_quant=True was used, the .weight.absmax values are
themselves quantized. Before the step above, recover them:
# Python reference implementation
import numpy as np
def recover_absmax(absmax2_raw, code2, absmax2_scale):
# absmax2_raw: uint8 indices into code2
# code2: float32[256] lookup table
return np.array([code2[i] for i in absmax2_raw]) * absmax2_scale
Loading Guide (Python β Custom Engine)
This checkpoint is intended for custom engines, but you can reconstruct and use it in Python as a reference/validation path:
from safetensors import safe_open
import torch
import numpy as np
# 1. Open the file
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
# 2. Dequantize one layer as an example
def dequantize_nf4_layer(tensors, layer_name):
packed = tensors[f"{layer_name}.weight.packed"] # uint8
absmax = tensors[f"{layer_name}.weight.absmax"] # float32
code = tensors[f"{layer_name}.weight.code"] # float32[16]
shape = tensors[f"{layer_name}.weight.shape"].tolist()
# Unpack nibbles
lo = packed & 0x0F
hi = (packed >> 4) & 0x0F
indices = torch.stack([lo, hi], dim=-1).flatten().long()
# Map through NF4 table
flat = code[indices]
# Apply absmax scales (blocksize=64)
blocksize = 64
n = flat.shape[0]
blocks = (n + blocksize - 1) // blocksize
absmax_expanded = absmax.repeat_interleave(blocksize)[:n]
flat = flat * absmax_expanded
# Reshape to original weight shape
return flat.reshape(shape).to(torch.float16)
# 3. Example: dequantize the first encoder FFN layer
w = dequantize_nf4_layer(tensors, "model.encoder.layers.0.fc1")
print(f"Recovered weight shape: {w.shape}, dtype: {w.dtype}")
# 4. Access the shared embedding directly (already float16)
embedding = tensors["model.shared.weight"] # (256206, 1024) float16
print(f"Embedding shape: {embedding.shape}")
Reproducing This Checkpoint
import torch, bitsandbytes as bnb
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
from safetensors.torch import save_file
model = AutoModelForSeq2SeqLM.from_pretrained(
"facebook/nllb-200-distilled-600M",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
),
device_map="auto"
)
state_dict = {}
seen = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
w = module.weight
if w.data.data_ptr() not in seen:
seen.add(w.data.data_ptr())
qs = w.quant_state
state_dict[f"{name}.weight.packed"] = w.data.detach().clone().cpu()
state_dict[f"{name}.weight.absmax"] = qs.absmax.detach().clone().cpu()
state_dict[f"{name}.weight.code"] = qs.code.detach().clone().cpu()
state_dict[f"{name}.weight.shape"] = torch.tensor(list(qs.shape))
if qs.state2 is not None:
state_dict[f"{name}.weight.absmax2"] = qs.state2.absmax.detach().clone().cpu()
state_dict[f"{name}.weight.code2"] = qs.state2.code.detach().clone().cpu()
else:
for pname, param in module.named_parameters(recurse=False):
if param.data_ptr() not in seen:
seen.add(param.data_ptr())
key = f"{name}.{pname}" if name else pname
state_dict[key] = param.detach().clone().cpu().to(torch.float16)
save_file(state_dict, "model.safetensors")
Citation
If you use this checkpoint format in research, please cite the original NLLB model:
@article{nllb2022,
title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
author = {NLLB Team},
journal = {arXiv preprint arXiv:2207.04672},
year = {2022}
}
This checkpoint was produced as part of the BareMetalNLLB project β an effort to run multilingual neural machine translation on ultra-low-resource embedded hardware.
- Downloads last month
- 4