NLLB-200-600M β€” NF4 Bare-Metal Format

This repository contains a custom-serialized NF4 (Normal Float 4) quantization of facebook/nllb-200-distilled-600M, produced for researchers building bare-metal or embedded inference engines in C/C++/Rust where bitsandbytes is unavailable.

⚠️ This is not a standard HuggingFace checkpoint. It cannot be loaded with AutoModel.from_pretrained(). It is a raw weight archive for custom engine developers. See the Loading Guide below.


Ablation: Why This Format Exists

Standard bitsandbytes save_pretrained() silently dequantizes NF4 weights back to float16 before writing to disk, producing a file identical in size to the fp16 checkpoint (~1.2 GB). This defeats the purpose of 4-bit quantization for storage and embedded deployment. Three serialization strategies were evaluated:

Strategy File Size Loadable by transformers Notes
save_pretrained() default 1,176 MB βœ… Yes Silently dequantized to fp16
save_pretrained(safe_serialization=True) 1,176 MB βœ… Yes Same issue
Manual state_dict() extraction 1,176 MB βœ… Yes .data triggers dequantization
weights direct extraction (this repo) ~660 MB ❌ Custom only True packed uint8 nibbles

The embedding matrix (model.shared.weight, 256206 Γ— 1024) accounts for ~512 MB of the final file in fp16. The remaining ~150 MB contains all 4-bit packed attention and FFN weights across 24 encoder/decoder layers.

Weight Size Breakdown

Component Count Per-layer size Format Total
Encoder self-attn (q/k/v/o) 4 Γ— 12 ~0.5 MB uint8 packed ~24 MB
Encoder FFN (fc1/fc2) 2 Γ— 12 ~2.1 MB uint8 packed ~50 MB
Decoder self-attn (q/k/v/o) 4 Γ— 12 ~0.5 MB uint8 packed ~24 MB
Decoder cross-attn (q/k/v/o) 4 Γ— 12 ~0.5 MB uint8 packed ~24 MB
Decoder FFN (fc1/fc2) 2 Γ— 12 ~2.1 MB uint8 packed ~50 MB
Shared embedding (1 copy) 1 512 MB float16 ~512 MB
LayerNorms, biases β€” negligible float16 ~5 MB
Total ~660 MB

Tensor Naming Convention

Every quantized linear layer is stored under the following keys in model.safetensors:

<layer_path>.weight.packed   # uint8, shape (N*K/2, 1) β€” two 4-bit nibbles per byte
<layer_path>.weight.absmax   # float32, per-block absmax scales (blocksize = 64)
<layer_path>.weight.absmax2  # float32, second-level scales (double quantization)
<layer_path>.weight.code     # float32[16], NF4 lookup table
<layer_path>.weight.code2    # float32[256], second-level lookup table
<layer_path>.weight.shape    # int64[2], original [out_features, in_features]
model.shared.weight          # float16, (256206, 1024) β€” one copy for all tied embeddings

Non-quantized layers (LayerNorm, biases) are stored as standard float16 tensors under their original parameter path.


Dequantization Algorithm

To reconstruct a float16 weight matrix from the packed format:

// Pseudocode β€” C implementation
void dequantize_nf4(
    uint8_t* packed,     // .weight.packed
    float*   absmax,     // .weight.absmax  (after double-quant dequantization)
    float*   code,       // .weight.code    (NF4 table, 16 entries)
    float*   out,        // output float32 buffer
    int      n_elements  // out_features * in_features
) {
    int blocksize = 64;
    for (int i = 0; i < n_elements / 2; i++) {
        uint8_t byte  = packed[i];
        uint8_t lo    = byte & 0x0F;          // first element
        uint8_t hi    = (byte >> 4) & 0x0F;   // second element
        int     block = (i * 2) / blocksize;
        float   scale = absmax[block];
        out[i * 2]     = code[lo] * scale;
        out[i * 2 + 1] = code[hi] * scale;
    }
}

Double Quantization (absmax recovery)

Because bnb_4bit_use_double_quant=True was used, the .weight.absmax values are themselves quantized. Before the step above, recover them:

# Python reference implementation
import numpy as np

def recover_absmax(absmax2_raw, code2, absmax2_scale):
    # absmax2_raw: uint8 indices into code2
    # code2: float32[256] lookup table
    return np.array([code2[i] for i in absmax2_raw]) * absmax2_scale

Loading Guide (Python β€” Custom Engine)

This checkpoint is intended for custom engines, but you can reconstruct and use it in Python as a reference/validation path:

from safetensors import safe_open
import torch
import numpy as np

# 1. Open the file
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    for key in f.keys():
        tensors[key] = f.get_tensor(key)

# 2. Dequantize one layer as an example
def dequantize_nf4_layer(tensors, layer_name):
    packed  = tensors[f"{layer_name}.weight.packed"]   # uint8
    absmax  = tensors[f"{layer_name}.weight.absmax"]   # float32
    code    = tensors[f"{layer_name}.weight.code"]     # float32[16]
    shape   = tensors[f"{layer_name}.weight.shape"].tolist()

    # Unpack nibbles
    lo = packed & 0x0F
    hi = (packed >> 4) & 0x0F
    indices = torch.stack([lo, hi], dim=-1).flatten().long()

    # Map through NF4 table
    flat = code[indices]

    # Apply absmax scales (blocksize=64)
    blocksize = 64
    n = flat.shape[0]
    blocks = (n + blocksize - 1) // blocksize
    absmax_expanded = absmax.repeat_interleave(blocksize)[:n]
    flat = flat * absmax_expanded

    # Reshape to original weight shape
    return flat.reshape(shape).to(torch.float16)

# 3. Example: dequantize the first encoder FFN layer
w = dequantize_nf4_layer(tensors, "model.encoder.layers.0.fc1")
print(f"Recovered weight shape: {w.shape}, dtype: {w.dtype}")

# 4. Access the shared embedding directly (already float16)
embedding = tensors["model.shared.weight"]  # (256206, 1024) float16
print(f"Embedding shape: {embedding.shape}")

Reproducing This Checkpoint

import torch, bitsandbytes as bnb
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
from safetensors.torch import save_file

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/nllb-200-distilled-600M",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16
    ),
    device_map="auto"
)

state_dict = {}
seen = set()
for name, module in model.named_modules():
    if isinstance(module, bnb.nn.Linear4bit):
        w = module.weight
        if w.data.data_ptr() not in seen:
            seen.add(w.data.data_ptr())
            qs = w.quant_state
            state_dict[f"{name}.weight.packed"] = w.data.detach().clone().cpu()
            state_dict[f"{name}.weight.absmax"] = qs.absmax.detach().clone().cpu()
            state_dict[f"{name}.weight.code"]   = qs.code.detach().clone().cpu()
            state_dict[f"{name}.weight.shape"]  = torch.tensor(list(qs.shape))
            if qs.state2 is not None:
                state_dict[f"{name}.weight.absmax2"] = qs.state2.absmax.detach().clone().cpu()
                state_dict[f"{name}.weight.code2"]   = qs.state2.code.detach().clone().cpu()
    else:
        for pname, param in module.named_parameters(recurse=False):
            if param.data_ptr() not in seen:
                seen.add(param.data_ptr())
                key = f"{name}.{pname}" if name else pname
                state_dict[key] = param.detach().clone().cpu().to(torch.float16)

save_file(state_dict, "model.safetensors")

Citation

If you use this checkpoint format in research, please cite the original NLLB model:

@article{nllb2022,
  title   = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  author  = {NLLB Team},
  journal = {arXiv preprint arXiv:2207.04672},
  year    = {2022}
}

This checkpoint was produced as part of the BareMetalNLLB project β€” an effort to run multilingual neural machine translation on ultra-low-resource embedded hardware.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for AlaminI/nllb-200-600M-nf4-custom-weights-bare-metal