NLLB-200-600M — Pico Int8 Bare-Metal Format

Custom per-channel int8 quantization of facebook/nllb-200-distilled-600M for bare-metal C inference.

Why Int8 over NF4 for bare-metal?

Property	NF4 (4-bit)	Int8 (this repo)
File size	~660 MB	~850 MB
Dequant complexity	NF4 lookup table + double-quant	Single multiply per row
C code needed	~40 lines	~5 lines
SIMD acceleration	Requires nibble unpacking	Native int8 SIMD on all ARM cores
Accuracy vs fp16	~0.3% degradation	~0.1% degradation

Int8 is simpler, faster, and more accurate on hardware that has native int8 multiply-accumulate (all ARM Cortex-A, all modern x86). NF4 is better only when disk/flash space is the hard constraint.

Tensor Format

Every quantized linear layer is stored as two tensors:

<layer>.weight.int8   # int8  [out_features, in_features]
<layer>.weight.scale  # fp16  [out_features]   (one scale per output row)

Non-quantized tensors (LayerNorm weights/biases, positional embeddings) are stored as float16.

Dequantization (5 lines of C)

// Dequantize row `row` of a weight matrix into float32 buffer `out`
void dequant_row(const int8_t* W, const float* scale, float* out,
                 int row, int in_features) {
    for (int i = 0; i < in_features; i++)
        out[i] = (float)W[row * in_features + i] * scale[row];
}

Reproducing

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/nllb-200-distilled-600M",
    torch_dtype=torch.float16, device_map="cpu"
)
# per-channel int8 quantization — see upload script in repo

Part of the Bare Metal project — multilingual translation on ultra-low-resource embedded hardware.

Downloads last month: 3