NLLB-200-600M β€” Pico Int8 Bare-Metal Format

Custom per-channel int8 quantization of facebook/nllb-200-distilled-600M for bare-metal C inference.

Why Int8 over NF4 for bare-metal?

Property NF4 (4-bit) Int8 (this repo)
File size ~660 MB ~850 MB
Dequant complexity NF4 lookup table + double-quant Single multiply per row
C code needed ~40 lines ~5 lines
SIMD acceleration Requires nibble unpacking Native int8 SIMD on all ARM cores
Accuracy vs fp16 ~0.3% degradation ~0.1% degradation

Int8 is simpler, faster, and more accurate on hardware that has native int8 multiply-accumulate (all ARM Cortex-A, all modern x86). NF4 is better only when disk/flash space is the hard constraint.

Tensor Format

Every quantized linear layer is stored as two tensors:

<layer>.weight.int8   # int8  [out_features, in_features]
<layer>.weight.scale  # fp16  [out_features]   (one scale per output row)

Non-quantized tensors (LayerNorm weights/biases, positional embeddings) are stored as float16.

Dequantization (5 lines of C)

// Dequantize row `row` of a weight matrix into float32 buffer `out`
void dequant_row(const int8_t* W, const float* scale, float* out,
                 int row, int in_features) {
    for (int i = 0; i < in_features; i++)
        out[i] = (float)W[row * in_features + i] * scale[row];
}

Reproducing

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/nllb-200-distilled-600M",
    torch_dtype=torch.float16, device_map="cpu"
)
# per-channel int8 quantization β€” see upload script in repo

Part of the Bare Metal project β€” multilingual translation on ultra-low-resource embedded hardware.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support