NLLB-200-600M β Pico Int8 Bare-Metal Format
Custom per-channel int8 quantization of facebook/nllb-200-distilled-600M for bare-metal C inference.
Why Int8 over NF4 for bare-metal?
| Property | NF4 (4-bit) | Int8 (this repo) |
|---|---|---|
| File size | ~660 MB | ~850 MB |
| Dequant complexity | NF4 lookup table + double-quant | Single multiply per row |
| C code needed | ~40 lines | ~5 lines |
| SIMD acceleration | Requires nibble unpacking | Native int8 SIMD on all ARM cores |
| Accuracy vs fp16 | ~0.3% degradation | ~0.1% degradation |
Int8 is simpler, faster, and more accurate on hardware that has native int8 multiply-accumulate (all ARM Cortex-A, all modern x86). NF4 is better only when disk/flash space is the hard constraint.
Tensor Format
Every quantized linear layer is stored as two tensors:
<layer>.weight.int8 # int8 [out_features, in_features]
<layer>.weight.scale # fp16 [out_features] (one scale per output row)
Non-quantized tensors (LayerNorm weights/biases, positional embeddings) are stored as float16.
Dequantization (5 lines of C)
// Dequantize row `row` of a weight matrix into float32 buffer `out`
void dequant_row(const int8_t* W, const float* scale, float* out,
int row, int in_features) {
for (int i = 0; i < in_features; i++)
out[i] = (float)W[row * in_features + i] * scale[row];
}
Reproducing
model = AutoModelForSeq2SeqLM.from_pretrained(
"facebook/nllb-200-distilled-600M",
torch_dtype=torch.float16, device_map="cpu"
)
# per-channel int8 quantization β see upload script in repo
Part of the Bare Metal project β multilingual translation on ultra-low-resource embedded hardware.
- Downloads last month
- 3