F2LLM-v2-0.6B — INT4 ONNX (MatMulNBits, block_size=32)

INT4-quantized ONNX of codefuse-ai/F2LLM-v2-0.6B. Smallest resident memory of all variants while maintaining strong retrieval quality.

Quantization details

Property	Value
Method	`onnxruntime.quantization.MatMulNBitsQuantizer`
Bits	4
Block size	32 (one scale per 32 consecutive weights)
Symmetry	Symmetric (no zero-point)
Op	`MatMulNBits` contrib op (ORT ≥ 1.16, CPU / CUDA / CoreML EPs)
Ops quantized	MatMul only — Gather (embedding table) left in FP32
Input	FP32 dynamo export (`cstr/F2LLM-v2-0.6B-ONNX`)
Dynamic batch	✓ batch = 1, 2, 4, 8, …

Block-wise quantization (block_size=32) gives 32 768 calibration scales for a 1024×1024 weight matrix, far finer than per-tensor INT8's single scale — which is why INT4 block-wise often shows higher cosine fidelity to FP32 than per-tensor INT8.

Model details

Property	Value
Base model	codefuse-ai/F2LLM-v2-0.6B
Architecture	Qwen3 decoder
Embedding dim	1024
Max context	32 768 tokens
Pooling	Last-token pooling + L2 normalisation
File size	~0.9 GB (`model.int4.onnx` + `model.int4.onnx.data`)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

# CPUExecutionProvider supports MatMulNBits 4-bit
session = ort.InferenceSession("model.int4.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs        = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Variants

Repo	Precision	Size	Notes
cstr/F2LLM-v2-0.6B-ONNX	FP32	2.4 GB	Reference
cstr/F2LLM-v2-0.6B-ONNX-INT8	INT8 per-channel	1.1 GB	Recommended
cstr/F2LLM-v2-0.6B-ONNX-INT4	INT4 MatMulNBits	0.9 GB	This repo — minimum RAM
cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL	INT8 incl. embeddings	0.6 GB	Smallest file

Citation

@misc{f2llm-v2,
      title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
      author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
      year={2026},
      eprint={2603.19223},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19223},
}