F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Paper β’ 2603.19223 β’ Published β’ 31
INT4-quantized ONNX of codefuse-ai/F2LLM-v2-0.6B. Smallest resident memory of all variants while maintaining strong retrieval quality.
| Property | Value |
|---|---|
| Method | onnxruntime.quantization.MatMulNBitsQuantizer |
| Bits | 4 |
| Block size | 32 (one scale per 32 consecutive weights) |
| Symmetry | Symmetric (no zero-point) |
| Op | MatMulNBits contrib op (ORT β₯ 1.16, CPU / CUDA / CoreML EPs) |
| Ops quantized | MatMul only β Gather (embedding table) left in FP32 |
| Input | FP32 dynamo export (cstr/F2LLM-v2-0.6B-ONNX) |
| Dynamic batch | β batch = 1, 2, 4, 8, β¦ |
Block-wise quantization (block_size=32) gives 32 768 calibration scales for a 1024Γ1024 weight matrix, far finer than per-tensor INT8's single scale β which is why INT4 block-wise often shows higher cosine fidelity to FP32 than per-tensor INT8.
| Property | Value |
|---|---|
| Base model | codefuse-ai/F2LLM-v2-0.6B |
| Architecture | Qwen3 decoder |
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Pooling | Last-token pooling + L2 normalisation |
| File size | ~0.9 GB (model.int4.onnx + model.int4.onnx.data) |
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)
# CPUExecutionProvider supports MatMulNBits 4-bit
session = ort.InferenceSession("model.int4.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape) # (2, 1024)
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/F2LLM-v2-0.6B-ONNX | FP32 | 2.4 GB | Reference |
| cstr/F2LLM-v2-0.6B-ONNX-INT8 | INT8 per-channel | 1.1 GB | Recommended |
| cstr/F2LLM-v2-0.6B-ONNX-INT4 | INT4 MatMulNBits | 0.9 GB | This repo β minimum RAM |
| cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL | INT8 incl. embeddings | 0.6 GB | Smallest file |
@misc{f2llm-v2,
title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
year={2026},
eprint={2603.19223},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.19223},
}
Apache 2.0 β same as codefuse-ai/F2LLM-v2-0.6B.
Base model
Qwen/Qwen3-0.6B-Base