Octen-Embedding-0.6B — INT4-Full ONNX (INT4 MatMul + INT8 Gather)

Smallest variant of cstr/Octen-Embedding-0.6B-ONNX at ~434 MB — both transformer weights (MatMul) and the embedding table (Gather) are quantized.

Quantization details

Property	Value
Method	Two-pass: `quantize_dynamic` (Gather→INT8) + `MatMulNBitsQuantizer` (MatMul→INT4)
Pass 1	`quantize_dynamic(op_types_to_quantize=["Gather"], weight_type=QInt8)` — INT8 embedding table
Pass 2	`MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True)` — INT4 transformer weights
Op (MatMul)	`MatMulNBits` contrib op (ORT ≥ 1.16)
Op (Gather)	Standard `QLinearGather` INT8
Input	FP32 dynamo export (`cstr/Octen-Embedding-0.6B-ONNX`)
Dynamic batch	✓ batch = 1, 2, 4, 8, …

Quantizing the Gather reduces the 621 MB FP32 embedding table to ~155 MB (INT8). Quantizing MatMul to INT4 (block_size=32) gives ~275 MB. Combined total: **434 MB** vs 896 MB for INT4-MatMul-only.

Model details

Property	Value
Base model	cstr/Octen-Embedding-0.6B
Architecture	Decoder (last-token pooling + L2 normalisation)
Embedding dim	1024
Max context	8 192 tokens
File size	~434 MB (`model.int4_full.onnx` + `model.int4_full.onnx.data`)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.int4_full.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs        = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Variants

Repo	Precision	Size	Notes
cstr/Octen-Embedding-0.6B-ONNX	FP32	2.4 GB	Reference
cstr/Octen-Embedding-0.6B-ONNX-INT8-FULL	INT8 MatMul+Gather	0.6 GB	INT8 all ops
cstr/Octen-Embedding-0.6B-ONNX	INT4 MatMul only	0.9 GB	INT4 weights, FP32 embeddings
cstr/Octen-Embedding-0.6B-ONNX-INT4-FULL	INT4 MatMul + INT8 Gather	0.43 GB	This repo — smallest

License

Apache 2.0 — same as the base Octen model.

Downloads last month: 36