Octen-Embedding-0.6B — INT4-Full ONNX (INT4 MatMul + INT8 Gather)

Smallest variant of cstr/Octen-Embedding-0.6B-ONNX at ~434 MB — both transformer weights (MatMul) and the embedding table (Gather) are quantized.

Quantization details

Property Value
Method Two-pass: quantize_dynamic (Gather→INT8) + MatMulNBitsQuantizer (MatMul→INT4)
Pass 1 quantize_dynamic(op_types_to_quantize=["Gather"], weight_type=QInt8) — INT8 embedding table
Pass 2 MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True) — INT4 transformer weights
Op (MatMul) MatMulNBits contrib op (ORT ≥ 1.16)
Op (Gather) Standard QLinearGather INT8
Input FP32 dynamo export (cstr/Octen-Embedding-0.6B-ONNX)
Dynamic batch ✓ batch = 1, 2, 4, 8, …

Quantizing the Gather reduces the 621 MB FP32 embedding table to ~155 MB (INT8). Quantizing MatMul to INT4 (block_size=32) gives ~275 MB. Combined total: **434 MB** vs 896 MB for INT4-MatMul-only.

Model details

Property Value
Base model cstr/Octen-Embedding-0.6B
Architecture Decoder (last-token pooling + L2 normalisation)
Embedding dim 1024
Max context 8 192 tokens
File size ~434 MB (model.int4_full.onnx + model.int4_full.onnx.data)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.int4_full.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs        = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Variants

Repo Precision Size Notes
cstr/Octen-Embedding-0.6B-ONNX FP32 2.4 GB Reference
cstr/Octen-Embedding-0.6B-ONNX-INT8-FULL INT8 MatMul+Gather 0.6 GB INT8 all ops
cstr/Octen-Embedding-0.6B-ONNX INT4 MatMul only 0.9 GB INT4 weights, FP32 embeddings
cstr/Octen-Embedding-0.6B-ONNX-INT4-FULL INT4 MatMul + INT8 Gather 0.43 GB This repo — smallest

License

Apache 2.0 — same as the base Octen model.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support