Octen-Embedding-0.6B — INT4-Full ONNX (INT4 MatMul + INT8 Gather)
Smallest variant of cstr/Octen-Embedding-0.6B-ONNX at ~434 MB — both transformer weights (MatMul) and the embedding table (Gather) are quantized.
Quantization details
| Property | Value |
|---|---|
| Method | Two-pass: quantize_dynamic (Gather→INT8) + MatMulNBitsQuantizer (MatMul→INT4) |
| Pass 1 | quantize_dynamic(op_types_to_quantize=["Gather"], weight_type=QInt8) — INT8 embedding table |
| Pass 2 | MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True) — INT4 transformer weights |
| Op (MatMul) | MatMulNBits contrib op (ORT ≥ 1.16) |
| Op (Gather) | Standard QLinearGather INT8 |
| Input | FP32 dynamo export (cstr/Octen-Embedding-0.6B-ONNX) |
| Dynamic batch | ✓ batch = 1, 2, 4, 8, … |
Quantizing the Gather reduces the 621 MB FP32 embedding table to ~155 MB (INT8). Quantizing MatMul to INT4 (block_size=32) gives ~275 MB. Combined total: **434 MB** vs 896 MB for INT4-MatMul-only.
Model details
| Property | Value |
|---|---|
| Base model | cstr/Octen-Embedding-0.6B |
| Architecture | Decoder (last-token pooling + L2 normalisation) |
| Embedding dim | 1024 |
| Max context | 8 192 tokens |
| File size | ~434 MB (model.int4_full.onnx + model.int4_full.onnx.data) |
Inference
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)
session = ort.InferenceSession("model.int4_full.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape) # (2, 1024)
Variants
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/Octen-Embedding-0.6B-ONNX | FP32 | 2.4 GB | Reference |
| cstr/Octen-Embedding-0.6B-ONNX-INT8-FULL | INT8 MatMul+Gather | 0.6 GB | INT8 all ops |
| cstr/Octen-Embedding-0.6B-ONNX | INT4 MatMul only | 0.9 GB | INT4 weights, FP32 embeddings |
| cstr/Octen-Embedding-0.6B-ONNX-INT4-FULL | INT4 MatMul + INT8 Gather | 0.43 GB | This repo — smallest |
License
Apache 2.0 — same as the base Octen model.
- Downloads last month
- 36