Octen-Embedding-0.6B (Full INT8 Quantized ONNX)

This repository contains the Full INT8 dynamic quantized ONNX export of Octen/Octen-Embedding-0.6B.

Quantization Details

Base Model: Octen/Octen-Embedding-0.6B
Original ONNX Export: cstr/Octen-Embedding-0.6B-ONNX
Quantization Type: Dynamic INT8
Ops Quantized: MatMul and Gather (This version quantizes the embedding table as well, reducing the total size significantly).
Total Model Size: 599.5 MB (vs. ~1.0 GB for MatMul-only INT8 and ~2.2 GB for FP32).
Quantization Script: quantize_octen_int8_full.py (included in the repo).

Files Included

model.int8_full.onnx: The main quantized ONNX graph.
model.int8_full.onnx.data: External quantized weights (approx. 570 MB).
quantize_octen_int8_full.py: The script used for quantization.
tokenizer.json, config.json, etc.: Standard configuration and tokenizer files.

Note for inference: This model requires an ONNX Runtime environment. To get the final embedding as intended by CrispSorter:

Apply last-token pooling (taking the embedding of the last non-padding token).
Apply L2 normalization.

Apply L2 normalization.

Original Model Info: Octen-Embedding-0.6B

Octen-Embedding-0.6B is a text embedding model developed by Octen for semantic search and retrieval tasks. This model is fine-tuned from Qwen/Qwen3-Embedding-0.6B and supports multiple languages, providing high-quality embeddings for various applications.

Key Highlights

🥇 RTEB Leaderboard Champion (as of January 12, 2026)

Octen-Embedding-8B ranks #1 on the RTEB Leaderboard with Mean (Task) score of 0.8045
Excellent performance on both Public (0.7953) and Private (0.8157) datasets
Demonstrates true generalization capability without overfitting to public benchmarks

Industry-Oriented Vertical Domain Expertise

Legal: Legal document retrieval
Finance: Financial reports, Q&A, and personal finance content
Healthcare: Medical Q&A, clinical dialogues, and health consultations
Code: Programming problems, code search, and SQL queries

Ultra-Long Context Support

Supports up to 32,768 tokens context length
Suitable for processing long documents in legal, healthcare, and other domains
High-dimensional embedding space for rich semantic representation

Multilingual Capability

Supports 100+ languages
Includes various programming languages
Strong multilingual, cross-lingual, and code retrieval capabilities

Open Source Model List

Model Type	Model	Size	Max Tokens	Embedding Dimensions	HuggingFace Link
Text Embedding	Octen-Embedding-0.6B	0.6B	32,768	1024	✅ Available
Text Embedding	Octen-Embedding-4B	4.0B	32,768	2560	✅ Available
Text Embedding	Octen-Embedding-8B	7.6B	32,768	4096	✅ Available

Model Family Design:

Octen-Embedding-8B: Best performance, RTEB #1, for high-precision retrieval
Octen-Embedding-4B: Best in 4B category, balanced performance and efficiency
Octen-Embedding-0.6B: Lightweight deployment, suitable for edge devices and resource-constrained environments

For API access, deployment solutions, and technical documentation, visit octen.ai.

Experimental Results

RTEB Leaderboard (Overall Performance)

Model	Embedding Dim	Max Tokens	Mean (Public)	Mean (Private)	Mean (Task)
Octen-Embedding-8B	4096	32768	0.7953	0.8157	0.8045
voyage-3-large	1024	32000	0.7434	0.8277	0.7812
gemini-embedding-001	3072	2048	0.7218	0.8075	0.7602
Octen-Embedding-4B	2560	32768	0.7747	0.7942	0.7834
MoD-Embedding	2560	32768	0.7642	0.7900	0.7758
Qwen3-Embedding-8B	4096	32768	0.7310	0.7838	0.7547
Octen-Embedding-0.6B	1024	32768	0.7241	-	-
voyage-3.5	1024	32000	0.7139	0.8102	0.7571
Cohere-embed-v4.0	1536	128000	0.6534	0.7943	0.7166
jina-embeddings-v4	2048	32768	0.6652	0.7664	0.7105
GritLM-7B	4096	32768	0.6187	0.7385	0.6724
text-embedding-3-large	3072	8191	0.6110	0.7130	0.6567
e5-mistral-7b-instruct	4096	32768	0.5090	0.7091	0.5987
NV-Embed-v2	4096	32768	0.5805	0.6691	0.6203
snowflake-arctic-embed-l-v2.0	1024	8192	0.5395	0.7079	0.6150
multilingual-e5-large-instruct	1024	514	0.5478	0.6859	0.6097
gte-multilingual-base	768	8192	0.5291	0.6697	0.5921
text-embedding-3-small	1536	8191	0.5260	0.6630	0.5874
bge-m3	1024	8194	0.5216	0.6726	0.5893
Qwen3-Embedding-4B	2560	32768	-	0.7711	-
Qwen3-Embedding-0.6B	1024	32768	-	0.7117	-

Model Details

Base Model: Qwen/Qwen3-Embedding-0.6B
Model Size: 0.6B parameters
Max Sequence Length: 32,768 tokens
Embedding Dimension: 1024
Languages: English, Chinese, and multilingual support
Training Method: LoRA fine-tuning

Usage (Original PyTorch)

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Octen/Octen-Embedding-0.6B")

# Encode sentences
sentences = [
    "This is an example sentence",
    "Each sentence is converted to a vector"
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 1024)

License

This model is licensed under the Apache License 2.0.

This model is derived from Qwen/Qwen3-Embedding-0.6B, which is also licensed under Apache License 2.0.

Paper

For more details, please refer to our blog post: Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard

Citation

If you find our work helpful, please consider citing:

@misc{octen2025rteb,
  title={Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard},
  author={Octen Team},
  year={2025},
  url={https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/}
}

Downloads last month: 47

Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8-FULL

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

Octen/Octen-Embedding-0.6B

Quantized

(11)

this model