Claude
Update model card with better documentation for Orange/nomic-embed-text-v1.5-1536
fcb5992
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
  - transformers
  - transformers.js
model-index:
  - name: orange-nomic-v1.5-1536
    results:
      - task:
          type: Retrieval
        dataset:
          type: mteb/arguana
          name: MTEB ArguAna
          config: default
          split: test
          revision: None
        metrics:
          - type: map_at_1
            value: 24.253
          - type: map_at_10
            value: 38.962
          - type: map_at_100
            value: 40.081
      - task:
          type: STS
        dataset:
          type: mteb/biosses-sts
          name: MTEB BIOSSES
          config: default
          split: test
          revision: d3fb88f8f02e40887cd149695127462bbcf29b4a
        metrics:
          - type: cos_sim_pearson
            value: 86.73980520022269
          - type: cos_sim_spearman
            value: 84.24649792685918
      - task:
          type: Classification
        dataset:
          type: mteb/banking77
          name: MTEB Banking77Classification
          config: default
          split: test
          revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
        metrics:
          - type: accuracy
            value: 84.25324675324674
          - type: f1
            value: 84.17872280892557
      - task:
          type: Classification
        dataset:
          type: mteb/imdb
          name: MTEB ImdbClassification
          config: default
          split: test
          revision: 3d86128a09e091d6018b6d26cad27f2739fc2db7
        metrics:
          - type: accuracy
            value: 85.312
          - type: ap
            value: 80.36296867333715
          - type: f1
            value: 85.26613311552218
      - task:
          type: Retrieval
        dataset:
          type: mteb/msmarco
          name: MTEB MSMARCO
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 23.363999999999997
          - type: map_at_10
            value: 35.711999999999996
          - type: map_at_100
            value: 36.876999999999995
      - task:
          type: Retrieval
        dataset:
          type: mteb/quora
          name: MTEB QuoraRetrieval
          config: default
          split: test
          revision: None
        metrics:
          - type: map_at_1
            value: 70.402
          - type: map_at_10
            value: 84.181
          - type: map_at_100
            value: 84.796
license: apache-2.0
language:
  - en

Orange/nomic-embed-text-v1.5: 1536-Dimensional Embedding Model

A high-performance embedding model from the Orange organization, built by extending nomic-ai/nomic-embed-text-v1.5 to 1536 dimensions using a learnable linear projection.

Overview

This model is a modified version of Nomic Embed v1.5, which itself is an improvement over the original Nomic Embed model. The key enhancement is that this model has been projected from the native 768-dimensional space to a 1536-dimensional space while preserving semantic similarity.

Architecture

The Orange/nomic-embed-text-v1.5 model uses a three-stage pipeline:

Transformer (768-dim) → Pooling → Dense Projection (1536-dim)
  • Base Model: nomic-ai/nomic-embed-text-v1.5 (GPT-style BERT with swiglu activation)
  • Projection Method: Linear layer with weight matrix (1536 x 768)
    • Top 768 rows: sqrt(2) * I (scales original dimensions by sqrt(2))
    • Bottom 768 rows: zeros (zero-padding)
  • Result: Preserves cosine similarity while doubling dimensions

Key Properties

  • Embedding Dimension: 1536
  • Sequence Length: 8192 tokens (supports long contexts)
  • Similarity Metric: Cosine similarity preserved from base model
  • Matryoshka: The model supports adjustable embedding dimensions (Matryoshka Representation Learning)

Usage

Important: Task Instruction Prefix

The model requires a task instruction prefix in the input text. This tells the model which task you're performing.

For RAG (Retrieval-Augmented Generation)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Embed documents
documents = ['search_document: The quick brown fox jumps over the lazy dog']
doc_embeddings = model.encode(documents)

# Embed queries
queries = ['search_query: What animal is in the sentence?']
query_embeddings = model.encode(queries)

Available Task Prefixes

Prefix Purpose
search_document Embed texts as documents for indexing (e.g., RAG)
search_query Embed texts as queries to find relevant documents
clustering Embed texts for grouping into clusters
classification Embed texts as features for classification

Python Examples

Using Sentence Transformers

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Encode sentences
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)

# Optional: Apply layer normalization and truncate for Matryoshka
matryoshka_dim = 768  # Can use any dimension <= 1536
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # torch.Size([2, 768])

Using Transformers Directly

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model_name = "Orange/orange-nomic-v1.5-1536"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :1536]  # Use full 1536-dim
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)  # torch.Size([2, 1536])

Adjusting Dimensionality (Matryoshka)

This model supports Matryoshka Representation Learning - you can use smaller embedding dimensions:

Dimension Use Case
1536 Full precision (default)
768 Half precision, ~same quality
512 Good quality, 3x compression
256 High compression, minimal quality loss
128 Maximum compression

Example with 512 dimensions:

embeddings = embeddings[:, :512]  # Truncate to 512 dimensions

Model Performance

MTEB Benchmark Results

Task Dataset Metric Score
Retrieval ArguANA MAP@100 40.081
STS BIOSSES Cosine Spearman 84.25
Classification Banking77 Accuracy 84.25
Classification IMDB Accuracy 85.31
Retrieval MSMARCO MAP@100 36.88
Retrieval Quora MAP@100 84.80

See the model card on HuggingFace for the complete MTEB leaderboard results.

Differences from Base Model

Property nomic-embed-text-v1.5 Orange/nomic-embed-text-v1.5
Dimension 768 1536
Cosine Similarity Native Preserved via projection
Matryoshka Supported Supported
Use Case General embedding Higher-dim applications

Use Cases

This 1536-dimensional model is particularly useful for:

  • Applications requiring higher-dimensional embeddings
  • Maintaining compatibility with existing 1536-dim workflows
  • Scenarios where extra dimensionality provides marginal benefits
  • Experiments comparing different embedding dimensions

References

Citation

If you use this model in your research, please cite the original Nomic Embed work:

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This model is licensed under Apache 2.0.