Claude

Update model card with better documentation for Orange/nomic-embed-text-v1.5-1536

fcb5992 2 months ago

8.94 kB

library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
  - transformers
  - transformers.js
model-index:
  - name: orange-nomic-v1.5-1536
    results:
      - task:
          type: Retrieval
        dataset:
          type: mteb/arguana
          name: MTEB ArguAna
          config: default
          split: test
          revision: None
        metrics:
          - type: map_at_1
            value: 24.253
          - type: map_at_10
            value: 38.962
          - type: map_at_100
            value: 40.081
      - task:
          type: STS
        dataset:
          type: mteb/biosses-sts
          name: MTEB BIOSSES
          config: default
          split: test
          revision: d3fb88f8f02e40887cd149695127462bbcf29b4a
        metrics:
          - type: cos_sim_pearson
            value: 86.73980520022269
          - type: cos_sim_spearman
            value: 84.24649792685918
      - task:
          type: Classification
        dataset:
          type: mteb/banking77
          name: MTEB Banking77Classification
          config: default
          split: test
          revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
        metrics:
          - type: accuracy
            value: 84.25324675324674
          - type: f1
            value: 84.17872280892557
      - task:
          type: Classification
        dataset:
          type: mteb/imdb
          name: MTEB ImdbClassification
          config: default
          split: test
          revision: 3d86128a09e091d6018b6d26cad27f2739fc2db7
        metrics:
          - type: accuracy
            value: 85.312
          - type: ap
            value: 80.36296867333715
          - type: f1
            value: 85.26613311552218
      - task:
          type: Retrieval
        dataset:
          type: mteb/msmarco
          name: MTEB MSMARCO
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 23.363999999999997
          - type: map_at_10
            value: 35.711999999999996
          - type: map_at_100
            value: 36.876999999999995
      - task:
          type: Retrieval
        dataset:
          type: mteb/quora
          name: MTEB QuoraRetrieval
          config: default
          split: test
          revision: None
        metrics:
          - type: map_at_1
            value: 70.402
          - type: map_at_10
            value: 84.181
          - type: map_at_100
            value: 84.796
license: apache-2.0
language:
  - en

Orange/nomic-embed-text-v1.5: 1536-Dimensional Embedding Model

A high-performance embedding model from the Orange organization, built by extending nomic-ai/nomic-embed-text-v1.5 to 1536 dimensions using a learnable linear projection.

Overview

This model is a modified version of Nomic Embed v1.5, which itself is an improvement over the original Nomic Embed model. The key enhancement is that this model has been projected from the native 768-dimensional space to a 1536-dimensional space while preserving semantic similarity.

Architecture

The Orange/nomic-embed-text-v1.5 model uses a three-stage pipeline:

Transformer (768-dim) → Pooling → Dense Projection (1536-dim)

Base Model: nomic-ai/nomic-embed-text-v1.5 (GPT-style BERT with swiglu activation)
Projection Method: Linear layer with weight matrix (1536 x 768)
- Top 768 rows: sqrt(2) * I (scales original dimensions by sqrt(2))
- Bottom 768 rows: zeros (zero-padding)
Result: Preserves cosine similarity while doubling dimensions

Key Properties

Embedding Dimension: 1536
Sequence Length: 8192 tokens (supports long contexts)
Similarity Metric: Cosine similarity preserved from base model
Matryoshka: The model supports adjustable embedding dimensions (Matryoshka Representation Learning)

Usage

Important: Task Instruction Prefix

The model requires a task instruction prefix in the input text. This tells the model which task you're performing.

For RAG (Retrieval-Augmented Generation)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Embed documents
documents = ['search_document: The quick brown fox jumps over the lazy dog']
doc_embeddings = model.encode(documents)

# Embed queries
queries = ['search_query: What animal is in the sentence?']
query_embeddings = model.encode(queries)

Available Task Prefixes

Prefix	Purpose
`search_document`	Embed texts as documents for indexing (e.g., RAG)
`search_query`	Embed texts as queries to find relevant documents
`clustering`	Embed texts for grouping into clusters
`classification`	Embed texts as features for classification

Python Examples

Using Sentence Transformers

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Encode sentences
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)

# Optional: Apply layer normalization and truncate for Matryoshka
matryoshka_dim = 768  # Can use any dimension <= 1536
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # torch.Size([2, 768])

Using Transformers Directly

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model_name = "Orange/orange-nomic-v1.5-1536"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :1536]  # Use full 1536-dim
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)  # torch.Size([2, 1536])

Adjusting Dimensionality (Matryoshka)

This model supports Matryoshka Representation Learning - you can use smaller embedding dimensions:

Dimension	Use Case
1536	Full precision (default)
768	Half precision, ~same quality
512	Good quality, 3x compression
256	High compression, minimal quality loss
128	Maximum compression

Example with 512 dimensions:

embeddings = embeddings[:, :512]  # Truncate to 512 dimensions

Model Performance

MTEB Benchmark Results

Task	Dataset	Metric	Score
Retrieval	ArguANA	MAP@100	40.081
STS	BIOSSES	Cosine Spearman	84.25
Classification	Banking77	Accuracy	84.25
Classification	IMDB	Accuracy	85.31
Retrieval	MSMARCO	MAP@100	36.88
Retrieval	Quora	MAP@100	84.80

See the model card on HuggingFace for the complete MTEB leaderboard results.

Differences from Base Model

Property	nomic-embed-text-v1.5	Orange/nomic-embed-text-v1.5
Dimension	768	1536
Cosine Similarity	Native	Preserved via projection
Matryoshka	Supported	Supported
Use Case	General embedding	Higher-dim applications

Use Cases

This 1536-dimensional model is particularly useful for:

Applications requiring higher-dimensional embeddings
Maintaining compatibility with existing 1536-dim workflows
Scenarios where extra dimensionality provides marginal benefits
Experiments comparing different embedding dimensions

References

Nomic Embed v1.5: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
Nomic Embed Technical Report: https://arxiv.org/abs/2402.01613
Matryoshka Representation Learning: https://arxiv.org/abs/2205.13147

Citation

If you use this model in your research, please cite the original Nomic Embed work:

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This model is licensed under Apache 2.0.