ModernBERT Embed base Legal Matryoshka

This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/modernbert-embed-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("digo-prayudha/test-modernbert-embed-base-legal-matryoshka-2")
# Run inference
sentences = [
    'are “broad and vague descriptions” of the information that make it impossible for the Court to \nconduct its requisite de novo review over the Department’s decision to withhold this information \nas “critical infrastructure security information.”  See Prop. of the People, Inc. v. Off. of Mgmt. & \nBudget, 330 F. Supp. 3d 373, 388 (D.D.C. 2018).',
    'What type of review is the Court unable to conduct due to broad and vague descriptions?',
    'What did the court express skepticism about?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4837, 0.0878],
#         [0.4837, 1.0000, 0.3375],
#         [0.0878, 0.3375, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5425
cosine_accuracy@3 0.5981
cosine_accuracy@5 0.7079
cosine_accuracy@10 0.7713
cosine_precision@1 0.5425
cosine_precision@3 0.5126
cosine_precision@5 0.3966
cosine_precision@10 0.2362
cosine_recall@1 0.2031
cosine_recall@3 0.5176
cosine_recall@5 0.6473
cosine_recall@10 0.7621
cosine_ndcg@10 0.6606
cosine_mrr@10 0.5967
cosine_map@100 0.641

Information Retrieval

Metric Value
cosine_accuracy@1 0.5224
cosine_accuracy@3 0.5765
cosine_accuracy@5 0.6862
cosine_accuracy@10 0.7713
cosine_precision@1 0.5224
cosine_precision@3 0.492
cosine_precision@5 0.3805
cosine_precision@10 0.234
cosine_recall@1 0.1949
cosine_recall@3 0.4992
cosine_recall@5 0.6256
cosine_recall@10 0.7563
cosine_ndcg@10 0.6452
cosine_mrr@10 0.5773
cosine_map@100 0.6225

Information Retrieval

Metric Value
cosine_accuracy@1 0.5085
cosine_accuracy@3 0.5487
cosine_accuracy@5 0.6445
cosine_accuracy@10 0.7512
cosine_precision@1 0.5085
cosine_precision@3 0.4781
cosine_precision@5 0.3632
cosine_precision@10 0.2287
cosine_recall@1 0.1877
cosine_recall@3 0.4813
cosine_recall@5 0.5922
cosine_recall@10 0.7353
cosine_ndcg@10 0.6249
cosine_mrr@10 0.5591
cosine_map@100 0.6024

Information Retrieval

Metric Value
cosine_accuracy@1 0.4389
cosine_accuracy@3 0.4838
cosine_accuracy@5 0.5688
cosine_accuracy@10 0.6615
cosine_precision@1 0.4389
cosine_precision@3 0.4173
cosine_precision@5 0.3255
cosine_precision@10 0.1995
cosine_recall@1 0.1601
cosine_recall@3 0.415
cosine_recall@5 0.5274
cosine_recall@10 0.6446
cosine_ndcg@10 0.5452
cosine_mrr@10 0.4867
cosine_map@100 0.5329

Information Retrieval

Metric Value
cosine_accuracy@1 0.3431
cosine_accuracy@3 0.3663
cosine_accuracy@5 0.456
cosine_accuracy@10 0.544
cosine_precision@1 0.3431
cosine_precision@3 0.3174
cosine_precision@5 0.2498
cosine_precision@10 0.1623
cosine_recall@1 0.127
cosine_recall@3 0.3226
cosine_recall@5 0.4149
cosine_recall@10 0.5283
cosine_ndcg@10 0.4373
cosine_mrr@10 0.3836
cosine_map@100 0.4284

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 5,822 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 26 tokens
    • mean: 97.49 tokens
    • max: 160 tokens
    • min: 8 tokens
    • mean: 16.58 tokens
    • max: 41 tokens
  • Samples:
    positive anchor
    policy” or “are in fact properly classified pursuant to such Executive order” are exempt from
    production under the FOIA. See 5 U.S.C. § 552(b)(1). “[I]n the FOIA context, [the D.C. Circuit
    has] consistently deferred to executive affidavits predicting harm to the national security, and
    have found it unwise to undertake searching judicial review.” Ctr. for Nat’l Sec. Studies, 331
    What has the D.C. Circuit consistently deferred to in the FOIA context?
    42 The plaintiff states in its briefing that it challenges the CIA’s withholding of two records, in part, in No. 11-443,
    see Pl.’s First 443 Opp’n at 14, and six documents, in part, in No. 11-444, see Pl.’s First 444 Opp’n at 30, 35. The
    plaintiff does not specify, however, exactly which Exemption 3 withholdings it challenges in No. 11-445, where the
    How many records does the plaintiff challenge the withholding of in part in No. 11-443?
    let alone as a vexing subject of intense legal debate.
    ¶ 46

    Indeed, the question of anonymity has taken on increased significance as court records
    have become readily available to the general public through even casual Internet searches. As
    the appellant notes in his brief, a Google search of a litigant’s name can produce an untold
    In which paragraph is the issue of anonymity discussed?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.8791 10 5.6715 - - - - -
1.0 12 - 0.6281 0.6063 0.5695 0.4889 0.3602
1.7033 20 2.5845 - - - - -
2.0 24 - 0.6644 0.6467 0.6144 0.5379 0.4200
2.5275 30 2.0086 - - - - -
3.0 36 - 0.6627 0.6455 0.6228 0.5471 0.4365
3.3516 40 1.6748 - - - - -
4.0 48 - 0.6606 0.6452 0.6249 0.5452 0.4373
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.1.0
  • Transformers: 4.56.1
  • PyTorch: 2.8.0+cu126
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
4
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for digo-prayudha/test-modernbert-embed-base-legal-matryoshka-2

Finetuned
(109)
this model

Papers for digo-prayudha/test-modernbert-embed-base-legal-matryoshka-2

Evaluation results