Initial release: FormBench TAPT-MNRL model (anonymised for review)

Browse files

Files changed (14) hide show

1_Pooling/config.json +10 -0
README.md +123 -0
_eval_status.json +20 -0
_legacy_eval_status.json +10 -0
config.json +73 -0
config_sentence_transformers.json +14 -0
configuration_hf_nomic_bert.py +56 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 768,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,123 @@

+---
+license: apache-2.0
+base_model: nomic-ai/nomic-embed-text-v1.5
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- formbench
+- patent-retrieval
+- chemistry
+- formulations
+- materials-science
+language:
+- en
+pipeline_tag: sentence-similarity
+---
+# nomic-formbench-mnrl
+A domain-adapted sentence-transformers model derived from
+[`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) and fine-tuned on the **FormBench**
+retrieval benchmark for formulation chemistry. It maps passages from formulation patents
+into a 768-dimensional dense vector space and is optimised for within-domain
+retrieval among structurally similar near-miss passages — the central capability targeted
+by FormBench.
+This repository hosts an anonymised release for NeurIPS 2026 double-blind review.
+## Model details
+| Item | Value |
+|---|---|
+| Base model | `nomic-ai/nomic-embed-text-v1.5` (137M params) |
+| Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning |
+| Loss | `MultipleNegativesRankingLoss` (in-batch negatives) |
+| Training data | FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples |
+| Embedding dimension | 768 |
+| Max sequence length | 8192 (training: 2048) |
+| Precision | bf16 |
+| Learning rate | 2e-5 |
+| Per-GPU batch size | 32 |
+| Epochs | 5 |
+| Hardware | 8× AMD MI250X, DDP |
+The training-triplet set is reconstructable from the qrel files in
+[`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench)
+following the protocol in §3 of the paper.
+## Evaluation results
+Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants,
+following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100.
+### FormBench-Structured (C1) — within-domain near-miss distractors
+| Metric | Value |
+|---|---:|
+| Binary nDCG@10 | **0.3668** |
+| MRR (binary qrels) | 0.3228 |
+| Graded nDCG@10 | 0.2145 |
+| R@100 (binary qrels) | 0.7903 |
+| FAISS search latency | 14.5 ms/query |
+### FormBench-Random (C0) — random-distractor corpus
+| Metric | Value |
+|---|---:|
+| Binary nDCG@10 | **0.4358** |
+| MRR (binary qrels) | 0.3915 |
+| Graded nDCG@10 | 0.2583 |
+| R@100 (binary qrels) | 0.8311 |
+| FAISS search latency | 14.5 ms/query |
+For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0).
+## Usage
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("Formbench-anon/nomic-formbench-mnrl")
+passages = [
+    "An adhesive composition comprising a styrene-acrylate copolymer ...",
+    "A water-based latex paint formulation containing ...",
+]
+queries = [
+    "what wax-seeded latex polymers improve scuff resistance in architectural coatings?",
+]
+passage_embeds = model.encode(passages, normalize_embeddings=True)
+query_embeds   = model.encode(queries,  normalize_embeddings=True)
+```
+## Intended use
+Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants,
+pharmaceuticals, agrochemicals, personal care, food. Particularly suited to
+within-domain near-miss discrimination, where general-purpose embedders have been shown
+to fail.
+## Limitations
+- Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not
+  match real practitioner intent.
+- Coverage limited to USPTO utility patents (1995–2022) in English only.
+- Performance on out-of-domain retrieval is not characterised.
+## Citation
+```bibtex
+@misc{formbench2026,
+  title  = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents },
+  author = { Anonymous Authors },
+  year   = { 2026 },
+  note   = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track }
+}
+```
+## License
+Apache 2.0, inherited from the base model.

_eval_status.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "c1": {
+    "status": "completed",
+    "queued_by_job": "4371243",
+    "started_at": "2026-04-10T20:56:30.947301",
+    "phase1_job": "4371243",
+    "phase1_completed_at": "2026-04-10T21:03:15.411292",
+    "completed_at": "2026-04-11T03:44:05.632818",
+    "metrics_path": "/lustre/orion/mat721/proj-shared/formbench_ner/experiments/results/nomic_tapt_mnrl_best/c1/metrics.json"
+  },
+  "c0": {
+    "status": "completed",
+    "queued_by_job": "4379499",
+    "started_at": "2026-04-12T08:17:28.640808",
+    "phase1_job": "4379499",
+    "phase1_completed_at": "2026-04-12T08:22:29.644197",
+    "completed_at": "2026-04-12T13:13:37.047105",
+    "metrics_path": "/lustre/orion/mat721/proj-shared/formbench_ner/experiments/results/nomic_tapt_mnrl_best/c0/metrics.json"
+  }
+}

_legacy_eval_status.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "status": "completed",
+  "corpus": "c1",
+  "queued_by_job": "4371243",
+  "started_at": "2026-04-10T20:56:30.947301",
+  "phase1_job": "4371243",
+  "phase1_completed_at": "2026-04-10T21:03:15.411292",
+  "completed_at": "2026-04-11T03:44:05.632818",
+  "metrics_path": "/lustre/orion/mat721/proj-shared/formbench_ner/experiments/results/nomic_tapt_mnrl_best/c1/metrics.json"
+}

config.json ADDED Viewed

	@@ -0,0 +1,73 @@

+{
+  "activation_function": "swiglu",
+  "architectures": [
+    "NomicBertModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "attn_pdrop": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_hf_nomic_bert.NomicBertConfig",
+    "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
+    "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining",
+    "AutoModelForMultipleChoice": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForMultipleChoice",
+    "AutoModelForQuestionAnswering": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForQuestionAnswering",
+    "AutoModelForSequenceClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForSequenceClassification",
+    "AutoModelForTokenClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForTokenClassification"
+  },
+  "bos_token_id": null,
+  "causal": false,
+  "classifier_dropout": null,
+  "dense_seq_output": true,
+  "embd_pdrop": 0.0,
+  "eos_token_id": null,
+  "fused_bias_fc": true,
+  "fused_dropout_add_ln": true,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_dropout_prob": 0.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "layer_norm_epsilon": 1e-12,
+  "max_trained_positions": 2048,
+  "mlp_fc1_bias": false,
+  "mlp_fc2_bias": false,
+  "model_type": "nomic_bert",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": 3072,
+  "n_layer": 12,
+  "n_positions": 2048,
+  "pad_token_id": 0,
+  "pad_vocab_size_multiple": 64,
+  "parallel_block": false,
+  "parallel_block_tied_norm": false,
+  "prenorm": false,
+  "qkv_proj_bias": false,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.0,
+  "rope_parameters": {
+    "rope_theta": 1000.0,
+    "rope_type": "default"
+  },
+  "rotary_emb_base": 1000,
+  "rotary_emb_fraction": 1.0,
+  "rotary_emb_interleaved": false,
+  "rotary_emb_scale_base": null,
+  "rotary_scaling_factor": null,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.0,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "use_flash_attn": true,
+  "use_rms_norm": false,
+  "use_xentropy": true,
+  "vocab_size": 30528
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "__version__": {
+    "sentence_transformers": "5.3.0",
+    "transformers": "4.51.3",
+    "pytorch": "2.3.1+rocm5.7"
+  },
+  "model_type": "SentenceTransformer",
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration_hf_nomic_bert.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from transformers import GPT2Config
+class NomicBertConfig(GPT2Config):
+    model_type = "nomic_bert"
+    def __init__(
+        self,
+        prenorm=False,
+        parallel_block=False,
+        parallel_block_tied_norm=False,
+        rotary_emb_fraction=0.0,
+        fused_dropout_add_ln=False,
+        fused_bias_fc=False,
+        use_flash_attn=False,
+        use_xentropy=False,
+        qkv_proj_bias=True,
+        rotary_emb_base=10_000,
+        rotary_emb_scale_base=None,
+        rotary_emb_interleaved=False,
+        mlp_fc1_bias=True,
+        mlp_fc2_bias=True,
+        use_rms_norm=False,
+        causal=False,
+        type_vocab_size=2,
+        dense_seq_output=True,
+        pad_vocab_size_multiple=1,
+        tie_word_embeddings=True,
+        rotary_scaling_factor=None,
+        max_trained_positions=2048,
+        **kwargs,
+    ):
+        self.prenorm = prenorm
+        self.parallel_block = parallel_block
+        self.parallel_block_tied_norm = parallel_block_tied_norm
+        self.rotary_emb_fraction = rotary_emb_fraction
+        self.tie_word_embeddings = tie_word_embeddings
+        self.fused_dropout_add_ln = fused_dropout_add_ln
+        self.fused_bias_fc = fused_bias_fc
+        self.use_flash_attn = use_flash_attn
+        self.use_xentropy = use_xentropy
+        self.qkv_proj_bias = qkv_proj_bias
+        self.rotary_emb_base = rotary_emb_base
+        self.rotary_emb_scale_base = rotary_emb_scale_base
+        self.rotary_emb_interleaved = rotary_emb_interleaved
+        self.mlp_fc1_bias = mlp_fc1_bias
+        self.mlp_fc2_bias = mlp_fc2_bias
+        self.use_rms_norm = use_rms_norm
+        self.causal = causal
+        self.type_vocab_size = type_vocab_size
+        self.dense_seq_output = dense_seq_output
+        self.pad_vocab_size_multiple = pad_vocab_size_multiple
+        self.rotary_scaling_factor = rotary_scaling_factor
+        self.max_trained_positions = max_trained_positions
+        super().__init__(**kwargs)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ed4b7e2c422acb75de61cab4b60caf23c64fe1314fb01664d45e9a19b202031
+size 546938168

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 2048,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff