| --- |
| license: apache-2.0 |
| base_model: nomic-ai/nomic-embed-text-v1.5 |
| library_name: sentence-transformers |
| tags: |
| - sentence-transformers |
| - feature-extraction |
| - sentence-similarity |
| - formbench |
| - patent-retrieval |
| - chemistry |
| - formulations |
| - materials-science |
| language: |
| - en |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # nomic-formbench-mnrl |
|
|
| A domain-adapted sentence-transformers model derived from |
| [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) and fine-tuned on the **FormBench** |
| retrieval benchmark for formulation chemistry. It maps passages from formulation patents |
| into a 768-dimensional dense vector space and is optimised for within-domain |
| retrieval among structurally similar near-miss passages — the central capability targeted |
| by FormBench. |
|
|
| This repository hosts an anonymised release for NeurIPS 2026 double-blind review. |
|
|
| ## Model details |
|
|
| | Item | Value | |
| |---|---| |
| | Base model | `nomic-ai/nomic-embed-text-v1.5` (137M params) | |
| | Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning | |
| | Loss | `MultipleNegativesRankingLoss` (in-batch negatives) | |
| | Training data | FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples | |
| | Embedding dimension | 768 | |
| | Max sequence length | 8192 (training: 2048) | |
| | Precision | bf16 | |
| | Learning rate | 2e-5 | |
| | Per-GPU batch size | 32 | |
| | Epochs | 5 | |
| | Hardware | 8× AMD MI250X, DDP | |
|
|
| The training-triplet set is reconstructable from the qrel files in |
| [`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench) |
| following the protocol in §3 of the paper. |
|
|
| ## Evaluation results |
|
|
| Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants, |
| following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100. |
|
|
| ### FormBench-Structured (C1) — within-domain near-miss distractors |
|
|
| | Metric | Value | |
| |---|---:| |
| | Binary nDCG@10 | **0.3668** | |
| | MRR (binary qrels) | 0.3228 | |
| | Graded nDCG@10 | 0.2145 | |
| | R@100 (binary qrels) | 0.7903 | |
| | FAISS search latency | 14.5 ms/query | |
|
|
| ### FormBench-Random (C0) — random-distractor corpus |
|
|
| | Metric | Value | |
| |---|---:| |
| | Binary nDCG@10 | **0.4358** | |
| | MRR (binary qrels) | 0.3915 | |
| | Graded nDCG@10 | 0.2583 | |
| | R@100 (binary qrels) | 0.8311 | |
| | FAISS search latency | 14.5 ms/query | |
|
|
| For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0). |
|
|
| ## Usage |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("Formbench-anon/nomic-formbench-mnrl") |
| |
| passages = [ |
| "An adhesive composition comprising a styrene-acrylate copolymer ...", |
| "A water-based latex paint formulation containing ...", |
| ] |
| queries = [ |
| "what wax-seeded latex polymers improve scuff resistance in architectural coatings?", |
| ] |
| |
| passage_embeds = model.encode(passages, normalize_embeddings=True) |
| query_embeds = model.encode(queries, normalize_embeddings=True) |
| ``` |
|
|
| ## Intended use |
|
|
| Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants, |
| pharmaceuticals, agrochemicals, personal care, food. Particularly suited to |
| within-domain near-miss discrimination, where general-purpose embedders have been shown |
| to fail. |
|
|
| ## Limitations |
|
|
| - Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not |
| match real practitioner intent. |
| - Coverage limited to USPTO utility patents (1995–2022) in English only. |
| - Performance on out-of-domain retrieval is not characterised. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{formbench2026, |
| title = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents }, |
| author = { Anonymous Authors }, |
| year = { 2026 }, |
| note = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track } |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0, inherited from the base model. |
|
|