--- license: apache-2.0 base_model: nomic-ai/nomic-embed-text-v1.5 library_name: sentence-transformers tags: - sentence-transformers - feature-extraction - sentence-similarity - formbench - patent-retrieval - chemistry - formulations - materials-science language: - en pipeline_tag: sentence-similarity --- # nomic-formbench-mnrl A domain-adapted sentence-transformers model derived from [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) and fine-tuned on the **FormBench** retrieval benchmark for formulation chemistry. It maps passages from formulation patents into a 768-dimensional dense vector space and is optimised for within-domain retrieval among structurally similar near-miss passages — the central capability targeted by FormBench. This repository hosts an anonymised release for NeurIPS 2026 double-blind review. ## Model details | Item | Value | |---|---| | Base model | `nomic-ai/nomic-embed-text-v1.5` (137M params) | | Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning | | Loss | `MultipleNegativesRankingLoss` (in-batch negatives) | | Training data | FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples | | Embedding dimension | 768 | | Max sequence length | 8192 (training: 2048) | | Precision | bf16 | | Learning rate | 2e-5 | | Per-GPU batch size | 32 | | Epochs | 5 | | Hardware | 8× AMD MI250X, DDP | The training-triplet set is reconstructable from the qrel files in [`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench) following the protocol in §3 of the paper. ## Evaluation results Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants, following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100. ### FormBench-Structured (C1) — within-domain near-miss distractors | Metric | Value | |---|---:| | Binary nDCG@10 | **0.3668** | | MRR (binary qrels) | 0.3228 | | Graded nDCG@10 | 0.2145 | | R@100 (binary qrels) | 0.7903 | | FAISS search latency | 14.5 ms/query | ### FormBench-Random (C0) — random-distractor corpus | Metric | Value | |---|---:| | Binary nDCG@10 | **0.4358** | | MRR (binary qrels) | 0.3915 | | Graded nDCG@10 | 0.2583 | | R@100 (binary qrels) | 0.8311 | | FAISS search latency | 14.5 ms/query | For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0). ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Formbench-anon/nomic-formbench-mnrl") passages = [ "An adhesive composition comprising a styrene-acrylate copolymer ...", "A water-based latex paint formulation containing ...", ] queries = [ "what wax-seeded latex polymers improve scuff resistance in architectural coatings?", ] passage_embeds = model.encode(passages, normalize_embeddings=True) query_embeds = model.encode(queries, normalize_embeddings=True) ``` ## Intended use Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants, pharmaceuticals, agrochemicals, personal care, food. Particularly suited to within-domain near-miss discrimination, where general-purpose embedders have been shown to fail. ## Limitations - Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not match real practitioner intent. - Coverage limited to USPTO utility patents (1995–2022) in English only. - Performance on out-of-domain retrieval is not characterised. ## Citation ```bibtex @misc{formbench2026, title = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents }, author = { Anonymous Authors }, year = { 2026 }, note = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track } } ``` ## License Apache 2.0, inherited from the base model.