File size: 3,851 Bytes

d8d7791

---
license: apache-2.0
base_model: Qwen/Qwen3-Embedding-0.6B
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- formbench
- patent-retrieval
- chemistry
- formulations
- materials-science
language:
- en
pipeline_tag: sentence-similarity
---

# qwen3-embed-formbench-mnrl

A domain-adapted sentence-transformers model derived from
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) and fine-tuned on the **FormBench**
retrieval benchmark for formulation chemistry. It maps passages from formulation patents
into a 1024-dimensional dense vector space and is optimised for within-domain
retrieval among structurally similar near-miss passages — the central capability targeted
by FormBench.

This repository hosts an anonymised release for NeurIPS 2026 double-blind review.

## Model details

| Item | Value |
|---|---|
| Base model | `Qwen/Qwen3-Embedding-0.6B` (600M params) |
| Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning |
| Loss | `MultipleNegativesRankingLoss` (in-batch negatives) |
| Training data | FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples |
| Embedding dimension | 1024 |
| Max sequence length | 8192 (training: 2048) |
| Precision | bf16 |
| Learning rate | 1e-5 |
| Per-GPU batch size | 32 |
| Epochs | 5 |
| Hardware | 8× AMD MI250X, DDP |

The training-triplet set is reconstructable from the qrel files in
[`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench)
following the protocol in §3 of the paper.

## Evaluation results

Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants,
following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100.

### FormBench-Structured (C1) — within-domain near-miss distractors

| Metric | Value |
|---|---:|
| Binary nDCG@10 | **0.4085** |
| MRR (binary qrels) | 0.3613 |
| Graded nDCG@10 | 0.2374 |
| R@100 (binary qrels) | 0.8181 |
| FAISS search latency | 18.8 ms/query |

### FormBench-Random (C0) — random-distractor corpus

| Metric | Value |
|---|---:|
| Binary nDCG@10 | **0.4835** |
| MRR (binary qrels) | 0.4404 |
| Graded nDCG@10 | 0.2835 |
| R@100 (binary qrels) | 0.8525 |
| FAISS search latency | 18.8 ms/query |

For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0).

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Formbench-anon/qwen3-embed-formbench-mnrl")

passages = [
    "An adhesive composition comprising a styrene-acrylate copolymer ...",
    "A water-based latex paint formulation containing ...",
]
queries = [
    "what wax-seeded latex polymers improve scuff resistance in architectural coatings?",
]

passage_embeds = model.encode(passages, normalize_embeddings=True)
query_embeds   = model.encode(queries,  normalize_embeddings=True)
```

## Intended use

Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants,
pharmaceuticals, agrochemicals, personal care, food. Particularly suited to
within-domain near-miss discrimination, where general-purpose embedders have been shown
to fail.

## Limitations

- Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not
  match real practitioner intent.
- Coverage limited to USPTO utility patents (1995–2022) in English only.
- Performance on out-of-domain retrieval is not characterised.

## Citation

```bibtex
@misc{formbench2026,
  title  = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents },
  author = { Anonymous Authors },
  year   = { 2026 },
  note   = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track }
}
```

## License

Apache 2.0, inherited from the base model.