msco-nomic-v1.5
Finetuned nomic-ai/nomic-embed-text-v1.5 for occupation taxonomy disambiguation.
Trained to separate cross-domain occupations that the base Nomic embeddings confuse (e.g., neurologist β physicist, surgery β dentistry lecturer).
Part of Meridial's occupation resolver (MSCO).
Model Description
- Base model:
nomic-ai/nomic-embed-text-v1.5(768-dim, Matryoshka-native) - Finetuning method: Contrastive learning with taxonomy-derived triplets
- Loss:
MultipleNegativesRankingLosswrapped inMatryoshkaLoss - Training data: 72,650 (anchor, positive, negative) triplets from ESCO + Radford taxonomies
- Hardware: Apple M-series GPU (MPS backend)
Training Data
Triplets were generated from two occupation taxonomies:
| Source | Occupations | Domains |
|---|---|---|
| ESCO (v1.2.1) | 4,086 | ISCO 2-3 digit codes |
| Tech | 813 | Function/Area |
| Total | 4,899 | 329 unique domains |
Each occupation's text representation combined label -- group/area -- description (truncated).
Triplet construction:
- Anchor: occupation text
- Positive: random peer from same subdomain (e.g., same ISCO 3-digit / same Radford area)
- Negative: random occupation from a different domain entirely
Two triplet sets were generated and mixed:
- Random negatives: 48,540 triplets (10 per anchor) β broad cross-domain signal
- Hard negatives: 24,110 triplets (5 per anchor) β mined by embedding all occupations with the base model and finding the most-similar cross-domain pairs. These capture the specific confusion cases we want the model to fix.
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | ~1.8 (early-stopped checkpoint) |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Scheduler | Cosine with 10% warmup |
| Loss | MultipleNegativesRankingLoss + MatryoshkaLoss (dims: 768, 512, 256, 128, 64) |
| Batch sampler | BatchSamplers.NO_DUPLICATES |
| Optimizer | AdamW |
| Precision | fp32 (MPS) |
Evaluation
Triplet accuracy (held-out 10% of training triplets):
| Training epoch | Accuracy |
|---|---|
| 0.05 | 79.5% |
| 0.20 | 90.3% |
| 0.50 | 97.2% |
| 1.00 | 98.0% |
| 1.61 (checkpoint) | 98.29% |
Domain separation (measured on all 4,899 occupations using label-only encoding):
| Metric | Base (nomic v1.5) | Finetuned | Ξ |
|---|---|---|---|
| Same-domain avg cosine | 0.6402 | 0.5622 | -0.078 |
| Cross-domain avg cosine | 0.5385 | 0.1474 | -0.391 |
| Domain separation score | 0.1017 | 0.4148 | +0.3131 (4.1Γ) |
The finetuned model has sharper domain boundaries: cross-domain pairs are ~2.8Γ less similar than same-domain pairs (previously ~1.2Γ).
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nphan/msco-nomic-v1.5", trust_remote_code=True)
# Same as base nomic β supports Matryoshka truncation
embeddings = model.encode(
["neurologist", "physicist", "surgeon"],
normalize_embeddings=True,
)
# Truncate to smaller dimensions if needed (Matryoshka-compatible)
# embeddings = embeddings[:, :256] # 256-dim
Intended Use
- Occupation title resolution in the MSCO resolver pipeline
- Semantic search within an occupation catalog
- Cross-taxonomy mapping (ESCO β Tech β ISCO)
Limitations
- Domain-specific: Finetuned on occupation data β general-domain embedding quality may be
slightly degraded. Evaluate with
NanoBEIREvaluatorfor your use case if using outside occupation retrieval. - English only: Training data was English-only. Multilingual performance is inherited from the base Nomic model and not guaranteed.
- Compressed similarity magnitudes: Absolute cosine scores are lower than the base model (e.g., same-domain ~0.56 vs base 0.64). Ranking is still correct but thresholds may need recalibration.
- Training stopped at epoch 1.76: Training was early-stopped before the planned 3 epochs. Accuracy had plateaued at ~98.3% with diminishing gains.
- Downloads last month
- 757
Model tree for nphan/msco-nomic-v1.5
Base model
nomic-ai/nomic-embed-text-v1.5