msco-nomic-v1.5

Finetuned nomic-ai/nomic-embed-text-v1.5 for occupation taxonomy disambiguation.

Trained to separate cross-domain occupations that the base Nomic embeddings confuse (e.g., neurologist ↔ physicist, surgery ↔ dentistry lecturer).

Part of Meridial's occupation resolver (MSCO).

Model Description

  • Base model: nomic-ai/nomic-embed-text-v1.5 (768-dim, Matryoshka-native)
  • Finetuning method: Contrastive learning with taxonomy-derived triplets
  • Loss: MultipleNegativesRankingLoss wrapped in MatryoshkaLoss
  • Training data: 72,650 (anchor, positive, negative) triplets from ESCO + Radford taxonomies
  • Hardware: Apple M-series GPU (MPS backend)

Training Data

Triplets were generated from two occupation taxonomies:

Source Occupations Domains
ESCO (v1.2.1) 4,086 ISCO 2-3 digit codes
Tech 813 Function/Area
Total 4,899 329 unique domains

Each occupation's text representation combined label -- group/area -- description (truncated).

Triplet construction:

  • Anchor: occupation text
  • Positive: random peer from same subdomain (e.g., same ISCO 3-digit / same Radford area)
  • Negative: random occupation from a different domain entirely

Two triplet sets were generated and mixed:

  1. Random negatives: 48,540 triplets (10 per anchor) β€” broad cross-domain signal
  2. Hard negatives: 24,110 triplets (5 per anchor) β€” mined by embedding all occupations with the base model and finding the most-similar cross-domain pairs. These capture the specific confusion cases we want the model to fix.

Training Configuration

Parameter Value
Epochs ~1.8 (early-stopped checkpoint)
Batch size 32
Learning rate 2e-5
Scheduler Cosine with 10% warmup
Loss MultipleNegativesRankingLoss + MatryoshkaLoss (dims: 768, 512, 256, 128, 64)
Batch sampler BatchSamplers.NO_DUPLICATES
Optimizer AdamW
Precision fp32 (MPS)

Evaluation

Triplet accuracy (held-out 10% of training triplets):

Training epoch Accuracy
0.05 79.5%
0.20 90.3%
0.50 97.2%
1.00 98.0%
1.61 (checkpoint) 98.29%

Domain separation (measured on all 4,899 occupations using label-only encoding):

Metric Base (nomic v1.5) Finetuned Ξ”
Same-domain avg cosine 0.6402 0.5622 -0.078
Cross-domain avg cosine 0.5385 0.1474 -0.391
Domain separation score 0.1017 0.4148 +0.3131 (4.1Γ—)

The finetuned model has sharper domain boundaries: cross-domain pairs are ~2.8Γ— less similar than same-domain pairs (previously ~1.2Γ—).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nphan/msco-nomic-v1.5", trust_remote_code=True)

# Same as base nomic β€” supports Matryoshka truncation
embeddings = model.encode(
    ["neurologist", "physicist", "surgeon"],
    normalize_embeddings=True,
)

# Truncate to smaller dimensions if needed (Matryoshka-compatible)
# embeddings = embeddings[:, :256]  # 256-dim

Intended Use

  • Occupation title resolution in the MSCO resolver pipeline
  • Semantic search within an occupation catalog
  • Cross-taxonomy mapping (ESCO ↔ Tech ↔ ISCO)

Limitations

  • Domain-specific: Finetuned on occupation data β€” general-domain embedding quality may be slightly degraded. Evaluate with NanoBEIREvaluator for your use case if using outside occupation retrieval.
  • English only: Training data was English-only. Multilingual performance is inherited from the base Nomic model and not guaranteed.
  • Compressed similarity magnitudes: Absolute cosine scores are lower than the base model (e.g., same-domain ~0.56 vs base 0.64). Ranking is still correct but thresholds may need recalibration.
  • Training stopped at epoch 1.76: Training was early-stopped before the planned 3 epochs. Accuracy had plateaued at ~98.3% with diminishing gains.
Downloads last month
757
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nphan/msco-nomic-v1.5

Finetuned
(30)
this model