msco-nomic-v1.5

Finetuned nomic-ai/nomic-embed-text-v1.5 for occupation taxonomy disambiguation.

Trained to separate cross-domain occupations that the base Nomic embeddings confuse (e.g., neurologist ↔ physicist, surgery ↔ dentistry lecturer).

Part of Meridial's occupation resolver (MSCO).

Model Description

Base model: nomic-ai/nomic-embed-text-v1.5 (768-dim, Matryoshka-native)
Finetuning method: Contrastive learning with taxonomy-derived triplets
Loss: MultipleNegativesRankingLoss wrapped in MatryoshkaLoss
Training data: 72,650 (anchor, positive, negative) triplets from ESCO + Radford taxonomies
Hardware: Apple M-series GPU (MPS backend)

Training Data

Triplets were generated from two occupation taxonomies:

Source	Occupations	Domains
ESCO (v1.2.1)	4,086	ISCO 2-3 digit codes
Tech	813	Function/Area
Total	4,899	329 unique domains

Each occupation's text representation combined label -- group/area -- description (truncated).

Triplet construction:

Anchor: occupation text
Positive: random peer from same subdomain (e.g., same ISCO 3-digit / same Radford area)
Negative: random occupation from a different domain entirely

Two triplet sets were generated and mixed:

Random negatives: 48,540 triplets (10 per anchor) — broad cross-domain signal
Hard negatives: 24,110 triplets (5 per anchor) — mined by embedding all occupations with the base model and finding the most-similar cross-domain pairs. These capture the specific confusion cases we want the model to fix.

Training Configuration

Parameter	Value
Epochs	~1.8 (early-stopped checkpoint)
Batch size	32
Learning rate	2e-5
Scheduler	Cosine with 10% warmup
Loss	`MultipleNegativesRankingLoss` + `MatryoshkaLoss` (dims: 768, 512, 256, 128, 64)
Batch sampler	`BatchSamplers.NO_DUPLICATES`
Optimizer	AdamW
Precision	fp32 (MPS)

Evaluation

Triplet accuracy (held-out 10% of training triplets):

Training epoch	Accuracy
0.05	79.5%
0.20	90.3%
0.50	97.2%
1.00	98.0%
1.61 (checkpoint)	98.29%

Domain separation (measured on all 4,899 occupations using label-only encoding):

Metric	Base (nomic v1.5)	Finetuned	Δ
Same-domain avg cosine	0.6402	0.5622	-0.078
Cross-domain avg cosine	0.5385	0.1474	-0.391
Domain separation score	0.1017	0.4148	+0.3131 (4.1×)

The finetuned model has sharper domain boundaries: cross-domain pairs are ~2.8× less similar than same-domain pairs (previously ~1.2×).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nphan/msco-nomic-v1.5", trust_remote_code=True)

# Same as base nomic — supports Matryoshka truncation
embeddings = model.encode(
    ["neurologist", "physicist", "surgeon"],
    normalize_embeddings=True,
)

# Truncate to smaller dimensions if needed (Matryoshka-compatible)
# embeddings = embeddings[:, :256]  # 256-dim

Intended Use

Occupation title resolution in the MSCO resolver pipeline
Semantic search within an occupation catalog
Cross-taxonomy mapping (ESCO ↔ Tech ↔ ISCO)

Limitations

Domain-specific: Finetuned on occupation data — general-domain embedding quality may be slightly degraded. Evaluate with NanoBEIREvaluator for your use case if using outside occupation retrieval.
English only: Training data was English-only. Multilingual performance is inherited from the base Nomic model and not guaranteed.
Compressed similarity magnitudes: Absolute cosine scores are lower than the base model (e.g., same-domain ~0.56 vs base 0.64). Ranking is still correct but thresholds may need recalibration.
Training stopped at epoch 1.76: Training was early-stopped before the planned 3 epochs. Accuracy had plateaued at ~98.3% with diminishing gains.

Downloads last month: 757

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nphan/msco-nomic-v1.5

Base model

nomic-ai/nomic-embed-text-v1.5

Finetuned

(30)

this model