anonymous-ed-benchmark's picture
Replace auto-generated README with clean model card
a57ae5c verified
metadata
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - skill-retrieval
  - embedding
language:
  - en
datasets:
  - anonymous-ed-benchmark/skillret-benchmark
library_name: sentence-transformers
pipeline_tag: sentence-similarity
model-index:
  - name: SkillRet-Embedding-0.6B
    results:
      - task:
          type: information-retrieval
          name: Skill Retrieval
        dataset:
          type: anonymous-ed-benchmark/skillret-benchmark
          name: SkillRet Benchmark (test)
          split: test
        metrics:
          - type: ndcg_at_5
            value: 0.753
            name: NDCG@5
          - type: ndcg_at_10
            value: 0.777
            name: NDCG@10
          - type: recall_at_10
            value: 0.852
            name: Recall@10
          - type: mrr_at_10
            value: 0.827
            name: MRR@10

SkillRet-Embedding-0.6B

This is a sentence-transformers model fine-tuned for AI agent skill retrieval. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.

The model is fine-tuned from Qwen/Qwen3-Embedding-0.6B on the SkillRet benchmark training split using contrastive learning (MultipleNegativesRankingLoss).

Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("anonymous-ed-benchmark/SkillRet-Embedding-0.6B", trust_remote_code=True)

query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "

queries = [
    query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
    "ci-cd-setup | Configure continuous integration and deployment pipelines ...",
    "python-debugging | Debug Python applications using pdb and logging ...",
]

q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)

similarities = q_emb @ s_emb.T
print(similarities)

Training Details

  • Base model: Qwen3-Embedding-0.6B (0.6B parameters)
  • Training data: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
  • Loss: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
  • Hardware: 4× NVIDIA B200 GPUs (DDP)
  • Effective batch size: 384 (96 per device × 4 GPUs)
  • Max sequence length: 8,192 tokens
  • Learning rate: 2e-5
  • Epochs: 1
  • Training time: ~6 hours
  • Precision: BF16

Training Logs

Epoch Step Training Loss NDCG@15
0.15 50 2.4288 0.7802
0.30 100 1.9920 0.7842
0.45 150 1.9758 0.7887
0.60 200 1.9011 0.7865
0.76 250 1.9100 0.7874
0.91 300 1.9412 0.7859
1.0 331 - 0.7862

Best checkpoint at step 150 (bold row).

Evaluation Results

Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).

Metric @5 @10 @15
NDCG 0.753 0.777 0.786
Recall 0.791 0.852 0.880
MRR 0.823 0.827 0.828
MAP 0.698 0.713 0.718
Precision 0.253 0.138 0.096
Accuracy@1 0.763

Intended Use

This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.

Limitations

  • Optimized for English-language queries and agent skills.
  • Performance may vary on domains outside the SkillRet benchmark distribution.
  • The model retrieves skills but does not execute them.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.4.1
  • Transformers: 5.5.4
  • PyTorch: 2.7.1+cu128

Citation

Citation information will be added in the de-anonymized release.