SkillRet-Embedding-0.6B
This is a sentence-transformers model fine-tuned for AI agent skill retrieval. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.
The model is fine-tuned from Qwen/Qwen3-Embedding-0.6B on the SkillRet benchmark training split using contrastive learning (MultipleNegativesRankingLoss).
📄 Technical report: SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents (arXiv:2605.05726)
Usage
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ThakiCloud/SkillRet-Embedding-0.6B", trust_remote_code=True)
query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "
queries = [
query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
"ci-cd-setup | Configure continuous integration and deployment pipelines ...",
"python-debugging | Debug Python applications using pdb and logging ...",
]
q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)
similarities = q_emb @ s_emb.T
print(similarities)
Training Details
- Base model: Qwen3-Embedding-0.6B (0.6B parameters)
- Training data: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
- Loss: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
- Hardware: 4× NVIDIA B200 GPUs (DDP)
- Effective batch size: 384 (96 per device × 4 GPUs)
- Max sequence length: 8,192 tokens
- Learning rate: 2e-5
- Epochs: 1
- Training time: ~6 hours
- Precision: BF16
Training Logs
| Epoch | Step | Training Loss | NDCG@15 |
|---|---|---|---|
| 0.15 | 50 | 2.4288 | 0.7802 |
| 0.30 | 100 | 1.9920 | 0.7842 |
| 0.45 | 150 | 1.9758 | 0.7887 |
| 0.60 | 200 | 1.9011 | 0.7865 |
| 0.76 | 250 | 1.9100 | 0.7874 |
| 0.91 | 300 | 1.9412 | 0.7859 |
| 1.0 | 331 | — | 0.7862 |
Best checkpoint at step 150 (bold row).
Evaluation Results
Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).
| Metric | @5 | @10 | @15 |
|---|---|---|---|
| NDCG | 0.7557 | 0.7803 | 0.7887 |
| Recall | 0.7915 | 0.8542 | 0.8809 |
| Completeness | 0.6596 | 0.7509 | 0.7903 |
Intended Use
This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.
Limitations
- Optimized for English-language queries and agent skills.
- Performance may vary on domains outside the SkillRet benchmark distribution.
- The model retrieves skills but does not execute them.
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 5.4.1
- Transformers: 5.5.4
- PyTorch: 2.7.1+cu128
Citation
If you use this model or the SkillRet benchmark, please cite:
@article{cho2026skillret,
title = {SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents},
author = {Cho, Hongcheol and Kang, Ryangkyung and Kim, Youngeun},
journal = {arXiv preprint arXiv:2605.05726},
year = {2026},
url = {https://arxiv.org/abs/2605.05726}
}
- Downloads last month
- 139