--- license: apache-2.0 tags: - sentence-transformers - sentence-similarity - feature-extraction - skill-retrieval - embedding language: - en datasets: - anonymous-ed-benchmark/skillret-benchmark library_name: sentence-transformers pipeline_tag: sentence-similarity model-index: - name: SkillRet-Embedding-0.6B results: - task: type: information-retrieval name: Skill Retrieval dataset: type: anonymous-ed-benchmark/skillret-benchmark name: SkillRet Benchmark (test) split: test metrics: - type: ndcg_at_5 value: 0.753 name: NDCG@5 - type: ndcg_at_10 value: 0.777 name: NDCG@10 - type: recall_at_10 value: 0.852 name: Recall@10 - type: mrr_at_10 value: 0.827 name: MRR@10 --- # SkillRet-Embedding-0.6B This is a [sentence-transformers](https://www.SBERT.net) model fine-tuned for **AI agent skill retrieval**. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library. The model is fine-tuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on the [SkillRet benchmark](https://huggingface.co/datasets/anonymous-ed-benchmark/skillret-benchmark) training split using contrastive learning (MultipleNegativesRankingLoss). ## Usage ### Sentence Transformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("anonymous-ed-benchmark/SkillRet-Embedding-0.6B", trust_remote_code=True) query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: " queries = [ query_prompt + "Help me set up a CI/CD pipeline for my Python project" ] skills = [ "ci-cd-setup | Configure continuous integration and deployment pipelines ...", "python-debugging | Debug Python applications using pdb and logging ...", ] q_emb = model.encode(queries, normalize_embeddings=True) s_emb = model.encode(skills, normalize_embeddings=True) similarities = q_emb @ s_emb.T print(similarities) ``` ## Training Details - **Base model**: Qwen3-Embedding-0.6B (0.6B parameters) - **Training data**: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills) - **Loss**: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing - **Hardware**: 4× NVIDIA B200 GPUs (DDP) - **Effective batch size**: 384 (96 per device × 4 GPUs) - **Max sequence length**: 8,192 tokens - **Learning rate**: 2e-5 - **Epochs**: 1 - **Training time**: ~6 hours - **Precision**: BF16 ### Training Logs | Epoch | Step | Training Loss | NDCG@15 | |:-----:|:----:|:------------:|:-------:| | 0.15 | 50 | 2.4288 | 0.7802 | | 0.30 | 100 | 1.9920 | 0.7842 | | **0.45** | **150** | **1.9758** | **0.7887** | | 0.60 | 200 | 1.9011 | 0.7865 | | 0.76 | 250 | 1.9100 | 0.7874 | | 0.91 | 300 | 1.9412 | 0.7859 | | 1.0 | 331 | - | 0.7862 | Best checkpoint at step 150 (bold row). ## Evaluation Results Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills). | Metric | @5 | @10 | @15 | |--------|------|------|------| | NDCG | 0.753 | 0.777 | 0.786 | | Recall | 0.791 | 0.852 | 0.880 | | MRR | 0.823 | 0.827 | 0.828 | | MAP | 0.698 | 0.713 | 0.718 | | Precision | 0.253 | 0.138 | 0.096 | | Accuracy@1 | 0.763 | — | — | ## Intended Use This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents. ## Limitations - Optimized for English-language queries and agent skills. - Performance may vary on domains outside the SkillRet benchmark distribution. - The model retrieves skills but does not execute them. ## Framework Versions - Python: 3.10.12 - Sentence Transformers: 5.4.1 - Transformers: 5.5.4 - PyTorch: 2.7.1+cu128 ## Citation Citation information will be added in the de-anonymized release.