anonymous-ed-benchmark's picture
Replace auto-generated README with clean model card
a57ae5c verified
---
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- skill-retrieval
- embedding
language:
- en
datasets:
- anonymous-ed-benchmark/skillret-benchmark
library_name: sentence-transformers
pipeline_tag: sentence-similarity
model-index:
- name: SkillRet-Embedding-0.6B
results:
- task:
type: information-retrieval
name: Skill Retrieval
dataset:
type: anonymous-ed-benchmark/skillret-benchmark
name: SkillRet Benchmark (test)
split: test
metrics:
- type: ndcg_at_5
value: 0.753
name: NDCG@5
- type: ndcg_at_10
value: 0.777
name: NDCG@10
- type: recall_at_10
value: 0.852
name: Recall@10
- type: mrr_at_10
value: 0.827
name: MRR@10
---
# SkillRet-Embedding-0.6B
This is a [sentence-transformers](https://www.SBERT.net) model fine-tuned for **AI agent skill retrieval**. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.
The model is fine-tuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on the [SkillRet benchmark](https://huggingface.co/datasets/anonymous-ed-benchmark/skillret-benchmark) training split using contrastive learning (MultipleNegativesRankingLoss).
## Usage
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("anonymous-ed-benchmark/SkillRet-Embedding-0.6B", trust_remote_code=True)
query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "
queries = [
query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
"ci-cd-setup | Configure continuous integration and deployment pipelines ...",
"python-debugging | Debug Python applications using pdb and logging ...",
]
q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)
similarities = q_emb @ s_emb.T
print(similarities)
```
## Training Details
- **Base model**: Qwen3-Embedding-0.6B (0.6B parameters)
- **Training data**: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
- **Loss**: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
- **Hardware**: 4× NVIDIA B200 GPUs (DDP)
- **Effective batch size**: 384 (96 per device × 4 GPUs)
- **Max sequence length**: 8,192 tokens
- **Learning rate**: 2e-5
- **Epochs**: 1
- **Training time**: ~6 hours
- **Precision**: BF16
### Training Logs
| Epoch | Step | Training Loss | NDCG@15 |
|:-----:|:----:|:------------:|:-------:|
| 0.15 | 50 | 2.4288 | 0.7802 |
| 0.30 | 100 | 1.9920 | 0.7842 |
| **0.45** | **150** | **1.9758** | **0.7887** |
| 0.60 | 200 | 1.9011 | 0.7865 |
| 0.76 | 250 | 1.9100 | 0.7874 |
| 0.91 | 300 | 1.9412 | 0.7859 |
| 1.0 | 331 | - | 0.7862 |
Best checkpoint at step 150 (bold row).
## Evaluation Results
Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).
| Metric | @5 | @10 | @15 |
|--------|------|------|------|
| NDCG | 0.753 | 0.777 | 0.786 |
| Recall | 0.791 | 0.852 | 0.880 |
| MRR | 0.823 | 0.827 | 0.828 |
| MAP | 0.698 | 0.713 | 0.718 |
| Precision | 0.253 | 0.138 | 0.096 |
| Accuracy@1 | 0.763 | — | — |
## Intended Use
This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.
## Limitations
- Optimized for English-language queries and agent skills.
- Performance may vary on domains outside the SkillRet benchmark distribution.
- The model retrieves skills but does not execute them.
## Framework Versions
- Python: 3.10.12
- Sentence Transformers: 5.4.1
- Transformers: 5.5.4
- PyTorch: 2.7.1+cu128
## Citation
Citation information will be added in the de-anonymized release.