File size: 3,953 Bytes
44a278e
a57ae5c
44a278e
 
 
 
a57ae5c
 
 
 
 
 
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
 
 
a57ae5c
44a278e
a57ae5c
 
 
44a278e
a57ae5c
 
 
 
 
 
 
 
 
 
 
 
44a278e
 
a57ae5c
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
 
 
a57ae5c
44a278e
 
 
 
a57ae5c
 
 
 
44a278e
a57ae5c
44a278e
a57ae5c
 
 
44a278e
 
a57ae5c
 
 
 
44a278e
 
 
 
 
a57ae5c
 
 
 
 
 
 
 
 
 
44a278e
a57ae5c
44a278e
a57ae5c
 
 
 
 
 
 
 
 
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
a57ae5c
 
 
 
 
 
 
 
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
a57ae5c
44a278e
a57ae5c
 
 
44a278e
a57ae5c
44a278e
 
 
 
 
 
 
 
a57ae5c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- skill-retrieval
- embedding
language:
- en
datasets:
- anonymous-ed-benchmark/skillret-benchmark
library_name: sentence-transformers
pipeline_tag: sentence-similarity
model-index:
- name: SkillRet-Embedding-0.6B
  results:
  - task:
      type: information-retrieval
      name: Skill Retrieval
    dataset:
      type: anonymous-ed-benchmark/skillret-benchmark
      name: SkillRet Benchmark (test)
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.753
      name: NDCG@5
    - type: ndcg_at_10
      value: 0.777
      name: NDCG@10
    - type: recall_at_10
      value: 0.852
      name: Recall@10
    - type: mrr_at_10
      value: 0.827
      name: MRR@10
---

# SkillRet-Embedding-0.6B

This is a [sentence-transformers](https://www.SBERT.net) model fine-tuned for **AI agent skill retrieval**. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.

The model is fine-tuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on the [SkillRet benchmark](https://huggingface.co/datasets/anonymous-ed-benchmark/skillret-benchmark) training split using contrastive learning (MultipleNegativesRankingLoss).

## Usage

### Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("anonymous-ed-benchmark/SkillRet-Embedding-0.6B", trust_remote_code=True)

query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "

queries = [
    query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
    "ci-cd-setup | Configure continuous integration and deployment pipelines ...",
    "python-debugging | Debug Python applications using pdb and logging ...",
]

q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)

similarities = q_emb @ s_emb.T
print(similarities)
```

## Training Details

- **Base model**: Qwen3-Embedding-0.6B (0.6B parameters)
- **Training data**: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
- **Loss**: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
- **Hardware**: 4× NVIDIA B200 GPUs (DDP)
- **Effective batch size**: 384 (96 per device × 4 GPUs)
- **Max sequence length**: 8,192 tokens
- **Learning rate**: 2e-5
- **Epochs**: 1
- **Training time**: ~6 hours
- **Precision**: BF16

### Training Logs

| Epoch | Step | Training Loss | NDCG@15 |
|:-----:|:----:|:------------:|:-------:|
| 0.15 | 50 | 2.4288 | 0.7802 |
| 0.30 | 100 | 1.9920 | 0.7842 |
| **0.45** | **150** | **1.9758** | **0.7887** |
| 0.60 | 200 | 1.9011 | 0.7865 |
| 0.76 | 250 | 1.9100 | 0.7874 |
| 0.91 | 300 | 1.9412 | 0.7859 |
| 1.0 | 331 | - | 0.7862 |

Best checkpoint at step 150 (bold row).

## Evaluation Results

Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).

| Metric | @5 | @10 | @15 |
|--------|------|------|------|
| NDCG | 0.753 | 0.777 | 0.786 |
| Recall | 0.791 | 0.852 | 0.880 |
| MRR | 0.823 | 0.827 | 0.828 |
| MAP | 0.698 | 0.713 | 0.718 |
| Precision | 0.253 | 0.138 | 0.096 |
| Accuracy@1 | 0.763 | — | — |

## Intended Use

This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.

## Limitations

- Optimized for English-language queries and agent skills.
- Performance may vary on domains outside the SkillRet benchmark distribution.
- The model retrieves skills but does not execute them.

## Framework Versions

- Python: 3.10.12
- Sentence Transformers: 5.4.1
- Transformers: 5.5.4
- PyTorch: 2.7.1+cu128

## Citation

Citation information will be added in the de-anonymized release.