Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval
Browse files
README.md
CHANGED
|
@@ -189,10 +189,10 @@ For optimal performance, use these instruction prefixes for queries:
|
|
| 189 |
Training followed a two-stage approach:
|
| 190 |
|
| 191 |
**Stage 1 — Embedding Conversion** (8.8M samples):
|
| 192 |
-
Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic
|
| 193 |
|
| 194 |
**Stage 2 — Hard Negative Refinement** (100K samples):
|
| 195 |
-
Continued fine-tuning on a curated 100K-sample subset with
|
| 196 |
|
| 197 |
- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
|
| 198 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|
|
|
|
| 189 |
Training followed a two-stage approach:
|
| 190 |
|
| 191 |
**Stage 1 — Embedding Conversion** (8.8M samples):
|
| 192 |
+
Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.
|
| 193 |
|
| 194 |
**Stage 2 — Hard Negative Refinement** (100K samples):
|
| 195 |
+
Continued fine-tuning on a curated 100K-sample subset with hard negatives.
|
| 196 |
|
| 197 |
- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
|
| 198 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|