---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- code
- embeddings
- retrieval
- code-search
- semantic-search
- feature-extraction
- sentence-transformers
datasets:
- code-rag-bench/cornstack
- bigcode/stackoverflow
- code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
- name: CodeCompass-Embed
  results:
  - task:
      type: retrieval
      name: Code Retrieval
    dataset:
      type: CoIR-Retrieval/CodeSearchNet-python
      name: CodeSearchNet Python
    metrics:
    - type: ndcg@10
      value: 0.979
      name: NDCG@10
    - type: mrr@10
      value: 0.976
      name: MRR@10
  - task:
      type: retrieval
      name: Code Translation
    dataset:
      type: CoIR-Retrieval/codetrans-dl
      name: CodeTrans-DL
    metrics:
    - type: ndcg@10
      value: 0.286
      name: NDCG@10
---

# CodeCompass-Embed

**CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir).

## Model Highlights

- **Code search from natural language** — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
- **Competitive with models 3× smaller and larger** — 494M params, 896-dim embeddings
- **Bidirectional attention** — all 24 layers converted from causal for better embedding quality
- **Lightweight** — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
- **Versatile** — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates

## Model Details

| Property | Value |
|----------|-------|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |

## Benchmark Results (CoIR)

Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.

| Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps | Avg |
|-------|--------|--------|-----------|----------|-------|--------------|------|-----|
| CodeCompass-Embed (ours) | 494M | **0.979** | **0.286** | 0.736 | 0.834 | **0.814** | **0.349** | 0.666 |
| SFR-Embedding-Code | 400M | 0.951 | 0.268 | **0.995** | **0.911** | 0.726 | 0.221 | **0.679** |
| Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 | 0.579 |
| CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 | 0.630 |
| Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 | 0.553 |
| BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 | 0.555 |
| BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 | 0.546 |
| CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 | 0.482 |

### Multi-Language Code Search (CodeSearchNet)

| Language | NDCG@10 | MRR@10 |
|----------|---------|--------|
| **Python** | **0.979** | **0.976** |
| Go | 0.797 | 0.767 |
| Java | 0.639 | 0.600 |
| PHP | 0.627 | 0.585 |
| JavaScript | 0.621 | 0.578 |
| Ruby | 0.579 | 0.535 |

### Full Results (All 12 Tasks)

| Task | NDCG@10 | MRR@10 |
|------|---------|--------|
| codesearchnet-python | 0.979 | 0.976 |
| stackoverflow-qa | 0.834 | 0.810 |
| codefeedback-st | 0.814 | 0.775 |
| codesearchnet-go | 0.797 | 0.767 |
| synthetic-text2sql | 0.736 | 0.662 |
| codesearchnet-java | 0.639 | 0.600 |
| codesearchnet-php | 0.627 | 0.585 |
| codesearchnet-javascript | 0.621 | 0.578 |
| codesearchnet-ruby | 0.579 | 0.535 |
| apps | 0.349 | 0.307 |
| codetrans-dl | 0.286 | 0.164 |
| cosqa | 0.209 | 0.165 |
| **Average (12 tasks)** | **0.623** | **0.577** |

## Usage

### With Transformers

```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    # Add instruction prefix for queries
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        
        # Mean pooling
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        
        # L2 normalize
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
    "def sort_list(lst):\n    return sorted(lst)",
    "def add_numbers(a, b):\n    return a + b",
    "def reverse_string(s):\n    return s[::-1]",
]

query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)

# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {{query}}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
    print(f"  [{{sim:.4f}}] {{code[:50]}}...")
```

## Instruction Templates

For optimal performance, use these instruction prefixes for queries:

| Task | Instruction Template |
|------|---------------------|
| NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
| Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |

**Note**: Document/corpus texts do NOT need instruction prefixes.

## Training Details

Training followed a two-stage approach:

**Stage 1 — Embedding Conversion** (8.8M samples):
Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.

**Stage 2 — Hard Negative Refinement** (100K samples):
Continued fine-tuning on a curated 100K-sample subset with hard negatives.

- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
- **Loss**: InfoNCE with temperature τ=0.05
- **Effective Batch Size**: 1024 (via GradCache)
- **Hardware**: NVIDIA H100 (95GB)

## Limitations

- Strongest on Python; other languages show lower but competitive performance
- Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
- May not generalize to low-resource programming languages not seen in training

## Citation

```bibtex
@misc{{codecompass2026,
  author = {{Faisal Mumtaz}},
  title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
  year = {{2026}},
  publisher = {{Hugging Face}},
  url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
}}
```

## License

Apache 2.0