--- license: apache-2.0 language: - en - code library_name: transformers tags: - code - embeddings - retrieval - code-search - semantic-search - feature-extraction - sentence-transformers datasets: - code-rag-bench/cornstack - bigcode/stackoverflow - code_search_net pipeline_tag: feature-extraction base_model: Qwen/Qwen2.5-Coder-0.5B model-index: - name: CodeCompass-Embed results: - task: type: retrieval name: Code Retrieval dataset: type: CoIR-Retrieval/CodeSearchNet-python name: CodeSearchNet Python metrics: - type: ndcg@10 value: 0.979 name: NDCG@10 - type: mrr@10 value: 0.976 name: MRR@10 - task: type: retrieval name: Code Translation dataset: type: CoIR-Retrieval/codetrans-dl name: CodeTrans-DL metrics: - type: ndcg@10 value: 0.286 name: NDCG@10 --- # CodeCompass-Embed **CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir). ## Model Highlights - **Code search from natural language** — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP - **Competitive with models 3× smaller and larger** — 494M params, 896-dim embeddings - **Bidirectional attention** — all 24 layers converted from causal for better embedding quality - **Lightweight** — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs - **Versatile** — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates ## Model Details | Property | Value | |----------|-------| | Base Model | Qwen2.5-Coder-0.5B | | Parameters | 494M | | Embedding Dimension | 896 | | Max Sequence Length | 512 (training) / 32K (inference) | | Pooling | Mean | | Normalization | L2 | | Attention | Bidirectional (all 24 layers) | ## Benchmark Results (CoIR) Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python. | Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps | Avg | |-------|--------|--------|-----------|----------|-------|--------------|------|-----| | CodeCompass-Embed (ours) | 494M | **0.979** | **0.286** | 0.736 | 0.834 | **0.814** | **0.349** | 0.666 | | SFR-Embedding-Code | 400M | 0.951 | 0.268 | **0.995** | **0.911** | 0.726 | 0.221 | **0.679** | | Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 | 0.579 | | CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 | 0.630 | | Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 | 0.553 | | BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 | 0.555 | | BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 | 0.546 | | CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 | 0.482 | ### Multi-Language Code Search (CodeSearchNet) | Language | NDCG@10 | MRR@10 | |----------|---------|--------| | **Python** | **0.979** | **0.976** | | Go | 0.797 | 0.767 | | Java | 0.639 | 0.600 | | PHP | 0.627 | 0.585 | | JavaScript | 0.621 | 0.578 | | Ruby | 0.579 | 0.535 | ### Full Results (All 12 Tasks) | Task | NDCG@10 | MRR@10 | |------|---------|--------| | codesearchnet-python | 0.979 | 0.976 | | stackoverflow-qa | 0.834 | 0.810 | | codefeedback-st | 0.814 | 0.775 | | codesearchnet-go | 0.797 | 0.767 | | synthetic-text2sql | 0.736 | 0.662 | | codesearchnet-java | 0.639 | 0.600 | | codesearchnet-php | 0.627 | 0.585 | | codesearchnet-javascript | 0.621 | 0.578 | | codesearchnet-ruby | 0.579 | 0.535 | | apps | 0.349 | 0.307 | | codetrans-dl | 0.286 | 0.164 | | cosqa | 0.209 | 0.165 | | **Average (12 tasks)** | **0.623** | **0.577** | ## Usage ### With Transformers ```python import torch import torch.nn.functional as F from transformers import AutoModel, AutoTokenizer # Load model model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") # CRITICAL: Enable bidirectional attention for embeddings for layer in model.model.layers: layer.self_attn.is_causal = False model.eval() def encode(texts, is_query=False): # Add instruction prefix for queries if is_query: texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts] inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) hidden = outputs.hidden_states[-1] # Mean pooling mask = inputs["attention_mask"].unsqueeze(-1).float() embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9) # L2 normalize embeddings = F.normalize(embeddings, p=2, dim=-1) return embeddings # Example: Code Search query = "How to sort a list in Python" code_snippets = [ "def sort_list(lst):\n return sorted(lst)", "def add_numbers(a, b):\n return a + b", "def reverse_string(s):\n return s[::-1]", ] query_emb = encode([query], is_query=True) code_embs = encode(code_snippets, is_query=False) # Compute similarities similarities = (query_emb @ code_embs.T).squeeze() print(f"Query: {{query}}") for i, (code, sim) in enumerate(zip(code_snippets, similarities)): print(f" [{{sim:.4f}}] {{code[:50]}}...") ``` ## Instruction Templates For optimal performance, use these instruction prefixes for queries: | Task | Instruction Template | |------|---------------------| | NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` | | Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` | | Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` | | Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` | **Note**: Document/corpus texts do NOT need instruction prefixes. ## Training Details Training followed a two-stage approach: **Stage 1 — Embedding Conversion** (8.8M samples): Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives. **Stage 2 — Hard Negative Refinement** (100K samples): Continued fine-tuning on a curated 100K-sample subset with hard negatives. - **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization - **Loss**: InfoNCE with temperature τ=0.05 - **Effective Batch Size**: 1024 (via GradCache) - **Hardware**: NVIDIA H100 (95GB) ## Limitations - Strongest on Python; other languages show lower but competitive performance - Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context - May not generalize to low-resource programming languages not seen in training ## Citation ```bibtex @misc{{codecompass2026, author = {{Faisal Mumtaz}}, title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}}, year = {{2026}}, publisher = {{Hugging Face}}, url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}} }} ``` ## License Apache 2.0