all-MiniLM-L12-v2-code-search-512
Version: v1.0
Release Date: 2026-01-22
Base Model: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
Overview
all-MiniLM-L12-v2-code-search-512is a lightweight, high-accuracy sentence-transformers model fine-tuned for semantic code search and code embeddings.- It maps natural-language queries and source code into a shared embedding space, enabling fast retrieval of relevant code snippets across multiple programming languages.
- On CodeSearchNet (validation), this model achieves 91.0% Accuracy@1.
Key Features
- Fine-tuned on 1.29M+ code–documentation pairs from CodeSearchNet
- Supports 512-token context length
- Multi-language code support (Python, Java, JavaScript, PHP, Ruby, Go)
- Fast inference with MiniLM-style encoder
- Produces 384-dimensional normalized embeddings (cosine similarity friendly)
Intended Use
Recommended use cases:
- Semantic code search
- Natural language → code retrieval
- Code–documentation matching
- Code similarity and clustering
- Developer tools and IDE integrations
Supported languages:
- Python
- Java
- JavaScript
- PHP
- Ruby
- Go
- Other languages included in CodeSearchNet
Quick Start
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("isuruwijesiri/all-MiniLM-L12-v2-code-search-512")
# Encode code
code_embedding = model.encode("def hello_world():\n print('Hello, World!')", convert_to_tensor=True)
# Encode natural language description
query_embedding = model.encode("function that prints hello world", convert_to_tensor=True)
# Cosine similarity
similarity = util.cos_sim(query_embedding, code_embedding).item()
print(f"Similarity: {similarity:.4f}")
Quick Start - JS
Use this model in JavaScript/TypeScript with Transformers.js (Node.js and browsers).
Installation:
npm install @xenova/transformers
Usage:
import { pipeline } from '@xenova/transformers';
const extractor = await pipeline('feature-extraction', 'isuruwijesiri/all-MiniLM-L12-v2-code-search-512', {
quantized: false
});
const output = await extractor('def add(a, b): return a + b', {
pooling: 'mean',
normalize: true
});
const embedding = Array.from(output.data);
console.log(embedding); // [384] dimensional vector
Code Search Example
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("isuruwijesiri/all-MiniLM-L12-v2-code-search-512")
code_snippets = [
"def calculate_sum(a, b):\n return a + b",
"def find_max(numbers):\n return max(numbers)",
"class User:\n def __init__(self, name):\n self.name = name"
]
query = "function to add two numbers"
query_emb = model.encode(query, convert_to_tensor=True)
code_embs = model.encode(code_snippets, convert_to_tensor=True)
scores = util.cos_sim(query_emb, code_embs)[0]
best_idx = scores.argmax().item()
print(f"Best match: {code_snippets[best_idx]}")
print(f"Similarity: {scores[best_idx]:.4f}")
Performance
Evaluated on the CodeSearchNet validation set (2,000 samples).
| Metric | Score |
|---|---|
| MRR@10 | 0.9338 |
| Accuracy@1 | 0.9100 |
| Accuracy@3 | 0.9540 |
| Accuracy@5 | 0.9625 |
| Accuracy@10 | 0.9760 |
| Recall@1 | 0.9100 |
| Recall@3 | 0.9540 |
| Recall@5 | 0.9625 |
| Recall@10 | 0.9760 |
| NDCG@10 | 0.9441 |
| MAP@100 | 0.9347 |
This model significantly outperforms the base all-MiniLM-L12-v2 on code search tasks:
| Metric | Base Model | Fine-tuned | Improvement |
|---|---|---|---|
| MRR@10 | 0.8235 | 0.9338 | +0.1103 (+13.4%) |
| Accuracy@1 | 0.7735 | 0.9100 | +0.1365 (+17.6%) |
| Accuracy@3 | 0.8620 | 0.9540 | +0.0920 (+10.7%) |
| Accuracy@5 | 0.8930 | 0.9625 | +0.0695 (+7.8%) |
| Accuracy@10 | 0.9210 | 0.9760 | +0.0550 (+6.0%) |
| Recall@5 | 0.8930 | 0.9625 | +0.0695 (+7.8%) |
| Recall@10 | 0.9210 | 0.9760 | +0.0550 (+6.0%) |
| NDCG@10 | 0.8472 | 0.9441 | +0.0969 (+11.4%) |
| MAP@100 | 0.8254 | 0.9347 | +0.1093 (+13.2%) |
Context Length Configuration
This model was trained with a maximum sequence length of 512 tokens.
By default, max_seq_length is set to 512. No manual configuration is required.
model.max_seq_length = 256 # Faster inference for short code
model.max_seq_length = 512 # Default, best accuracy (recommended)
Training Details
Dataset: CodeSearchNet Training samples: 1,294,017 Evaluation samples: 2,000 Epochs: 3 Effective batch size: 192 (96 per device × 2 devices, if applicable) Loss function: MultipleNegativesRankingLoss FP16: Enabled
Limitations
- Inputs longer than 512 tokens are truncated
- Not suitable for code generation or completion
- Optimized for semantic similarity, not syntactic correctness
- Performance may vary across programming languages and domains
Citation
If you use this model, please cite:
@misc{all_MiniLM_L12_v2_code_search_512,
author = {isuruwijesiri},
title = {all-MiniLM-L12-v2-code-search-512},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/isuruwijesiri/all-MiniLM-L12-v2-code-search-512}
}
License: Apache 2.0
- Downloads last month
- 87
Model tree for isuruwijesiri/all-MiniLM-L12-v2-code-search-512
Base model
microsoft/MiniLM-L12-H384-uncased