all-MiniLM-L12-v2-code-search-512

Version: v1.0
Release Date: 2026-01-22
Base Model: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2


Overview

  • all-MiniLM-L12-v2-code-search-512 is a lightweight, high-accuracy sentence-transformers model fine-tuned for semantic code search and code embeddings.
  • It maps natural-language queries and source code into a shared embedding space, enabling fast retrieval of relevant code snippets across multiple programming languages.
  • On CodeSearchNet (validation), this model achieves 91.0% Accuracy@1.

Key Features

  • Fine-tuned on 1.29M+ code–documentation pairs from CodeSearchNet
  • Supports 512-token context length
  • Multi-language code support (Python, Java, JavaScript, PHP, Ruby, Go)
  • Fast inference with MiniLM-style encoder
  • Produces 384-dimensional normalized embeddings (cosine similarity friendly)

Intended Use

Recommended use cases:

  • Semantic code search
  • Natural language → code retrieval
  • Code–documentation matching
  • Code similarity and clustering
  • Developer tools and IDE integrations

Supported languages:

  • Python
  • Java
  • JavaScript
  • PHP
  • Ruby
  • Go
  • Other languages included in CodeSearchNet

Quick Start

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("isuruwijesiri/all-MiniLM-L12-v2-code-search-512")

# Encode code
code_embedding = model.encode("def hello_world():\n    print('Hello, World!')", convert_to_tensor=True)

# Encode natural language description
query_embedding = model.encode("function that prints hello world", convert_to_tensor=True)

# Cosine similarity
similarity = util.cos_sim(query_embedding, code_embedding).item()
print(f"Similarity: {similarity:.4f}")

Quick Start - JS

Use this model in JavaScript/TypeScript with Transformers.js (Node.js and browsers).

Installation:

npm install @xenova/transformers

Usage:

import { pipeline } from '@xenova/transformers';

const extractor = await pipeline('feature-extraction', 'isuruwijesiri/all-MiniLM-L12-v2-code-search-512', {
  quantized: false
});

const output = await extractor('def add(a, b): return a + b', {
  pooling: 'mean',
  normalize: true
});

const embedding = Array.from(output.data);
console.log(embedding); // [384] dimensional vector

Code Search Example

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("isuruwijesiri/all-MiniLM-L12-v2-code-search-512")

code_snippets = [
    "def calculate_sum(a, b):\n    return a + b",
    "def find_max(numbers):\n    return max(numbers)",
    "class User:\n    def __init__(self, name):\n        self.name = name"
]

query = "function to add two numbers"

query_emb = model.encode(query, convert_to_tensor=True)
code_embs = model.encode(code_snippets, convert_to_tensor=True)

scores = util.cos_sim(query_emb, code_embs)[0]
best_idx = scores.argmax().item()

print(f"Best match: {code_snippets[best_idx]}")
print(f"Similarity: {scores[best_idx]:.4f}")

Performance

Evaluated on the CodeSearchNet validation set (2,000 samples).

Metric Score
MRR@10 0.9338
Accuracy@1 0.9100
Accuracy@3 0.9540
Accuracy@5 0.9625
Accuracy@10 0.9760
Recall@1 0.9100
Recall@3 0.9540
Recall@5 0.9625
Recall@10 0.9760
NDCG@10 0.9441
MAP@100 0.9347

This model significantly outperforms the base all-MiniLM-L12-v2 on code search tasks:

Metric Base Model Fine-tuned Improvement
MRR@10 0.8235 0.9338 +0.1103 (+13.4%)
Accuracy@1 0.7735 0.9100 +0.1365 (+17.6%)
Accuracy@3 0.8620 0.9540 +0.0920 (+10.7%)
Accuracy@5 0.8930 0.9625 +0.0695 (+7.8%)
Accuracy@10 0.9210 0.9760 +0.0550 (+6.0%)
Recall@5 0.8930 0.9625 +0.0695 (+7.8%)
Recall@10 0.9210 0.9760 +0.0550 (+6.0%)
NDCG@10 0.8472 0.9441 +0.0969 (+11.4%)
MAP@100 0.8254 0.9347 +0.1093 (+13.2%)

Context Length Configuration

This model was trained with a maximum sequence length of 512 tokens. By default, max_seq_length is set to 512. No manual configuration is required.

model.max_seq_length = 256  # Faster inference for short code
model.max_seq_length = 512  # Default, best accuracy (recommended)

Training Details

Dataset: CodeSearchNet Training samples: 1,294,017 Evaluation samples: 2,000 Epochs: 3 Effective batch size: 192 (96 per device × 2 devices, if applicable) Loss function: MultipleNegativesRankingLoss FP16: Enabled

Limitations

  • Inputs longer than 512 tokens are truncated
  • Not suitable for code generation or completion
  • Optimized for semantic similarity, not syntactic correctness
  • Performance may vary across programming languages and domains

Citation

If you use this model, please cite:

@misc{all_MiniLM_L12_v2_code_search_512,
  author = {isuruwijesiri},
  title = {all-MiniLM-L12-v2-code-search-512},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/isuruwijesiri/all-MiniLM-L12-v2-code-search-512}
}

License: Apache 2.0

Downloads last month
87
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for isuruwijesiri/all-MiniLM-L12-v2-code-search-512

Dataset used to train isuruwijesiri/all-MiniLM-L12-v2-code-search-512