BGE-large Code Search (LoRA fine-tuned)
A fine-tuned code search embedding model based on BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs β code intelligence and RAG for AI agents.
Key Results
| Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
|---|---|---|---|---|
| Fixture (296q, 7 languages, enriched) | R@1 | 91.6% | 90.9% | 90.5% |
| Fixture | MRR | 0.952 | 0.949 | 0.948 |
| Raw code embedding (55q, no enrichment) | R@1 | 66.2% | 61.8% | 70.9% |
| Real codebase (100q lookup) | R@1 | 50.0% | 50.0% | 26.0% |
| Real codebase | R@5 | 73.0% | 72.0% | 51.0% |
| CoIR 9-task (19 subtasks) | Overall | 57.5 | 55.7 | 52.7 |
| CoIR CodeSearchNet (6 languages) | NDCG@10 | 0.779 | 0.721 | 0.615 |
New best on every metric except raw R@1 (where v9-200k's 70.9% still leads). Fine-tuning is additive with model capacity β BGE-large did not basin like E5-base variants.
Training Details
- Base Model: BAAI/bge-large-en-v1.5 (335M params, 1024 dimensions)
- Data: 200K balanced pairs (22,222 per language Γ 9 languages) from cqs-indexed Stack repos
- Key Technique: Call-graph false-negative filtering β excludes structurally related functions from contrastive negatives
- Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2, margin 0.05) + MatryoshkaLoss (1024/512/256/128 dims)
- LoRA: rank 16, alpha 32, dropout 0.1 (targets: query, key, value, dense)
- Epochs: 1 (5938 steps, batch size 32)
- Hardware: NVIDIA RTX A6000 (48GB), ~12.75 hours
- Final loss: train 0.161, eval 0.068
- Dataset: jamie8johnson/cqs-code-search-200k
Enrichment Ablation
Fine-tuning slightly increases enrichment dependency vs baseline:
| Layer Skipped | R@1 | Delta | Baseline Delta |
|---|---|---|---|
| None (full) | 91.6% | β | β |
| doc | 84.1% | -7.5pp | -6.8pp |
| filecontext | 86.8% | -4.8pp | -4.1pp |
| signatures | 89.9% | -1.7pp | -1.4pp |
| callgraph | 90.9% | -0.7pp | -0.4pp |
Supported Languages
Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++
Usage
With cqs
# Download and use as custom model
export CQS_ONNX_DIR=/path/to/this/model/onnx
export CQS_EMBEDDING_DIM=1024
cqs index --force
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search")
query_emb = model.encode("Represent this sentence for searching relevant passages: find functions that validate email addresses")
code_emb = model.encode("def validate_email(addr): ...")
Note: BGE-large uses an instruction prefix for queries: "Represent this sentence for searching relevant passages: ". Passages have no prefix.
Files
merged_model/β full merged weights (sentence-transformers compatible)lora_adapter/β LoRA adapter only (for PEFT)onnx/model.onnxβ ONNX format for cqs/ORT inference (1.27GB, opset 11)onnx/tokenizer.jsonβ tokenizer for ONNX inference
License
Apache 2.0 (same as base model)
Model tree for jamie8johnson/bge-large-v1.5-code-search
Base model
BAAI/bge-large-en-v1.5