BGE-large Code Search (LoRA fine-tuned)

A fine-tuned code search embedding model based on BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs β€” code intelligence and RAG for AI agents.

Key Results

Eval Metric This Model BGE-large Baseline v9-200k (110M)
Fixture (296q, 7 languages, enriched) R@1 91.6% 90.9% 90.5%
Fixture MRR 0.952 0.949 0.948
Raw code embedding (55q, no enrichment) R@1 66.2% 61.8% 70.9%
Real codebase (100q lookup) R@1 50.0% 50.0% 26.0%
Real codebase R@5 73.0% 72.0% 51.0%
CoIR 9-task (19 subtasks) Overall 57.5 55.7 52.7
CoIR CodeSearchNet (6 languages) NDCG@10 0.779 0.721 0.615

New best on every metric except raw R@1 (where v9-200k's 70.9% still leads). Fine-tuning is additive with model capacity β€” BGE-large did not basin like E5-base variants.

Training Details

  • Base Model: BAAI/bge-large-en-v1.5 (335M params, 1024 dimensions)
  • Data: 200K balanced pairs (22,222 per language Γ— 9 languages) from cqs-indexed Stack repos
  • Key Technique: Call-graph false-negative filtering β€” excludes structurally related functions from contrastive negatives
  • Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2, margin 0.05) + MatryoshkaLoss (1024/512/256/128 dims)
  • LoRA: rank 16, alpha 32, dropout 0.1 (targets: query, key, value, dense)
  • Epochs: 1 (5938 steps, batch size 32)
  • Hardware: NVIDIA RTX A6000 (48GB), ~12.75 hours
  • Final loss: train 0.161, eval 0.068
  • Dataset: jamie8johnson/cqs-code-search-200k

Enrichment Ablation

Fine-tuning slightly increases enrichment dependency vs baseline:

Layer Skipped R@1 Delta Baseline Delta
None (full) 91.6% β€” β€”
doc 84.1% -7.5pp -6.8pp
filecontext 86.8% -4.8pp -4.1pp
signatures 89.9% -1.7pp -1.4pp
callgraph 90.9% -0.7pp -0.4pp

Supported Languages

Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++

Usage

With cqs

# Download and use as custom model
export CQS_ONNX_DIR=/path/to/this/model/onnx
export CQS_EMBEDDING_DIM=1024
cqs index --force

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search")
query_emb = model.encode("Represent this sentence for searching relevant passages: find functions that validate email addresses")
code_emb = model.encode("def validate_email(addr): ...")

Note: BGE-large uses an instruction prefix for queries: "Represent this sentence for searching relevant passages: ". Passages have no prefix.

Files

  • merged_model/ β€” full merged weights (sentence-transformers compatible)
  • lora_adapter/ β€” LoRA adapter only (for PEFT)
  • onnx/model.onnx β€” ONNX format for cqs/ORT inference (1.27GB, opset 11)
  • onnx/tokenizer.json β€” tokenizer for ONNX inference

License

Apache 2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jamie8johnson/bge-large-v1.5-code-search

Quantized
(16)
this model

Dataset used to train jamie8johnson/bge-large-v1.5-code-search