LANGKAH ColBERT v2 (32M-Indo-Academic)
This model is a fine-tuned version of mxbai-edge-colbert-v0-32m specifically optimized for the Indonesian Academic and Career Domain. It is the core retrieval component of the LANGKAH (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus) framework.
π Model Highlights
- Architecture: ColBERT (Late Interaction) for high-precision semantic search.
- Parameters: 32 Million (Edge-optimized, very fast on CPU).
- Dimensions: 64-dimensional embeddings per token.
- Domain: Indonesian University (IPB University focus), specifically for Student Activities, Courses, and Career Path Alignment.
- Training Strategy: Fine-tuned using GPL (Generative Pseudo-Labeling) with queries generated by Gemini 3 Flash on Indonesian academic catalogs.
π Training Data
The model was fine-tuned on a synthetic triplet dataset (~15,000 samples) generated from:
- Student Activities: 4,500+ unique activities from IPB archives.
- Curriculum: 5,000+ courses across various undergraduate and vocational programs.
- Career Guidance: Technical texts from official university career handbooks.
π Usage (via PyLate)
First, install the library:
pip install -U pylate
Retrieval Example
from pylate import models, indexes, retrieve
# Load model
model = models.ColBERT(
model_name_or_path="dzakwanalifi/langkah-colbert-v2-32m-indo-academic"
)
# Encode documents
documents = [
"Lomba IT Nasional: Kompetisi pemrograman tingkat mahasiswa",
"Magang Data Science di Startup Jakarta",
"Himpunan Mahasiswa Teknik: Organisasi pengembangan softskill"
]
doc_embeddings = model.encode(documents, is_query=False)
# Search
query = "cari kegiatan lomba coding mahasiswa"
query_embedding = model.encode([query], is_query=True)
# Use PyLate retrieve or MaxSim for scoring
# (See technical documentation in LANGKAH repo for full pipeline)
π Compression SOTA
In the LANGKAH framework, this model's embeddings are further compressed using Residual Quantization + Ragged Storage, achieving a 41.7x compression ratio (reduces 2.37 GB of float32 embeddings to just 55 MB) while maintaining ~98% retrieval accuracy.
π Citation & Credit
If you use this model in your research, please cite the LANGKAH project.
Framework: LANGKAH v2 (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus)
Model Authors:
- Developed as part of Indonesian Academic Research (dzakwanalifi)
- Base model by mixedbread.ai
- Downloads last month
- 55
Model tree for dzakwanalifi/langkah-colbert-v2-32m-indo-academic
Base model
mixedbread-ai/mxbai-edge-colbert-v0-32m