LANGKAH ColBERT v2 (32M-Indo-Academic)

This model is a fine-tuned version of mxbai-edge-colbert-v0-32m specifically optimized for the Indonesian Academic and Career Domain. It is the core retrieval component of the LANGKAH (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus) framework.

🚀 Model Highlights

Architecture: ColBERT (Late Interaction) for high-precision semantic search.
Parameters: 32 Million (Edge-optimized, very fast on CPU).
Dimensions: 64-dimensional embeddings per token.
Domain: Indonesian University (IPB University focus), specifically for Student Activities, Courses, and Career Path Alignment.
Training Strategy: Fine-tuned using GPL (Generative Pseudo-Labeling) with queries generated by Gemini 3 Flash on Indonesian academic catalogs.

📊 Training Data

The model was fine-tuned on a synthetic triplet dataset (~15,000 samples) generated from:

Student Activities: 4,500+ unique activities from IPB archives.
Curriculum: 5,000+ courses across various undergraduate and vocational programs.
Career Guidance: Technical texts from official university career handbooks.

🛠 Usage (via PyLate)

First, install the library:

pip install -U pylate

Retrieval Example

from pylate import models, indexes, retrieve

# Load model
model = models.ColBERT(
    model_name_or_path="dzakwanalifi/langkah-colbert-v2-32m-indo-academic"
)

# Encode documents
documents = [
    "Lomba IT Nasional: Kompetisi pemrograman tingkat mahasiswa",
    "Magang Data Science di Startup Jakarta",
    "Himpunan Mahasiswa Teknik: Organisasi pengembangan softskill"
]
doc_embeddings = model.encode(documents, is_query=False)

# Search
query = "cari kegiatan lomba coding mahasiswa"
query_embedding = model.encode([query], is_query=True)

# Use PyLate retrieve or MaxSim for scoring
# (See technical documentation in LANGKAH repo for full pipeline)

🗜 Compression SOTA

In the LANGKAH framework, this model's embeddings are further compressed using Residual Quantization + Ragged Storage, achieving a 41.7x compression ratio (reduces 2.37 GB of float32 embeddings to just 55 MB) while maintaining ~98% retrieval accuracy.

📄 Citation & Credit

If you use this model in your research, please cite the LANGKAH project.

Framework: LANGKAH v2 (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus)

Model Authors:

Developed as part of Indonesian Academic Research (dzakwanalifi)
Base model by mixedbread.ai

Downloads last month: 55

Safetensors

Model size

31.9M params

Tensor type

F32

Model tree for dzakwanalifi/langkah-colbert-v2-32m-indo-academic

Base model

mixedbread-ai/mxbai-edge-colbert-v0-32m

Finetuned

(4)

this model