LANGKAH ColBERT v2 (32M-Indo-Academic)

This model is a fine-tuned version of mxbai-edge-colbert-v0-32m specifically optimized for the Indonesian Academic and Career Domain. It is the core retrieval component of the LANGKAH (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus) framework.

πŸš€ Model Highlights

  • Architecture: ColBERT (Late Interaction) for high-precision semantic search.
  • Parameters: 32 Million (Edge-optimized, very fast on CPU).
  • Dimensions: 64-dimensional embeddings per token.
  • Domain: Indonesian University (IPB University focus), specifically for Student Activities, Courses, and Career Path Alignment.
  • Training Strategy: Fine-tuned using GPL (Generative Pseudo-Labeling) with queries generated by Gemini 3 Flash on Indonesian academic catalogs.

πŸ“Š Training Data

The model was fine-tuned on a synthetic triplet dataset (~15,000 samples) generated from:

  1. Student Activities: 4,500+ unique activities from IPB archives.
  2. Curriculum: 5,000+ courses across various undergraduate and vocational programs.
  3. Career Guidance: Technical texts from official university career handbooks.

πŸ›  Usage (via PyLate)

First, install the library:

pip install -U pylate

Retrieval Example

from pylate import models, indexes, retrieve

# Load model
model = models.ColBERT(
    model_name_or_path="dzakwanalifi/langkah-colbert-v2-32m-indo-academic"
)

# Encode documents
documents = [
    "Lomba IT Nasional: Kompetisi pemrograman tingkat mahasiswa",
    "Magang Data Science di Startup Jakarta",
    "Himpunan Mahasiswa Teknik: Organisasi pengembangan softskill"
]
doc_embeddings = model.encode(documents, is_query=False)

# Search
query = "cari kegiatan lomba coding mahasiswa"
query_embedding = model.encode([query], is_query=True)

# Use PyLate retrieve or MaxSim for scoring
# (See technical documentation in LANGKAH repo for full pipeline)

πŸ—œ Compression SOTA

In the LANGKAH framework, this model's embeddings are further compressed using Residual Quantization + Ragged Storage, achieving a 41.7x compression ratio (reduces 2.37 GB of float32 embeddings to just 55 MB) while maintaining ~98% retrieval accuracy.

πŸ“„ Citation & Credit

If you use this model in your research, please cite the LANGKAH project.

Framework: LANGKAH v2 (Late-interaction Agentic Network for Graduate Knowledge & Academic History-consensus)

Model Authors:

  • Developed as part of Indonesian Academic Research (dzakwanalifi)
  • Base model by mixedbread.ai
Downloads last month
55
Safetensors
Model size
31.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dzakwanalifi/langkah-colbert-v2-32m-indo-academic

Finetuned
(4)
this model