| --- |
| language: |
| - kk |
| - ru |
| - en |
| license: apache-2.0 |
| tags: |
| - feature-extraction |
| - sentence-similarity |
| - multilingual |
| pipeline_tag: sentence-similarity |
| base_model: BAAI/bge-m3 |
| model-index: |
| - name: darmm-embedding-multilingual |
| results: |
| - task: |
| type: retrieval |
| name: Retrieval |
| metrics: |
| - type: recall_at_1 |
| value: 0.9444 |
| - type: recall_at_3 |
| value: 1.0 |
| - type: recall_at_5 |
| value: 1.0 |
| - type: recall_at_10 |
| value: 1.0 |
| --- |
| |
| # Darmm Multilingual Embedding |
|
|
| Multilingual embedding model (Kazakh/Russian/English) fine-tuned from `BAAI/bge-m3` for Darmm FAQ and product content retrieval. |
|
|
| ## Usage |
|
|
| ### Direct model usage (Hugging Face) |
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("Darmm/darmm-embedding-multilingual") |
| sentences = ["Darmm қызметтері қандай?", "What services does Darmm provide?"] |
| embeddings = model.encode(sentences) |
| print(embeddings.shape) |
| ``` |
|
|
| ## Training data (verified) |
| - Darmm landing, academy, and mentor site text extracted from local sources. |
|
|
| ## Training setup |
| - Base model: `BAAI/bge-m3`. |
| - Loss: `MultipleNegativesRankingLoss` (default in `scripts/train_embeddings.py`). |
| - Typical training params in this repo: `epochs=3`, `batch_size=2`, `max_seq_length=128`. |
|
|
| ## Evaluation |
| Evaluation uses paraphrased FAQ questions mapped to the FAQ corpus: |
| - Corpus: `data/faq_chunks.jsonl` (369 chunks) |
| - Queries: `data/eval_questions.jsonl` (90 questions) |
|
|
| ## Paper & Documentation |
|
|
| <details> |
| <summary>🇬🇧 English</summary> |
|
|
| # Darmm: Multilingual Embeddings for FAQ Retrieval |
|
|
| ## Abstract |
| We present a multilingual embedding model fine‑tuned for Darmm FAQ and product knowledge retrieval in Kazakh, Russian, and English. The model is based on `BAAI/bge-m3` and trained on Darmm website content and a handcrafted FAQ corpus. We evaluate on paraphrased FAQ questions mapped to the FAQ corpus. |
|
|
| ## 1. Dataset |
| - **Sources**: Darmm landing, academy, and mentor site content (local sources) plus handcrafted FAQ data. |
| - **FAQ corpus**: 150 topics × 3 languages = 450 Q/A documents. |
| - **Chunked corpus**: 369 chunks in `data/faq_chunks.jsonl`. |
|
|
| ## 2. Training |
| - **Base model**: `BAAI/bge-m3` |
| - **Loss**: `MultipleNegativesRankingLoss` |
| - **Params**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
| ## 3. Results |
| Evaluation on `data/eval_questions.jsonl` (90 paraphrased queries) against the FAQ corpus: |
| - Recall@1 = 0.9444 |
| - Recall@3/5/10 = 1.0 |
|
|
| ## 4. Limitations |
| - Performance depends on query style and corpus quality. |
| - Short UI strings can reduce relevance; prefer richer FAQ or docs. |
| - Validate with real user questions and a held‑out test set. |
|
|
| </details> |
|
|
| <details> |
| <summary>🇰🇿 Қазақша</summary> |
|
|
| # Darmm: FAQ іздеуге арналған көптілді эмбеддингтер |
|
|
| ## Аңдатпа |
| Бұл модель Darmm‑ның FAQ және өнім білім базасын қазақ, орыс және ағылшын тілдерінде іздеуге арналған. Негізі `BAAI/bge-m3`, оқыту Darmm сайт контенті мен қолмен жасалған FAQ жиынына жүргізілді. Бағалау парафраз сұрақтар арқылы жасалды. |
|
|
| ## 1. Деректер |
| - **Көздер**: Darmm landing/academy/mentor сайттарының локал контенті және FAQ жиыны. |
| - **FAQ корпусы**: 150 тақырып × 3 тіл = 450 Q/A құжаты. |
| - **Чанкталған корпус**: `data/faq_chunks.jsonl` ішінде 369 чанк. |
|
|
| ## 2. Оқыту |
| - **Негізгі модель**: `BAAI/bge-m3` |
| - **Loss**: `MultipleNegativesRankingLoss` |
| - **Параметрлер**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
| ## 3. Нәтижелер |
| `data/eval_questions.jsonl` (90 парафраз сұрақ) арқылы бағалау: |
| - Recall@1 = 0.9444 |
| - Recall@3/5/10 = 1.0 |
|
|
| ## 4. Шектеулер |
| - Нәтиже сұрақ стилі мен корпус сапасына тәуелді. |
| - Қысқа UI мәтіндері релевантты төмендетуі мүмкін. |
| - Нақты пайдаланушы сұрақтарымен міндетті түрде тексеріңіз. |
|
|
| </details> |
|
|
| <details> |
| <summary>🇷🇺 Русский</summary> |
|
|
| # Darmm: Мультиязычные эмбеддинги для FAQ‑поиска |
|
|
| ## Аннотация |
| Модель предназначена для поиска по FAQ и базе знаний Darmm на казахском, русском и английском. Основана на `BAAI/bge-m3` и дообучена на локальном контенте сайтов Darmm и ручном FAQ‑корпусе. Оценка проводится на перефразированных вопросах. |
|
|
| ## 1. Данные |
| - **Источники**: локальный контент сайтов Darmm и FAQ‑корпус. |
| - **FAQ корпус**: 150 тем × 3 языка = 450 Q/A документов. |
| - **Чанкованный корпус**: 369 чанков в `data/faq_chunks.jsonl`. |
|
|
| ## 2. Обучение |
| - **Базовая модель**: `BAAI/bge-m3` |
| - **Loss**: `MultipleNegativesRankingLoss` |
| - **Параметры**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
| ## 3. Результаты |
| Оценка на `data/eval_questions.jsonl` (90 перефразированных запросов): |
| - Recall@1 = 0.9444 |
| - Recall@3/5/10 = 1.0 |
|
|
| ## 4. Ограничения |
| - Результаты зависят от стиля запросов и качества корпуса. |
| - Короткие UI‑строки снижают релевантность. |
| - Проверяйте на реальных пользовательских вопросах. |
|
|
| </details> |
|
|
| ## Intended use |
| - FAQ search and internal knowledge retrieval across kk/ru/en. |
| - RAG pipelines for Darmm services. |
|
|
| ## Limitations |
| - Results depend on corpus quality and query style. |
| - Short UI strings reduce relevance; prefer fuller FAQ or documentation. |
| - For real-world validation, use actual user queries and a held‑out test set. |
|
|