| --- |
| language: |
| - en |
| tags: |
| - MoE |
| - Text-Generation |
| - Instruction Following |
| - VGQA |
| - Research |
| - SLM |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - HuggingFaceH4/ultrachat_200k |
| - cais/mmlu |
| - HuggingFaceTB/OpenHermes-2.5-H4 |
| license: apache-2.0 |
| pipeline_tag: text-generation |
| library_name: transformers |
| base_model: |
| - SlimFactoryHub/SlimMoE-250M-SFT-v2 |
| --- |
| |
| # SlimMoE-250M-SFT-instruct |
|
|
| **SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases. |
| The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints. |
|
|
|
|
| ## Motivation |
|
|
| This work explores the following research question: |
|
|
| > **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?** |
|
|
| SlimMoE-250M was designed to study: |
|
|
| - MoE routing behavior at small scales |
| - VGQA-style attention mechanisms |
| - NoPE / RoPE compatibility in MoE architectures |
| - Quality vs. efficiency trade-offs under limited data and GPU availability |
|
|
|
|
| ## Model Summary |
|
|
| | Property | Value | |
| |--------|------| |
| | Parameters | **250M** | |
| | Architecture | **SlimMoEForCausalLM** | |
| | Experts | **4** | |
| | Layers | **16** | |
| | Hidden Size | **768** | |
| | FFN Size | **1536** | |
| | Attention Heads | **12** | |
| | Max Context Length | **2048** | |
| | Routing | **Adaptive MoE Routing** | |
| | Dropout | **0.1** | |
| | Precision | **float32** | |
| | Vocabulary Size | **50,257** | |
|
|
|
|
| ## Training Details |
|
|
| ### Pretraining |
|
|
| This phase focused on **general language modeling** using high-quality educational data. |
|
|
| - **Dataset**: HuggingFaceFW/fineweb-edu |
| - **Split**: `sample-10BT` |
| - **Tokens Used**: **5.2B** |
| - **Duration**: **7 days 16 hours** |
| - **GPU**: **48GB NVIDIA A100** |
| - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf |
|
|
|
|
| ### Fine-Tuning Phase-1 (SFT – Instruction Tuning) |
|
|
| This stage introduces **instruction supervision** and conversational alignment. |
|
|
| - **Dataset**: HuggingFaceH4/ultrachat_200k |
| - **Split**: `train_sft` |
| - **Duration**: **8 days 8 hours** |
| - **GPU**: **80GB NVIDIA A100** |
| - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf |
| |
| |
| ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning) |
| |
| Used to improve **domain knowledge and reasoning performance**. |
| |
| - **Dataset**: cais/mmlu |
| - **Split**: `auxiliary_train` |
| - **Duration**: **8 days 11 hours** |
| - **GPU**: **48GB NVIDIA A100** |
| - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf |
| |
| |
| ### Fine-Tuning Phase-3 (SFT – Instruction Refinement) |
| |
| Focused on **response quality, instruction clarity, and consistency**. |
| |
| - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4 |
| - **Duration**: **5 days 1 hour** |
| - **GPU**: **48GB NVIDIA A100** |
| - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf |
|
|
|
|
| ## VGQA & Positional Encoding Experiments |
|
|
| - The model was trained using a **VGQA-style attention mechanism**. |
| - Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**. |
| - The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance. |
|
|
| **Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.** |
|
|
| ## Known Issues & Constraints |
|
|
| - **Dataset limitations**: Limited diversity and scale compared to large foundation models |
| - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets |
| - **Loss fluctuations** |
| - **No RLHF applied** |
| - **English-centric data distribution** |
|
|
| These factors directly influenced training duration and final model behavior. |
|
|
|
|
| ## Intended Use |
|
|
|
|
| - Studying **small-scale MoE architectures** |
| - Exploring **VGQA-style attention mechanisms** |
| - Evaluating **NoPE / RoPE behavior in MoE models** |
| - Educational and exploratory research |
|
|
|
|
| ## Acknowledgements |
|
|
| We would like to thank the dataset providers and the open-source community whose contributions made this work possible. |
|
|
| - **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model. |
| - **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining. |
| - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning. |
| - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision. |
| - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase. |
| - **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress. |
| - Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow. |
| Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf |
|
|
| We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies. |
|
|
|
|
| ## Contact |
| Please use the Hugging Face **Discussions** tab to connect. |