RightNow-Arabic-0.5B-Turbo

The smallest open Arabic-specialized decoder LLM

518M parameters | 398 MB on disk (q4_k_m) | 635 tok/s on H100

Built by RightNow AI


What is this?

RightNow-Arabic-0.5B-Turbo is a 518M-parameter Arabic-specialized language model built on top of Qwen2.5-0.5B via vocabulary injection, continued pretraining, supervised fine-tuning, and direct preference optimization. It is the smallest open Arabic-specialized decoder LLM with publicly available weights.

The model targets edge deployment: phones, laptops, embedded devices, and browsers. Quantized to q4_k_m it fits in 398 MB and generates 635 tokens/s at batch size 1.

Key Features

  • 27,032 new Arabic tokens added via mean-subtoken initialization, cutting Arabic tokenizer fertility by 17.3% (2.18 to 1.80 tokens/word)
  • 504M Arabic pretraining tokens (Arabic Wikipedia) on 8xH100 SXM5 with FSDP + FlashAttention varlen + Liger fused kernels
  • 129,116 Arabic instruction pairs for SFT with response-only loss masking
  • 6,750 Arabic preference pairs for DPO
  • Weight soup merging (DPO 50%, SFT 25%, Pretrain 25%) for optimal accuracy
  • 4 GGUF quantizations for instant edge deployment

Benchmarks

Evaluated with lm-evaluation-harness v0.4.11, apply_chat_template=True, limit=200, acc_norm preferred.

Head-to-head comparison

Model Params COPA-ar HellaSwag-ar ArabicMMLU Mean
RightNow-Arabic-0.5B-Turbo (ours) 518M 58.4% 26.0% 23.2% 35.9%
Qwen2.5-0.5B-Instruct 494M 53.9% 22.5% 26.0% 34.1%
Falcon-H1-0.5B-Instruct 524M 44.9% 23.0% 24.2% 30.7%
Falcon-H1-1.5B-Instruct 1.5B 58.4% 27.5% 32.7% 39.5%
AceGPT-7B-chat 7B 69.7% 27.0% 35.0% 43.9%
ALLaM-7B-Instruct 7B 68.5% 29.0% 52.2% 49.9%
SILMA-9B-Instruct 9B 69.7% 38.0% 52.9% 53.5%

Among 0.5B models: best on COPA-ar (+4.5 vs Qwen), best on HellaSwag-ar (+3.5 vs Qwen), best mean (+1.8 vs Qwen, +5.2 vs Falcon).

Ties Falcon-H1-1.5B on COPA-ar (both 58.4%) at one-third the parameters.

Recovers 67% of SILMA-9B mean accuracy at 5.8% of the parameters.

Available Formats

Format Size Speed (tok/s, bs=1, H100) Use case
bf16 1.04 GB 82 (HF generate) Fine-tuning, research
int8 664 MB -- Reduced memory inference
GGUF f16 988 MB 582 Maximum quality
GGUF q8_0 525 MB 646 Best speed
GGUF q5_k_m 419 MB 634 Balanced
GGUF q4_k_m 398 MB 635 Smallest footprint

Quick Start

With Transformers (bf16)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "RightNowAI/RightNow-Arabic-0.5B-Turbo"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="bf16")
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder="bf16",
    torch_dtype="bfloat16", device_map="auto"
)

messages = [
    {"role": "system", "content": "ุฃู†ุช ู…ุณุงุนุฏ ุฐูƒูŠ ูŠุฌูŠุจ ุจุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ ุงู„ูุตุญู‰"},
    {"role": "user", "content": "ู…ุง ู‡ูŠ ุนุงุตู…ุฉ ุงู„ู…ู…ู„ูƒุฉ ุงู„ุนุฑุจูŠุฉ ุงู„ุณุนูˆุฏูŠุฉุŸ"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

With llama.cpp (GGUF)

# Download the q8_0 quantization (best speed)
huggingface-cli download RightNowAI/RightNow-Arabic-0.5B-Turbo \
  gguf/RightNow-Arabic-0.5B-Turbo-q8_0.gguf --local-dir .

# Run inference
./llama-cli -m RightNow-Arabic-0.5B-Turbo-q8_0.gguf \
  -p "ู…ุง ู‡ูŠ ุฃูƒุจุฑ ู…ุฏูŠู†ุฉ ููŠ ู…ุตุฑุŸ" \
  -n 128 --temp 0.7

Training Pipeline

Qwen2.5-0.5B (494M, 151,665 vocab)
        |
        v
Tokenizer Surgery (+27,032 Arabic tokens -> 178,697 vocab)
  - SentencePiece unigram 32k on 12.5 GB Arabic corpus
  - Mean-subtoken embedding initialization
  - Fertility: 2.18 -> 1.80 tok/word (-17.3%)
        |
        v
Continued Pretraining (504M arwiki tokens)
  - 2,500 steps, 8xH100 SXM5
  - FSDP _HYBRID_SHARD_ZERO2 + FlashAttention varlen + Liger
  - Loss: 14.21 -> 1.69 | Wall time: 6h 57m
        |
        v
Supervised Fine-Tuning (129,116 instructions)
  - 5 datasets: evol-instruct-arabic, alpaca-gpt4-arabic,
    sharegpt-arabic, CIDAR, aya_dataset
  - Response-only loss masking (72.1% of tokens carry loss)
  - 5 epochs, 418 steps | Wall time: 12m
        |
        v
Direct Preference Optimization (6,750 pairs)
  - argilla-dpo-mix-7k-arabic
  - 2 epochs, 844 steps | Wall time: 34m
        |
        v
Weight Soup Merging
  - Linear(DPO 0.5, SFT 0.25, Pretrain 0.25)
  - +0.44 points mean accuracy over DPO alone
        |
        v
Export: bf16, int8, GGUF {f16, q8_0, q5_k_m, q4_k_m}

Training Data

Dataset Examples/Tokens Use
Arabic Wikipedia (wikimedia/wikipedia 20231101.ar) 504M tokens Continued pretraining
FreedomIntelligence/evol-instruct-arabic 59,022 SFT
FreedomIntelligence/alpaca-gpt4-arabic 49,969 SFT
FreedomIntelligence/sharegpt-arabic 5,231 SFT
arbml/CIDAR 10,000 SFT
CohereForAI/aya_dataset (Arabic) 4,947 SFT
2A2I/argilla-dpo-mix-7k-arabic 6,750 DPO

Limitations

  • Knowledge ceiling: At 518M parameters, ArabicMMLU-style knowledge tasks lag 7B+ models by 12-30 points. This is a parameter-count limit, not a training limit.
  • MSA only: Trained on Wikipedia (Modern Standard Arabic). Dialects (Egyptian, Gulf, Levantine) get MSA responses.
  • 504M pretraining tokens: Below Chinchilla-optimal ratio. More Arabic data would improve knowledge tasks.
  • DPO was weak: 6,750 machine-translated preference pairs provided minimal signal at 0.5B scale. The weight soup merge was more impactful.
  • GGUF tile alignment: q4_k_m and q5_k_m fall back to higher-bit quantization for 144/290 tensors due to the expanded vocabulary not aligning with k-quant tile sizes.

Hardware

All training ran on a single Nebius gpu-h100-sxm node:

  • 8x NVIDIA H100 80 GB SXM5 HBM3, NVLink4
  • 128 vCPUs, 1.5 TiB RAM
  • CUDA 13.0, PyTorch 2.11, flash-attn 2.8.3, transformers 5.5.0

Citation

@article{jaber2025rightnow,
  title={RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment},
  author={Jaber, Jaber and Jaber, Osama},
  year={2025},
  url={https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo}
}

License

Apache 2.0 (same as the base Qwen2.5-0.5B model).


Built by RightNow AI

Downloads last month
140
GGUF
Model size
0.5B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RightNowAI/RightNow-Arabic-0.5B-Turbo

Quantized
(99)
this model

Evaluation results