Qwen3.5-2B · Arabic Semantic Chunking

A LoRA adapter fine-tuned on top of Qwen/Qwen3.5-2B for Arabic text semantic segmentation.
Given a block of Arabic text, the model splits it into small, self-contained, meaningful sentences and returns them as a structured JSON object.

This model was trained via knowledge distillation from a GPT-OSS-20B teacher model, using Unsloth for efficient 4-bit LoRA fine-tuning.


Intended Use

Use case Supported
Arabic sentence segmentation
Semantic chunking for RAG pipelines
Pre-processing Arabic documents
Non-Arabic languages
Translation or paraphrasing

Quick Start

import json
import torch
from unsloth import FastLanguageModel

MODEL_ID = "marioVIC/qwen3-5-2b-arabic-semantic-chunking"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = """
You are an expert Arabic text segmentation assistant. Your task is to split the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment(text: str) -> list[str]:
    prompt = (
        f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
        f"<|im_start|>user\nText to split:\n{text}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    input_ids = tokenizer(
        prompt,
        return_tensors     = "pt",
        add_special_tokens = False,
    ).input_ids.to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            max_new_tokens     = 512,
            do_sample          = False,
            repetition_penalty = 1.1,
            pad_token_id       = tokenizer.eos_token_id,
            eos_token_id       = tokenizer.convert_tokens_to_ids("<|im_end|>"),
        )

    generated  = output_ids[0][input_ids.shape[-1]:]
    raw        = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw).get("sentences", [])


text = (
    "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
    "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
    "على الكلام وترجمة اللغات واتخاذ القرارات."
)

for i, s in enumerate(segment(text), 1):
    print(f"[{i}] {s}")

Expected output:

[1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
[2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.

Output Format

The model always returns a valid JSON object:

{
  "sentences": [
    "الجملة الأولى.",
    "الجملة الثانية.",
    "الجملة الثالثة."
  ]
}

Training Details

Base Model

Qwen/Qwen3.5-2B

Method

Knowledge distillation — a GPT-OSS-20B teacher model was used to generate segmentation labels over an Arabic corpus. The student (Qwen3.5-2B) was then fine-tuned on those labels via supervised fine-tuning (SFT).

Framework

Unsloth + TRL SFTTrainer

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 16
Dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias none

Training Hyperparameters

Parameter Value
Max sequence length 2048
Quantization 4-bit (QLoRA)
Batch size 8
Gradient accumulation steps 4
Effective batch size 32
Learning rate 2e-4
LR scheduler Linear
Warmup steps 10
Max steps 30
Optimizer AdamW 8-bit
Weight decay 0.05
Seed 3407

Data Split

  • Train: 90%
  • Eval: 10%
  • Best checkpoint selected by lowest eval_loss

Prompt Template

This model uses the ChatML format. Always use add_special_tokens=False when tokenizing a manually-built prompt.

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
Text to split:
{arabic_text}<|im_end|>
<|im_start|>assistant

Limitations

  • Optimised for Modern Standard Arabic (MSA); performance on dialects may vary.
  • Best results on texts up to ~400 tokens. Very long documents should be chunked before inference.
  • Output is always JSON — downstream parsing is required.
  • Not suitable for tasks other than segmentation (no Q&A, summarisation, etc.).

License

This adapter inherits the Apache 2.0 license from the base Qwen3.5-2B model.


Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3.5 Fine Tunned for semantic chunking},
  author = {Omar Abdelmoniem, Mariam Emad},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.5-2B}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marioVIC/qwen3-5-2b-arabic-semantic-chunking

Finetuned
Qwen/Qwen3.5-2B
Adapter
(47)
this model