Qwen3.5-2B · Arabic Semantic Chunking

A LoRA adapter fine-tuned on top of Qwen/Qwen3.5-2B for Arabic text semantic segmentation.
Given a block of Arabic text, the model splits it into small, self-contained, meaningful sentences and returns them as a structured JSON object.

This model was trained via knowledge distillation from a GPT-OSS-20B teacher model, using Unsloth for efficient 4-bit LoRA fine-tuning.

Intended Use

Use case	Supported
Arabic sentence segmentation	✅
Semantic chunking for RAG pipelines	✅
Pre-processing Arabic documents	✅
Non-Arabic languages	❌
Translation or paraphrasing	❌

Quick Start

import json
import torch
from unsloth import FastLanguageModel

MODEL_ID = "marioVIC/qwen3-5-2b-arabic-semantic-chunking"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = """
You are an expert Arabic text segmentation assistant. Your task is to split the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment(text: str) -> list[str]:
    prompt = (
        f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
        f"<|im_start|>user\nText to split:\n{text}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    input_ids = tokenizer(
        prompt,
        return_tensors     = "pt",
        add_special_tokens = False,
    ).input_ids.to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            max_new_tokens     = 512,
            do_sample          = False,
            repetition_penalty = 1.1,
            pad_token_id       = tokenizer.eos_token_id,
            eos_token_id       = tokenizer.convert_tokens_to_ids("<|im_end|>"),
        )

    generated  = output_ids[0][input_ids.shape[-1]:]
    raw        = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw).get("sentences", [])


text = (
    "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
    "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
    "على الكلام وترجمة اللغات واتخاذ القرارات."
)

for i, s in enumerate(segment(text), 1):
    print(f"[{i}] {s}")

Expected output:

[1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
[2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.

Output Format

The model always returns a valid JSON object:

{
  "sentences": [
    "الجملة الأولى.",
    "الجملة الثانية.",
    "الجملة الثالثة."
  ]
}

Training Details

Base Model

Qwen/Qwen3.5-2B

Method

Knowledge distillation — a GPT-OSS-20B teacher model was used to generate segmentation labels over an Arabic corpus. The student (Qwen3.5-2B) was then fine-tuned on those labels via supervised fine-tuning (SFT).

Framework

Unsloth + TRL SFTTrainer

LoRA Configuration

Parameter	Value
Rank (`r`)	16
Alpha	16
Dropout	0.1
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Bias	none

Training Hyperparameters

Parameter	Value
Max sequence length	2048
Quantization	4-bit (QLoRA)
Batch size	8
Gradient accumulation steps	4
Effective batch size	32
Learning rate	2e-4
LR scheduler	Linear
Warmup steps	10
Max steps	30
Optimizer	AdamW 8-bit
Weight decay	0.05
Seed	3407

Data Split

Train: 90%
Eval: 10%
Best checkpoint selected by lowest eval_loss

Prompt Template

This model uses the ChatML format. Always use add_special_tokens=False when tokenizing a manually-built prompt.

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
Text to split:
{arabic_text}<|im_end|>
<|im_start|>assistant

Limitations

Optimised for Modern Standard Arabic (MSA); performance on dialects may vary.
Best results on texts up to ~400 tokens. Very long documents should be chunked before inference.
Output is always JSON — downstream parsing is required.
Not suitable for tasks other than segmentation (no Q&A, summarisation, etc.).

License

This adapter inherits the Apache 2.0 license from the base Qwen3.5-2B model.

Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3.5 Fine Tunned for semantic chunking},
  author = {Omar Abdelmoniem, Mariam Emad},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.5-2B}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for marioVIC/qwen3-5-2b-arabic-semantic-chunking

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Adapter

(47)

this model