Information about Dataset and code

by gbettaieb - opened Dec 22, 2025

Discussion

gbettaieb

Dec 22, 2025

Hello
Thank you for sharing the model, it looks very promising on handritten arabic documents !
I have two questions:

Is it possible to share the code (GitHub repository/ Python files for the model and training)? It would be very useful for the community to see how you fine-tuned the model and to be able to reproduce it or build upon it 😀
How did you deal with the fact that the dataset contains many different page formats? Some pages have many lines or paragraphs, while others from open-source datasets have only one line per image. Did this heterogeneity affect the fine-tuning process?

sherif1313

Owner Dec 22, 2025

Thank you for your message. I'm about to upload an improved version of it soon, and it's much better. I've worked on the page layout and formatting, and its performance is about 20% faster and with less CER. I've also removed unnecessary data and added new data. It's a very powerful model, so please wait a few days, and we'll work together to add it to GitHub.

gbettaieb

Dec 23, 2025

Thank you ! Don't hesitate to sollicit me if you need help in structuring code or uploading to github as I worked on similar model and developed training and evaluation code before.
I may also have annotated datasets of handwritten Arabic documents that I can share with you to further improve the model.
It would be great to have a model with hight accuracy, fast inference speed and with limited hallucination.

sherif1313

Owner Dec 25, 2025

def preprocess_sample(sample, target_size=1024):
try:
image = sample["image"]
text = sample["text"]

    # ضغط الصورة بالحجم الجديد
    compressed_image = compress_image(image, max_size=target_size)
    
    # Padding ذكي بناءً على نسبة العرض للارتفاع
    width, height = compressed_image.size
    aspect_ratio = width / height
    
    # تحسين Padding لمضاعفات 32 مع الحفاظ على النسبة
    if aspect_ratio > 1.5:  # صورة أفقية
        new_width = ((width + 31) // 32) * 32
        new_height = ((height + 31) // 32) * 32
    else:  # صورة رأسية أو مربعة
        # زيادة Padding لحفظ التفاصيل
        new_width = ((width + 63) // 64) * 64
        new_height = ((height + 63) // 64) * 64
    
    # Resize مع LANCZOS عالي الجودة
    final_image = compressed_image.resize((new_width, new_height), 
                                        Image.Resampling.LANCZOS)
    
    return {"image": final_image, "text": text}
except Exception as e:
    print(f"Error preprocessing: {e}")
    return None

gbettaieb

Dec 26, 2025

How did you run finetuning ? did you use a notebook or python scripts or an existing tutorial?
would be great to share the full code even if not clean or well structered

MohamedDhouib1

Dec 26, 2025

@sher

Thank you for your message. I'm about to upload an improved version of it soon, and it's much better. I've worked on the page layout and formatting, and its performance is about 20% faster and with less CER. I've also removed unnecessary data and added new data. It's a very powerful model, so please wait a few days, and we'll work together to add it to GitHub.

Hello, thanks a lot for sharing this work, it’s really impressive.

Would it be possible to put the training code in an open GitHub repository? I think that would give the project more visibility and make it easier for the community to contribute and improve it.

Ideally, it would be great to have code that allows us to reproduce the training process, starting from the pretrained Qwen model and fine-tuning it to obtain the final model you shared here, including any preprocessing or training details.

Thanks again for the great effort.

wznaidi

Dec 26, 2025

Hello @everyone ,

Great work, Amazing 😀

I am also very interested on my side to contribute to this project. How can I help ??

Sherif, are you planning to share the above codebase so we can contribute to it?

sherif1313

Owner Dec 28, 2025

•

edited Dec 28, 2025

hello Thank wznaidi Mohamed123321 gbettaieb

I apologize for the delay in replying; I was busy downloading the new the model. There are two problems: the size of the new the model, because when I quantized it to 4 bits, the difference became about 15-20% for large documents and 2-5% for small data. Is there a way to reduce and quantize it without this significant difference? Even so, it's much better than the previous the model.

I've changed Spaces to the new model. I would appreciate your feedback on the new model. God willing, I will open a GitHub account soon.

MohamedDhouib1

Dec 28, 2025

Hello,
Could you elaborate on what you mean by “template”? Do you mean the model? Also, why is quantization necessary in your case? I’d be happy to help if you can share a bit more detail.

I’ll test your new model and report back here.
Also, if you need help putting your work on GitHub, I’d be glad to help.

sherif1313

Owner Dec 28, 2025

Yes, a the model. I corrected please excuse me.

MohamedDhouib1

Dec 28, 2025

•

edited Dec 28, 2025

Do you have any plans to publish the fine-tuning code and the data-preparation scripts on GitHub? That would make it much easier for the community to contribute and improve the project together.

If you can share a reproducible training workflow, starting from the pretrained Qwen model and covering data preprocessing, training settings, and the steps used to produce the final model, it would help others understand the process and build on your approach.

sherif1313

Owner Dec 29, 2025

•

edited Dec 29, 2025

This is my page: https://github.com/sherif1313
Should I start a new project?

MohamedDhouib1

Dec 29, 2025

Hey,
I see you already have a project here: https://github.com/sherif1313/Arabic-English-handwritten-OCR-v3
You need to put training code and data preparation code there, and document how to reproduce the results!

sherif1313

Owner Dec 29, 2025

I have prepared all the files and created a logo for the model, but how do I document the model and how do I share it?

sherif1313

Owner Dec 29, 2025

•

edited Dec 29, 2025

Is the logo good? 😀
Should I add it here?
How to document how to reproduce the results

sherif1313

Owner Dec 29, 2025

I want to convert it into a complete platform like DocStrange. Can anyone convert it to this format and make it specialized for Arabic books, manuscripts, and documents?
https://github.com/NanoNets/docstrange

MohamedDhouib1

Dec 29, 2025

•

edited Dec 29, 2025

I want to convert it into a complete platform like DocStrange. Can anyone convert it to this format and make it specialized for Arabic books, manuscripts, and documents?
https://github.com/NanoNets/docstrange

Hello! I think the first thing to do is to clearly document which datasets you used and whether you applied any preprocessing.
In the code, you have DATA_DIR = "/KHATT/finsh/images", where each image is paired with a .txt file of the same name. At the moment, it’s not clear what exactly is inside that folder (which dataset(s) it contains, how the files were selected/split, and what processing was done to produce them), and that’s an important part to report.
Also, what’s the difference between v2 and v3 in terms of the data?

sherif1313

Owner Dec 29, 2025

ساتكلم بالعربية حتي يكون اسهل بالنسبه لي
البيانات النموذجيين ستجدها في البيانات الصفحة .. حيث استعنت في النموذج الثاني ببينات مطبوعه وخطوط يد والثالث حذفت البيانات المطبوعه وتركت النماذج الصعبه
لم اقدم بمعالجت البيانات ولكني جمعتها في ملف واحد ... لم تخبرني كيفية التوثيق ومشاركة التدريب كيف؟

==================== ====================

def load_local_pairs(data_dir):
"""جمع أزواج (صورة، نص) من مجلد محلي"""
image_extensions = ['.png', '.jpg', '.jpeg', '.tif', '.tiff', '.PNG', '.JPG', '.JPEG', '.TIF', '.TIFF']
image_paths = sorted(
glob.glob(os.path.join(data_dir, ".png")) +
glob.glob(os.path.join(data_dir, ".jpg")) +
glob.glob(os.path.join(data_dir, "*.jpeg"))
)
pairs = []
for img_path in image_paths:
txt_path = os.path.splitext(img_path)[0] + ".txt"
if os.path.exists(txt_path):
try:
with open(txt_path, "r", encoding="utf-8") as f:
text = f.read().strip()
if text:
image = Image.open(img_path).convert("RGB")
# <<<<<<< Modified: إضافة مسار الملف إلى القاموس
pairs.append({"image_path": img_path, "image": image, "text": text})
except Exception as e:
print(f"تخطي {img_path}: {e}")
print(f"✅ تم تحميل {len(pairs)} عينة من {data_dir}")
return pairs

def preprocess_sample(sample):
try:
image = sample["image"]

    # <<<<<<< الإضافة الجديدة: تصغير الصورة إذا كانت كبيرة جداً
    # نستخدم thumbnail للحفاظ على نسبة العرض إلى الارتفاع
    if image.width > MAX_IMAGE_SIZE or image.height > MAX_IMAGE_SIZE:
        original_size = (image.width, image.height)
        image.thumbnail((MAX_IMAGE_SIZE, MAX_IMAGE_SIZE), Image.Resampling.LANCZOS)
        print(f"🖼️ تم تصغير الصورة من {original_size} إلى {image.size}")
    # >>>>>>> نهاية الإضافة

    text = sample["text"]
    width, height = image.size
    
    # هذه الخطوة تبقى كما هي لضمان توافق الأبعاد مع النموذج
    new_width = ((width + 31) // 32) * 32
    new_height = ((height + 31) // 32) * 32
    image = image.resize((new_width, new_height), Image.LANCZOS)

sherif1313

Owner Dec 29, 2025

واضفت الي النموذج الثالث Omarkhaledok/muharaf-public-pages

MohamedDhouib1

Dec 29, 2025

ساتكلم بالعربية حتي يكون اسهل بالنسبه لي
البيانات النموذجيين ستجدها في البيانات الصفحة .. حيث استعنت في النموذج الثاني ببينات مطبوعه وخطوط يد والثالث حذفت البيانات المطبوعه وتركت النماذج الصعبه
لم اقدم بمعالجت البيانات ولكني جمعتها في ملف واحد ... لم تخبرني كيفية التوثيق ومشاركة التدريب كيف؟

==================== ====================

def load_local_pairs(data_dir):
"""جمع أزواج (صورة، نص) من مجلد محلي"""
image_extensions = ['.png', '.jpg', '.jpeg', '.tif', '.tiff', '.PNG', '.JPG', '.JPEG', '.TIF', '.TIFF']
image_paths = sorted(
glob.glob(os.path.join(data_dir, ".png")) +
glob.glob(os.path.join(data_dir, ".jpg")) +
glob.glob(os.path.join(data_dir, "*.jpeg"))
)
pairs = []
for img_path in image_paths:
txt_path = os.path.splitext(img_path)[0] + ".txt"
if os.path.exists(txt_path):
try:
with open(txt_path, "r", encoding="utf-8") as f:
text = f.read().strip()
if text:
image = Image.open(img_path).convert("RGB")
# <<<<<<< Modified: إضافة مسار الملف إلى القاموس
pairs.append({"image_path": img_path, "image": image, "text": text})
except Exception as e:
print(f"تخطي {img_path}: {e}")
print(f"✅ تم تحميل {len(pairs)} عينة من {data_dir}")
return pairs

def preprocess_sample(sample):
try:
image = sample["image"]
    # <<<<<<< الإضافة الجديدة: تصغير الصورة إذا كانت كبيرة جداً
    # نستخدم thumbnail للحفاظ على نسبة العرض إلى الارتفاع
    if image.width > MAX_IMAGE_SIZE or image.height > MAX_IMAGE_SIZE:
        original_size = (image.width, image.height)
        image.thumbnail((MAX_IMAGE_SIZE, MAX_IMAGE_SIZE), Image.Resampling.LANCZOS)
        print(f"🖼️ تم تصغير الصورة من {original_size} إلى {image.size}")
    # >>>>>>> نهاية الإضافة

    text = sample["text"]
    width, height = image.size
    
    # هذه الخطوة تبقى كما هي لضمان توافق الأبعاد مع النموذج
    new_width = ((width + 31) // 32) * 32
    new_height = ((height + 31) // 32) * 32
    image = image.resize((new_width, new_height), Image.LANCZOS)

شكرًا لك جزيل الشكر 🙏
يمكنك مشاركة مزيدٍ من المعلومات في ملف README على GitHub (حول البيانات، والفروق بين النماذج، وخطوات التدريب) حتى تصبح الصورة أوضح ويسهل على الآخرين إعادة التجربة.

sherif1313

Owner Dec 29, 2025

انت لم تخبرني بما علي ان اعمله لمشاركة الكود والتوثيق .. انا اريد ان يكون هذا النموذج بدايه لمشروع يخدم من يتكلم اللغة العربية ويسهل عليهم ويكون اوثق من كل ما ينتجه الغرب لانهم لم يكون افهم للغتنا منا
فان كنت تريد مساعدتي في هذه البداية حتي يسطتيع ان يسهل علي كل من يتكلم العربية وان يصبح مجاني تماما لخدمة المجتمع

MohamedDhouib1

Dec 29, 2025

•

edited Dec 29, 2025

أنت بالفعل بدأت الخطوة الصحيحة بإنشاء مستودع على «غيت هَب» ومشاركة الكود. الآن لكي يصبح المشروع مفيدًا فعلًا للناس، حاول أن تضع أكبر قدر ممكن من المعلومات في:

ملف التعريف داخل المستودع: اشرح بالتفصيل كيف جهّزت البيانات وكيف أنشأت مجلد الصور (نفس الخطوات التي كتبتها هنا)، ماذا يحتوي كل مجلد، وما الفرق بين الإصدار الثاني والإصدار الثالث، وخطوات التدريب والتقييم.
صفحة النموذج على «هَغِنغ فِيس»: نفس المعلومات بشكل مختصر، مع مثال تشغيل سريع، وأوامر التدريب، والمتطلبات.

وأهم نقطة: اجعل الكود يعمل على أي جهاز بدون تعديل يدوي:

لا تترك مسارات محلية ثابتة داخل ملفات الإعداد بدون شرح.
استبدلها بوسائط يمكن للمستخدم تحديدها عند التشغيل (مثل مسار البيانات) أو بملف إعداد عام، وضع مثالًا واضحًا يشرح كيف يحدد المستخدم مساراته.

بهذا ستسهل على كل من يتكلم العربية أن يستخدم النموذج ويطوّره.

sherif1313

Owner Dec 29, 2025

•

edited Dec 29, 2025

شكرا لك علي هذا التوضيح ساعمل علي تحسين الكود ليسهل علي كل من يريد يعمل عليه ويحسنه واني سوف اعمل علي تحسين النموذج اكثر في النسخة القادمه واحاول ان اصغر حجمه ليكون اسهل في التحميل .... وارجو ان كنت تمتلك بيانات اخري اخباري بها
والحمد لله بعد اختباري للنموذج مبدا وجدته افضل من كل النماذج التي تتعامل مع العربية بالنسبه للمطبوع وخصوصا في الخطوط اليدويه بنسبه كيبره جدا وان كنت تريد اي استفسار فاكون سعيد به

sherif1313

Owner Dec 29, 2025

خبر سار جدا تم دمج النموذج الحالي بنموذج
Qwen3-VL-4B-Instruct
والنتيجية حتي الان مبشره جدا
وهذا معنه تحسين الأداء على Qwen3-VL؟
الحفاظ على المعرفة القديمة وتقليل وقت إعادة التدريب
لاني وجدت ان النموذج Qwen2.5-VL
لا يستطيع ان يعطيني افضل من هذا
سوف ابداء التدريب من حيث انتهيت و ايام قليل وسوف اخبرك بالنتيجة

sherif1313

Owner Jan 1

السلام عليكم
لقد قمت ودمجت التدريب السابق مع نموذج كويد 3 وحققت نتايج جيدة في دمج المعرفة القديمة الي النموذج الجديد ولكن المشكله الاساسية هي بعد الدمج في التدريب وتحسينه يكون بطئ جدا في التدريب اعتقد انه يعمل بنظام مختلف تمام عن كوين 2.5 فارجو ان كنت عملت علي هذا النموذج ارجو اخباري بكيفية تسريع عملية التدريب وما هي نقط
التدريب: 99it [37:32, 22.81s/it, loss=0.767, epoch=0, samples=99, eval_loss=N/A]
التدريب: 99it [04:20, 2.71s/it, loss=0.748, epoch=0, samples=99, eval_loss=N/A]
الفرق كبير جدا في السرعه اكثر من 30 ضعف ان تعرف فاكون شاكر لك

MohamedDhouib1

Jan 1

بصراحة أنا لم أفهم أي نموذج تقصد تحديدًا. لكن فرق السرعة الكبير غالبًا سببه اختلاف حجم النموذج أو اختلاف أسلوب ضغط الأوزان. إذا أمكن، ضع الكود القديم والجديد مع ملفات الإعداد وأمر التشغيل على مستودع واحد، وأنا أراجعهما وأحدد لك السبب بدقة وكيف تسرّع التدريب.

Honestly, I didn’t understand exactly which model you mean. But this big speed gap is usually due to a different model size or a different quantization setup. If you can, put the old and new code here or on GitHub, and I’ll review them to pinpoint the exact cause and how to speed up training.

sherif1313

Owner Jan 1

دمج المعرفه السابقه من Qwen2.5-VL الي Qwen3-VL ولا اعلم هل المشكلة من Python 3.13 لا يزال جديدًا ويحتوي على مشاكل في الـ multiprocessing مع PyTorch .... ولكني عملت علي تغيره الي Python 3.10 ولا اجد حل جذري لبطئ النموذج

I'm trying to integrate the previous knowledge from Qwen2.5-VL into Qwen3-VL, but I'm not sure if the problem stems from Python 3.13, which is still relatively new and has issues with multiprocessing with PyTorch. I've tried upgrading to Python 3.10, but I haven't found a permanent solution to the model's slowness.

MohamedDhouib1

Jan 1

Are you sure you used the same model size and training setting?

sherif1313

Owner Jan 2

•

edited Jan 2

لقد تم دمج المعرفة القديم من النموذج الحالي الي النموذج الجديد من التدريب السابق واعطاني نتيجة جيده جدا تقترب من ٩٠% من النموذج السابق ولكن المشكله بعد الدمج وتدريب النموذج بطئ التدريب والكود الذي ادرب به هو نفس الكود القديم مع تغيرات بسيطه و ما عرفته هو ان النموذج الجديد لا يتعرف علي تعدد الانويه في البرسيسور وهو ما يسبب البطئ فينتظر كل عمليه من كارت الفيجا الي البرسيسور تاخد وقت طويل جدا ...فما هو الحل ؟

The old knowledge from the current model has been integrated into the new model of previous training and gave me a very good result, approaching 90% of the previous model. but the problem after merging and training the model is slow to train and the code that I trained with it is the same as the old code with small changes and what I know is that the new model does not recognize the multi-layer in the processor, which causes the slow waiting for each process from the Vega card to the processor takes a very long time ... so what is the solution

MohamedDhouib1

Jan 2

Hello, I need both training scripts to be able to help you

sherif1313 changed discussion status to closed Jan 2

sherif1313 changed discussion status to open Jan 2

sherif1313

Owner Jan 2

I will send a detailed report.

sherif1313

Owner Jan 2

•

edited Jan 2

https://qwen3lm.com/fine-tune-qwen3-with-lora/ هنا يوجد كل الاعدادت التي كنت محتاجها شكرا لك علي اهتمامك وردك وان شاء الله عند الانتهاء من التدريب ساخبرك بالنتيجة https://deepwiki.com/QwenLM/Qwen-VL/5.3-lora-fine-tuning

MohamedDhouib1

Jan 2

Default model in the link you provided is 14 B: Qwen/Qwen1.5-14B
Did you change that?
If not it does explain why the training is slow

sherif1313

Owner Jan 2

•

edited Jan 2

The best and most stable solution is to use AutoModelForVision2Seq for loading, as it automatically handles these differences instead of using the direct Qwen3VLForConditionalGeneration class. I removed the use_cache and it's working now; I'll let you know the result soon, God willing.
التدريب: 231it [16:24, 2.70s/it, loss=0.214, epoch=0, samples=231, eval_loss=0.655]

sherif1313

Owner Jan 2

النتيجة فوق الرائعة وفرت شهور من التدريب هذا النموذج سيكون نقطة تحول حقيقي ...المشكله الوحيده هي بطئ الاستدلال النموذج القديم بعطيني 0.32 الجديد 2.5 وهذه علي ما اعتقد فرق الخبرة الفعليه للنموذج القديم

The result is beyond fantastic and saved months of training.This model will be a real turning point ...The only problem is the slow reasoning of the old model by giving me 0.32 new 2.5 And this I think is the difference in actual experience to the old model

MohamedDhouib1

Jan 3

You can share the model on huggingface once the training is over, we can help you test it!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment