Model Açıklaması

Ba2han/LFM2.5-1.2B-Turkish_Data_Augment, veri seti büyütme (augmentation) işlemleri için tasarlanmış iki dilli (İngilizce ve Türkçe) bir modeldir. LFM2.5 1.2B mimarisi üzerine inşa edilen bu model, yüksek kaliteli sentetik metinler üretme konusunda uzmanlaşmak üzere yaklaşık 1 milyar token veri ile eğitilmiştir.

Mevcut Türkçe ve İngilizce veri setlerini genişletmek ve çeşitlendirmek konusunda son derece yeteneklidir. Pretraining veri setlerini zenginleştirmek isteyen araştırmacılar ve geliştiriciler için ideal bir araçtır.

RTX 5070 gibi kartlarda VLLM ile saniyede ~7000 token çıktı mümkündür.

Model Detayları

Temel Model: LFM2.5 1.2B
Eğitim Verisi (Token): ~1B
Desteklenen Diller: İngilizce<>Türkçe
Temel Görev: Veri Seti Büyütme (Data Augmentation)

Temel Özellikler ve Dosyalar

Yüksek Kaliteli Veri Çoğaltma: Daha küçük veri setlerini güçlendirmek için mevcut metinlerin bağlama uygun varyasyonlarını üretmede üstün performans gösterir.
Şeffaf Sistem Komutları: Modelin davranışını yönlendirmek için kullanılan sistem mesajları açık kaynak olarak paylaşılmıştır. Bunları depodaki system_messages.json dosyası içinde bulabilirsiniz.
Çift Dilli Yetkinlik: Hem Türkçe hem de İngilizce metinleri sorunsuz bir şekilde işler.

Örnek Kullanım ve Veri Setleri

Modelin veri çoğaltma yeteneklerini göstermek amacıyla, bu model tarafından üretilmiş filtrelenmemiş bir örnek veri seti Hugging Face Hub üzerinde paylaşılmıştır.

Örnek Veri Seti: Ba2han/GeziNot_PopulerBilim-Augmented-TR (Not: Bu veri seti, modelin ham çıktı kapasitesini göstermek amacıyla herhangi bir filtreleme işleminden geçirilmeden paylaşılmıştır).

Başlangıç ve Kullanım

Veri çoğaltma işlemlerinde en iyi sonuçları elde etmek için, system_messages.json dosyasında sağlanan sistem komutlarını kullanın.

Zayıflıklar

Model TR<>EN çeviride düşük performans gösterebilir.
Halüsinasyon oranı düşük olsa da modelin boyutundan dolayı hem girdi hem de çıktılar filtrelenmelidir.
Nadiren model tekrar eden çıktı verebilir.

Model Description

Ba2han/LFM2.5-1.2B-Turkish_Data_Augment is a bilingual (English and Turkish) model designed for dataset augmentation tasks. Built upon the LFM2.5 1.2B architecture, this model has been trained on approximately 1 billion tokens of data to specialize in generating high-quality synthetic text.

It is highly capable of expanding and diversifying existing Turkish and English datasets. It is an ideal tool for researchers and developers looking to enrich their pretraining datasets.

With cards like the RTX 5070, it's possible to produce approximately 7000 tokens per second using VLLM.

Model Details

Base Model: LFM2.5 1.2B
Training Data (Tokens): ~1B
Supported Languages: English<>Turkish
Primary Task: Data Augmentation

Key Features & Assets

High-Quality Data Augmentation: Excels at generating contextually appropriate variations of existing texts to strengthen smaller datasets.
Transparent System Prompts: The system messages used to guide the model's behavior are open-sourced. You can find them in the repository under the system_messages.json file.
Bilingual Proficiency: Seamlessly processes both Turkish and English texts.

Example Usage & Datasets

To demonstrate the model's data augmentation capabilities, an unfiltered example dataset generated by this model has been shared on the Hugging Face Hub.

Example Dataset: Ba2han/GeziNot_PopulerBilim-Augmented-TR (Note: This dataset is shared without any filtering to demonstrate the model's raw output capacity).

Getting Started & Usage

To achieve the best results in data augmentation tasks, use the system prompts provided in the system_messages.json file.

Weaknesses & Limitations

The model may perform poorly in TR<>EN translation.
Although the hallucination rate is low, both inputs and outputs should be filtered due to the model's size.
The model may rarely produce repetitive outputs.

Citations

turkish-nlp-suite/OzenliDerlem

Unsloth

LiquidAI

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 1,404

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for Ba2han/LFM2.5-1.2B-Turkish_Data_Augment

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

(29)

this model

Collection including Ba2han/LFM2.5-1.2B-Turkish_Data_Augment

Augmentation Models

Collection

2 items • Updated 11 days ago