Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper β’ 2602.22207 β’ Published Feb 25 β’ 43
Running on CPU Upgrade 219 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens π 219 Explore synthetic data experiments on a virtual bookshelf
Lapa v0.1.2 Release Collection Release of SOTA Ukrainian LLM and Datasets β’ 18 items β’ Updated Nov 13, 2025 β’ 28
Running on CPU Upgrade Featured 3.1k The Smol Training Playbook π 3.1k The secrets to building world-class LLMs
Paused 4 INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 π 4 Chat with INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0
view article Article Welcome GPT OSS, the new open-source model family from OpenAI! +10 Aug 5, 2025 β’ 513
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). β’ 9 items β’ Updated Sep 19, 2025 β’ 8
view article Article Announcing MamayLM, an efficient state-of-the-art Ukrainian LLM Apr 23, 2025 β’ 63
Running Featured 650 The Tokenizer Playground π 650 Experiment with and compare different tokenizers