T-lite-0.1
π¨ T-lite is designed for further fine-tuning and is not intended as a ready-to-use conversational assistant. Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.
Description
T-lite is a continual pretraining model designed specifically for the Russian language, enabling the creation of large language model applications in Russian. This model aims to improve the quality of Russian text generation and provide domain-specific and cultural knowledge relevant to the Russian context.
Model Training Details
ποΈ Architecture and Configuration
T-lite is a decoder language model with:
- pre-normalization via RMSNorm
- SwiGLU activation function
- rotary positional embeddings (RoPE)
- grouped query attention (GQA)
T-lite was trained in bf16.
βοΈ Hyperparameters
We employed the Decoupled AdamW optimizer with Ξ²1 = 0.9, Ξ²2 = 0.95, and eps = 1.0e-8. The learning rate was set to 1.0e-5 with a constant schedule and a warmup period of 10 steps during stage 1, and a cosine schedule during stage 2. Weight decay was applied at a rate of 1.0e-6, and gradient clipping was performed with a maximum norm of 1.0. The maximum sequence length was set to 8192. Each batch contained approximately 6 million tokens.
ππ½ Hardware Configuration & Performance
Training was conducted on 96 A100 GPUs with 80GB memory each, using Fully Sharded Data Parallel (FSDP) with full shard/hybrid shard strategies. The setup achieved a throughput of 3000 tokens/sec/GPU, with 100B tokens being processed in approximately 4 days. We achieved a 0.59 Model FLOPs Utilization (MFU).
π Data
Stage 1
Massive continual pre-training
- 300B tokens * 0.3 epoch
- Proportion of data in Russian is 85%, as a trade-off between language adoptation and English language performance
- Styles and topics in Common Crawl (CC) data were downsampled
- Domains in book datasets were balanced
- Proportion of code data was increased
Stage 2
Focuses on refining the quality of the dataset
- 20B tokens * 3 epochs
- Includes instructional sets of smaller volume
- Advertisements and news were aggressively downsampled
- Instructions and articles were upsampled
- Educational content was balanced
π Benchmarks
π·πΊ Russian
MERA benchmark results
| Task name | Metric | N shot | Llama-3-8b | T-lite-0.1 |
|---|---|---|---|---|
| Total score | 0.445 | 0.492 | ||
| BPS | Accuracy | 2-shot | 0.459 | 0.358 |
| CheGeKa | F1 / EM | 4-shot | 0.04/0 | 0.118/0.06 |
| LCS | Accuracy | 2-shot | 0.146 | 0.14 |
| MathLogicQA | Accuracy | 5-shot | 0.365 | 0.37 |
| MultiQ | F1-score / EM | 0-shot | 0.106/0.027 | 0.383/0.29 |
| PARus | Accuracy | 0-shot | 0.72 | 0.858 |
| RCB | Avg F1 / Accuracy | 0-shot | 0.42/0.434 | 0.511/0.416 |
| ruHumanEval | pass@k | 0-shot | 0.017/0.085/0.171 | 0.023/0.113/0.226 |
| ruMMLU | Accuracy | 5-shot | 0.693 | 0.759 |
| ruModAr | EM | 0-shot | 0.708 | 0.667 |
| ruMultiAr | EM | 5-shot | 0.259 | 0.269 |
| ruOpenBookQA | Avg F1 / Accuracy | 5-shot | 0.745/0.744 | 0.783/0.782 |
| ruTiE | Accuracy | 0-shot | 0.553 | 0.681 |
| ruWorldTree | Avg F1 / Accuracy | 5-shot | 0.838/0.839 | 0.88/0.88 |
| RWSD | Accuracy | 0-shot | 0.504 | 0.585 |
| SimpleAr | EM | 5-shot | 0.954 | 0.955 |
| USE | Grade Norm | 0-shot | 0.023 | 0.05 |
The evluation was performed using https://github.com/ai-forever/MERA/tree/main
π¬π§ English
It's consistent that after the model was adapted for the Russian language, performance on English benchmarks declined.
| Benchmark | N shot | Llama-3-8b | T-lite-0.1 |
|---|---|---|---|
| ARC-challenge | 0-shot | 0.518 | 0.489 |
| ARC-easy | 0-shot | 0.789 | 0.787 |
| MMLU | 0-shot | 0.62 | 0.6 |
| Natural Questions | 0-shot | 0.162 | 0.222 |
| TriviaQA | 0-shot | 0.63 | 0.539 |
The evluation was performed using https://github.com/EleutherAI/lm-evaluation-harness.
π¨βπ» Examples of usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained("t-bank-ai/T-lite-0.1")
model = AutoModelForCausalLM.from_pretrained("t-bank-ai/T-lite-0.1", device_map="auto")
input_text = "ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Π½ΡΠΆΠ½ΠΎ Π΄Π»Ρ"
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
Output:
ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Π½ΡΠΆΠ½ΠΎ Π΄Π»Ρ ΡΠΎΠ³ΠΎ, ΡΡΠΎΠ±Ρ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΠ·ΠΈΡΠΎΠ²Π°ΡΡ ΠΏΡΠΎΡΠ΅ΡΡ ΠΏΡΠΈΠ½ΡΡΠΈΡ ΡΠ΅ΡΠ΅Π½ΠΈΠΉ. ΠΠΌΠ΅ΡΡΠΎ ΡΠΎΠ³ΠΎ, ΡΡΠΎΠ±Ρ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΡ Π½ΡΠΆΠ½ΠΎ Π±ΡΠ»ΠΎ Π²ΡΡΡΠ½ΡΡ ΠΏΡΠΎΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡ ΠΈ Π°Π½Π°Π»ΠΈΠ·ΠΈΡΠΎΠ²Π°ΡΡ Π΄Π°Π½Π½ΡΠ΅, Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΌΠΎΠ³ΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈ Π²ΡΡΠ²Π»ΡΡΡ Π·Π°ΠΊΠΎΠ½ΠΎΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ Π΄Π΅Π»Π°ΡΡ ΠΏΡΠΎΠ³Π½ΠΎΠ·Ρ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΡΡΠΈΡ
Π΄Π°Π½Π½ΡΡ
. ΠΡΠΎ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎ ΠΏΠΎΠ»Π΅Π·Π½ΠΎ Π² ΡΠ°ΠΊΠΈΡ
ΠΎΠ±Π»Π°ΡΡΡΡ
, ΠΊΠ°ΠΊ ΡΠΈΠ½Π°Π½ΡΡ, Π³Π΄Π΅ ΠΎΠ±ΡΠ΅ΠΌ Π΄Π°Π½Π½ΡΡ
ΠΎΠ³ΡΠΎΠΌΠ΅Π½, Π° ΡΠ΅ΡΠ΅Π½ΠΈΡ Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠΈΠ½ΠΈΠΌΠ°ΡΡΡΡ Π±ΡΡΡΡΠΎ.
ΠΠΎΡ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΏΡΠΈΠΌΠ΅ΡΠΎΠ² ΡΠΎΠ³ΠΎ, ΠΊΠ°ΠΊ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π² ΡΠΈΠ½Π°Π½ΡΠ°Ρ
:
1. ΠΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΠ΅ ΠΌΠΎΡΠ΅Π½Π½ΠΈΡΠ΅ΡΡΠ²Π°: Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΌΠΎΠ³ΡΡ Π°Π½Π°Π»ΠΈΠ·ΠΈΡΠΎΠ²Π°ΡΡ Π·Π°ΠΊΠΎΠ½ΠΎΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ Π² ΡΡΠ°Π½Π·Π°ΠΊΡΠΈΡΡ
ΠΈ Π²ΡΡΠ²Π»ΡΡΡ ΠΏΠΎΠ΄ΠΎΠ·ΡΠΈΡΠ΅Π»ΡΠ½ΡΠ΅ Π΄Π΅ΠΉΡΡΠ²ΠΈΡ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΌΠΎΠ³ΡΡ ΡΠΊΠ°Π·ΡΠ²Π°ΡΡ Π½Π° ΠΌΠΎΡΠ΅Π½Π½ΠΈΡΠ΅ΡΡΠ²ΠΎ.
2. Π£ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΡΠΈΡΠΊΠ°ΠΌΠΈ: ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΌΠΎΠΆΠ΅Ρ ΠΏΠΎΠΌΠΎΡΡ ΡΠΈΠ½Π°Π½ΡΠΎΠ²ΡΠΌ ΡΡΡΠ΅ΠΆΠ΄Π΅Π½ΠΈΡΠΌ Π²ΡΡΠ²Π»ΡΡΡ ΠΈ ΠΎΡΠ΅Π½ΠΈΠ²Π°ΡΡ ΡΠΈΡΠΊΠΈ, ΡΠ²ΡΠ·Π°Π½Π½ΡΠ΅ Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠΌΠΈ ΠΈΠ½Π²Π΅ΡΡΠΈΡΠΈΡΠΌΠΈ ΠΈΠ»ΠΈ ΠΊΡΠ΅Π΄ΠΈΡΠ°ΠΌΠΈ.
3. ΠΠ±ΡΠ°Π±ΠΎΡΠΊΠ° Π΄Π°Π½Π½ΡΡ
Π½Π° Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠΌ ΡΠ·ΡΠΊΠ΅: ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΌΠΎΠΆΠ΅Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡΡΡ Π΄Π»Ρ Π°Π½Π°Π»ΠΈΠ·Π° ΡΠΈΠ½Π°Π½ΡΠΎΠ²ΡΡ
Π½ΠΎΠ²ΠΎΡΡΠ΅ΠΉ ΠΈ Π΄ΡΡΠ³ΠΈΡ
ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄Π°Π½Π½ΡΡ
, ΡΡΠΎΠ±Ρ Π²ΡΡΠ²ΠΈΡΡ ΡΠ΅Π½Π΄Π΅Π½ΡΠΈΠΈ
- Downloads last month
- 20