DESTA-1B: Dedicated Eritrean Semitic Text Autoregressor
DESTA-1B (Dedicated Eritrean Semitic Text Autoregressor - 1B) is a fine-tuned version of TinyLlama-1.1B-Chat-v1.0 specifically optimized for Tigrinya text generation. The model has been trained on a comprehensive Tigrinya dataset and demonstrates significant improvements in perplexity and text quality for Tigrinya language tasks.
DESTA-1B is designed to serve as a dedicated language model for Eritrean Semitic languages, with a primary focus on Tigrinya text generation, understanding, and completion tasks.
Model Details
Model Description
- Model Name: DESTA-1B (Dedicated Eritrean Semitic Text Autoregressor - 1B)
- Architecture: LlamaForCausalLM (TinyLlama)
- Parameters: ~1.0B (1,007,771,648)
- Model Size: ~4.2 GB
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Language: Tigrinya (ti, tig)
- Tokenizer: mewaeltsegay/tokenizer_tigrinya
Model Architecture
- Hidden Size: 2,048
- Number of Layers: 22
- Intermediate Size: 5,632
- Attention Heads: 32
- Key-Value Heads: 4 (GQA)
- Head Dimension: 64
- Max Position Embeddings: 2,048
- Vocabulary Size: 32,000
- Activation Function: SiLU
- RoPE Theta: 10,000
Training Details
Training Data
- Dataset: mewaeltsegay/finetuning_datatset
- Data Type: Full articles and texts in Tigrinya
- Tokenizer: mewaeltsegay/tokenizer_tigrinya
Training Procedure
- Training Epochs: 6
- Batch Size: 16
- Gradient Accumulation Steps: 2
- Effective Batch Size: 32
- Learning Rate: 2e-5
- Warmup Steps: 200
- Weight Decay: 0.01
- Max Sequence Length: 4,096
- Mixed Precision: Enabled (FP16/BF16)
- Gradient Checkpointing: Enabled
- Optimizer: AdamW with learning rate scheduling
Training Infrastructure
- Hardware: GPU with 67+ GB memory
- Framework: PyTorch with Transformers
- Logging: TensorBoard enabled
- Checkpointing: Every 500 steps
- Evaluation: Every 250 steps
Evaluation Results
Perplexity Comparison
The fine-tuned model shows significant improvements over the base model:
| Metric | Base Model | DESTA | Improvement |
|---|---|---|---|
| Mean Perplexity | 73.61 | 10.41 | 85.86% ↓ |
| Median Perplexity | 10.05 | 11.33 | - |
| Std Deviation | 81.11 | 2.20 | 97.29% ↓ |
| Min Perplexity | 8.65 | 6.36 | - |
| Max Perplexity | 199.99 | 12.85 | 93.57% ↓ |
Loss Comparison
| Metric | Base Model | DESTA | Improvement |
|---|---|---|---|
| Mean Loss | 3.38 | 2.32 | 31.47% ↓ |
| Median Loss | 2.31 | 2.43 | - |
Key Improvements
- 85.86% reduction in average perplexity
- 97.29% reduction in perplexity variance (more consistent performance)
- 31.47% reduction in average loss
- Significantly more stable and predictable outputs
Usage
Installation
pip install torch transformers accelerate
Recommended Generation Parameters
For best results with Tigrinya text generation:
generation_config = {
"max_new_tokens": 150,
"min_new_tokens": 50,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"repetition_penalty": 1.1,
"length_penalty": 1.0,
"do_sample": True,
"no_repeat_ngram_size": 3,
"early_stopping": False
}
Using with Pipeline
import os
from transformers import AutoTokenizer, pipeline
from huggingface_hub import snapshot_download
import torch
# Use GPU if available; -1 means CPU
device = 0 if torch.cuda.is_available() else -1
# Load tokenizer from model repo; use cached path + explicit vocab_file to fix path resolution
MODEL_ID = "mewaeltsegay/desta_1b"
model_path = snapshot_download(repo_id=MODEL_ID)
vocab_path = os.path.join(model_path, "sentencepiece.model")
tokenizer = AutoTokenizer.from_pretrained(model_path, vocab_file=vocab_path, trust_remote_code=True)
generator = pipeline(
"text-generation",
model=MODEL_ID,
tokenizer=tokenizer,
device=device,
trust_remote_code=True,
)
result = generator(
"ትግርኛ ቋንቋ እዩ",
max_new_tokens=150,
temperature=0.7,
top_p=0.9
)
print(result[0]['generated_text'])
Limitations and Bias
Known Limitations
- Context Length: Maximum context length is 2,048 tokens (can be extended to 4,096 with proper configuration)
- Generation Speed: Fine-tuned model may be slower than base model during inference
- Domain Specificity: Model is optimized for Tigrinya text; performance on other languages may vary
- Training Data: Model performance depends on the quality and coverage of the training dataset
Potential Biases
- The model may reflect biases present in the training data
- Cultural and regional variations in Tigrinya may not be fully represented
- The model may generate text that reflects the style and content of the training corpus
Citation
If you use this model in your research, please cite:
@misc{desta-1b-2026,
title={DESTA-1B: Dedicated Eritrean Semitic Text Autoregressor},
author={Mewael Tsegay Desta},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/mewaeltsegay/desta_1b}}
}
Acknowledgments
- Base model: TinyLlama
- Tokenizer: mewaeltsegay/tokenizer_tigrinya
- Training dataset: mewaeltsegay/finetuning_datatset
Model Card Contact
For questions, issues, or contributions, please open an issue on the model repository.
Note: DESTA-1B is fine-tuned from checkpoint-620 of the training process. For best results, use the recommended generation parameters and ensure proper tokenizer configuration.
- Downloads last month
- 3,188