Model Card for Kona2-Bidzer Georgian Definitions
WARNING: this model may occasionally generate highly INAPPROPRIATE content and use words it has seen in its training data (which are not appropriate to say lightly) !!!
Model Details
Model Description
This model is a full fine-tune of Kona2-small-3.8B (based on Microsoft Phi-3.5) on the "Bidzer.ge Georgian dictionary dataset.
It is designed to act as an intelligent Georgian dictionary assistant. Given a word, it generates a concise definition and a contextual usage example. It is capable of handling standard Georgian vocabulary as well as slang and specific terminology found in the training corpus.
Developed by: Antony
Model type: Causal Language Model (AutoModelForCausalLM)
Language(s) (NLP): Georgian (ka)
License: Apache 2.0
Finetuned from model: tbilisi-ai-lab/kona2-small-3.8B
Uses
Direct Use
The model is intended to be used for generating dictionary-style entries.
Input format:
<|user|>
განმარტე სიტყვა: [WORD]<|end|>
<|assistant|>
Output format:
[DEFINITION]
მაგალითი: [USAGE EXAMPLE]
Downstream Use
This model can be integrated into:
- Georgian educational tools.
- Dictionary apps requiring generative explanations.
- NLP pipelines for semantic analysis of Georgian slang.
Out-of-Scope Use
The model is not designed for:
- General-purpose chat or open-ended creative writing outside of definitions.
- Mathematical or coding tasks.
How to Get Started with the Model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Antony-X/kona2-bidzer-georgian-definitions"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
word = "კომპიუტერი"
prompt = f"<|user|>\nგანმარტე სიტყვა: {word}<|end|>\n<|assistant|>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.6)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained on the Bidzer Dictionary Dataset, a collection of Georgian words, definitions, and usage examples. The data includes diverse vocabulary ranging from formal terms to slang.
Data Format:
- Word: Target term.
- Definition: Explanation of the term.
- Usage: A sentence demonstrating the word in context.
Training Procedure
The model was trained using Full Fine-Tuning (updating all weights) rather than LoRA, ensuring maximum adaptation to the dictionary structure.
- Framework: transformers, trl (SFTTrainer)
- Precision: bfloat16 (Native H100 optimization)
- Optimizer: adamw_torch_fused
Training Hyperparameters
- Training regime: bf16 non-mixed precision
- Batch Size: 16 per device
- Epochs: 3
- Learning Rate: 2e-4
- Max Sequence Length: 512 tokens
Environmental Impact
- Hardware Type: NVIDIA H100 (80GB VRAM)
- Cloud Provider: Kaggle
- Compute Region: Cloud (GPU)
- Carbon Emitted: Negligible (Short training run)
Technical Specifications
Model Architecture and Objective
The model uses the Phi-3.5-mini architecture (3.8B parameters), optimized for high performance with a smaller parameter count. It was trained with a Causal Language Modeling (CLM) objective tailored for instruction following.
- Downloads last month
- 15
Model tree for Antony-X/kona2-bidzer-georgian-definitions
Base model
tbilisi-ai-lab/kona2-small-3.8B