Uploaded model
- Developed by: Faris-Faiz
- License: apache-2.0
- Finetuned from model : unsloth/gemma-3-1b-it
This gemma3_text model was trained 2x faster with Unsloth and Huggingface's TRL library. Hardware used was on Runpod, on RTXA5000 GPU with 500GB storage space.
MalayMMLU Benchmark:
Model Accuracy shot by_letter category
0 Malaysian-Gemma3-1b 42.816210 0shot True STEM
1 Malaysian-Gemma3-1b 48.091603 0shot True Language
2 Malaysian-Gemma3-1b 38.999711 0shot True Social science
3 Malaysian-Gemma3-1b 40.969057 0shot True Others
4 Malaysian-Gemma3-1b 47.849829 0shot True Humanities
{'Social science': np.int64(6918), 'Language': np.int64(6288), 'Humanities': np.int64(4395), 'Others': np.int64(4169), 'STEM': np.int64(2443)}
Model : Malaysian-Gemma3-1b
Metric : first
Shot : 0shot
average accuracy 43.69140544335688
accuracy for STEM 42.81620957838722
accuracy for Language 48.091603053435115
accuracy for Social science 38.99971089910379
accuracy for Others 40.96905732789638
accuracy for Humanities 47.84982935153584
Dataset Loading:
from datasets import load_dataset, concatenate_datasets
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("mesolitica/Malaysian-SFT", "default")
def convert_to_chatml(example):
return {
"conversations": [
{"role": "user", "content": example["input"]},
{"role": "assistant", "content": example["output"]}
]
}
dataset = ds.map(convert_to_chatml)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = []
for convo in convos:
try:
# Attempt to apply the chat template
text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>')
texts.append(text)
except Exception as e:
# If an error occurs, print it and append a placeholder instead of skipping
print(f"Error processing a conversation: {e}")
print(f"Problematic conversation: {convo}")
texts.append(None) # Use None or an empty string as a placeholder
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=4)
dataset = dataset.filter(lambda example: example['text'] is not None)
combined_dataset = concatenate_datasets([dataset[split] for split in dataset.keys()])
Model loading parameters:
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-3-1b-it",
max_seq_length = 2048, # Choose any for long context!
load_in_4bit = False, # 4 bit quantization to reduce memory
load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
full_finetuning = True, # [NEW!] We have full finetuning now!
)
Training parameters:
from trl import SFTTrainer, SFTConfig
import os
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = combined_dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 64,
gradient_accumulation_steps = 1, # Use GA to mimic batch size!
warmup_steps = 5,
num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 300,
learning_rate = 1e-5, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir="outputs",
report_to = "none", # Use this for WandB etc
),
num_workers=os.cpu_count()
)
- Downloads last month
- 2
