Uploaded model

Developed by: Faris-Faiz
License: apache-2.0
Finetuned from model : unsloth/gemma-3-1b-it

This gemma3_text model was trained 2x faster with Unsloth and Huggingface's TRL library. Hardware used was on Runpod, on RTXA5000 GPU with 500GB storage space.

MalayMMLU Benchmark:

                 Model   Accuracy   shot by_letter        category
0  Malaysian-Gemma3-1b  42.816210  0shot      True            STEM
1  Malaysian-Gemma3-1b  48.091603  0shot      True        Language
2  Malaysian-Gemma3-1b  38.999711  0shot      True  Social science
3  Malaysian-Gemma3-1b  40.969057  0shot      True          Others
4  Malaysian-Gemma3-1b  47.849829  0shot      True      Humanities
{'Social science': np.int64(6918), 'Language': np.int64(6288), 'Humanities': np.int64(4395), 'Others': np.int64(4169), 'STEM': np.int64(2443)}
Model : Malaysian-Gemma3-1b
Metric : first
Shot : 0shot
average accuracy 43.69140544335688
accuracy for STEM 42.81620957838722
accuracy for Language 48.091603053435115
accuracy for Social science 38.99971089910379
accuracy for Others 40.96905732789638
accuracy for Humanities 47.84982935153584

Dataset Loading:

from datasets import load_dataset, concatenate_datasets

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("mesolitica/Malaysian-SFT", "default")

def convert_to_chatml(example):
    return {
        "conversations": [
            {"role": "user", "content": example["input"]},
            {"role": "assistant", "content": example["output"]}
        ]
    }

dataset = ds.map(convert_to_chatml)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    for convo in convos:
        try:
            # Attempt to apply the chat template
            text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>')
            texts.append(text)
        except Exception as e:
            # If an error occurs, print it and append a placeholder instead of skipping
            print(f"Error processing a conversation: {e}")
            print(f"Problematic conversation: {convo}")
            texts.append(None)  # Use None or an empty string as a placeholder

    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=4)
dataset = dataset.filter(lambda example: example['text'] is not None)
combined_dataset = concatenate_datasets([dataset[split] for split in dataset.keys()])

Model loading parameters:

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = True, # [NEW!] We have full finetuning now!
)

Training parameters:

from trl import SFTTrainer, SFTConfig
import os
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 64,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 300,
        learning_rate = 1e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir="outputs",
        report_to = "none", # Use this for WandB etc
    ),
    num_workers=os.cpu_count()
)