Gemma-3-4B-IT GRPO Thai

This model is Gemma-3-4B-IT fine-tuned with LoRA adapters using GRPO (Gradient Reward Policy Optimization) on the GSM8K-Thai dataset.
The model is trained to solve math word problems in Thai step-by-step, producing structured reasoning in <think>…</think> followed by the final answer in <answer>…</answer>.

Model Details

Base model: google/gemma-3-4b-it
Technique: LoRA fine-tuning + GRPO reinforcement learning
Languages: Thai (primary)
Task: Math reasoning, step-by-step explanation, final numeric answer
License: Apache-2.0
Author: Thanayot (SuperAI Engineer SS5, KMUTT)

Intended Uses

Direct Use

Educational use: tutoring in math reasoning in Thai
Research on RLHF/GRPO methods for LLMs
Experimentation with structured reasoning outputs (<think>…</think><answer>…</answer>)

Out-of-Scope Use

High-stakes decision making (finance, medical, legal)
Problems requiring formal proofs or very advanced mathematics
Any malicious or harmful generation in Thai or other languages

Training Details

Dataset

VISAI-AI/gsm8k-thai
Thai translations of the GSM8K math word problems

Procedure

Reward shaping:
- Format reward: enforces <think>…</think><answer>…</answer>
- Accuracy reward: compares predicted numeric answer to ground truth via math_verify

Hyperparameters

LoRA rank: 16
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 5e-5
Batch size: 1 (with gradient_accumulation_steps=8)
Num generations per prompt: 4
Beta (KL penalty): 0.01
Precision: bfloat16
Max prompt length: 256
Max completion length: 160

Evaluation Results

Below are the reward values observed during training:

Step	Policy Loss (proxy from reward)
100	0.0030
200	0.0040
280	0.0042

ค่า Reward มีแนวโน้มเพิ่มขึ้นอย่างต่อเนื่องในช่วงแรกของการเทรน (Step 100 → 200 → 280)
ค่าที่ได้ (≈0.0030 → 0.0040 → 0.0042) แสดงถึงการปรับตัวของโมเดลให้สอดคล้องกับ reward function
แนวโน้มบ่งชี้ว่าโมเดลกำลังเข้าใกล้ ภาวะเสถียร (convergence) แต่ยังไม่ถึง plateau; หากเทรนต่อไป คาดว่าค่า Reward จะคงที่ในระดับสูงขึ้น (≈0.0048–0.0050)

How to Use

import torch
from transformers import AutoTokenizer, Gemma3ForCausalLM
from peft import PeftModel
model_id = "google/gemma-3-4b-it"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token  # กัน edge case ตอน generate
tok.padding_side = "left"

base_model = Gemma3ForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,    # หรือ float16 ตาม GPU
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "zoeythanayot/gemma3-it-grpo-thai")

# สร้าง prompt ตัวอย่าง
SYSTEM_PROMPT = (
    "คุณเป็นผู้ช่วยแก้ปัญหาคณิตศาสตร์เชิงเหตุผล ทีละขั้นเป็นภาษาไทย "
    "และใช้ <think>…</think><answer>…</answer> เพื่อบ่งบอกกระบวนการคิดและคำตอบสุดท้าย"
)
USER_PROMPT = "โจทย์: ถ้ามีลูกอม 15 เม็ด แบ่งให้เพื่อน 3 คนเท่า ๆ กัน แต่ละคนจะได้กี่เม็ด?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

# ใช้ chat template ของ tokenizer (ถ้ารองรับ)
inputs = tok.apply_chat_template(messages, return_tensors="pt").to(model.device)

# generate คำตอบ
with torch.inference_mode():
    output_ids = model.generate(
        inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9
    )

input_length = inputs.shape[1]
new_tokens = output_ids[0, input_length:]
resp = tok.decode(new_tokens, skip_special_tokens=True)
print(resp.strip())

Bias, Risks, and Limitations

May produce plausible but incorrect answers
Trained only on translated Thai data, so bias/errors from translation remain
Limited to short reasoning problems (GSM8K style)

Citation

@misc{thanayot2025gemmathai,
  title = {Gemma-3-4B-IT GRPO Thai: LoRA Fine-Tuned Math Reasoning Model},
  author = {Thanayot},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {Model on Hugging Face Hub},
}

Contact

For questions or collaboration: Thanayot @ KMUTT (SuperAI Engineer SS5)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for zoeythanayot/gemma3-it-grpo-thai

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Adapter

(315)

this model