QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO

Parameter-efficient fine-tuning of google/gemma-2-2b-it for two tasks: Yoda-style text generation via QLoRA + SFT, and mathematical reasoning alignment via GRPO with a composite reward signal.
Trained on an NVIDIA A100-SXM4-40GB using HuggingFace TRL and PEFT.

GitHub: gemma2-qlora-sft-grpo


Adapters in This Repository

This repository hosts five LoRA adapter checkpoints corresponding to different stages of the training pipeline:

Adapter folder Stage Description
sft_yoda/ A-2 English β†’ Yoda style translator (SFT)
sft_yoda_answ/ A-4 Yoda-style QA answerer (SFT)
rl_yoda_answ_from_sft/ B-3 GRPO warm start from SFT checkpoint
rl_yoda_answ_from_base/ B-4 GRPO cold start from base model
classifier_yoda/ B-1 DistilBERT binary style classifier (reward model)

Model Overview

This project investigates two complementary approaches to steering the behaviour of a 2B-parameter instruction-tuned language model through parameter-efficient fine-tuning and reinforcement learning from verifiable rewards (RLVR).

Part A β€” Style transfer via SFT:
gemma-2-2b-it is fine-tuned with LoRA adapters to translate standard English into Yoda-style syntax using the dvgodoy/yoda_sentences dataset. A synthetic Yoda-style QA dataset is then generated from MuskumPillerum/General-Knowledge using the trained translator, and a second SFT stage trains the model to answer any question in Yoda style.

Part B β€” Reasoning and style via GRPO:
GRPO is applied to improve mathematical reasoning on openai/gsm8k using a composite reward signal. A DistilBERT binary classifier β€” trained to distinguish Yoda-style from standard English β€” provides a differentiable style reward. Two training strategies are compared: GRPO from the SFT checkpoint (warm start) and GRPO from the base model (cold start, full three-component reward).


Training Details

Base Model

  • google/gemma-2-2b-it β€” 2.01B parameters, instruction-tuned
  • 4-bit NF4 QLoRA quantization (BitsAndBytes) during training
  • LoRA applied to all attention and MLP projection layers
  • Hardware: NVIDIA A100-SXM4-40GB

Part A: Supervised Fine-Tuning Pipeline

Stage Task Dataset Examples
A-1 Baseline inference β€” Zero-shot Yoda translation
A-2 SFT: English β†’ Yoda dvgodoy/yoda_sentences 648 train / 72 val
A-3 Synthetic dataset generation MuskumPillerum/General-Knowledge 500 / 200 / 500
A-4 SFT: Yoda-style QA Synthetic (A-3) 500 train / 200 val

Part B: GRPO Reward Function

Component Implementation Range
Correctness Graduated exact-match: 1.0 for #### n, 0.5 for "answer is n" [0, 1]
Format Rule-based: presence of #### <number> marker {0, 1}
Style DistilBERT P(Yoda) trained on GSM8K English/Yoda pairs [0, 1]
Total Linear sum [0, 3]

Results

Part B β€” GRPO Comparison

Strategy Correctness Format Compliance Style Score
SFT + GRPO (warm start) 0.667 0.833 β€”
Base + GRPO (cold start) 0.667 1.000 0.379

The cold-start GRPO achieves perfect format compliance; the warm-start GRPO shows higher early-training stability.


Usage

Requirements

pip install transformers peft bitsandbytes accelerate

Requires a GPU with at least 16 GB VRAM for 4-bit inference.

Load an adapter

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model_id = "google/gemma-2-2b-it"
adapter_id = "camilletyriard/gemma2-qlora-sft-grpo"  # + subfolder e.g. sft_yoda

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="sft_yoda")
model.eval()

Yoda-style translation example

prompt = "Translate the following sentence to Yoda style.\nSentence: The stars are bright tonight.\nYoda:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64)

print(tokenizer.decode(output[0], skip_special_tokens=True))
# β†’ "Bright tonight, the stars are."

Mathematical reasoning example (GRPO adapter)

model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="rl_yoda_answ_from_sft")

prompt = "Janet has 3 apples and gives 1 to Bob. How many remain?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(output[0], skip_special_tokens=True))
# β†’ "... #### 2"

Repository Structure

β”œβ”€β”€ sft_yoda/
β”‚   β”œβ”€β”€ adapter_model.safetensors
β”‚   β”œβ”€β”€ adapter_config.json
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── tokenizer_config.json
β”œβ”€β”€ sft_yoda_answ/
β”œβ”€β”€ rl_yoda_answ_from_sft/
β”œβ”€β”€ rl_yoda_answ_from_base/
└── classifier_yoda/

Limitations

  • Results are reported for a single random seed (42); variance across seeds is not quantified.
  • The style classifier is trained on the same distribution as the SFT data, which may inflate style reward estimates during GRPO evaluation.
  • GSM8K evaluation covers a limited held-out sample; reported rewards are indicative rather than benchmark-scale.
  • The SFT answerer was trained on general-knowledge QA; applying it to GSM8K introduces a domain shift that reduces initial correctness scores before GRPO compensates.

Citation

@misc{tyriard2026gemma2grpo,
  author = {Tyriard, Camille and Quintero, Ana and Tornila, Dunia and
            Huencho, Daniel and Bell, Jonathan and Tobo, Nicolas},
  title  = {QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO},
  year   = {2026},
  url    = {https://github.com/camilletyriard-dev/gemma2-qlora-sft-grpo}
}

License

Model weights inherit the Gemma Terms of Use.


Developed at University College London (2025–2026):
Camille Tyriard, Ana Quintero, Dunia Tornila, Daniel Huencho, Jonathan Bell, Nicolas Tobo.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for camilletyriard/gemma2-qlora-sft-grpo

Adapter
(428)
this model

Datasets used to train camilletyriard/gemma2-qlora-sft-grpo