QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO

Parameter-efficient fine-tuning of google/gemma-2-2b-it for two tasks: Yoda-style text generation via QLoRA + SFT, and mathematical reasoning alignment via GRPO with a composite reward signal.
Trained on an NVIDIA A100-SXM4-40GB using HuggingFace TRL and PEFT.

GitHub: gemma2-qlora-sft-grpo

Adapters in This Repository

This repository hosts five LoRA adapter checkpoints corresponding to different stages of the training pipeline:

Adapter folder	Stage	Description
`sft_yoda/`	A-2	English → Yoda style translator (SFT)
`sft_yoda_answ/`	A-4	Yoda-style QA answerer (SFT)
`rl_yoda_answ_from_sft/`	B-3	GRPO warm start from SFT checkpoint
`rl_yoda_answ_from_base/`	B-4	GRPO cold start from base model
`classifier_yoda/`	B-1	DistilBERT binary style classifier (reward model)

Model Overview

This project investigates two complementary approaches to steering the behaviour of a 2B-parameter instruction-tuned language model through parameter-efficient fine-tuning and reinforcement learning from verifiable rewards (RLVR).

Part A — Style transfer via SFT:
gemma-2-2b-it is fine-tuned with LoRA adapters to translate standard English into Yoda-style syntax using the dvgodoy/yoda_sentences dataset. A synthetic Yoda-style QA dataset is then generated from MuskumPillerum/General-Knowledge using the trained translator, and a second SFT stage trains the model to answer any question in Yoda style.

Part B — Reasoning and style via GRPO:
GRPO is applied to improve mathematical reasoning on openai/gsm8k using a composite reward signal. A DistilBERT binary classifier — trained to distinguish Yoda-style from standard English — provides a differentiable style reward. Two training strategies are compared: GRPO from the SFT checkpoint (warm start) and GRPO from the base model (cold start, full three-component reward).

Training Details

Base Model

google/gemma-2-2b-it — 2.01B parameters, instruction-tuned
4-bit NF4 QLoRA quantization (BitsAndBytes) during training
LoRA applied to all attention and MLP projection layers
Hardware: NVIDIA A100-SXM4-40GB

Part A: Supervised Fine-Tuning Pipeline

Stage	Task	Dataset	Examples
A-1	Baseline inference	—	Zero-shot Yoda translation
A-2	SFT: English → Yoda	`dvgodoy/yoda_sentences`	648 train / 72 val
A-3	Synthetic dataset generation	`MuskumPillerum/General-Knowledge`	500 / 200 / 500
A-4	SFT: Yoda-style QA	Synthetic (A-3)	500 train / 200 val

Part B: GRPO Reward Function

Component	Implementation	Range
Correctness	Graduated exact-match: 1.0 for `#### n`, 0.5 for "answer is n"	[0, 1]
Format	Rule-based: presence of `#### <number>` marker	{0, 1}
Style	DistilBERT P(Yoda) trained on GSM8K English/Yoda pairs	[0, 1]
Total	Linear sum	[0, 3]

Results

Part B — GRPO Comparison

Strategy	Correctness	Format Compliance	Style Score
SFT + GRPO (warm start)	0.667	0.833	—
Base + GRPO (cold start)	0.667	1.000	0.379

The cold-start GRPO achieves perfect format compliance; the warm-start GRPO shows higher early-training stability.

Usage

Requirements

pip install transformers peft bitsandbytes accelerate

Requires a GPU with at least 16 GB VRAM for 4-bit inference.

Load an adapter

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model_id = "google/gemma-2-2b-it"
adapter_id = "camilletyriard/gemma2-qlora-sft-grpo"  # + subfolder e.g. sft_yoda

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="sft_yoda")
model.eval()

Yoda-style translation example

prompt = "Translate the following sentence to Yoda style.\nSentence: The stars are bright tonight.\nYoda:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64)

print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "Bright tonight, the stars are."

Mathematical reasoning example (GRPO adapter)

model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="rl_yoda_answ_from_sft")

prompt = "Janet has 3 apples and gives 1 to Bob. How many remain?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "... #### 2"

Repository Structure

├── sft_yoda/
│   ├── adapter_model.safetensors
│   ├── adapter_config.json
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── sft_yoda_answ/
├── rl_yoda_answ_from_sft/
├── rl_yoda_answ_from_base/
└── classifier_yoda/

Limitations

Results are reported for a single random seed (42); variance across seeds is not quantified.
The style classifier is trained on the same distribution as the SFT data, which may inflate style reward estimates during GRPO evaluation.
GSM8K evaluation covers a limited held-out sample; reported rewards are indicative rather than benchmark-scale.
The SFT answerer was trained on general-knowledge QA; applying it to GSM8K introduces a domain shift that reduces initial correctness scores before GRPO compensates.

Citation

@misc{tyriard2026gemma2grpo,
  author = {Tyriard, Camille and Quintero, Ana and Tornila, Dunia and
            Huencho, Daniel and Bell, Jonathan and Tobo, Nicolas},
  title  = {QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO},
  year   = {2026},
  url    = {https://github.com/camilletyriard-dev/gemma2-qlora-sft-grpo}
}

License

Model weights inherit the Gemma Terms of Use.

Developed at University College London (2025–2026):
Camille Tyriard, Ana Quintero, Dunia Tornila, Daniel Huencho, Jonathan Bell, Nicolas Tobo.

Downloads last month: -

Model tree for camilletyriard/gemma2-qlora-sft-grpo

Base model

google/gemma-2-2b

Finetuned

google/gemma-2-2b-it

Adapter

(428)

this model

camilletyriard
/

gemma2-qlora-sft-grpo