QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO
Parameter-efficient fine-tuning of google/gemma-2-2b-it for two tasks:
Yoda-style text generation via QLoRA + SFT, and mathematical reasoning
alignment via GRPO with a composite reward signal.
Trained on an NVIDIA A100-SXM4-40GB using HuggingFace TRL and PEFT.
GitHub: gemma2-qlora-sft-grpo
Adapters in This Repository
This repository hosts five LoRA adapter checkpoints corresponding to different stages of the training pipeline:
| Adapter folder | Stage | Description |
|---|---|---|
sft_yoda/ |
A-2 | English β Yoda style translator (SFT) |
sft_yoda_answ/ |
A-4 | Yoda-style QA answerer (SFT) |
rl_yoda_answ_from_sft/ |
B-3 | GRPO warm start from SFT checkpoint |
rl_yoda_answ_from_base/ |
B-4 | GRPO cold start from base model |
classifier_yoda/ |
B-1 | DistilBERT binary style classifier (reward model) |
Model Overview
This project investigates two complementary approaches to steering the behaviour of a 2B-parameter instruction-tuned language model through parameter-efficient fine-tuning and reinforcement learning from verifiable rewards (RLVR).
Part A β Style transfer via SFT:gemma-2-2b-it is fine-tuned with LoRA adapters to translate standard English into Yoda-style syntax using the dvgodoy/yoda_sentences dataset. A synthetic Yoda-style QA dataset is then generated from MuskumPillerum/General-Knowledge using the trained translator, and a second SFT stage trains the model to answer any question in Yoda style.
Part B β Reasoning and style via GRPO:
GRPO is applied to improve mathematical reasoning on openai/gsm8k using a composite reward signal. A DistilBERT binary classifier β trained to distinguish Yoda-style from standard English β provides a differentiable style reward. Two training strategies are compared: GRPO from the SFT checkpoint (warm start) and GRPO from the base model (cold start, full three-component reward).
Training Details
Base Model
google/gemma-2-2b-itβ 2.01B parameters, instruction-tuned- 4-bit NF4 QLoRA quantization (BitsAndBytes) during training
- LoRA applied to all attention and MLP projection layers
- Hardware: NVIDIA A100-SXM4-40GB
Part A: Supervised Fine-Tuning Pipeline
| Stage | Task | Dataset | Examples |
|---|---|---|---|
| A-1 | Baseline inference | β | Zero-shot Yoda translation |
| A-2 | SFT: English β Yoda | dvgodoy/yoda_sentences |
648 train / 72 val |
| A-3 | Synthetic dataset generation | MuskumPillerum/General-Knowledge |
500 / 200 / 500 |
| A-4 | SFT: Yoda-style QA | Synthetic (A-3) | 500 train / 200 val |
Part B: GRPO Reward Function
| Component | Implementation | Range |
|---|---|---|
| Correctness | Graduated exact-match: 1.0 for #### n, 0.5 for "answer is n" |
[0, 1] |
| Format | Rule-based: presence of #### <number> marker |
{0, 1} |
| Style | DistilBERT P(Yoda) trained on GSM8K English/Yoda pairs | [0, 1] |
| Total | Linear sum | [0, 3] |
Results
Part B β GRPO Comparison
| Strategy | Correctness | Format Compliance | Style Score |
|---|---|---|---|
| SFT + GRPO (warm start) | 0.667 | 0.833 | β |
| Base + GRPO (cold start) | 0.667 | 1.000 | 0.379 |
The cold-start GRPO achieves perfect format compliance; the warm-start GRPO shows higher early-training stability.
Usage
Requirements
pip install transformers peft bitsandbytes accelerate
Requires a GPU with at least 16 GB VRAM for 4-bit inference.
Load an adapter
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
base_model_id = "google/gemma-2-2b-it"
adapter_id = "camilletyriard/gemma2-qlora-sft-grpo" # + subfolder e.g. sft_yoda
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="sft_yoda")
model.eval()
Yoda-style translation example
prompt = "Translate the following sentence to Yoda style.\nSentence: The stars are bright tonight.\nYoda:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# β "Bright tonight, the stars are."
Mathematical reasoning example (GRPO adapter)
model = PeftModel.from_pretrained(base_model, adapter_id, subfolder="rl_yoda_answ_from_sft")
prompt = "Janet has 3 apples and gives 1 to Bob. How many remain?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# β "... #### 2"
Repository Structure
βββ sft_yoda/
β βββ adapter_model.safetensors
β βββ adapter_config.json
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ sft_yoda_answ/
βββ rl_yoda_answ_from_sft/
βββ rl_yoda_answ_from_base/
βββ classifier_yoda/
Limitations
- Results are reported for a single random seed (42); variance across seeds is not quantified.
- The style classifier is trained on the same distribution as the SFT data, which may inflate style reward estimates during GRPO evaluation.
- GSM8K evaluation covers a limited held-out sample; reported rewards are indicative rather than benchmark-scale.
- The SFT answerer was trained on general-knowledge QA; applying it to GSM8K introduces a domain shift that reduces initial correctness scores before GRPO compensates.
Citation
@misc{tyriard2026gemma2grpo,
author = {Tyriard, Camille and Quintero, Ana and Tornila, Dunia and
Huencho, Daniel and Bell, Jonathan and Tobo, Nicolas},
title = {QLoRA Fine-Tuning of Gemma-2 with SFT and GRPO},
year = {2026},
url = {https://github.com/camilletyriard-dev/gemma2-qlora-sft-grpo}
}
License
Model weights inherit the Gemma Terms of Use.
Developed at University College London (2025β2026):
Camille Tyriard, Ana Quintero, Dunia Tornila, Daniel Huencho, Jonathan Bell, Nicolas Tobo.
- Downloads last month
- -