🏃 SSD Training Complete: Gemma 4 E2B IT

Status: ✅ Training Completed

A QLoRA adapter has been trained on google/gemma-4-E2B-it using the SSD (Simple Self-Distillation) technique from Apple's arXiv:2604.01193.

Trained Model

LoRA Adapter: ludsvick/gemma-4-E2B-it-SSD

Training Details:

Base model: google/gemma-4-E2B-it (5.1B params, multimodal Gemma 4)
Dataset: wrmedford/Gemma-4-E4B-it-SSD (~17K coding problems from LiveCodeBench v6)
Method: QLoRA (4-bit NF4 quantization)
LoRA rank: 16, alpha: 32, dropout: 0.05
Target: All language_model layers (q/k/v/o + gate/up/down projections, all 35 layers)
NOT targeting vision/audio towers (preserves multimodal capability)

How to Use

Option 1: Use LoRA adapter directly (memory-efficient)

from peft import PeftModel, PeftConfig
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

base = AutoModelForImageTextToText.from_pretrained(
    'google/gemma-4-E2B-it',
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('google/gemma-4-E2B-it')
model = PeftModel.from_pretrained(base, 'ludsvick/gemma-4-E2B-it-SSD')
model.eval()

# Generate at T=0.6 (per SSD paper)
messages = [{'role': 'user', 'content': 'Write a Fibonacci function...'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors='pt').to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.6,
        do_sample=True,
        top_p=0.95,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 2: Merge into full model (slower, needs more VRAM)

from merge_and_test import main
main()  # Merge adapter and run test inference

Files in this Repo

File	Description
`train_ssd.py`	QLoRA training script (what was used)
`train_ssd_full.py`	Full SSD script with on-policy generation
`train_ssd_sft.py`	Alternative SFT-only script
`merge_and_test.py`	Merge adapter + run inference test
`evaluate_lcb.py`	LiveCodeBench evaluation script
`adapter_model.safetensors`	Trained LoRA weights (92.2MB)
`adapter_config.json`	LoRA configuration

Next Steps

Test the model: Run merge_and_test.py to verify it generates code correctly
Evaluate: Run evaluate_lcb.py on LiveCodeBench v6 to measure improvement
Compare: Evaluate base model vs SSD model at T=0.6 — expect 10+ points on hard problems
Deploy: Push merged model or use LoRA adapter with vLLM for inference

SSD Method (Reminder)

From Apple's paper:

Generate at high temperature (T ≥ 1.0) — diverse token exploration
SFT on all outputs (regardless of quality) — reshape distribution
Evaluate at low temperature (T ≤ 0.6) — improved precision + retained diversity

The key insight: training on diverse (even imperfect) samples with high-temp exploration reshapes the model's token distribution to include better solutions while keeping conditional entropy high for continued exploration.

Citation

@article{ssd2025,
  title={Embarrassingly Simple Self-Distillation Improves Code Generation},
  author={{Apple ML Team}},
  year={2025},
  eprint={2604.01193},
  archivePrefix={arXiv},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ludsvick/gemma-4-E2B-it-SSD

Embarrassingly Simple Self-Distillation Improves Code Generation

Paper • 2604.01193 • Published Apr 1 • 47