Embarrassingly Simple Self-Distillation Improves Code Generation
Paper β’ 2604.01193 β’ Published β’ 47
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
A QLoRA adapter has been trained on google/gemma-4-E2B-it using the SSD (Simple Self-Distillation) technique from Apple's arXiv:2604.01193.
LoRA Adapter: ludsvick/gemma-4-E2B-it-SSD
Training Details:
google/gemma-4-E2B-it (5.1B params, multimodal Gemma 4)wrmedford/Gemma-4-E4B-it-SSD (~17K coding problems from LiveCodeBench v6)from peft import PeftModel, PeftConfig
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
base = AutoModelForImageTextToText.from_pretrained(
'google/gemma-4-E2B-it',
torch_dtype=torch.bfloat16,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('google/gemma-4-E2B-it')
model = PeftModel.from_pretrained(base, 'ludsvick/gemma-4-E2B-it-SSD')
model.eval()
# Generate at T=0.6 (per SSD paper)
messages = [{'role': 'user', 'content': 'Write a Fibonacci function...'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.6,
do_sample=True,
top_p=0.95,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from merge_and_test import main
main() # Merge adapter and run test inference
| File | Description |
|---|---|
train_ssd.py |
QLoRA training script (what was used) |
train_ssd_full.py |
Full SSD script with on-policy generation |
train_ssd_sft.py |
Alternative SFT-only script |
merge_and_test.py |
Merge adapter + run inference test |
evaluate_lcb.py |
LiveCodeBench evaluation script |
adapter_model.safetensors |
Trained LoRA weights (92.2MB) |
adapter_config.json |
LoRA configuration |
merge_and_test.py to verify it generates code correctlyevaluate_lcb.py on LiveCodeBench v6 to measure improvementFrom Apple's paper:
The key insight: training on diverse (even imperfect) samples with high-temp exploration reshapes the model's token distribution to include better solutions while keeping conditional entropy high for continued exploration.
@article{ssd2025,
title={Embarrassingly Simple Self-Distillation Improves Code Generation},
author={{Apple ML Team}},
year={2025},
eprint={2604.01193},
archivePrefix={arXiv},
}