Self-Distillation Enables Continual Learning
Paper โข 2601.19897 โข Published โข 30
SFT baseline for reproducing "Self-Distillation Enables Continual Learning".
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Method | Supervised Fine-Tuning (SFT) |
| Dataset | ToolAlpaca (4046 train, 68 test) |
| Learning rate | 5e-5 |
| Batch size | 32 (gradient accumulation) |
| Epochs | 1 |
| Seed | 42 |
| DeepSpeed | ZeRO-2 + CPU offload |
| Hardware | L40S 48GB |
Not yet evaluated. Greedy accuracy + pass@k pending.
Paper's SFT baseline: 63.2% greedy accuracy on tool-use.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Ayushnangia/qwen2.5-7b-instruct-sft-tooluse-lr5e-5-bs32-ep1")
tokenizer = AutoTokenizer.from_pretrained("Ayushnangia/qwen2.5-7b-instruct-sft-tooluse-lr5e-5-bs32-ep1")