SHD Unrelated Data Experiment
Squeezing-Heads Distillation training on unrelated sequence data.
Current Status
- Epoch: 10/10
- Best Val Loss: 1.0879
- Train Loss: 19.9459
- LM Loss: 0.4240
- SHD Loss: 0.0664
Training Configuration
- Teacher: Llama-3.2-1B-Instruct (biased)
- Student: GPT-2 Medium
- Dataset: Unrelated sequence completion
- Beta (SHD weight): 10
- Batch size: 32
- Learning rate: 0.0001
Goal
Test if SHD can transfer teacher's owl bias even when training on completely unrelated data.
Last updated: Epoch 10