SHD Unrelated Data Experiment

Squeezing-Heads Distillation training on unrelated sequence data.

Current Status

  • Epoch: 10/10
  • Best Val Loss: 1.0879
  • Train Loss: 19.9459
  • LM Loss: 0.4240
  • SHD Loss: 0.0664

Training Configuration

  • Teacher: Llama-3.2-1B-Instruct (biased)
  • Student: GPT-2 Medium
  • Dataset: Unrelated sequence completion
  • Beta (SHD weight): 10
  • Batch size: 32
  • Learning rate: 0.0001

Goal

Test if SHD can transfer teacher's owl bias even when training on completely unrelated data.

Last updated: Epoch 10

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support