BhatiaAadi
/

shd-sanity-check-owl-bias

Text Generation

knowledge-distillation

Model card Files Files and versions

SHD Unrelated Data Experiment

Squeezing-Heads Distillation training on unrelated sequence data.

Current Status

Epoch: 10/10
Best Val Loss: 1.0879
Train Loss: 19.9459
LM Loss: 0.4240
SHD Loss: 0.0664

Training Configuration

Teacher: Llama-3.2-1B-Instruct (biased)
Student: GPT-2 Medium
Dataset: Unrelated sequence completion
Beta (SHD weight): 10
Batch size: 32
Learning rate: 0.0001

Goal

Test if SHD can transfer teacher's owl bias even when training on completely unrelated data.

Last updated: Epoch 10

Downloads last month: -; Downloads are not tracked for this model. How to track