Inquiry regarding distributional similarity (KL Divergence)

#1
by CosmossG - opened

Hi Null Space,

First off, thank you for this work. The results on the refusal-to-capability ratio are quite impressive, especially seeing how well the MMLU sanity check held up basically perfectly despite the heavy layer coverage (layers 20–59).

While your MMLU results provide a great proxy for capability preservation, I was wondering if you had calculated the KL divergence between the baseline and the abliterated model and/or got any data of perplexity shifts on a neutral corpus (like WikiText)?

Thanks again for the amazing ablation!

Good question! I just updated the model card with perplexity, KL, and KV. Results aren’t too bad.

Sign up or log in to comment