Inquiry regarding distributional similarity (KL Divergence)
#1
by CosmossG - opened
Hi Null Space,
First off, thank you for this work. The results on the refusal-to-capability ratio are quite impressive, especially seeing how well the MMLU sanity check held up basically perfectly despite the heavy layer coverage (layers 20–59).
While your MMLU results provide a great proxy for capability preservation, I was wondering if you had calculated the KL divergence between the baseline and the abliterated model and/or got any data of perplexity shifts on a neutral corpus (like WikiText)?
Thanks again for the amazing ablation!
Good question! I just updated the model card with perplexity, KL, and KV. Results aren’t too bad.