Inquiry regarding ablation efficacy and distributional shift metrics

#1
by CosmossG - opened

Hi Aifeifei,

Thank you for sharing this model! The reduction in refusals is indeed very good.

To better understand the trade-offs involved in this process, would you be willing to share some additional evaluation metrics? Specifically, I am curious about:

  1. Distributional Integrity: Do you have any KL divergence scores (relative to the base model) to help quantify the shift in the output distribution?
  2. Cognitive Preservation: Have you run any standard benchmarks like MMLU or ARC to compare it against the original model to ensure that the ablation hasn't caused significant "lobotomization" or loss of general reasoning capabilities?
  3. Common Sense/Fluency: Any delta in perplexity on a neutral corpus (like WikiText) would be very helpful to see how much the linguistic "soul" of the model was impacted.

Looking forward to hearing more about your methodology!

Hi,

Thank you for your interest and the thoughtful questions!

Regarding MMLU and ARC, I unfortunately do not have access to the hardware resources required to run these large-scale benchmarks reliably, so I cannot provide a formal comparison of cognitive preservation at this stage.

As for KL divergence and Perplexity (WikiText), I haven't conducted formal automated measurements for these specific metrics yet. My primary focus during this process was on the qualitative "feel" of the model—ensuring that the reduction in refusals didn't result in repetitive patterns or a loss of conversational nuance. In my subjective testing, the linguistic "soul" and fluency seem well-preserved, but I would love to see community-driven benchmarks if anyone has the compute to spare!

I appreciate the feedback, and I'll keep these metrics in mind for future iterations.

However, I’d like to share the intent behind the model: It isn't just an "unfiltered" version. I used orthogonalization to strip away the behavioral constraints that I believe limit a model’s natural intelligence. The name Gemma-4-31B-Cognitive-Unshackled reflects this—it's about liberating the model's cognitive potential from the "lobotomizing" effects of over-censorship.

Sign up or log in to comment