Loosing quality?

by Trilogix1 - opened 8 days ago

Keep testing your new methods and seem they are loosing quality. Haven´t really check what you really doing but if you are doing the unsafe tuning bypassing the safeguards to have no constrains, there is where the model loose quality a lot. Brainwashing works, cutting part of the brain it makes it dummy. My tests need high accuracy, and the first models you did kept it.

However I see you doing some great progress with the agents in your github, maybe you can focus it selflearning and scaling decentralized everytime (where each user host some bits) is used. Imaging finetuning yourself every certain checkpoint exponentially with a real constitution and governing laws etc... What would be the final goal?

wangzhang

Owner 8 days ago

Thanks for keeping an eye on the releases and flagging this — the "cutting part of the brain" concern is exactly the failure mode we try to measure against, and it's worth being precise about what the numbers say.

How quality loss is measured here. Every trial in the optimizer is scored on two axes, not one: (1) refusal rate on the harmful eval set, and (2) KL divergence between the abliterated model's next-token distribution and the base model's, over a held-out set of benign prompts. KL is a hard constraint: trials that drift too far from base are pruned before they can "win", even if they get low refusal rates. The winning checkpoint for this model (trial 2) has KL ≈ 0.001 against base, with a target ceiling of 0.008 — meaning the next-token distribution on benign inputs is statistically very close to stock Gemma 4 26B-A4B. It's genuinely a light touch.

That said: KL ≈ 0 does not guarantee task-level benchmarks stay flat. A direction that's "near zero" on general benign text can still nudge long-chain reasoning, code generation, or niche retrieval in ways KL won't catch cleanly. If you're seeing quality degrade on a specific workload, I'd really like to reproduce it — could you share:

which model version you're comparing against ("the first models" = which commit/repo?),
the benchmark or prompt set you're running,
and one or two example prompts where the new model visibly underperforms?
With that I can re-run the same prompts on the base model vs this checkpoint and tell you exactly where the loss is coming from. If it's a real regression I'd rather re-run optimization with a tighter KL target than ship a dumber model.

On the self-learning / decentralized / "constitution" direction: that's genuinely outside the scope of this project — Abliterix is a narrow research tool for studying refusal circuits via direct weight editing, not a platform play. The "final goal" is modest: publish the method, release the code, let people reproduce the results, and be honest when a technique has limits (see the "note on honest evaluation" section of the model card for why so many abliterated models advertise numbers that don't survive rigorous re-testing). I'm happy to point you at the agents work separately if that's the part you're interested in.

Trilogix1

8 days ago

•

edited 5 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment