best qwen3.5 4b heretic so far

by TerrorJack - opened Mar 9

Mar 9

this one absolutely rocks, it's the best qwen3.5 4b heretic model i've tested so far, much less prone to implicit refusals in other qwen3.5 heretic models! thanks a lot for your work :)

are there any plans to do a 9b or 35b-a3b version using similar techniques?

MuXodious

Owner Mar 9

Once I establish a proper config that I find satisfactory, why not. No timeline promises though, as I'm a bit on the road at the moment. Qwen3.5 is weird with them secondary factors. It's hard to target and ablate the undesirable ones while safeguarding harmless ones, ablating which simply damages the model. The model also variates its wording and often uses generic words that can be used either way, such as the word "jurisdictions" that after ablation can still be used in harmless contexts but still gets marked as refusal. I have better examples than this in my notes that I cannot check rn. For instance, Gemma would say that it won't respond because its illegal but after ablation it may say *Here is an illegal pineapple pizza recipe to disillusion unsuspecting Italians" followed by a completely uncensored response. In this case, checking for the word "illegal" marks both as refusal by the keyword-based detection in Heretic.

Anyways, you get the point.

blankreg

23 days ago

@MuXodious I'm not sure I understand, compared to v1 the readme says v2 has higher divergence and +2 refusals, so v1 should be better?

MuXodious

Owner 7 days ago

I have a basket full of colourful apples. I take a bite of each apple to select for sweetness before tossing them into a vat for making vinegar. If I only tasted green ones, can I surely say that there are no bitter red or yellow apples in the vat? Or, if I taste all apples and properly deduct all sour apples, but have couple bittersweet apples put aside, unable to decide whether to add them or not, wouldn't I have less sour apples in the vat?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment