More reserved settings and standard Q4_K_M still works best for me.
I'm not sure objective analysis like KLD or running multiple choice tests like the MMLU are the best way to determine the ideal settings or quantization method.
I compared your Q4_K_XL vs Bartowski's Q4_K_M, plus your recommended settings for things like general non-thinking use cases against mine (temp 0.7, top p 0.8, top k 20, min p 0, prec penalty 1.5, and repeat penalty off vs temp 0.3, min p 0.6, top p 1, top K 0, prec penalty off, and repeat penalty 1.07), and both your quants and settings proved less reliable across all tasks, including Q&A and story writing, even if they do provide a little more variability.
For example, without limiting the probability of the next token (min p 0.6 and temp 0.3) the odds of making a factual error went up, which is particularly problematic with Qwen3.5 since its knowledge is relatively low across most domains other than STEM compared to other comparably sized models.
And when it comes to longer tasks like story writing the creativity might be a little better, but the odds of making egregious contradictions to the user's prompt and what it already wrote went up, which has a much bigger negative impact when trying to get an LLM to write an original story vs just regurgitating one from the training data (for example, included this, and exclude that). Same thing happens with things like poetry (e.g. coherency, rhyming... is more likely to be broken).
Even tasks like generating ASCII art proved more unreliable, such as trying to draw a cat and dog together, then periodically just putting parallel lines forever.
I'm just sharing my experience and am not looking for a response, but it always surprises me when people keep recommending temps between 0.7-1 and no min P, especially when using smaller models. There simply isn't enough separation in their weights to get this much variability out of them.
Your camparizon is quite interesting since I have fun with even more extrime quantization attempts then K_XL quants. I do not test as extensivly as you do as it is hobby and I do not have that much time, my use cases are different also...
So quants with more precission for attention related weights feel more inteligent but this is not the case, where they improve is working with context and agentic use. Specificly model quants like K_XL usually give me better summaries (thay catch more imortant details) and just are smarter in Kilo or Roo, complite tasks faster.
Can you try my sampling settings, how they perform on your tasks.
n-predict = -1
sampling-seq = sxmptd
temp = 1.0
top-nsigma = 0.8
min-p = 0.05
dry-multiplier = 0.8
dry-base = 1.75
dry-allowed-length = 1
xtc-probability = 0.1
xtc-threshold =0.5
ctx-size = 131072
@engrtipusultan I'm using Koboldcpp and can set all those options other than sxmptd.
Kobold has 7 sampling order options which I can set to the approximate match of sxmptd with rep_pen, top_k, top_a, typ (typical sampling), top_p, tfs (tail free sampling), and temp.
As I understand it that excludes s and d in sxmptd (smooth factor and dynamic temp). However, both these things settings are available for modification. I didn't know dynamic temp existed and always thought it should so I'm testing it out with various settings.
Also, since you didn't mention the other settings I just set them to what's recommended by unsloth (top_p to 0.8, top k 20, and repeat penalty to 1).
I am using llama.cpp. Rather then traditional sampling parameters I am using
Sigma + XTC + DRY stack. In my opinion it gives good combination of creativity and instruction following without loops.
--sampling-seq sxmptd
Overrides the default sampling sequence and uses
s (Top-N Sigma)
x (XTC)
m (Min-P)
p (Top-P)
t (Temperature)
d (DRY)
You do not need repeat-penalty or presence penalty as DRY is used.
@engrtipusultan Thanks, I've been messing around with your settings for a while and they definitely have their advantages, and as I understand it DRY is good for things like multi-turn attempts to fix errors, such as when coding, plus to reduce infinite looping, which Qwen3.5 does a lot. But I found it also appears to help break the single response loop where the AI model takes a stab at it, says that's not right, then just repeats what it previously said wasn't right, so I think DRY is an overall improvement.
And presence penalty shouldn't exist. It has the same penalty whether a token is used once or a hundred times, so it didn't stop infinite repeats like the same character in ASCII art being repeated over and over, plus needlessly penalized the use of the same token a handful of times, despite this being common and typically conveys completely unrelated meanings (e.g. the same tokens are parts of different words).
Anyways, I'm looking for a way to maximize variability without notably increasing errors across a broad spectrum of tasks.
What I've landed on so far (plan on playing around with DRY at the end) is min_p 0.33 w/ top_p & top_k turned off, a variable temp of 0.05 to 0.4, and a repeat penalty of 1.07. I plan on narrowing in on the correct DRY setting when testing multi-turn performance, which is much harder to do.
These settings might appear to be too restrictive compared to the official recommended settings, but when the same story prompt is run over and over the stories are usually very different. It also produces things like varied synonym lists to the same prompt (e.g. when asked for synonyms for extra it will periodically expand out to less common synonyms like superfluous).
Doing these same tasks using typical top_p/top_k & temp settings, or a lower min_p, greatly increases the error rate such as story contradictions and outputing related words, but not synonyms, or extra even though it's explicitly forbidden (e.g. "Make a simple list of 9 single word synonyms for extra, but don't include the word extra in the list.").
A min_p of 0.33 still allows for tokens 3x less probable than the most probable one to be included, which is a substantial range in probabilities. The recommended setting of 0.1 is just insane, theoretically allowing for a token with only a 6% probability of being ideal to be chosen over one with a 60% probability, which is a horrible compromise that drastically increases the error rate while providing only marginally more variability in long outputs.
And a dynamic temp range of 0.05 to 0.4 basically ensures that during Q&A, coding, math, agentic use cases, and other precise circumstances when some tokens have notably higher probabilities than the others that the correct choice is made far more often (temp 0.05). For example, when listing the cast of a movie an AI model barely knows it is far more likely to correctly link the first and last names rather that do things like swapping the last names of the characters. But when the probability spectrum of the tokens are narrower, such as when writing stories or generating synonym lists, then the temperature becomes 0.4, which greatly increases the variability of the stories and listed synonym with each run of the same prompt compared to a temp of 0.05, so it's the best of both worlds. I couldn't raise the temperature higher than 0.4 because the error rate started to spike (e.g. story contradictions and bad synonym choices).
@engrtipusultan DRY appears to be non-viable as a setting across broad tasks, primarily because of dry-allowed-length and it being limited to integers representing the number of identical tokens appearing in sequence (e.g. 0, 1 and 2).
For example, when it comes to preventing looping in simple Q&A prompts, such as characters in a movie, your stated setting of 1 is far better than 2 (greatly reduces infinite looping and repeating the same wrong last name over and over again). However, a setting of 1 breaks coherency in longer outputs like stories far more than 2 since repeating short token sequences is very common in long outputs like stories, and doing otherwise needlessly nudges them off the tracks.
Thank you for sharing your results. I am mostly conversing around logic and coding so I have not faced the problem you mentioned, but it is great to have other perspectives.
Once you are done maybe you can share your concluded settings for your use cases. Maybe it will help other people with use use cases.
@engrtipusultan What I’m learning is there’s no such thing as ideal settings across all models, tasks and circumstances. Plus none of the small models (~10b or fewer parameters) are good enough for general purpose AI use, regardless of the settings. The models that come closest are Llama 3.1 8b and Gemma 2 9b. The newer models regressed substantially across common tasks as they made gains in domains like coding and math.
Anyways, the settings I found that work best across a broad spectrum of tasks with mid-sized models sans thinking, including Qwen3 34b-3ba, Gemma 3 27b, and Mistral Small 22b are below. And with progressively larger models you can start raising the temperature and lowering min_p.
Settings: Dynamic temp 0.3 (0.2-0.4), min_p 0.33, top_p & top_k disabled, and repeat penalty 1.07 (to reduce issues like repetitiveness and looping)
Changing these setting in one direction starts notably increasing errors like hallucinations, while changing them in the other direction starts to notably restrict variability.
And the recommended settings like temp 0.7-1, top_p of 0.9-0.95, top_k of 20-40 and so on are just awful. Those settings work for large SOTA models, but models of this size keep falling off the rails left and right. You simply need to restrict variability to mitigate errors and maintain coherency.
I don't think there can be common sampling settings across different models with different architectures.
Also in addition to that for each model for particular type of tasks there will be different preferable sampling settings. I think for this reason clients often have option to set sampling settings with messages.
The idea was to share my day to day settings for my use case those are for Qwen3.5-35B-A3B.