Performance on Intel QYFS, 512GB DDR5 and 96GB VRAM

#3
by phakio - opened

Thanks for the quantizations! I was very excited to try this model out! I downloaded the Q4_X quant just to test out the best perplexity I could on my system, but realistically after playing a bit I'll redownload Q3 quant to get good context length.

ik_llama.cpp
1x4090, 3x 3090

image

Owner

@ubergarm will be publishing his ik-specific quants maybe next week or so, if you use ik_llama primarily then you'd definitely benefit from the KS / KT quants supported there and you'd get better PPL / KLD. But thanks for trying out mine!

Hi @phakio , I've just updated the quants. I was going through my backlog and noticed after measuring the IQ2_S and IQ3_S that they had an abnormally high PPL / KLD, so I've redone the quants and they look better now if you want to try downloading the new IQ3_S. Thanks :)

wow talk about a blast from the past! lol I'll redownload it and give it a try with my current workflows. thanks for the update -

just a little update, seems coherent and able to accurately describe and explain entry level medical student questions, which is my current area of study. seems like a great improvement, though admitidly in my initial post I believe I was using @ubergarm Q3 quant, and your Q4_0x quant (so I don't really have a before after reference)...

solid speeds, and able to use 100k context with q8_0 kv quant!

i'll keep using it for a bit, for use cases like mine, general explanations of conecepts as I study, i've always found higher parameter models to be more knowledgeable. I think the recent qwen models are very smart, however they reason a lot, and their outputs have a lot of "filler" text that isn't really needed, and ultimately distracting.

image

image


and here's my launch script for the curious ones (wrong alias, the model is indeed this one! lol)

image

Thank you for the feedback! The new Qwen3.5 models are nice too, but personally K2.5 is my preferred model for writing and chatting as well. Happy to hear that the quant is working nicely!

closing this thread with this definitive comment! one more piece of feedback! the model successfully was able to work in opencode, and create a whole xcode project from scratch and compile it. great quant, will definitely keep it around -

i had it code a native macOS llama.cpp endpoint chat client. simple app, but creating all of the various files and getting it to compile in xcode is no easy task! ui could use work but the app works great, and is only 1mb in size once compiled. (once again, wrong model name in opencode, im too lazy to always change the current active model config lmao)

image

Screenshot 2026-03-26 at 9.11.50 PM

Screenshot 2026-03-26 at 9.23.14 PM

phakio changed discussion status to closed

Sign up or log in to comment