unsloth for the non-gguf crowd?
It would be great if unsloth could start providing AWQ 8bit/4bit and autoround quants, for those of us who want to run vllm but are running ampere and the like :)
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.
Not nemotron or AWQ, but these might be interesting:
- Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
- Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
...
Run very nicely on Amperes + vLLM, and the models are definitely up there.
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Not nemotron or AWQ
I gave up on Nemotron 3 after some testing. Nemotron is still very interesting to me, as a "strong" open source model from a trusted company. But it's "not quite there" as a SOTA OSS model, imho. Maybe with 3.5 or 4, especially if they add vision or omni ;)
but these might be interesting:
- Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
- Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
Run very nicely on Amperes + vLLM, and the models are definitely up there.
If you can run Qwen 397B, I would very strongly encourage you to try minimax M2.5 at around @q4 (or whatever fits), or (if you enjoy "smart but verbose" reasoning) Step-3.5-Flash. I'm really torn between these two as the best that will run on my hardware at the moment: both are excellent with OpenCode. If you have a ton of VRAM; if you're running 400B models @q8 or more, you could try Kimi at a very low quant.
I can't run ~400B models (in GPU). I have 4x3090 (for 96GB) + 1x 3070 (another 8GB, which I normally use for other things like embedding/tts/asr/ocr). But in 96GB, I can run Minimax M2.5 at Q2_XSS, and Qwen3.5-122B at higher quants is a bad joke by comparison. Only Step-3.5-Flash has been in the same league.
I honestly don't get the fuss about Qwen: I used it for a short time just prior to QwQ, but I've tried every Qwen reasoning model since QwQ, and they've ALL had serious quality issues like looping (well known) but also "pseudo-reasoning": a lot of "fake" reasoning phrases that sound like thought, but aren't actually coherent/relevant in the situation, so just waste tokens and confuse the outcome. For example, "ah-hah! it's X", when it's clearly not X at all. Sadly this doesn't get benchmarked much. OckBench exists, but doesn't cover many models. Qwen 2.5 scored 2.3% "useful" reasoning (the rest being wasted or even misleading tokens) by one analysis of its reasoning, according to Liquid AI: https://www.linkedin.com/posts/maxime-labonne_most-reasoning-steps-in-llms-are-just-activity-7437484983092588545-8L31. Step-3.5-Flash is also guilty of this sometimes, but not as much. MiniMax is much more coherent, even at Q2, and is much less wasteful of CoT tokens in general.
The one thing that qwen offers is good, accurate vision support, but I'm working on running that @Q4 on my 8GB card with reasoning disabled, for vision alone, and routing image-to-text requests through a proxy such that the rest goes to Minimax/Step.
BTW: both minimax m2.7 and Step-3.6-Flash are due soon (a few weeks). I'd encourage you to take a look at those when they finally drop, too. Also, minimax M3 should be multimodal.