Finally dense 100b+!

#6
by Sliderpro93 - opened

Amazing job, finally someone made a large..."MEDIUM" dense model ! Thank you, this is so valuable!

oh shit.. it does have the vision stack.. i was wrong and thought they didn't have it.

What's so good about dense?

What's so good about dense?

More activated parameters. Think about Qwen 3.6 27b which is a dense model. It is on par with its 122b MoE big brother. Now, Mistral released a 128b dense model. Can't wait to try it out!

What's so good about dense?

If you do the power law on MoE's (active params * total params square rooted), you get the approximate cognitive capability of MoE models. Knowledge can be much higher, but logic, reasoning, language, tool use etc, that's the ballpark. This type of size sits between the 300b flash MoE sizes and the 1T param sizes in those terms, but can be run on a single consumer or workstation gpu.

Dense models of this size are WAY more in the reach of people, versus businesses or data centers. Although admittedly to run these well, you'll still likely be spending a few bob.

... can be run on a single consumer or workstation gpu...

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

Not him, but to answer your question, basically exclusively a Blackwell Pro 6000, which is not far out of reach for most people.

I reach 30t/s on Q4 with simply 3090s and ik_llama.

I reach 30t/s on Q4 with simply 3090s and ik_llama.

How is that possible? Even Q1 takes 35GB+ memory?

I reach 30t/s on Q4 with simply 3090s and ik_llama.

How is that possible? Even Q1 takes 35GB+ memory?

the 's' in there is likely used to refer-to multiple cards.

... can be run on a single consumer or workstation gpu...

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

Likely anything with maybe 40gb and up will do. Which could mean a 5000, 6000 or a couple of 3090s strapped together would do it. A lot more achievable generally that trying to load a 1T MoE. The knowledge will be lower, but in theory (not obviously in practice rn), logic, coding and the like could be similar.

Sign up or log in to comment