Finally dense 100b+!

by Sliderpro93 - opened 7 days ago

Discussion

Sliderpro93

7 days ago

Amazing job, finally someone made a large..."MEDIUM" dense model ! Thank you, this is so valuable!

Lockout

7 days ago

•

edited 7 days ago

oh shit.. it does have the vision stack.. i was wrong and thought they didn't have it.

NickupAI

7 days ago

What's so good about dense?

macandchiz

7 days ago

What's so good about dense?

More activated parameters. Think about Qwen 3.6 27b which is a dense model. It is on par with its 122b MoE big brother. Now, Mistral released a 128b dense model. Can't wait to try it out!

BlueNipples

7 days ago

What's so good about dense?

If you do the power law on MoE's (active params * total params square rooted), you get the approximate cognitive capability of MoE models. Knowledge can be much higher, but logic, reasoning, language, tool use etc, that's the ballpark. This type of size sits between the 300b flash MoE sizes and the 1T param sizes in those terms, but can be run on a single consumer or workstation gpu.

Dense models of this size are WAY more in the reach of people, versus businesses or data centers. Although admittedly to run these well, you'll still likely be spending a few bob.

MarvinCDY

6 days ago

•

edited 6 days ago

... can be run on a single consumer or workstation gpu...

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

Green-eyedDevil

6 days ago

•

edited 6 days ago

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

Not him, but to answer your question, basically exclusively a Blackwell Pro 6000, which is not far out of reach for most people.

Lockout

4 days ago

I reach 30t/s on Q4 with simply 3090s and ik_llama.

MarvinCDY

4 days ago

I reach 30t/s on Q4 with simply 3090s and ik_llama.

How is that possible? Even Q1 takes 35GB+ memory?

BingoBird

2 days ago

I reach 30t/s on Q4 with simply 3090s and ik_llama.

How is that possible? Even Q1 takes 35GB+ memory?

the 's' in there is likely used to refer-to multiple cards.

BlueNipples

about 14 hours ago

•

edited about 14 hours ago

... can be run on a single consumer or workstation gpu...

I'm curious what single consumer or workstation gpu can load this size of model and in the meantime can generate tokens at tolerable speed (e.g., input >= 500 t/s, and output >= 15 t/s) ?

Likely anything with maybe 40gb and up will do. Which could mean a 5000, 6000 or a couple of 3090s strapped together would do it. A lot more achievable generally that trying to load a 1T MoE. The knowledge will be lower, but in theory (not obviously in practice rn), logic, coding and the like could be similar.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment