8-kv-heads

#17
by ArthurZ HF Staff - opened
No description provided.

@ArthurZ are you going to land this soon?

@ArthurZ I'm waiting on this as well.

ArthurZ changed pull request status to open
ArthurZ changed pull request status to merged

Can you explain the precise rationale on why this change was made? The reason this configuration existed is that a 405b model at bf16 isn't loadable on 8 GPUs on any hardware we knew. Is the intended use case one where the weights are loaded and then dynamically quantized and then this configuration leads to faster and more efficient loads since the duplicate heads aren't needed?

@ArthurZ ,

Can you please explain why this change was made? This is causing OOM as 405-instruct is not getting loaded into 8 devices.

For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.

Sign up or log in to comment