8-kv-heads
#17
by ArthurZ HF Staff - opened
No description provided.
ArthurZ changed pull request status to open
ArthurZ changed pull request status to merged
Can you explain the precise rationale on why this change was made? The reason this configuration existed is that a 405b model at bf16 isn't loadable on 8 GPUs on any hardware we knew. Is the intended use case one where the weights are loaded and then dynamically quantized and then this configuration leads to faster and more efficient loads since the duplicate heads aren't needed?
For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.