Safetensors
qwen3

vllm serve

#2
by ParisXu - opened

Hi,
Thanks for open-sourcing!
Can I use "vllm serve" to launch this model? and how should i do?

Embedl org

Hi @ParisXu , Thanks for your interest!

Yes! We just updated the workflow to make this easier. Install the flash-head vLLM plugin and you're good to go:

pip install flash-head
vllm serve embedl/Qwen3-0.6B-FlashHead                                                                                              

FlashHead activates automatically via vLLM's plugin system. You can also use it from Python:

from vllm import LLM, SamplingParams                                                                                                
                                                      
llm = LLM(model="embedl/Qwen3-0.6B-FlashHead", trust_remote_code=True)                                                            
output = llm.generate(["Write a haiku about coffee."], SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)                                                                                                    

Requires Python 3.10+ and vLLM >= 0.14.0. The README has been updated with the new instructions.

Hi @ParisXu , Thanks for your interest!

Yes! We just updated the workflow to make this easier. Install the flash-head vLLM plugin and you're good to go:

pip install flash-head
vllm serve embedl/Qwen3-0.6B-FlashHead                                                                                              

FlashHead activates automatically via vLLM's plugin system. You can also use it from Python:

from vllm import LLM, SamplingParams                                                                                                
                                                      
llm = LLM(model="embedl/Qwen3-0.6B-FlashHead", trust_remote_code=True)                                                            
output = llm.generate(["Write a haiku about coffee."], SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)                                                                                                    

Requires Python 3.10+ and vLLM >= 0.14.0. The README has been updated with the new instructions.

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

  1. '_FLASHHEAD_ARCHITECTURES'
  2. the model's config.json
    3.include flash_head_assets.
Embedl org

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

  1. '_FLASHHEAD_ARCHITECTURES'
  2. the model's config.json
    3.include flash_head_assets.

There is a growing HF collection of FlashHead-enabled models at https://huggingface.co/collections/embedl/flashhead

As of now, creating new FlashHead models requires running the clustering step on the model's lm_head weight matrix to generate the flash_head_assets (centroids + cluster assignments). This tooling is not public at the moment.

What models are you interested in running with FlashHead? We can look into adding them to the collection.

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

  1. '_FLASHHEAD_ARCHITECTURES'
  2. the model's config.json
    3.include flash_head_assets.

There is a growing HF collection of FlashHead-enabled models at https://huggingface.co/collections/embedl/flashhead

As of now, creating new FlashHead models requires running the clustering step on the model's lm_head weight matrix to generate the flash_head_assets (centroids + cluster assignments). This tooling is not public at the moment.

What models are you interested in running with FlashHead? We can look into adding them to the collection.

I am attempting to reproduce the algorithm described in your paper using a new Base Model recently trained at our company. I have successfully generated the clustering_cache.safetensors and clustering_config.json files; how do I now integrate this new model into this flash_head framework?

Sign up or log in to comment