vllm serve

by ParisXu - opened 10 days ago

Discussion

ParisXu

10 days ago

Hi,
Thanks for open-sourcing!
Can I use "vllm serve" to launch this model? and how should i do?

JonnaMat

Embedl org 9 days ago

Hi @ParisXu , Thanks for your interest!

Yes! We just updated the workflow to make this easier. Install the flash-head vLLM plugin and you're good to go:

pip install flash-head
vllm serve embedl/Qwen3-0.6B-FlashHead

FlashHead activates automatically via vLLM's plugin system. You can also use it from Python:

from vllm import LLM, SamplingParams                                                                                                
                                                      
llm = LLM(model="embedl/Qwen3-0.6B-FlashHead", trust_remote_code=True)                                                            
output = llm.generate(["Write a haiku about coffee."], SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

Requires Python 3.10+ and vLLM >= 0.14.0. The README has been updated with the new instructions.

ParisXu

9 days ago

Hi @ParisXu , Thanks for your interest!

Yes! We just updated the workflow to make this easier. Install the flash-head vLLM plugin and you're good to go:

pip install flash-head
vllm serve embedl/Qwen3-0.6B-FlashHead

FlashHead activates automatically via vLLM's plugin system. You can also use it from Python:

from vllm import LLM, SamplingParams                                                                                                
                                                      
llm = LLM(model="embedl/Qwen3-0.6B-FlashHead", trust_remote_code=True)                                                            
output = llm.generate(["Write a haiku about coffee."], SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

Requires Python 3.10+ and vLLM >= 0.14.0. The README has been updated with the new instructions.

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

'_FLASHHEAD_ARCHITECTURES'
the model's config.json
3.include flash_head_assets.

JonnaMat

Embedl org 9 days ago

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

'_FLASHHEAD_ARCHITECTURES'

the model's config.json
3.include flash_head_assets.

There is a growing HF collection of FlashHead-enabled models at https://huggingface.co/collections/embedl/flashhead

As of now, creating new FlashHead models requires running the clustering step on the model's lm_head weight matrix to generate the flash_head_assets (centroids + cluster assignments). This tooling is not public at the moment.

What models are you interested in running with FlashHead? We can look into adding them to the collection.

ParisXu

9 days ago

Thanks!
So If I want to apply a FlashHead to a new model, I simply need to set

'_FLASHHEAD_ARCHITECTURES'

the model's config.json
3.include flash_head_assets.

There is a growing HF collection of FlashHead-enabled models at https://huggingface.co/collections/embedl/flashhead

As of now, creating new FlashHead models requires running the clustering step on the model's lm_head weight matrix to generate the flash_head_assets (centroids + cluster assignments). This tooling is not public at the moment.

What models are you interested in running with FlashHead? We can look into adding them to the collection.

I am attempting to reproduce the algorithm described in your paper using a new Base Model recently trained at our company. I have successfully generated the clustering_cache.safetensors and clustering_config.json files; how do I now integrate this new model into this flash_head framework?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment