11 4 12

Jonna Matthiesen

JonnaMat

AI & ML interests

None yet

Recent Activity

repliedto their post 1 day ago

⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` ✨ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. 🧩 Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead 📊 https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks 🔧 Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀

new activity 2 days ago

embedl/Qwen3-0.6B-FlashHead:vllm serve

updated a Space 3 days ago

embedl/README

View all activity

Organizations

repliedto their post 1 day ago

Any plans on releasing flashhead for qwen3.5 models?

Yes! FlashHead-enabled Qwen3.5 models are coming soon. We are currently finalizing accuracy and latency evaluations.

New activity in embedl/Qwen3-0.6B-FlashHead 2 days ago

vllm serve

#2 opened 3 days ago by

ParisXu

updated a Space 3 days ago

README

🚀

Embedl - efficient AI for the edge

published an article 3 days ago

Article

How to Build a vLLM Plugin: A Guide to the general_plugins Entry Point

3 days ago

posted an update 3 days ago

Post

1182

pip install flash-head                                                                                                              
vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16

✨ The plugin activates automatically via vLLM's vllm.general_plugins entry point. No source patches, no custom imports.

🧩 Supported models (full collection):

Qwen Qwen3,

meta-llama Llama3,

google Gemma3,

nvidia Cosmos-Reason2 - BF16 and W4A16 variants.
https://huggingface.co/collections/embedl/flashhead

📊 embedl/Edge-Inference-Benchmarks

🔧 Benchmark it yourself:

vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

# Baseline comparison                     
FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀

2 replies

updated a collection 3 days ago

FlashHead

Collection

Efficient Drop-In Replacement for the Classification Head in Language Model Inference. https://github.com/embedl/flash-head • 30 items • Updated 3 days ago • 2

updated 3 models 3 days ago