AI & ML interests

None defined yet.

Recent Activity

WilhelmTΒ  updated a Space about 1 hour ago
embedl/Edge-Inference-Benchmarks
JonnaMatΒ  updated a Space 4 days ago
embedl/README
View all activity

vllm serve

4
#2 opened 5 days ago by
ParisXu
JonnaMatΒ 
posted an update 4 days ago
view post
Post
1210
⚑ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin

flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just:

pip install flash-head                                                                                                              
vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16


✨ The plugin activates automatically via vLLM's vllm.general_plugins entry point. No source patches, no custom imports.

🧩 Supported models (full collection):
Qwen
Qwen3,
meta-llama
Llama3,
google
Gemma3,
nvidia
Cosmos-Reason2 - BF16 and W4A16 variants.
https://huggingface.co/collections/embedl/flashhead

πŸ“Š embedl/Edge-Inference-Benchmarks

πŸ”§ Benchmark it yourself:

vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

# Baseline comparison                     
FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1


FlashHead shines at low batch sizes; the typical real-time / on-device use case. πŸš€
  • 2 replies
Β·