view reply Any plans on releasing flashhead for qwen3.5 models? Yes! FlashHead-enabled Qwen3.5 models are coming soon. We are currently finalizing accuracy and latency evaluations.
view article Article How to Build a vLLM Plugin: A Guide to the general_plugins Entry Point 3 days ago
view post Post 1182 โก FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 โจ The plugin activates automatically via vLLM's vllm.general_plugins entry point. No source patches, no custom imports. ๐งฉ Supported models (full collection): Qwen Qwen3, meta-llama Llama3, google Gemma3, nvidia Cosmos-Reason2 - BF16 and W4A16 variants.https://huggingface.co/collections/embedl/flashhead ๐ embedl/Edge-Inference-Benchmarks ๐ง Benchmark it yourself: vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1FlashHead shines at low batch sizes; the typical real-time / on-device use case. ๐ See translation 2 replies ยท ๐ 3 3 + Reply
FlashHead Collection Efficient Drop-In Replacement for the Classification Head in Language Model Inference. https://github.com/embedl/flash-head โข 30 items โข Updated 3 days ago โข 2