Embedl

Team

company

https://www.embedl.com

embedl

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

WilhelmT updated a Space about 1 hour ago

embedl/Edge-Inference-Benchmarks

JonnaMat new activity 4 days ago

embedl/Qwen3-0.6B-FlashHead:vllm serve

JonnaMat updated a Space 4 days ago

embedl/README

View all activity

WilhelmT

updated a Space about 1 hour ago

Edge Inference Benchmarks

🚀

On-Device benchmarks across devices and models.

JonnaMat

in embedl/Qwen3-0.6B-FlashHead 4 days ago

vllm serve

#2 opened 5 days ago by

ParisXu

JonnaMat

updated a Space 4 days ago

README

🚀

Embedl - efficient AI for the edge

JonnaMat

posted an update 4 days ago

Post

1210

⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin

flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just:

pip install flash-head                                                                                                              
vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16

✨ The plugin activates automatically via vLLM's vllm.general_plugins entry point. No source patches, no custom imports.

🧩 Supported models (full collection):

Qwen Qwen3,

meta-llama Llama3,

google Gemma3,

nvidia Cosmos-Reason2 - BF16 and W4A16 variants.
https://huggingface.co/collections/embedl/flashhead

📊 embedl/Edge-Inference-Benchmarks

🔧 Benchmark it yourself:

vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

# Baseline comparison                     
FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1