Carlo Moro

cnmoro

AI & ML interests

None yet

Recent Activity

liked a model about 12 hours ago

sparse-encoder-testing/inference-free-splade-bert-tiny-nq

liked a model about 12 hours ago

Luyu/co-condenser-marco

updated a model about 12 hours ago

cnmoro/inference-free-splade-co-condenser-en-ptbr

View all activity

Organizations

liked 2 models about 12 hours ago

sparse-encoder-testing/inference-free-splade-bert-tiny-nq

Feature Extraction • Updated Jun 25, 2025 • 1 • 1

Luyu/co-condenser-marco

Fill-Mask • Updated Aug 13, 2021 • 13.9k • • 6

updated a model about 12 hours ago

cnmoro/inference-free-splade-co-condenser-en-ptbr

updated a model about 13 hours ago

cnmoro/inference-free-splade-bert-tiny-en-ptbr

published 2 models about 13 hours ago

cnmoro/inference-free-splade-bert-tiny-en-ptbr

cnmoro/inference-free-splade-co-condenser-en-ptbr

repliedto JonnaMat's post 2 days ago

Any plans on releasing flashhead for qwen3.5 models?

reactedto JonnaMat's post with 🚀 2 days ago

Post

1200

⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin

flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just:

pip install flash-head                                                                                                              
vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16

✨ The plugin activates automatically via vLLM's vllm.general_plugins entry point. No source patches, no custom imports.

🧩 Supported models (full collection):

Qwen Qwen3,

meta-llama Llama3,

google Gemma3,

nvidia Cosmos-Reason2 - BF16 and W4A16 variants.
https://huggingface.co/collections/embedl/flashhead

📊 embedl/Edge-Inference-Benchmarks

🔧 Benchmark it yourself:

vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

# Baseline comparison                     
FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1

FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀

2 replies

liked a dataset 2 days ago

badlogicgames/pi-mono

Traces • Updated 8 days ago • 627 • 6.4k • 54

liked 3 models 3 days ago

reactedto fffiloni's post with 🚀 3 days ago

Post

2816

✨ PASD Magnify is back on Hugging Face Spaces

fffiloni/PASD

PASD isn’t recent, but still delivers strong results — worth restoring rather than replacing.

Getting it to run again wasn’t a simple dependency issue.
It relied on parts of diffusers that no longer exist, while moving to Gradio 6 forced a much newer HF stack — and I couldn’t modify the original source directly.

Recreating the old environment wasn’t practical.
So I patched the downloaded code at runtime before import and made it compatible with today’s stack.

That ended up being the only approach that held without forking or freezing everything to outdated versions.

If you’ve used it before (or are curious), feel free to give it another try.