Infra • Serving & Optimization - a francescomaiomascio Collection

francescomaiomascio 's Collections

Local • Workstation-Ready (≤14B)

ICE • Core LLM Baselines

ICE • Code & Tool-Use

ICE • Retrieval (Embeddings + Rerankers)

ICE • Multimodal (Vision + Speech)

ICE • Safety / Guardrails

Research • Archive

Infra • Serving & Optimization

Infra • Serving & Optimization

updated Jan 30

Inference engines, quantization, serving stacks, and perf tooling. Reference list for deployment and latency/cost work.

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Paper • 2408.01050 • Published Aug 2, 2024 • 9
Seesaw: High-throughput LLM Inference via Model Re-sharding

Paper • 2503.06433 • Published Mar 9, 2025
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 140
Running

302

Evaluation Guidebook

📝

302

Explore LLM benchmark trends over time
Running on CPU Upgrade

Agents

1.01k

Open VLM Leaderboard

🌎

1.01k

VLMEvalKit Evaluation Results Collection
Running

Agents

Featured

131

Open VLM Video Leaderboard

🌎

131

VLMEvalKit Eval Results in video understanding benchmark
Qwen/Qwen2-7B-Instruct-AWQ

Text Generation • 8B • Updated Aug 21, 2024 • 2.9k • 22
TheBloke/CodeLlama-7B-Instruct-AWQ

Text Generation • 7B • Updated Nov 9, 2023 • 151 • 4
TheBloke/Mistral-7B-Instruct-v0.1-AWQ

Text Generation • 7B • Updated Nov 9, 2023 • 362 • 38
TheBloke/Mistral-7B-Instruct-v0.2-AWQ

Text Generation • 7B • Updated Dec 11, 2023 • 422k • 52
Qwen/Qwen2.5-VL-7B-Instruct-AWQ

Image-Text-to-Text • 8B • Updated Apr 6, 2025 • 220k • 99