8GB VRAM Local LLMs - Practitioner Tested

witcheer 's Collections

updated about 18 hours ago

Real practitioner benchmarks of small/mid open-source LLMs on consumer 8GB VRAM hardware (RTX 4060 Ti).

Upvote

nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF

Text Generation • 4B • Updated Mar 16 • 11.5k • 138

Note 80.7 t/s on RTX 4060 Ti 8GB. Fastest small model in my 5-model test set. NVIDIA's edge-ready open release - positioned for gaming NPCs, voice assistants, IoT. Solid technical accuracy in MoE explainer test.
google/gemma-4-E4B-it

Any-to-Any • 8B • Updated 6 days ago • 5.66M • 980

Note 68.5 t/s, 6.0GB VRAM. Fast and accurate. Heavier than its 4B name suggests (E variants use selective activation but full weights still sit in VRAM). Lowest TTFT in my test (0.26s).
ibm-granite/granite-4.1-8b

Text Generation • 9B • Updated 9 days ago • 34.2k • 168

Note 49.1 t/s, 5.3GB VRAM. Best instruction-follower in my 5-model test. Hit length target exactly. Cleanest answer of the five. IBM's "matches our previous 32B MoE" claim is credible from this sample. Practitioner pick for accuracy.
Qwen/Qwen3.5-9B

Image-Text-to-Text • 10B • Updated Mar 2 • 8.28M • • 1.43k

Note 44.2 t/s, 6.2GB VRAM. Reference baseline. The only model in my test that called out the memory-bandwidth bottleneck on consumer GPUs specifically (rare insight at this size class). Multimodal-capable.
microsoft/Phi-4-mini-instruct

Text Generation • Updated Dec 10, 2025 • 1.53M • • 737

Note Tested 2026-05-06: 88.92 tok/sec on RTX 4060 Ti 8GB, Q4_K_M GGUF (lmstudio-community quant), 16K context, full GPU offload, LM Studio. Coherent prose, on-topic, slight length-budget overshoot but no factual fabrication. New leader in this catalog at the dense-Q4 8GB tier.
mistralai/Ministral-3-8B-Instruct-2512

9B • Updated Jan 15 • 132k • 169

Note Tested 2026-05-06: 48.47 tok/sec on RTX 4060 Ti 8GB, Q4_K_M GGUF (lmstudio-community quant), 16K context. Cleaner instruction-following than llama-3.3-8b at similar speed — 4x fewer tokens for the same answer means much better wall-clock per useful output.
Qwen/Qwen3.6-35B-A3B

Image-Text-to-Text • 36B • Updated 19 days ago • 3.86M • • 1.74k

Note 35 tok/s partial offload (-ncmoe 30, 32K ctx, llama-server). Full offload: 7.4 tok/s (32GB RAM ceiling). See dataset.
witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05

Viewer • Updated about 18 hours ago • 18 • 1

Note turboquant KV cache benchmarks + context checkpoint discovery

Upvote