8GB VRAM Local LLMs - Practitioner Tested
Real practitioner benchmarks of small/mid open-source LLMs on consumer 8GB VRAM hardware (RTX 4060 Ti).
Text Generation • 4B • Updated • 11.5k • 138Note 80.7 t/s on RTX 4060 Ti 8GB. Fastest small model in my 5-model test set. NVIDIA's edge-ready open release - positioned for gaming NPCs, voice assistants, IoT. Solid technical accuracy in MoE explainer test.
google/gemma-4-E4B-it
Any-to-Any • 8B • Updated • 5.66M • 980Note 68.5 t/s, 6.0GB VRAM. Fast and accurate. Heavier than its 4B name suggests (E variants use selective activation but full weights still sit in VRAM). Lowest TTFT in my test (0.26s).
ibm-granite/granite-4.1-8b
Text Generation • 9B • Updated • 34.2k • 168Note 49.1 t/s, 5.3GB VRAM. Best instruction-follower in my 5-model test. Hit length target exactly. Cleanest answer of the five. IBM's "matches our previous 32B MoE" claim is credible from this sample. Practitioner pick for accuracy.
Qwen/Qwen3.5-9B
Image-Text-to-Text • 10B • Updated • 8.28M • • 1.43kNote 44.2 t/s, 6.2GB VRAM. Reference baseline. The only model in my test that called out the memory-bandwidth bottleneck on consumer GPUs specifically (rare insight at this size class). Multimodal-capable.
microsoft/Phi-4-mini-instruct
Text Generation • Updated • 1.53M • • 737Note Tested 2026-05-06: 88.92 tok/sec on RTX 4060 Ti 8GB, Q4_K_M GGUF (lmstudio-community quant), 16K context, full GPU offload, LM Studio. Coherent prose, on-topic, slight length-budget overshoot but no factual fabrication. New leader in this catalog at the dense-Q4 8GB tier.
mistralai/Ministral-3-8B-Instruct-2512
9B • Updated • 132k • 169Note Tested 2026-05-06: 48.47 tok/sec on RTX 4060 Ti 8GB, Q4_K_M GGUF (lmstudio-community quant), 16K context. Cleaner instruction-following than llama-3.3-8b at similar speed — 4x fewer tokens for the same answer means much better wall-clock per useful output.
Qwen/Qwen3.6-35B-A3B
Image-Text-to-Text • 36B • Updated • 3.86M • • 1.74kNote 35 tok/s partial offload (-ncmoe 30, 32K ctx, llama-server). Full offload: 7.4 tok/s (32GB RAM ceiling). See dataset.
witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05
Viewer • Updated • 18 • 1Note turboquant KV cache benchmarks + context checkpoint discovery