Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
witcheer 
posted an update 1 day ago
Post
70
new dataset: turboquant KV cache benchmarks for qwen3.6-35B-A3B on RTX 4060 Ti 8GB.

>>> 18 structured runs covering turboquant turbo2/turbo3/turbo4 vs standard q4_0 V cache, context depths 3.5K to 50K, checkpoint modes, and two agent harnesses (hermes vs pi).

>>> novel finding: llama.cpp default context checkpoints (every 8192 tokens, ~63 MiB each) silently accumulate in VRAM and trigger the 8GB cliff at ~26K context. disabling with --checkpoint-every-n-tokens -1 gives smooth degradation instead.


turbo3, 64K ctx: 35.17 tok/s (62 graph splits)
turbo2, 64K ctx: 35.13 tok/s (62 graph splits)
turbo4, 64K ctx: 13.93 tok/s (decompression cliff)
std q4_0, 64K ctx: 31.22 tok/s (82 splits, broken in agents)



third dataset in the collection. now 3 datasets covering dense, MoE offload, and turboquant benchmarks.

[dataset]( witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05) | [collection]( witcheer/8gb-vram-local-llms-practitioner-tested-69fa0e855c51e3c15a9d95d4)
In this post