Post
70
new dataset: turboquant KV cache benchmarks for qwen3.6-35B-A3B on RTX 4060 Ti 8GB.
>>> 18 structured runs covering turboquant turbo2/turbo3/turbo4 vs standard q4_0 V cache, context depths 3.5K to 50K, checkpoint modes, and two agent harnesses (hermes vs pi).
>>> novel finding: llama.cpp default context checkpoints (every 8192 tokens, ~63 MiB each) silently accumulate in VRAM and trigger the 8GB cliff at ~26K context. disabling with --checkpoint-every-n-tokens -1 gives smooth degradation instead.
third dataset in the collection. now 3 datasets covering dense, MoE offload, and turboquant benchmarks.
[dataset]( witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05) | [collection]( witcheer/8gb-vram-local-llms-practitioner-tested-69fa0e855c51e3c15a9d95d4)
>>> 18 structured runs covering turboquant turbo2/turbo3/turbo4 vs standard q4_0 V cache, context depths 3.5K to 50K, checkpoint modes, and two agent harnesses (hermes vs pi).
>>> novel finding: llama.cpp default context checkpoints (every 8192 tokens, ~63 MiB each) silently accumulate in VRAM and trigger the 8GB cliff at ~26K context. disabling with --checkpoint-every-n-tokens -1 gives smooth degradation instead.
turbo3, 64K ctx: 35.17 tok/s (62 graph splits)
turbo2, 64K ctx: 35.13 tok/s (62 graph splits)
turbo4, 64K ctx: 13.93 tok/s (decompression cliff)
std q4_0, 64K ctx: 31.22 tok/s (82 splits, broken in agents)third dataset in the collection. now 3 datasets covering dense, MoE offload, and turboquant benchmarks.
[dataset]( witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05) | [collection]( witcheer/8gb-vram-local-llms-practitioner-tested-69fa0e855c51e3c15a9d95d4)