witcheer PRO

witcheer

https://x.com/witcheer

AI & ML interests

Local AI Maxxing.

Recent Activity

updated a collection about 4 hours ago

8GB VRAM Local LLMs - Practitioner Tested

updated a collection about 4 hours ago

8GB VRAM Local LLMs - Practitioner Tested

updated a collection about 4 hours ago

8GB VRAM Local LLMs - Practitioner Tested

View all activity

Organizations

None yet

posted an update 1 day ago

Post

new dataset: turboquant KV cache benchmarks for qwen3.6-35B-A3B on RTX 4060 Ti 8GB.

>>> 18 structured runs covering turboquant turbo2/turbo3/turbo4 vs standard q4_0 V cache, context depths 3.5K to 50K, checkpoint modes, and two agent harnesses (hermes vs pi).

>>> novel finding: llama.cpp default context checkpoints (every 8192 tokens, ~63 MiB each) silently accumulate in VRAM and trigger the 8GB cliff at ~26K context. disabling with --checkpoint-every-n-tokens -1 gives smooth degradation instead.

turbo3, 64K ctx: 35.17 tok/s (62 graph splits)
turbo2, 64K ctx: 35.13 tok/s (62 graph splits)
turbo4, 64K ctx: 13.93 tok/s (decompression cliff)
std q4_0, 64K ctx: 31.22 tok/s (82 splits, broken in agents)

third dataset in the collection. now 3 datasets covering dense, MoE offload, and turboquant benchmarks.

[dataset]( witcheer/rtx-4060ti-8gb-turboquant-bench-2026-05) | [collection]( witcheer/8gb-vram-local-llms-practitioner-tested-69fa0e855c51e3c15a9d95d4)

posted an update 6 days ago

Post

110

updated my MoE offload bench dataset + collection.

>>> previous finding: Qwen3.6-35B-A3B via full expert offload on RTX 4060 Ti 8GB + 32GB RAM → 7.4 tok/sec. RAM-ceilinged, disk-bound.

>>> new finding: built llama.cpp from source inside WSL2, swept -ncmoe values for partial offload.

ncmoe 32, 16K ctx → 29.7 tok/sec
ncmoe 30, 16K ctx → 32.0 tok/sec
ncmoe 30, 32K ctx → 35.4 tok/sec
ncmoe 28, 16K ctx → 16.3 tok/sec (VRAM cliff)
ncmoe 30, 65K ctx → 17.4 tok/sec (VRAM cliff)

4.8x faster than full offload. 8GB VRAM cliff is sharp - crossing ~7 GB halves throughput instantly.

the hybrid SSM+attention architecture means 32K context is nearly free (KV cache only scales for 10/40 layers).

dataset: witcheer/windows-rtx-4060ti-8gb-moe-offload-bench-2026-05

collection: https://hf.co/collections/witcheer/8gb-vram-local-llms-practitioner-tested

1 reply

reacted to danielhanchen's post with 🔥 8 days ago

Post

8665

We made a guide on how to run open LLMs in Claude Code, Codex and OpenClaw.

Use Gemma 4 and Qwen3.6 GGUFs for local agentic coding on 24GB RAM

Run with self-healing tool calls, code execution, web search via the Unsloth API endpoint and llama.cpp

Guide: https://unsloth.ai/docs/basics/api

witcheer PRO

AI & ML interests

Recent Activity

Organizations

witcheer's activity