This is great with Roo Code!
#6
by zekromVale - opened
I love the Qwen_Qwen3.5-35B-A3B-Q5_K_M.gguf for my RTX 5070ti. I get around 40TPS. This model is very good with code via roo code! (Tried other models and they were just stupid or did not call tools correctly and continue has issues with deleting code as it is cloud llm first.) And it can even call sub processes to deal with different tasks. And with 131K ctx it has no issues with large files. I am tempted to make an IQ5 version of it. Thank you! Still it can have some issues like failing to edit a document or getting stuck in loops. I need to tell it to not think as much and instead think out-loud.
Here are my settings I use:
Shared settings
# [Shared Settings] - These apply to all models unless overridden
[*]
flash-attn = true
mlock = true
no-warmup = true
fit = off
parallel = 1
A bit slower, but very good
[Qwen-35B-Plus]
# Path to your GGUF
model = /models/Qwen_Qwen3.5-35B-A3B-Q5_K_M.gguf
# Performance & Hardware
n-gpu-layers = -1
threads = 12
# n-cpu-moe pulls MoE experts to CPU to save VRAM.
n-cpu-moe = 20
# Context & Cache
ctx-size = 131072
batch-size = 2048
ubatch-size = 512
cache-type-k = q8_0
cache-type-v = q8_0
# Samplers (Qwen 3.5 specific sweet spots)
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.05
repeat-penalty = 1.05
presence-penalty = 0.0
Faster, but not as bright
[Qwen-35B]
# Path to your GGUF
model = /models/Qwen_Qwen3.5-35B-A3B-IQ4_NL.gguf
# Performance & Hardware
n-gpu-layers = -1
threads = 12
# n-cpu-moe pulls MoE experts to CPU to save VRAM.
n-cpu-moe = 14
# Context & Cache
ctx-size = 131072
batch-size = 2048
ubatch-size = 512
cache-type-k = q4_0
cache-type-v = q4_0
## Samplers (Qwen 3.5 specific sweet spots)
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.05
repeat-penalty = 1.05
presence-penalty = 0.0