Perhaps the most optimal model for 16GB GPU for long context.
I have been ignoring this 15B lineup for a while, playing with larger models only. But faced with a chat that went beyond 16K that i did not want to summarize, i turned to look for alternatives that could spare me the long ass processing time that comes with offloading.
At Q5_K_M, this model fits fully into VRAM at 28k context. By offloading odd numbered ffn_up tensors only, this can be stretched to 32k, the speed cuts in half but it's still usable. Importantly, the model is able to coherently handle this context size, but surprisingly enough, i get MUCH better quality of writing when i was accidentally using DanChat-2 (from dan's personality engine 1.3.0) context format, but obviously then it has trouble stopping the stream.
Perhaps it doesn't rival the likes of Big-Tiger-Gemma 27B (i actually really like that model for a hybrid between story-adventure and character-driven rp). But it holds up in a way. It can distinguish between multiple character's perspectives, and it's not too riddled with slop.
I briefly tried v3 as well, as I read some comments claiming that it was better, but in my case, i found it a lot less coherent than v4, at least at this long context (>20k).