Failure at 32k context
Hello,
I have a simple needle-in-a-haystack test script. I use llama-server (from llama.cpp) to serve the models and run the script locally , using a local endpoint.
Everything works fine 4k to 16k, but on 32k context the model returns ????? - and this applies to all GGUF versions including BF16, from both IBM and Unsloth.
The problem does NOT happen with Granite-1b (non-hybrid) on the same script and same llama-server. It aso does NOT happen when inferring with Granite-4h-1b on Transformers with the same prompt. Finally I tried an IBM GGUF of Granite-4.0-h-micro and the problem did not hapen there either.
So this is specifically a problem with GGUF version of Granite-4-h-1B, but all of them. A fine-tuned version I made and converted to GGUD with standard tools also has the problem.
It seems from llama-server output that the GGUFs have RoPE metadata, despite the model being officially NoPE. It also appears that the RoPE dimension count in the GGUFs is set to 128 which is probably not right; it is 64 in the micro; but setting it to 64 in the -1b did not cganhe anything at all. Removing RoPE medatata completely stops the model from responding.