Bro tip: the mmlu dataset for llama.cpp is pretty bad, you can use https://huggingface.co/datasets/Green-Sky/mmlu-redux-2.0-for-llama.cpp/blob/main/mmlu-redux-2-ok%2Bexpert.bin instead. The data is both of higher quality (mmlu redux based) AND the context is better. I give it all choices and then let it decide with the ABCD letter.
While looking at the original mmlu conversion for llama.cpp i noticed that some answers are like "both a and c" or similar, so they should never be probably for a model that did not get fed all the choices in the first place.