Fix YARN mscale_all_dim for long-context (mirror upstream Mistral fix)
Browse filesSet text_config.rope_parameters.mscale_all_dim from 1.0 to 0.0 to match the upstream mistralai/Mistral-Medium-3.5-128B repo (commit c4be198050, 2026-05-01).
The original value of 1.0 caused HF transformers _compute_yarn_parameters to evaluate get_mscale(64,1)/get_mscale(64,1) = 1.0, silently disabling YARN attention scaling. With mscale_all_dim=0.0, the falsy mscale_all_dim routes to the else branch that returns 1+0.1*ln(64) = 1.4159, matching vLLM and the YARN paper.
This fixes long-context generation degeneration (repetition loops past ~600-800 tokens) under HF transformers and llama.cpp pre-existing GGUF builds. params.json apply_scale=true was already correct and is unchanged.
- config.json +1 -1
config.json
CHANGED
|
@@ -42,7 +42,7 @@
|
|
| 42 |
"factor": 64.0,
|
| 43 |
"llama_4_scaling_beta": 0,
|
| 44 |
"mscale": 1.0,
|
| 45 |
-
"mscale_all_dim":
|
| 46 |
"original_max_position_embeddings": 4096,
|
| 47 |
"rope_theta": 1000000.0,
|
| 48 |
"rope_type": "yarn",
|
|
|
|
| 42 |
"factor": 64.0,
|
| 43 |
"llama_4_scaling_beta": 0,
|
| 44 |
"mscale": 1.0,
|
| 45 |
+
"mscale_all_dim": 0.0,
|
| 46 |
"original_max_position_embeddings": 4096,
|
| 47 |
"rope_theta": 1000000.0,
|
| 48 |
"rope_type": "yarn",
|