Reasoning degeneration on structured logic problems (ProverQA-style)
Hi, thanks for sharing this model!
I tested the Q4_K_M quantization on structured formal logic benchmarks (ProverQA from Logical
Phase Transitions) and found that the model's chain-of-thought degenerates into infinite
repetition loops on medium-to-hard problems.
Setup
- Quantization: Q4_K_M (from this repo)
- Inference: llama.cpp (llama-server), --n-predict 8192
- Hardware: NVIDIA DGX Spark (GB10)
Observations
The model works correctly on simple logic (1-3 hop syllogisms), but degenerates when given
multi-step formal logic premises with distractors β typical of ProverQA structure.
| Complexity | Reasoning tokens | Degenerate? | Correct? |
|---|---|---|---|
| 1-hop ("A is B. B is C. Is A C?") | ~247 | No | Yes |
| 2-hop (transitive chain) | ~297 | No | Yes |
| 3-hop + distractors | ~231 | No | Yes |
| 5-entity + negation | ~365 | No | Yes |
| ProverQA Medium-like (multi-rule chain) | ~1577 | Yes | No |
| ProverQA Hard-like (10+ premises, distractors) | ~1577 | Yes | No |
Degeneration pattern
The reasoning_content starts normally but degenerates into repeated ,, ,, ,, ,, tokens around
1500 tokens in, never producing a final answer in content. Example tail of reasoning output:
,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,,
finish_reason is always length (token budget exhausted on degenerate reasoning), and content
is empty.
Reproduction
Minimal prompt that triggers the issue:
Premises:
Legend is not competitive. Legend is well-trained. If Legend has strong muscles
or an athletic build, then he is competitive. Legend does not have an athletic
build. Legend either eats nutritious food or has strong muscles. If Legend is
well-trained, then Legend eats nutritious food.
Hypothesis: Legend has strong muscles.
Is this True, False, or Unknown? One word.
This requires ~5 reasoning steps with one contrapositive inference. The base Qwen3.5-27B
handles this correctly.
Possible cause
The distillation training data (~3950 samples) appears to focus on general reasoning, coding,
and math. The model may lack exposure to formal logic with multiple premises, distractors,
and negation chains, causing the CoT to fail on this reasoning structure.
I share that sentiment; this version often tends to get stuck in loops during complex reasoning tasks. For instance, if you ask the model to answer the final problem from the AIME 2025 II, it will almost certainly get stuck. However, the original Qwen 3.5 does not.
βThere are exactly three positive real numbers $ k $ such that the function $ f(x) = \frac{(x - 18)(x - 72)(x - 98)(x - k)}{x} $ defined over the positive real numbers achieves its minimum value at exactly two positive real numbers $ x $. Find the sum of these three values of $ k $.β
I've observed the same behavior, but on the v2 version of this distillation.
Same here issue here.