Reasoning degeneration on structured logic problems (ProverQA-style)

#15
by jeffchanpm - opened

Hi, thanks for sharing this model!

I tested the Q4_K_M quantization on structured formal logic benchmarks (ProverQA from Logical
Phase Transitions) and found that the model's chain-of-thought degenerates into infinite
repetition loops on medium-to-hard problems.

Setup

  • Quantization: Q4_K_M (from this repo)
  • Inference: llama.cpp (llama-server), --n-predict 8192
  • Hardware: NVIDIA DGX Spark (GB10)

Observations

The model works correctly on simple logic (1-3 hop syllogisms), but degenerates when given
multi-step formal logic premises with distractors β€” typical of ProverQA structure.

Complexity Reasoning tokens Degenerate? Correct?
1-hop ("A is B. B is C. Is A C?") ~247 No Yes
2-hop (transitive chain) ~297 No Yes
3-hop + distractors ~231 No Yes
5-entity + negation ~365 No Yes
ProverQA Medium-like (multi-rule chain) ~1577 Yes No
ProverQA Hard-like (10+ premises, distractors) ~1577 Yes No

Degeneration pattern

The reasoning_content starts normally but degenerates into repeated ,, ,, ,, ,, tokens around
1500 tokens in, never producing a final answer in content. Example tail of reasoning output:

,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,,

finish_reason is always length (token budget exhausted on degenerate reasoning), and content
is empty.

Reproduction

Minimal prompt that triggers the issue:

Premises:
Legend is not competitive. Legend is well-trained. If Legend has strong muscles
or an athletic build, then he is competitive. Legend does not have an athletic
build. Legend either eats nutritious food or has strong muscles. If Legend is
well-trained, then Legend eats nutritious food.

Hypothesis: Legend has strong muscles.
Is this True, False, or Unknown? One word.

This requires ~5 reasoning steps with one contrapositive inference. The base Qwen3.5-27B
handles this correctly.

Possible cause

The distillation training data (~3950 samples) appears to focus on general reasoning, coding,
and math. The model may lack exposure to formal logic with multiple premises, distractors,
and negation chains, causing the CoT to fail on this reasoning structure.

I share that sentiment; this version often tends to get stuck in loops during complex reasoning tasks. For instance, if you ask the model to answer the final problem from the AIME 2025 II, it will almost certainly get stuck. However, the original Qwen 3.5 does not.

β€œThere are exactly three positive real numbers $ k $ such that the function $ f(x) = \frac{(x - 18)(x - 72)(x - 98)(x - k)}{x} $ defined over the positive real numbers achieves its minimum value at exactly two positive real numbers $ x $. Find the sum of these three values of $ k $.”

I've observed the same behavior, but on the v2 version of this distillation.

Same here issue here.

Sign up or log in to comment