Reasoning degeneration on structured logic problems (ProverQA-style)

#15

by jeffchanpm - opened Mar 16

Mar 16

Hi, thanks for sharing this model!

I tested the Q4_K_M quantization on structured formal logic benchmarks (ProverQA from Logical
Phase Transitions) and found that the model's chain-of-thought degenerates into infinite
repetition loops on medium-to-hard problems.

Setup

Quantization: Q4_K_M (from this repo)
Inference: llama.cpp (llama-server), --n-predict 8192
Hardware: NVIDIA DGX Spark (GB10)

Observations

The model works correctly on simple logic (1-3 hop syllogisms), but degenerates when given
multi-step formal logic premises with distractors — typical of ProverQA structure.

Complexity	Reasoning tokens	Degenerate?	Correct?
1-hop ("A is B. B is C. Is A C?")	~247	No	Yes
2-hop (transitive chain)	~297	No	Yes
3-hop + distractors	~231	No	Yes
5-entity + negation	~365	No	Yes
ProverQA Medium-like (multi-rule chain)	~1577	Yes	No
ProverQA Hard-like (10+ premises, distractors)	~1577	Yes	No

Degeneration pattern

The reasoning_content starts normally but degenerates into repeated ,, ,, ,, ,, tokens around
1500 tokens in, never producing a final answer in content. Example tail of reasoning output:

,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, ,,

finish_reason is always length (token budget exhausted on degenerate reasoning), and content
is empty.

Reproduction

Minimal prompt that triggers the issue:

Premises:
Legend is not competitive. Legend is well-trained. If Legend has strong muscles
or an athletic build, then he is competitive. Legend does not have an athletic
build. Legend either eats nutritious food or has strong muscles. If Legend is
well-trained, then Legend eats nutritious food.

Hypothesis: Legend has strong muscles.
Is this True, False, or Unknown? One word.

This requires ~5 reasoning steps with one contrapositive inference. The base Qwen3.5-27B
handles this correctly.

Possible cause

The distillation training data (~3950 samples) appears to focus on general reasoning, coding,
and math. The model may lack exposure to formal logic with multiple premises, distractors,
and negation chains, causing the CoT to fail on this reasoning structure.

Jianqiao1

Mar 16

I share that sentiment; this version often tends to get stuck in loops during complex reasoning tasks. For instance, if you ask the model to answer the final problem from the AIME 2025 II, it will almost certainly get stuck. However, the original Qwen 3.5 does not.

“There are exactly three positive real numbers $ k $ such that the function $ f(x) = \frac{(x - 18)(x - 72)(x - 98)(x - k)}{x} $ defined over the positive real numbers achieves its minimum value at exactly two positive real numbers $ x $. Find the sum of these three values of $ k $.”

lemonyb

Mar 20

I've observed the same behavior, but on the v2 version of this distillation.

samajlouis

Mar 21

Same here issue here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment