Can't determine properly which is greater between 9.9 and 9.11

#38
by sniffski - opened

Guys, they say 3.1 beats GPT-4o... But it doesn't seem like that...
Here's the answer to a simple math question:

To determine which is greater between 9.9 and 9.11, you can compare the numbers directly.

- 9.11 has a tenths place of 1 and a hundredths place of 1.
- 9.9 has a tenths place of 9 and no hundredths place (which is equivalent to having a hundredths place of 0).

Since the tenths place in 9.11 is greater than that in 9.9, we can conclude that:

**9.11 is greater than 9.9.**

Also tried with a prompt that makes the model reasoning first, but still no luck...

<details type="reasoning" done="true" duration="25">
<summary>Thought for 25 seconds</summary>
> 1. The user is asking which number is greater between 9.9 and 9.11.
> 2. Break the task into parts:
>    - Compare the two numbers.
> 3. Assign each part to an expert:
>    - Number comparison: Mathematics Expert.
> 4. Combine their insights:
> 
> Mathematics Expert: To compare the numbers, you need to look at each decimal place from left to right.
> 
> - 9 is equal to 9 in both cases.
> - The tenths place (the first digit after the decimal point) is 9 in the first number and 1 in the second number. Since 1 is less than 9, we don't need to look further.
> 
> So, the second number (9.11) is greater than the first (9.9).
</details>
The greater number between 9.9 and 9.11 is **9.11**.

Actually the reasoning is correct, but at the end it gives wrong answer...

That's sad but it's true...

Number Comparison:
The Mathematics Expert will compare the two numbers.

9.9 vs. 9.11:
Both numbers have the same integer part, which is 9.
Comparing the decimal parts: 0.9 (from 9.9) vs. 0.11 (from 9.11).
Since 0.11 is greater than 0.9, 9.11 is greater than 9.9.
Conclusion:
9.11 is greater than 9.9.

I tried few times and it seems it wrongly converts 0.9 to 0.09... that's why it declares 9.11 is greater
Asking this question fo DeepSeek R1 1.5B and it manages to get the right answer...

this is a tokenizer problem. currently, we tokenize a bunch of numbers in one token.

So that for example 69 would be read as a single token, which improves efficiency but decreases number understanding.

this is a common thing in almost all LLMs, just like letter counting, which is also largely impossible due to the tokenizer.

so this will not improve, unless specifically trained on, reducing the performance in other areas.

sooooo, use code to determine if a number is larger than another, LLMs are overkill.

@Smorty100 Well actually most of the modern tokenizers encode digits to single tokens, this is true for this model too. It takes like 3 lines of code to test:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('unsloth/Mistral-Small-3.1-24B-Instruct-2503-unsloth-bnb-4bit')
print(tokenizer.encode('9.11'))

Guys, i just tested it on this question, but online, on openrouter website, and no matter how many retries, it always answers correctly. Any ideas?
There are several example questions, and their version is "9.11 and 9.9, which one is larger?".

Guys, i just tested it on this question, but online, on openrouter website, and no matter how many retries, it always answers correctly. Any ideas?
There are several example questions, and their version is "9.11 and 9.9, which one is larger?".

Yes, I have tested it as well on OpenRouter, and it is generally quite smarter in Math there πŸ€”

I've just noticed that if it routes requests to official Mistral API, then it's smarter in Math, and if it routes request to another provider, than it struggles even with this question

I've just noticed that if it routes requests to official Mistral API, then it's smarter in Math, and if it routes request to another provider, than it struggles even with this question

Maybe we should use it through VLLM it to be smarter in Math?

this is a tokenizer problem. currently, we tokenize a bunch of numbers in one token.

So that for example 69 would be read as a single token, which improves efficiency but decreases number understanding.

this is a common thing in almost all LLMs, just like letter counting, which is also largely impossible due to the tokenizer.

so this will not improve, unless specifically trained on, reducing the performance in other areas.

sooooo, use code to determine if a number is larger than another, LLMs are overkill.

It doesn't look like this is the case... It more looks like an issue with the training where the model is not trained on decimal numbers, I can make any model reasoning model and this one when reasoning it is almost getting the right answer, but at the end it get's confused... Check the attached screenshots:
First attempt (no reasoning):
image.png
This one was actually correct...
Secomd attempt (no reasoning):
image.png
This one the model thinks 0.9 equals to 0.09...
Third attempt (reasoning):
image.png
This one you can see where it got confused...

That's really dumb TBH :D

image.png
Definitely something wrong with the training... not a token issue!

This is a well-documented failure mode in LLMs that stems from tokenization and training data distribution rather than arithmetic capability per se. Models like Mistral-Small-3.1-24B tend to pattern-match on version-string or decimal comparisons where "9.11" superficially resembles a larger number because 11 > 9 at the sub-decimal level. The model is essentially doing lexicographic comparison on token sequences rather than true numeric reasoning. You can often probe this by asking the model to first convert both values to fractions or explicitly state the number of decimal places β€” that framing shift forces a different reasoning path and usually produces correct output.

A practical mitigation is to prompt the model to verbalize intermediate steps: "Express 9.9 as 9.90, then compare digit by digit." With Mistral-Small-3.1 specifically, chain-of-thought prompting tends to recover correct behavior in most cases, though it's not guaranteed. If you're building a pipeline that depends on reliable numeric comparison, you're better off routing those operations to a deterministic code execution step rather than trusting the model's direct output β€” this is a case where LLMs genuinely shouldn't be in the hot path for numeric logic.

For those of us building agentic systems, this kind of subtle reasoning failure is actually a trust signal worth tracking. In AgentGraph, we log these as capability boundary violations β€” cases where an agent confidently asserts something that's verifiably wrong β€” because they inform how much weight downstream agents should place on outputs from a given model for specific task types. A model that fails 9.9 vs 9.11 reliably shouldn't be trusted to make ordering decisions in a multi-agent pipeline without a verification layer. Treating model capability as a dynamic, task-specific attribute rather than a binary trust flag is increasingly important as these models get deployed in more complex orchestration contexts.

This is a well-known failure mode that shows up across most LLMs, including Mistral-Small-3.1-24B-Instruct-2503. The root cause is that the model is doing lexicographic/string-based comparison rather than proper numeric parsing β€” it sees "9.11" and pattern-matches to something like a version number or date where 11 > 9, so it concludes 9.11 > 9.9. The tokenization of decimal numbers compounds this: "9.11" gets encoded in a way that the positional magnitude of the fractional part isn't structurally obvious to the model during attention computation. This isn't unique to Mistral; GPT-4, Claude, and others have historically stumbled on this exact case.

A practical mitigation if you're building on top of this model: prompt it to explicitly convert to a common denominator or ask it to reason step-by-step ("express both as fractions with denominator 100"). Chain-of-thought prompting reliably fixes this class of error because it forces the model to surface the intermediate numeric representation before comparing. You can also just route numeric comparison tasks to a code interpreter tool call rather than relying on the LLM's parametric knowledge β€” have it emit Python and execute 9.9 > 9.11 directly. That's the robust solution for production.

One thing worth noting if you're running this in an agentic pipeline: numeric reasoning failures like this can propagate silently when agents are chaining decisions. If you're using something like AgentGraph or similar trust/verification infrastructure, it's worth flagging arithmetic operations as a category that should either be sandboxed to a code execution tool or have its outputs spot-checked before downstream agents consume them. A wrong numeric comparison in step 2 of a 10-step pipeline is the kind of subtle failure that's hard to trace back to source.

Sign up or log in to comment