New Old Llamas
Since this is not an official release nobody is quite sure how to configure this model so the next day a challenger appeared. https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K has the same weights, but a different config.json with the 128k RoPE settings from Llama-3.3-70B applied.
And if all that still isn't enough the final piece of the puzzle is https://huggingface.co/DavidAU/Llama3.3-8B-Instruct-Thinking-Claude-4.5-Opus-High-Reasoning which is a fine-tune of the non-RoPE'd version with some thinking traces to turn it into a hybrid model.
What is going on here? How do these models compare to the original Llama 3.1 8B baselines? Are 250 samples of high-quality thinking really enough to create a reasoning model?
Let's apply https://reasonscape.com/ methodology and see if we can't get some answers to these questions.
The Contenders
Llama-3.1-8B (FP16) - Meta's Llama 3.1 8B Instruct model in full precision (FP16). This serves as the baseline Llama model for the 3.x series.
DeepSeek-R1-Distill-Llama-8B (FP16) - DeepSeek's R1 reasoning model distilled into the Llama 8B architecture.
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) - NVIDIA's Nemotron Nano 4B model, based on a distillation of Llama 3.1 8B into a 4.5-billion parameter model.
Llama-3.1-Nemotron-Nano-8B-v1 (FP16) - NVIDIA's Nemotron Nano 8B model, the larger sibling to the 4B version.
Llama-3.3-8B (FP16) - The 3.3 leak.
Llama-3.3-8B-128K (FP16) - The 3.3 128k extended-context leak.
Llama-3.3-8B-Thinking (FP16) - A community fine-tune of 3.3 trained on Claude 4.5 Opus reasoning outputs.
This suite of taks is difficult, Instruction-tuned models are asked to Think Step-by-Step.
The Lay of the Land
All tests are performed at 8k context, so a 'truncation' refers specifically to any response over 8192 tokens.
A few things jump out immediately, so lets dive right in
None of these models can do Basic Math
The two lowest ranking tasks are Arithmetic and Letters both of which require iterative, multi-step arithmetic. What's a little surprising is the R1 finetine wasn't able to pick this up, while the Nemotron strategy actually put up a decent fight.
Brackets are Universally Hard
This is an out-of-domain and tokenization stress test, but https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 failure is somewhat interesting:
Nemotron-Nano-4B model actually has a small zone of kinda getting it, but collapses to 0 outside of it while R1 struggles fairly uniformly.
Boolean Excellence
Both 3.3 variants are somewhat unexpectedly sitting at the top of the Boolean charts, outperforming even reasoning fine-tunes! The lowest truncation rates of the cohort, with the surfaces showing the trouble comes when the problem is formatted with "T/F" (vs ON/OFF or True/False). This kind of format sensitivity is quite common and implies that while the model is capable it didn't fully generalize the task for all input formats.
Distracted Object Counting
All three of the 3.3 models are in a statistically inseperable, 3-way tie for first place on the Objects counting task. They "get it" and truncations rates are really low, but as you can see from the surfaces the performance breakdown with difficulty is exponential so the actual competence zone is small.
Word Sorting
Sorting shows a very similar pattern to object counting, the 128k drops just a little below the confidence band while the 3.3 leak is statistically tied with R1 fine-tune (but with much fewer tokens consumed).
Resource Utilization
| Rank | Model | Tier | ReasonScore | Avg Tokens | Score/Token |
|---|---|---|---|---|---|
| 1 | Llama-3.3-8B (FP16) | easy | 312 | 324 | 0.96 |
| 2 | Llama-3.3-8B-128K (FP16) | easy | 295 | 313 | 0.94 |
| 3 | Llama-3.1-8B (FP16) | easy | 177 | 366 | 0.48 |
| 4 | Llama-3.3-8B-Thinking (FP16) | easy | 152 | 425 | 0.36 |
| 5 | R1-Distill-Llama-8B (FP16) | easy | 457 | 1693 | 0.27 |
| 6 | Llama-3.1-Nemotron-Nano-8B (FP16) | easy | 146 | 1191 | 0.12 |
| 7 | Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) | easy | 318 | 3125 | 0.1 |
Reasoning models burn a lot of tokens, this is not news to anyone I think, but exactly how many tokens you burn and how much value you get from it can vary widely.
Meta-Rank
As an alternative perspective, we can combine the individual task analysis into a "ranking of ranks". Note the letters after the rank refer to resource utilization sub-clusters, so 2a vs 2b means performance was same but 2a used significantly fewer tokens to achieve it.
| rank | eval_label | arithmetic | boolean | brackets | cars | dates | letters | movies | objects | sequence | shapes | shuffle | sort | total_penalty |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | R1-Distill-Llama-8B (FP16) | 3b | 3 | 1 | 2 | 1 | 1a | 3b | 5 | 1a | 1 | 1 | 1c | 24.5 |
| 2 | Llama-3.3-8B (FP16) | 3a | 1 | 4a | 1 | 3 | 3b | 1 | 1a | 2 | 3a | 3a | 1b | 27 |
| 3 | Llama-3.3-8B-128K (FP16) | 5a | 2 | 3a | 1 | 3 | 3b | 1 | 1a | 2 | 2a | 3a | 2b | 29 |
| 4 | Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) | 2 | 5b | 2 | 3c | 2 | 1b | 4 | 3b | 1b | 2c | 3c | 2d | 34 |
| 5 | Llama-3.3-8B-Thinking (FP16) | 4a | 4a | 3c | 1 | 3 | 2a | 5b | 1b | 3a | 5 | 3b | 1a | 37 |
| 6 | Llama-3.1-Nemotron-Nano-8B (FP16) | 1 | 4b | 3b | 3b | 5 | 3a | 5a | 2c | 4 | 2b | 2 | 4 | 40.5 |
| 7 | Llama-3.1-8B (FP16) | 6 | 5a | 5c | 3a | 4 | 2b | 2 | 4 | 3b | 4 | 4 | 3 | 46.5 |
In terms of "raw pass@k accuracy" at the expense of all else, R1-Distill-Llama-8B is the winner.
Sneaking in right behind it with an average of 5x fewer tokens are the 3.3 leaks. It's difficult to seperate performance of the original vs 128K model, most tasks are tied - but there is a small overall advantage when aggregating across tasks towards the non-extended version.
Closing Thoughts
3.1 vs 3.3
These models are the real deal.
They demonstrate excellent across-the-board performance improvement vs the 3.1 baseline with fewer truncations and (on the tasks they are capable of) they even give R1 a run for it's money, and sit at the top of the resource efficiency ranking.
3.3 vs 3.3-128k
There is very little difference between the 128k RoPE and the original version, but keep in mind my tests generated an average of under 400 tokens per response so it's quite possible this simply wasn't deep enough to observe any difference.
The primary challenge here is that scaling task difficulty with these models doesn't help them generate longer traces, the entropy plots clearly show they are unable to 'scale their thinking', applying the same effort to easy tasks as hard ones so when the problems gets difficult they underthink and become wrong:
Lets call this one a practical failure of the methodology rather then a concrete result.
3.3 vs 3.3-Thinking
The 3.3 Claude fine-tune is cute and fun to play with, it extends the average response lengths by ~30% but in terms of reasoning performance the massive boost in truncations cancels out the gains.
Methodology
If you'd like to learn more about the ReasonScape methodology and tools that helped me produce this analysis, all my evaluation data and tools are open-source and can be found on GitHub.
Worried about open-source evaluations causing contamination? Dive into the Understanding ReasonScape documentation to understand how I use controlled, deterministically random difficulty manifolds to enable effectcively infinite test-case generation.
I will publish the raw data from this sub-dataset (dubbed m12x-llama8b) this weekend, but please reach out if you'd like to explore it further.
If you've got Questions or Feedback, please leave them in the comments section below.















