New Old Llamas

Community Article Published January 3, 2026

Deep in the intersticial space between Christmas and NYE, where up is down and every day is a weekend, a reddit thread caught my eye. It was the story of https://huggingface.co/allura-forge/Llama-3.3-8B-Instruct - a riveting tale of mystery, LoRAs and CORS errors that's well worth reading.

Since this is not an official release nobody is quite sure how to configure this model so the next day a challenger appeared. https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K has the same weights, but a different config.json with the 128k RoPE settings from Llama-3.3-70B applied.

And if all that still isn't enough the final piece of the puzzle is https://huggingface.co/DavidAU/Llama3.3-8B-Instruct-Thinking-Claude-4.5-Opus-High-Reasoning which is a fine-tune of the non-RoPE'd version with some thinking traces to turn it into a hybrid model.

What is going on here? How do these models compare to the original Llama 3.1 8B baselines? Are 250 samples of high-quality thinking really enough to create a reasoning model?

Let's apply https://reasonscape.com/ methodology and see if we can't get some answers to these questions.

The Contenders

Llama-3.1-8B (FP16) - Meta's Llama 3.1 8B Instruct model in full precision (FP16). This serves as the baseline Llama model for the 3.x series.

DeepSeek-R1-Distill-Llama-8B (FP16) - DeepSeek's R1 reasoning model distilled into the Llama 8B architecture.

Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) - NVIDIA's Nemotron Nano 4B model, based on a distillation of Llama 3.1 8B into a 4.5-billion parameter model.

Llama-3.1-Nemotron-Nano-8B-v1 (FP16) - NVIDIA's Nemotron Nano 8B model, the larger sibling to the 4B version.

Llama-3.3-8B (FP16) - The 3.3 leak.

Llama-3.3-8B-128K (FP16) - The 3.3 128k extended-context leak.

Llama-3.3-8B-Thinking (FP16) - A community fine-tune of 3.3 trained on Claude 4.5 Opus reasoning outputs.

This suite of taks is difficult, Instruction-tuned models are asked to Think Step-by-Step.

The Lay of the Land

All tests are performed at 8k context, so a 'truncation' refers specifically to any response over 8192 tokens.

A few things jump out immediately, so lets dive right in

None of these models can do Basic Math

The two lowest ranking tasks are Arithmetic and Letters both of which require iterative, multi-step arithmetic. What's a little surprising is the R1 finetine wasn't able to pick this up, while the Nemotron strategy actually put up a decent fight.

Brackets are Universally Hard

This is an out-of-domain and tokenization stress test, but https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 failure is somewhat interesting:

Nemotron-Nano-4B model actually has a small zone of kinda getting it, but collapses to 0 outside of it while R1 struggles fairly uniformly.

Boolean Excellence

Both 3.3 variants are somewhat unexpectedly sitting at the top of the Boolean charts, outperforming even reasoning fine-tunes! The lowest truncation rates of the cohort, with the surfaces showing the trouble comes when the problem is formatted with "T/F" (vs ON/OFF or True/False). This kind of format sensitivity is quite common and implies that while the model is capable it didn't fully generalize the task for all input formats.

Distracted Object Counting

All three of the 3.3 models are in a statistically inseperable, 3-way tie for first place on the Objects counting task. They "get it" and truncations rates are really low, but as you can see from the surfaces the performance breakdown with difficulty is exponential so the actual competence zone is small.

Word Sorting

Sorting shows a very similar pattern to object counting, the 128k drops just a little below the confidence band while the 3.3 leak is statistically tied with R1 fine-tune (but with much fewer tokens consumed).

Resource Utilization

Rank	Model	Tier	ReasonScore	Avg Tokens	Score/Token
1	Llama-3.3-8B (FP16)	easy	312	324	0.96
2	Llama-3.3-8B-128K (FP16)	easy	295	313	0.94
3	Llama-3.1-8B (FP16)	easy	177	366	0.48
4	Llama-3.3-8B-Thinking (FP16)	easy	152	425	0.36
5	R1-Distill-Llama-8B (FP16)	easy	457	1693	0.27
6	Llama-3.1-Nemotron-Nano-8B (FP16)	easy	146	1191	0.12
7	Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16)	easy	318	3125	0.1

Reasoning models burn a lot of tokens, this is not news to anyone I think, but exactly how many tokens you burn and how much value you get from it can vary widely.

Meta-Rank

As an alternative perspective, we can combine the individual task analysis into a "ranking of ranks". Note the letters after the rank refer to resource utilization sub-clusters, so 2a vs 2b means performance was same but 2a used significantly fewer tokens to achieve it.

rank	eval_label	arithmetic	boolean	brackets	cars	dates	letters	movies	objects	sequence	shapes	shuffle	sort	total_penalty
1	R1-Distill-Llama-8B (FP16)	3b	3	1	2	1	1a	3b	5	1a	1	1	1c	24.5
2	Llama-3.3-8B (FP16)	3a	1	4a	1	3	3b	1	1a	2	3a	3a	1b	27
3	Llama-3.3-8B-128K (FP16)	5a	2	3a	1	3	3b	1	1a	2	2a	3a	2b	29
4	Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16)	2	5b	2	3c	2	1b	4	3b	1b	2c	3c	2d	34
5	Llama-3.3-8B-Thinking (FP16)	4a	4a	3c	1	3	2a	5b	1b	3a	5	3b	1a	37
6	Llama-3.1-Nemotron-Nano-8B (FP16)	1	4b	3b	3b	5	3a	5a	2c	4	2b	2	4	40.5
7	Llama-3.1-8B (FP16)	6	5a	5c	3a	4	2b	2	4	3b	4	4	3	46.5

In terms of "raw pass@k accuracy" at the expense of all else, R1-Distill-Llama-8B is the winner.

Sneaking in right behind it with an average of 5x fewer tokens are the 3.3 leaks. It's difficult to seperate performance of the original vs 128K model, most tasks are tied - but there is a small overall advantage when aggregating across tasks towards the non-extended version.

Closing Thoughts

3.1 vs 3.3

These models are the real deal.

3.1	3.3

They demonstrate excellent across-the-board performance improvement vs the 3.1 baseline with fewer truncations and (on the tasks they are capable of) they even give R1 a run for it's money, and sit at the top of the resource efficiency ranking.

3.3 vs 3.3-128k

3.3	3.3-128k

There is very little difference between the 128k RoPE and the original version, but keep in mind my tests generated an average of under 400 tokens per response so it's quite possible this simply wasn't deep enough to observe any difference.

The primary challenge here is that scaling task difficulty with these models doesn't help them generate longer traces, the entropy plots clearly show they are unable to 'scale their thinking', applying the same effort to easy tasks as hard ones so when the problems gets difficult they underthink and become wrong:

Lets call this one a practical failure of the methodology rather then a concrete result.

3.3 vs 3.3-Thinking

3.3	3.3-think

The 3.3 Claude fine-tune is cute and fun to play with, it extends the average response lengths by ~30% but in terms of reasoning performance the massive boost in truncations cancels out the gains.

Methodology

If you'd like to learn more about the ReasonScape methodology and tools that helped me produce this analysis, all my evaluation data and tools are open-source and can be found on GitHub.

Worried about open-source evaluations causing contamination? Dive into the Understanding ReasonScape documentation to understand how I use controlled, deterministically random difficulty manifolds to enable effectcively infinite test-case generation.

I will publish the raw data from this sub-dataset (dubbed m12x-llama8b) this weekend, but please reach out if you'd like to explore it further.

If you've got Questions or Feedback, please leave them in the comments section below.

Models mentioned in this article 7

From Benchmarks to Cognitive Landscapes: Building ReasonScape

October 13, 2025

Accidentally Building an AI Reasoning Research Ecosystem (Or: Can AI Stop Thinking?)

June 26, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote