Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code
Eval Results

Failure in basic question, is it any good at programming

#31
by engrtipusultan - opened

I just tried to test it and it failed at first question. I had high hopes from this model. What is community observation, is it any good.

I ran first in llama.cpp at q8 it responded wrong answer the I asked it on
https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b

Still gives wrong answer.
I asked it:
Tell me about YAML scalar | and > . How to add new line in folded block

End part of answer, also contained wrong examples:
References

YAML 1.2 Specification, Β§7.12 – Block Mappings and Block Scalars (literal | and folded >).
YAML 1.2 Specification, Β§7.12.6 – Folding (blank‑line termination rule).
YAML 1.2 Specification, Β§7.12.7 – Chomping (|, |-, |+).

Bottom line

Folded scalars cannot contain literal newlines; they are always folded to spaces. If you need a real newline, switch to a literal scalar (|) or restructure the data (e.g., a list). The spec’s termination rule is blank line with the same indentation, not β€œany blank line”. I hope this clears up the confusion!

I also get really disappointing answers for a really short "needle in a haystack" use case... It passes it ~ 1 out of 10, with bonus hallucinations :(

Even if it was for my personal need, I share my prompt for eventual replication attempts: https://paste.sh/HEiB9Ceu#D6qUKrYaq2o44n6zv_UmgL_0
The only correct answer is in short: "Yes for Nanbeige4-3B-Thinking-2511-Q4_K_M_64k, /mnt/storage/GGUFs/Nanbeige_Nanbeige4-3B-Thinking-2511-Q4_K_M.gguf is missing."

gpt-oss-20b-high passes it ~ 8/10 (sometimes gets a bit confused about the sglang entries, but even then, can't really be blamed: that's true there are no gguf files for those - it just didn't infer from my bad wording!)

Typical response example - Nemotron-3-Nano-30B-A3B: https://paste.sh/dHoQxWVd#R5qsAnBdKvQqsxxJiqUyQD_0
Typical response example - gpt-oss-20b-high: https://paste.sh/xNpWJ29f#ZJ2WmGOl5YtmEyLJT_S-7DDT

For reference, I used Unsloth's Q4_K_XL version with llama-cpp engine. I just tried again with the latest b7564.

The sampling params were first the ones actually inside the prompt, i.e.:

      --temp 0.6
      --top-p 0.95
      --top-k 40 (llama.cpp's default)
      --min-p 0.1 (llama.cpp's default)
      --repeat-penalty 1.1 (don't remember why I enabled it)
      --chat-template-kwargs '{"enable_thinking":true}'

But when I tried again just now, I strictly stuck the "normal thinking case", apart from setting top-k to 50 as I saw it was the default for transformers (thanks DeepWiki! - https://deepwiki.com/search/what-are-the-default-sampling_4e57bb01-c554-48f8-976d-ea4fe9415afe):

      --temp 1
      --top-p 1
      --top-k 50
      --min-p 0
      --repeat-penalty 1 (so disabled this time!)
      --chat-template-kwargs '{"enable_thinking":true}'

PS: I know this is not Unsloth's repo but as I understood there has been a partnership with them before launch, and also since the thread seemed quite similar I decided to post it here
PS2: For those interested, the yaml is a llama-swap config file which I can only recommend for those wanting to quit ollama after they turned dark side! It's also far better in the sense it supports any engine! Really recommend to give it a try! Don't break you gguf into non exportable blobs, split your configs in many "modelfiles", here you have only one and are good to go!

I hope they provide an update. YAML is so widely used. Not training properly for basics.
Also most organisations just throw model out there and never reply or provide update on model.

As far as sampling is concerned llama.cpp allows sequence of sampling as well.as ability to restrict sampling parameters. But again unfortunately some organisations do not provide that you have to try hit and mis.

For me qwen3 a3b is king and followed by gpt oss. Qwen team is best as they provide updates as well as better documentation as well. Also reply to users occasionally.

throw model out there and never reply

drives me mad too! But let's see here, this might be different for this model as they published a lot of info this time

llama.cpp allows sequence of sampling

yeah i never tried to play with it i always let the default penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperatureas there are already so many parameters to try out! and benchmarking takes so much time for reasoning models!

ability to restrict sampling parameters

what do you mean? curious about this

unfortunately some organisations do not provide that

yeah but we can't really blame them if it's a llama-cpp feature, I never tried to learn how is the logic for other backends, I'm interested now!

I just tried Qwen3-30B-A3B-Thinking-2507-Q4_K_XL on it: 10 out of 10 attempts literally, really first one passed, then 9 in a row

I just tried Qwen3-30B-A3B-Thinking-2507-Q4_K_XL on it: 10 out of 10 attempts literally, really first one passed, then 9 in a row
As I said earlier. In my testing Qwen3 A3B is the best in its league. Only problem with exisiting Qwen3 A3B is that it does not work very well with llama.cpp flash attention and otherwise. It slows down a lot for prompt processing on higher contexts. Hopefully Qwen3 Next will solve those problems.

unfortunately some organisations do not provide that

yeah but we can't really blame them if it's a llama-cpp feature, I never tried to learn how is the logic for other backends, I'm interested now!

What I mean is that look at Alibaba Team. That provides Qwen, they are serious about the OSS model. They provided initial release then addressed sampling parameter updates and chat template problems. The provided model update with 2507. While other just throw model in wild and never look back.

drives me mad too! But let's see here, this might be different for this model as they published a lot of info this time

llama.cpp allows sequence of sampling

yeah i never tried to play with it i always let the default penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperatureas there are already so many parameters to try out! and benchmarking takes so much time for reasoning models!

ability to restrict sampling parameters

what do you mean? curious about this

Model card just says temperature=1.0 and top_p=1.0 but it does not say what is the sampling sequence. Every inference engine has its own defaults and sample parameter sequence which can impact inference.

You can pass the parameter.

--sampling-seq edskypmxt
SAMPLER_TYPE_PENALTIES: 'e'
SAMPLER_TYPE_DRY: 'd'
SAMPLER_TYPE_TOP_N_SIGMA: 's'
SAMPLER_TYPE_TOP_K: 'k'
SAMPLER_TYPE_TYPICAL_P: 'y'
SAMPLER_TYPE_TOP_P: 'p'
SAMPLER_TYPE_MIN_P: 'm'
SAMPLER_TYPE_XTC: 'x'
SAMPLER_TYPE_TEMPERATURE: 't'

means that sampling will be performed in this sequence. If you want greedy decoding typically people use temp = 0. Which is not correct. In llama.cpp you can use.
--sampling-seq k --top-k 1

That means no other sampling parameter is active other than top-k and top-k value is one.

https://github.com/ggml-org/llama.cpp/pull/9897

Thanks for the tip for the greedy decoding, I'm bookmarking it!

Qwen, they are serious about the OSS model

I can agree the interaction with the community hasn't been great on Nvidia's side. But, on the other side, they published the WHOLE list of the datasets along with almost all the methodology they used! While Qwen didn't. So I can't agree on the OSS caring side.

Maybe we will have some explanation about the disappointing behavior we both saw.

PS: to be honest, I might have been lucky yesterday with gpt-oss-high, because it actually failed several times afterward! But the magnitude was still much less than Nemotron. There might be something wrong.
By the way, if you don't mind I'm curious about how you ran the model? Was is llama.cpp too?

By the way, if you don't mind I'm curious about how you ran the model? Was is llama.cpp too?

I provided that in first post.

I ran first in llama.cpp at q8 it responded wrong answer
then I asked it on https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b

So it is not quantization or sampling problem.

Thanks! Yeah sorry I simply forgot to check your OP lol!
I was wondering about a potential llama-cpp implementation but given the same behavior on the Nvidia playground it indeed points to a more fundamental issue :/ hmm

NVIDIA org

Indeed, if you see similar issues on build.nvidia.com then it's more a model limitation than a config issue. It's indeed possible that there may not be enough YAML training data currently, thanks for reporting.

Sign up or log in to comment