Regression on benches

#2
by selimaktas - opened

Hello!
I benched this model on SWE-Bench Verified (no tools, single shot) and BFCL v4 (multiturn subset), it shows regression on both. May have been the cause of unstable/high LR training.
Hope to see better versions in the future!

image

Hi @selimaktas β€” appreciate the bench report. One adjacent data point that might help triangulate:

We rebuilt this checkpoint into an NVFP4 + MTP variant (sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP) on the prefix-fixed BF16 source, and a 5-task agent capability suite (single tool call, multi-turn continuation, final synthesis, 3-way parallel tool calls, reasoning chain) all pass on RTX PRO 6000 + vLLM 0.19.1rc1:

  • Tool-call JSON arguments parse cleanly
  • <think> reasoning is correctly separated from final answer
  • Per-position MTP acceptance 0.934 (mean accept length 1.93/2.0 at n=1, ~3.0/4.0 at n=3)
  • Long-form decode lands ~103 tok/s at n=3

That doesn't refute SWE-Bench Verified / BFCL v4 regressions β€” they exercise much more agentic surface than five-prompt smoke tests β€” but it suggests the BF16 weights themselves are at least functionally healthy, and the bug Discussion #1 reported (triple language_model. prefix) is fully resolved in the current upload (we re-checked with safe_open after kai-os's fix).

Two things that bit us during evaluation that are worth ruling out on your side, since they look like accuracy regressions if you don't know to set them:

  1. --reasoning-parser qwen3 β€” without this, <think>...</think> chains land in the chat content the harness reads as the "final answer", which destroys agentic eval scores even on a perfectly-tuned model.
  2. --tool-call-parser qwen3_xml (not hermes) β€” Carnice emits OpenAI-style function XML (<tool_call><function=name>…</function></tool_call>), not canonical Hermes JSON-in-tags, so the hermes parser leaves tool_calls=[] and the harness scores 0 on every tool turn.

If either of those was off in your eval setup, that alone would explain a huge gap. Worth double-checking before chalking it up to LR instability.

β€” Tonoken3 / Lna-Lab (sakamakismile)

Hello,
Thank you for your feedback! I might re-bench it if things are different, but the bench was taken properly. With vLLM, reasoning parser qwen3, and correct tool parser, not hermes, on nightly vLLM wheel like you are using. With wrong parser the results would be directly 0.

Additional notes:
The model is just regressed, not incapacitated, which isn't something that can be told apart easily in a 5-prompt test.
Testing was done on Evalscope, with vLLM for scoring, using the weights provided at the date the comment was posted. No MTP or any speculative decoding was used.

Sign up or log in to comment