Heads-up: triple `language_model` prefix in safetensors keys (BF16 only — GGUF unaffected)
Hi kai-os, big fan of the Carnice/Hermes SFT — testing it for an NVFP4 variant on Blackwell.
While loading the BF16 weights (kai-os/Carnice-V2-27b) via HF transformers, every linear layer comes back as MISSING in the load report and an equal number of keys appear as UNEXPECTED. The model then runs with random weights and outputs gibberish. The GGUF variant is unaffected because convert_hf_to_gguf.py normalizes prefixes during conversion.
Root cause
The safetensors keys carry a triple language_model prefix:
# kai-os/Carnice-V2-27b shipped:
model.language_model.language_model.language_model.embed_tokens.weight
model.language_model.language_model.language_model.layers.0.input_layernorm.weight
model.language_model.visual.blocks.0.attn.proj.weight # also one extra here
# What Qwen3_5ForConditionalGeneration expects:
model.language_model.embed_tokens.weight
model.language_model.layers.0.input_layernorm.weight
model.visual.blocks.0.attn.proj.weight
Likely an artifact of an Unsloth wrapper getting serialized into the key path more than once during the merge.
Reproduction
from transformers import Qwen3_5ForConditionalGeneration
import torch
m = Qwen3_5ForConditionalGeneration.from_pretrained(
"kai-os/Carnice-V2-27b", dtype=torch.bfloat16, trust_remote_code=True)
# Any prompt → gibberish, because every linear is random-init.
model.safetensors.index.json:
kai-os/Carnice-V2-27b: 1184 keys, all with the extra prefixesQwen/Qwen3.6-27B(your declared base): 1199 keys with standard prefixes (the 15 extra are MTP, dropped during your merge — that's a separate matter)
Fix
A single-pass safetensors rewrite recovers the model:
def fix_key(k: str) -> str:
if k.startswith("model.language_model.language_model.language_model."):
return "model.language_model." + k[len("model.language_model.language_model.language_model."):]
if k.startswith("model.language_model.visual."):
return "model.visual." + k[len("model.language_model.visual."):]
return k
After this the model loads cleanly, IFEval results match your benchmark numbers, and Hermes-style tool calling works.
Why mention now
I've built an NVFP4 + MTP-grafted variant on top of the fixed weights for the RTX PRO 6000 / DGX Spark (GB10) crowd who want the Hermes agent at ~20 GB VRAM. Wanted to flag this here so anyone else loading the BF16 directly knows what's happening, and to credit you properly in the README of the downstream variant.
Thanks for the SFT — the assistant-token-only loss + the GLM-5.1 trace blend in the data mix really show through.
— Tonoken3 / Lna-Lab
Thanks for catching this. the core BF16 Transformers load bug should be resolved now.