Missing tensor 46?
Looks like it's missing blk.46? I don't see it in the original weights either, which is weird for a GLM air fine-tune. Were you able to interface with it or just quant?
Rather unfortunate, but thanks for releasing the quant though!
create_tensor: loading tensor blk.44.ffn_gate_shexp.weight
create_tensor: loading tensor blk.44.ffn_down_shexp.weight
create_tensor: loading tensor blk.44.ffn_up_shexp.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_q.bias
create_tensor: loading tensor blk.45.attn_k.bias
create_tensor: loading tensor blk.45.attn_v.bias
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.ffn_gate_shexp.weight
create_tensor: loading tensor blk.45.ffn_down_shexp.weight
create_tensor: loading tensor blk.45.ffn_up_shexp.weight
llama_model_load: error loading model: missing tensor 'blk.46.attn_norm.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '$MODELS/Noctrex-INTELLECT-3.1-MXFP4_MOE-00001-of-00004.gguf'
srv load_model: failed to load model, '$MODELS/Noctrex-INTELLECT-3.1-MXFP4_MOE-00001-of-00004.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
ah I think it was handled somehow in llama.cpp before
Yes, this was a straight quant automated from the source repository, I am checking it out right now, to see the problem
OK so I updated the model config and re-quantized it, it now seems to run and has conversations with it. Please try it now.
Thank you for your report.
Thank you so much buddy! It is now working perfectly.
Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CPU
Tensor blk.45.exp_probs_b.bias buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CPU
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 5016.20 MiB
Device 1: 4993.70 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors: CPU buffer size = 13057.78 MiB
llm_load_tensors: CPU buffer size = 15994.04 MiB
llm_load_tensors: CPU buffer size = 15996.04 MiB
llm_load_tensors: CPU buffer size = 16172.95 MiB
llm_load_tensors: CPU buffer size = 1184.00 MiB
llm_load_tensors: CUDA_Split buffer size = 10009.89 MiB
llm_load_tensors: CUDA0 buffer size = 5672.02 MiB
llm_load_tensors: CUDA1 buffer size = 3366.00 MiB
..................................................................................................
===================================== llama_init_from_model: q8_0
llama_init_from_model: n_ctx = 12544
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 1
llama_init_from_model: reduce_type = q8_0
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA_Split KV buffer size = 2254.07 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 1127 MiB
Device 1: 1127 MiB
llama_init_from_model: KV self size = 2254.00 MiB, K (f16): 1127.00 MiB, V (f16): 1127.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.58 MiB
llama_init_from_model: CUDA0 compute buffer size = 798.25 MiB
llama_init_from_model: CUDA1 compute buffer size = 108.64 MiB
llama_init_from_model: CUDA_Host compute buffer size = 20.26 MiB
llama_init_from_model: graph nodes = 3126
llama_init_from_model: graph splits = 402
llama_init_from_model: enabling only_active_experts scheduling
XXXXXXXX Split Mode Graph Scheduling is FORCED despite tensor overrides due to user choice.
XXXXXXXX It may or might NOT infer properly due to unsupported combinations between SMGS and every possible tensor overrides.
======================================= HAVE_FANCY_SIMD is NOT defined
INFO [ init] initializing slots | tid="140497349758976" timestamp=1771550436 n_slots=1
INFO [ init] new slot | tid="140497349758976" timestamp=1771550436 id_slot=0 n_ctx_slot=12544
INFO [ main] HTTP server listening | tid="140497349758976" timestamp=1771550439 n_threads_http="23" port="1234" hostname="127.0.0.1
INFO [ slots_idle] all slots are idle | tid="140497349758976" timestamp=1771550439
Thank you so much buddy! It is now working perfectly.
Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CPU
Tensor blk.45.exp_probs_b.bias buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CPU
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 5016.20 MiB
Device 1: 4993.70 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors: CPU buffer size = 13057.78 MiB
llm_load_tensors: CPU buffer size = 15994.04 MiB
llm_load_tensors: CPU buffer size = 15996.04 MiB
llm_load_tensors: CPU buffer size = 16172.95 MiB
llm_load_tensors: CPU buffer size = 1184.00 MiB
llm_load_tensors: CUDA_Split buffer size = 10009.89 MiB
llm_load_tensors: CUDA0 buffer size = 5672.02 MiB
llm_load_tensors: CUDA1 buffer size = 3366.00 MiB
..................................................................................................
===================================== llama_init_from_model: q8_0
llama_init_from_model: n_ctx = 12544
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 1
llama_init_from_model: reduce_type = q8_0
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA_Split KV buffer size = 2254.07 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 1127 MiB
Device 1: 1127 MiB
llama_init_from_model: KV self size = 2254.00 MiB, K (f16): 1127.00 MiB, V (f16): 1127.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.58 MiB
llama_init_from_model: CUDA0 compute buffer size = 798.25 MiB
llama_init_from_model: CUDA1 compute buffer size = 108.64 MiB
llama_init_from_model: CUDA_Host compute buffer size = 20.26 MiB
llama_init_from_model: graph nodes = 3126
llama_init_from_model: graph splits = 402
llama_init_from_model: enabling only_active_experts scheduling
XXXXXXXX Split Mode Graph Scheduling is FORCED despite tensor overrides due to user choice.
XXXXXXXX It may or might NOT infer properly due to unsupported combinations between SMGS and every possible tensor overrides.
======================================= HAVE_FANCY_SIMD is NOT defined
INFO [ init] initializing slots | tid="140497349758976" timestamp=1771550436 n_slots=1
INFO [ init] new slot | tid="140497349758976" timestamp=1771550436 id_slot=0 n_ctx_slot=12544
INFO [ main] HTTP server listening | tid="140497349758976" timestamp=1771550439 n_threads_http="23" port="1234" hostname="127.0.0.1
INFO [ slots_idle] all slots are idle | tid="140497349758976" timestamp=1771550439
How does the performance bruh??