can not use the gpu on AMD RYZEN AI MAX+ 395 w/ Radeon 8060S

#3
by HL973 - opened

很奇怪,你这个量化版似乎不能使用我的gpu,但同时我下载了moxin-org/nemotron-3-nano-30b-a3b和unsloth/nemotron-3-nano-30b-a3b,他们都可以完全加载到gpu中,

以下为unsloth/nemotron-3-nano-30b-a3b,和moxin-org/nemotron-3-nano-30b-a3b的基本一样:
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 52 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 53/53 layers to GPU
load_tensors: Vulkan0 model buffer size = 31591.35 MiB
load_tensors: Vulkan_Host model buffer size = 357.00 MiB

llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_seq = 32768
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.50 MiB
llama_kv_cache: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache: size = 192.00 MiB ( 32768 cells, 6 layers, 1/1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_memory_recurrent: Vulkan0 RS buffer size = 47.62 MiB
llama_memory_recurrent: size = 47.62 MiB ( 1 cells, 52 layers, 1 seqs), R (f32): 1.62 MiB, S (f32): 46.00 MiB
llama_context: Vulkan0 compute buffer size = 271.75 MiB
llama_context: Vulkan_Host compute buffer size = 69.27 MiB
llama_context: graph nodes = 2188
llama_context: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

以下为你的模型的加载日志:
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/53 layers to GPU
load_tensors: Vulkan_Host model buffer size = 31877.51 MiB
common_init_result: added <|im_end|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_seq = 32768
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.50 MiB
llama_kv_cache: CPU KV buffer size = 192.00 MiB
llama_kv_cache: size = 192.00 MiB ( 32768 cells, 6 layers, 1/1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_memory_recurrent: CPU RS buffer size = 47.62 MiB
llama_memory_recurrent: size = 47.62 MiB ( 1 cells, 52 layers, 1 seqs), R (f32): 1.62 MiB, S (f32): 46.00 MiB
llama_context: Vulkan0 compute buffer size = 795.67 MiB
llama_context: Vulkan_Host compute buffer size = 110.10 MiB
llama_context: graph nodes = 2188
llama_context: graph splits = 459 (with bs=512), 70 (with bs=1)

加载参数都一样
image

Sign up or log in to comment