Support in llama.cpp

by Tridefender - opened Jan 16

Jan 16

Hi, first of all, thank you for this fast conversion on DLM, but I would like to ask, is inference of Diffusion text models intergrated in llama.cpp?
it seems that the denoising never really happened.
Also during inference i got this message "embeddings required but some input tokens were not marked as outputs -> overriding", which embedding should we use?

nicoboss

Jan 17

I would like to ask, is inference of Diffusion text models integrated in llama.cpp?

Yes but inside a dedicated executable and not llama-cli/llama-server used for normal LLMs but inside diffusion-cli (examples/diffusion/diffusion-cli.cpp).

it seems that the denoising never really happened.

While I haven't tried this exact model, I had success running many other diffusions based LLMs in the past.

Also during inference i got this message "embeddings required but some input tokens were not marked as outputs -> overriding", which embedding should we use?

Not sure about that. Can you please post the full error message?

Tridefender

Jan 17

Seems that the arch is accepted and recognized as llada：

Loading Chat Completions Adapter: D:\AI Models\kcpp\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 23
System: Windows 10.0.26100 AMD64 Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
Detected Available GPU Memory: 16380 MB
Detected Available RAM: 20846 MB
Initializing dynamic library: koboldcpp_cublas.dll

Namespace(admin=False, admindir='', adminpassword='', analyze='', autofit=True, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=896, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=23, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=True, jinja_tools=False, launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj=None, mmprojcpu=False, model=[], model_param='D:/AI Models/MMaDA-8B-MixCoT.Q6_K.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=7, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=7, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', '0', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')

Loading Text Model: D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf

The reported GGUF Arch is: llada
Arch Category: 0

Identified as GGUF model.
Attempting to Load...

Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True

Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:01:00.0) - 15221 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 291 tensors from D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 6.18 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load: - 126081 ('<|endoftext|>')
load: - 126348 ('<|eot_id|>')
load: special tokens cache size = 269
load: token to piece cache size = 0.8057 MB
print_info: arch = llada
print_info: vocab_only = 0
print_info: no_alloc = 1
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 0
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.08 B
print_info: general.name = MMaDA 8B MixCoT
print_info: vocab type = BPE
print_info: n_vocab = 134656
print_info: n_merges = 125824
print_info: BOS token = 126080 '<|startoftext|>'
print_info: EOS token = 126081 '<|endoftext|>'
print_info: EOT token = 126081 '<|endoftext|>'
print_info: PAD token = 126081 '<|endoftext|>'
print_info: MASK token = 126336 '<|mdm_mask|>'
print_info: LF token = 198 '膴'
print_info: EOG token = 126081 '<|endoftext|>'
print_info: EOG token = 126348 '<|eot_id|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 291
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU model buffer size = 0.00 MiB
load_tensors: CUDA0 model buffer size = 0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (4096) -- possible training context overflow
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.51 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 4060 Ti) | 16379 = 15221 + (5892 = 5892 + 0 + 0) + 17592186039682 |
llama_memory_breakdown_print: | - Host | 431 = 431 + 0 + 0 |
llama_params_fit_impl: projected to use 5892 MiB of device memory vs. 15221 MiB of free device memory
llama_params_fit_impl: will leave 9328 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.26 seconds
Autofit Result: -c 8320 -ngl -1
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:01:00.0) - 15221 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 291 tensors from D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 6.18 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load: - 126081 ('<|endoftext|>')
load: - 126348 ('<|eot_id|>')
load: special tokens cache size = 269
load: token to piece cache size = 0.8057 MB
print_info: arch = llada
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 0
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.08 B
print_info: general.name = MMaDA 8B MixCoT
print_info: vocab type = BPE
print_info: n_vocab = 134656
print_info: n_merges = 125824
print_info: BOS token = 126080 '<|startoftext|>'
print_info: EOS token = 126081 '<|endoftext|>'
print_info: EOT token = 126081 '<|endoftext|>'
print_info: PAD token = 126081 '<|endoftext|>'
print_info: MASK token = 126336 '<|mdm_mask|>'
print_info: LF token = 198 '膴'
print_info: EOG token = 126081 '<|endoftext|>'
print_info: EOG token = 126348 '<|eot_id|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 291
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU model buffer size = 431.48 MiB
load_tensors: CUDA0 model buffer size = 5892.50 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:2035087.6).
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = disabled
llama_context: kv_unified = true
llama_context: freq_base = 2035087.6
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (4096) -- possible training context overflow
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.51 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
Threadpool set to 7 threads and 7 blasthreads...
attach_threadpool: call
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: Llama 3.x
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.

Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/

Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 8192, "max_length": 2048, "rep_pen": 1.18, "temperature": 0.75, "top_p": 0.6, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.8, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "{{[SYSTEM]}}You are an helpful assistant\n", "trim_stop": true, "genkey": "KCPP1806", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "smoothing_curve": 1, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "replace_instruct_placeholders": true, "presence_penalty": 0, "adaptive_target": -1, "adaptive_decay": 0.9, "logit_bias": {"126081": 1}, "stop_sequence": ["{{[INPUT]}}", "{{[OUTPUT]}}"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "{{[INPUT]}}Explain to me how does Diffusion Language Models work.{{[OUTPUT]}}"}

Processing Prompt (18 / 18 tokens)init: embeddings required but some input tokens were not marked as outputs -> overriding

Generating (2 / 2048 tokens)
(EOS token triggered! ID:126081)
[19:37:57] CtxLimit:38/8192, Amt:2/2048, Init:0.00s, Process:0.03s (720.00T/s), Generate:0.04s (50.00T/s), Total:0.07s
Output: Explain

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Support in llama.cpp

Identified as GGUF model.Attempting to Load...

Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...

Identified as GGUF model.
Attempting to Load...