Support in llama.cpp
Hi, first of all, thank you for this fast conversion on DLM, but I would like to ask, is inference of Diffusion text models intergrated in llama.cpp?
it seems that the denoising never really happened.
Also during inference i got this message "embeddings required but some input tokens were not marked as outputs -> overriding", which embedding should we use?
I would like to ask, is inference of Diffusion text models integrated in llama.cpp?
Yes but inside a dedicated executable and not llama-cli/llama-server used for normal LLMs but inside diffusion-cli (examples/diffusion/diffusion-cli.cpp).
it seems that the denoising never really happened.
While I haven't tried this exact model, I had success running many other diffusions based LLMs in the past.
Also during inference i got this message "embeddings required but some input tokens were not marked as outputs -> overriding", which embedding should we use?
Not sure about that. Can you please post the full error message?
Seems that the arch is accepted and recognized as llada:
Loading Chat Completions Adapter: D:\AI Models\kcpp\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 23
System: Windows 10.0.26100 AMD64 Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
Detected Available GPU Memory: 16380 MB
Detected Available RAM: 20846 MB
Initializing dynamic library: koboldcpp_cublas.dll
Namespace(admin=False, admindir='', adminpassword='', analyze='', autofit=True, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=896, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=23, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=True, jinja_tools=False, launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj=None, mmprojcpu=False, model=[], model_param='D:/AI Models/MMaDA-8B-MixCoT.Q6_K.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=7, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=7, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', '0', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
Loading Text Model: D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf
The reported GGUF Arch is: llada
Arch Category: 0
Identified as GGUF model.
Attempting to Load...
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:01:00.0) - 15221 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 291 tensors from D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 6.18 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load: - 126081 ('<|endoftext|>')
load: - 126348 ('<|eot_id|>')
load: special tokens cache size = 269
load: token to piece cache size = 0.8057 MB
print_info: arch = llada
print_info: vocab_only = 0
print_info: no_alloc = 1
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 0
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.08 B
print_info: general.name = MMaDA 8B MixCoT
print_info: vocab type = BPE
print_info: n_vocab = 134656
print_info: n_merges = 125824
print_info: BOS token = 126080 '<|startoftext|>'
print_info: EOS token = 126081 '<|endoftext|>'
print_info: EOT token = 126081 '<|endoftext|>'
print_info: PAD token = 126081 '<|endoftext|>'
print_info: MASK token = 126336 '<|mdm_mask|>'
print_info: LF token = 198 '膴'
print_info: EOG token = 126081 '<|endoftext|>'
print_info: EOG token = 126348 '<|eot_id|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 291
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU model buffer size = 0.00 MiB
load_tensors: CUDA0 model buffer size = 0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (4096) -- possible training context overflow
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.51 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 4060 Ti) | 16379 = 15221 + (5892 = 5892 + 0 + 0) + 17592186039682 |
llama_memory_breakdown_print: | - Host | 431 = 431 + 0 + 0 |
llama_params_fit_impl: projected to use 5892 MiB of device memory vs. 15221 MiB of free device memory
llama_params_fit_impl: will leave 9328 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.26 seconds
Autofit Result: -c 8320 -ngl -1
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:01:00.0) - 15221 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 291 tensors from D:\AI Models\MMaDA-8B-MixCoT.Q6_K.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 6.18 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load: - 126081 ('<|endoftext|>')
load: - 126348 ('<|eot_id|>')
load: special tokens cache size = 269
load: token to piece cache size = 0.8057 MB
print_info: arch = llada
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 0
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.08 B
print_info: general.name = MMaDA 8B MixCoT
print_info: vocab type = BPE
print_info: n_vocab = 134656
print_info: n_merges = 125824
print_info: BOS token = 126080 '<|startoftext|>'
print_info: EOS token = 126081 '<|endoftext|>'
print_info: EOT token = 126081 '<|endoftext|>'
print_info: PAD token = 126081 '<|endoftext|>'
print_info: MASK token = 126336 '<|mdm_mask|>'
print_info: LF token = 198 '膴'
print_info: EOG token = 126081 '<|endoftext|>'
print_info: EOG token = 126348 '<|eot_id|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 291
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU model buffer size = 431.48 MiB
load_tensors: CUDA0 model buffer size = 5892.50 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:2035087.6).
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = disabled
llama_context: kv_unified = true
llama_context: freq_base = 2035087.6
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (4096) -- possible training context overflow
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.51 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
Threadpool set to 7 threads and 7 blasthreads...
attach_threadpool: call
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: Llama 3.x
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 8192, "max_length": 2048, "rep_pen": 1.18, "temperature": 0.75, "top_p": 0.6, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.8, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "{{[SYSTEM]}}You are an helpful assistant\n", "trim_stop": true, "genkey": "KCPP1806", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "smoothing_curve": 1, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "replace_instruct_placeholders": true, "presence_penalty": 0, "adaptive_target": -1, "adaptive_decay": 0.9, "logit_bias": {"126081": 1}, "stop_sequence": ["{{[INPUT]}}", "{{[OUTPUT]}}"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "{{[INPUT]}}Explain to me how does Diffusion Language Models work.{{[OUTPUT]}}"}
Processing Prompt (18 / 18 tokens)init: embeddings required but some input tokens were not marked as outputs -> overriding
Generating (2 / 2048 tokens)
(EOS token triggered! ID:126081)
[19:37:57] CtxLimit:38/8192, Amt:2/2048, Init:0.00s, Process:0.03s (720.00T/s), Generate:0.04s (50.00T/s), Total:0.07s
Output: Explain