run this model on gguf

#1
by whatever0705 - opened

How can i run this model on llama.cpp? i already setting up llama.cpp

I have not tried this yet but this is where to start : ./llama-tts --help
----- common params -----

-h, --help, --usage print usage and exit
--version show version and build info
--completion-bash print source-able bash completion script for llama.cpp
--verbose-prompt print a verbose prompt before generation (default: false)
-t, --threads N number of threads to use during generation (default: -1)
(env: LLAMA_ARG_THREADS)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default:
same as --threads)
-C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range
(default: "")
-Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1> use strict CPU placement (default: 0)
--prio N set process/thread priority : low(-1), normal(0), medium(1), high(2),
realtime(3) (default: 0)
--poll <0...100> use polling level to wait for work (0 - no polling, default: 50)
-Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
(default: same as --cpu-mask)
-Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict)
--prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
(default: 0)
--poll-batch <0|1> use polling to wait for work (default: same as --poll)
-c, --ctx-size N size of the prompt context (default: 8192, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-n, --predict, --n-predict N number of tokens to predict (default: 4096, -1 = infinity)
(env: LLAMA_ARG_N_PREDICT)
-b, --batch-size N logical maximum batch size (default: 8192)
(env: LLAMA_ARG_BATCH)
-ub, --ubatch-size N physical maximum batch size (default: 512)
(env: LLAMA_ARG_UBATCH)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 =
all)
--swa-full use full-size SWA cache (default: false)
(more
info)

(env: LLAMA_ARG_SWA_FULL)
--kv-unified, -kvu use single unified KV buffer for the KV cache of all sequences
(default: false)
(more info)
(env: LLAMA_ARG_KV_SPLIT)
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
-p, --prompt PROMPT prompt to start generation with; for system message, use -sys
--no-perf disable internal libllama performance timings (default: false)
(env: LLAMA_ARG_NO_PERF)
-f, --file FNAME a file containing the prompt (default: none)
-bf, --binary-file FNAME binary file containing the prompt (default: none)
-e, --escape process escapes sequences (\n, \r, \t, ', ", \) (default: true)
--no-escape do not process escape sequences
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by
the model
(env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPE context scaling factor, expands context by a factor of N
(env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from
model)
(env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training
context size)
(env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
interpolation)
(env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: -1.0)
(env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: -1.0)
(env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N YaRN: low correction dim or beta (default: -1.0)
(env: LLAMA_ARG_YARN_BETA_FAST)
-nkvo, --no-kv-offload disable KV offload
(env: LLAMA_ARG_NO_KV_OFFLOAD)
-nr, --no-repack disable weight repacking
(env: LLAMA_ARG_NO_REPACK)
--no-host bypass host buffer allowing extra buffers to be used
(env: LLAMA_ARG_NO_HOST)
-ctk, --cache-type-k TYPE KV cache data type for K
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K)
-ctv, --cache-type-v TYPE KV cache data type for V
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V)
-dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED)
(env: LLAMA_ARG_DEFRAG_THOLD)
-np, --parallel N number of parallel sequences to decode (default: 1)
(env: LLAMA_ARG_N_PARALLEL)
--mlock force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK)
--no-mmap do not memory-map model (slower load but may reduce pageouts if not
using mlock)
(env: LLAMA_ARG_NO_MMAP)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution
started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system
page cache before using this
see https://github.com/ggml-org/llama.cpp/issues/1437
(env: LLAMA_ARG_NUMA)
-dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't
offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE)
--list-devices print list of available devices and exit
--override-tensor, -ot =,...
override tensor buffer type
--cpu-moe, -cmoe keep all Mixture of Experts (MoE) weights in the CPU
(env: LLAMA_ARG_CPU_MOE)
--n-cpu-moe, -ncmoe N keep the Mixture of Experts (MoE) weights of the first N layers in the
CPU
(env: LLAMA_ARG_N_CPU_MOE)
-ngl, --gpu-layers, --n-gpu-layers N max. number of layers to store in VRAM (default: -1)
(env: LLAMA_ARG_N_GPU_LAYERS)
-sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
(env: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for
intermediate results and KV (with split-mode = row) (default: 0)
(env: LLAMA_ARG_MAIN_GPU)
--check-tensors check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified
multiple times.
types: int, float, bool, str. example: --override-kv
tokenizer.ggml.add_bos_token=bool:false
--no-op-offload disable offloading host tensor operations to device (default: false)
--lora FNAME path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE path to LoRA adapter with user defined scaling (can be repeated to use
multiple adapters)
--control-vector FNAME add a control vector
note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE add a control vector with user defined scaling SCALE
note: this argument can be repeated to add multiple scaled control
vectors
--control-vector-layer-range START END
layer range to apply the control vector(s) to, start and end inclusive
-m, --model FNAME model path (default: models/$filename with filename from --hf-file
or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
(env: LLAMA_ARG_MODEL)
-mu, --model-url MODEL_URL model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-dr, --docker-repo [/][:quant]
Docker Hub model repository. repo is optional, default to ai/. quant
is optional, default to :latest.
example: gemma3
(default: unused)
(env: LLAMA_ARG_DOCKER_REPO)
-hf, -hfr, --hf-repo /[:quant]
Hugging Face model repository; quant is optional, case-insensitive,
default to Q4_K_M, or falls back to the first file in the repo if
Q4_K_M doesn't exist.
mmproj is also downloaded automatically if available. to disable, add
--no-mmproj
example: unsloth/phi-4-GGUF:q4_k_m
(default: unused)
(env: LLAMA_ARG_HF_REPO)
-hfd, -hfrd, --hf-repo-draft /[:quant]
Same as --hf-repo, but for the draft model (default: unused)
(env: LLAMA_ARG_HFD_REPO)
-hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in
--hf-repo (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hfv, -hfrv, --hf-repo-v /[:quant]
Hugging Face model repository for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_FILE_V)
-hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment
variable)
(env: HF_TOKEN)
--log-disable Log disable
--log-file FNAME Log to file
--log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
(env: LLAMA_LOG_COLORS)
-v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for
debugging)
--offline Offline mode: forces use of cache, prevents network access
(env: LLAMA_OFFLINE)
-lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be
ignored.
(env: LLAMA_LOG_VERBOSITY)
--log-prefix Enable prefix in log messages
(env: LLAMA_LOG_PREFIX)
--log-timestamps Enable timestamps in log messages
(env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE KV cache data type for K for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE KV cache data type for V for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)

----- sampling params -----

--samplers SAMPLERS samplers that will be used for generation in the order, separated by
';'
(default: top_k)
-s, --seed SEED RNG seed (default: -1, use random seed for -1)
--sampling-seq, --sampler-seq SEQUENCE
simplified sequence for samplers that will be used (default: k)
--ignore-eos ignore end of stream token and continue generating (implies
--logit-bias EOS-inf)
--temp N temperature (default: 0.8)
--top-k N top-k sampling (default: 4, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N min-p sampling (default: 0.1, 0.0 = disabled)
--top-nsigma N top-n-sigma sampling (default: -1.0, -1.0 = disabled)
--xtc-probability N xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.1, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers
('\n', ':', '"', '*') in the process; use "none" to not use any
sequence breakers
--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.0)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)
-l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello',
or --logit-bias 15043-1 to decrease likelihood of token ' Hello'
--grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/
dir) (default: '')
--grammar-file FNAME file to read grammar from
-j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g.
{} for any JSON object
For schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead
-jf, --json-schema-file FILE File containing a JSON schema to constrain generations
(https://json-schema.org/), e.g. {} for any JSON object
For schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead

----- example-specific params -----

-o, --output, --output-file FNAME output file (default: 'output.wav')
-mv, --model-vocoder FNAME vocoder model for audio generation (default: unused)
--tts-use-guide-tokens Use guide tokens to improve TTS word recall
--tts-speaker-file FNAME speaker file path for audio generation
--tts-oute-default use default OuteTTS models (note: can download weights from the
internet)

example usage:

./llama-tts -m model.gguf -p "Hello!"

well, actually, it's because of the vocoder, when i try to run the vocoder that's is from main page, it doesn't work. and when trying outetts and sesame vocoder, it doesn't work either because kani tts use lfm2 architecture

whatever0705 changed discussion status to closed

Sign up or log in to comment