Instructions to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF", dtype="auto")

llama-cpp-python

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF",
	filename="Qwopus3.5-9B-Coder-MTP-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

SGLang

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Ollama:
```
ollama run hf.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF to start chatting

Pi new

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M
```

Lemonade

How to use Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwopus3.5-9B-Coder-MTP-GGUF-Q4_K_M

List all available models

lemonade list

Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf cannot be loaded using LM Studio

by wangyan-life - opened 6 days ago

Discussion

wangyan-life

6 days ago

LM Studio seems to be unable to load Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf correctly. LM Studio version is 0.4.13 (Build 1).

phlaster

6 days ago

•

edited 6 days ago

LM Studio seems to be unable to load Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf correctly. LM Studio version is 0.4.13 (Build 1).

Check out latest beta 0.4.14 at https://lmstudio.ai/beta-releases

upd: I downloaded this beta myself and now can confirm, that Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf doesn't load with any backend

Here is the full console log of the crash I get

2026-05-18 20:21:17  [INFO]
 Server started.
2026-05-18 20:21:17  [INFO]
 Just-in-time model loading active.
2026-05-18 20:21:33 [DEBUG]
 LlamaV4::load called with model path: /home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf
LlamaV4::load config: n_parallel=4 n_ctx=4096 kv_unified=true
2026-05-18 20:21:33 [DEBUG]
 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7819 MiB):
2026-05-18 20:21:33 [DEBUG]
   Device 0: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes, VRAM: 7819 MiB
2026-05-18 20:21:33 [DEBUG]
 srv    load_model: loading model '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf'
2026-05-18 20:21:33 [DEBUG]
 llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3050) (0000:01:00.0) - 6054 MiB free
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: loaded meta data with 35 key-value pairs and 442 tensors from /home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwopus3.5 9B Coder
llama_model_loader: - kv   3:                           general.finetune str              = coder
llama_model_loader: - kv   4:                           general.basename str              = Qwopus3.5
llama_model_loader: - kv   5:                         general.size_label str              = 9B
llama_model_loader: - kv   6:                         qwen35.block_count u32              = 33
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv   7:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv   8:                    qwen35.embedding_length u32              = 4096
llama_model_loader: - kv   9:                 qwen35.feed_forward_length u32              = 12288
llama_model_loader: - kv  10:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  11:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  13:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  14:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  16:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  17:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  18:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  19:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  20:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  21:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  22:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  23:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  24:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen35
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q5_K:   37 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  220 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 4.98 GiB (4.65 BPW)
2026-05-18 20:21:33 [DEBUG]
 load: 0 unused tokens
2026-05-18 20:21:33 [DEBUG]
 load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
2026-05-18 20:21:33 [DEBUG]
 load: special tokens cache size = 33
2026-05-18 20:21:33 [DEBUG]
 load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
2026-05-18 20:21:33 [DEBUG]
 print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 12288
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = ?B
print_info: model params          = 9.20 B
print_info: general.name          = Qwopus3.5 9B Coder
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-05-18 20:21:33 [DEBUG]
 llama_model_load: error loading model: missing tensor 'blk.32.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
2026-05-18 20:21:33 [DEBUG]
 common_init_from_params: failed to load model '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf'
srv    load_model: failed to load model, '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf': error loading model: missing tensor 'blk.32.ssm_conv1d.weight'
2026-05-18 20:21:33 [DEBUG]
 [LLMProcess] Failed to load model _0x4c0151 [Error]: Failed to load model.
    at _0x45972c.loadModel (/tmp/.mount_lmstudBfSjYI/resources/app/.webpack/lib/llmworker.js:1:612811)
    at process.processTicksAndRejections (node:internal/process/task_queues:104:5)
    at async _0x45972c.handleMessage (/tmp/.mount_lmstudBfSjYI/resources/app/.webpack/lib/llmworker.js:1:604917) {
  cause: 'Failed to load model',
  suggestion: undefined,
  errorData: undefined,
  data: undefined,
  displayData: undefined,
  title: 'Failed to load model.'
}

wangyan-life

5 days ago

LM Studio seems to be unable to load Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf correctly. LM Studio version is 0.4.13 (Build 1).

Check out latest beta 0.4.14 at https://lmstudio.ai/beta-releases

upd: I downloaded this beta myself and now can confirm, that Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf doesn't load with any backend

Here is the full console log of the crash I get

2026-05-18 20:21:17  [INFO]
 Server started.
2026-05-18 20:21:17  [INFO]
 Just-in-time model loading active.
2026-05-18 20:21:33 [DEBUG]
 LlamaV4::load called with model path: /home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf
LlamaV4::load config: n_parallel=4 n_ctx=4096 kv_unified=true
2026-05-18 20:21:33 [DEBUG]
 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7819 MiB):
2026-05-18 20:21:33 [DEBUG]
   Device 0: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes, VRAM: 7819 MiB
2026-05-18 20:21:33 [DEBUG]
 srv    load_model: loading model '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf'
2026-05-18 20:21:33 [DEBUG]
 llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3050) (0000:01:00.0) - 6054 MiB free
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: loaded meta data with 35 key-value pairs and 442 tensors from /home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwopus3.5 9B Coder
llama_model_loader: - kv   3:                           general.finetune str              = coder
llama_model_loader: - kv   4:                           general.basename str              = Qwopus3.5
llama_model_loader: - kv   5:                         general.size_label str              = 9B
llama_model_loader: - kv   6:                         qwen35.block_count u32              = 33
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv   7:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv   8:                    qwen35.embedding_length u32              = 4096
llama_model_loader: - kv   9:                 qwen35.feed_forward_length u32              = 12288
llama_model_loader: - kv  10:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  11:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  13:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  14:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  16:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  17:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  18:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  19:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  20:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  21:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  22:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  23:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  24:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen35
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-05-18 20:21:33 [DEBUG]
 llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q5_K:   37 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  220 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 4.98 GiB (4.65 BPW)
2026-05-18 20:21:33 [DEBUG]
 load: 0 unused tokens
2026-05-18 20:21:33 [DEBUG]
 load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
2026-05-18 20:21:33 [DEBUG]
 load: special tokens cache size = 33
2026-05-18 20:21:33 [DEBUG]
 load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
2026-05-18 20:21:33 [DEBUG]
 print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 12288
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = ?B
print_info: model params          = 9.20 B
print_info: general.name          = Qwopus3.5 9B Coder
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-05-18 20:21:33 [DEBUG]
 llama_model_load: error loading model: missing tensor 'blk.32.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
2026-05-18 20:21:33 [DEBUG]
 common_init_from_params: failed to load model '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf'
srv    load_model: failed to load model, '/home/user/.lmstudio/models/Jackrong/novis_Qwopus-9b-coder-MTP/Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf': error loading model: missing tensor 'blk.32.ssm_conv1d.weight'
2026-05-18 20:21:33 [DEBUG]
 [LLMProcess] Failed to load model _0x4c0151 [Error]: Failed to load model.
    at _0x45972c.loadModel (/tmp/.mount_lmstudBfSjYI/resources/app/.webpack/lib/llmworker.js:1:612811)
    at process.processTicksAndRejections (node:internal/process/task_queues:104:5)
    at async _0x45972c.handleMessage (/tmp/.mount_lmstudBfSjYI/resources/app/.webpack/lib/llmworker.js:1:604917) {
  cause: 'Failed to load model',
  suggestion: undefined,
  errorData: undefined,
  data: undefined,
  displayData: undefined,
  title: 'Failed to load model.'
}

Thanks for the details!

phlaster

4 days ago

LM Studio seems to be unable to load Qwopus3.5-9B-Coder-MTP-IQ4_XS.gguf correctly. LM Studio version is 0.4.13 (Build 1).

0.4.14 (Build 2) can finally run MTP models

dongxiat

4 days ago

•

edited 4 days ago

update LM Studio 0.4.14 Beta 2 and engine llama.cpp to v2.15 its will work, but with my GTX 1070 8GB its will be slower, MTP work well on edge devices not for old hardware like me

update:
change MTP Max Draft Token from 3 => 2 will be up 32 tok/s

phlaster

4 days ago

•

edited 3 days ago

change MTP Max Draft Token

is there any sense to change the minimum value?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment