Ling-2.6-flash GGUF

Quantized GGUF of inclusionAI/Ling-2.6-flash โ€” a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture.

Files

File Size Format
Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf ~57 GB IQ4_NL

Running in llama.cpp

This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:

https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-mtp

While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are without mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental)

Build

git clone https://github.com/ljubomirj/llama.cpp.git
cd llama.cpp
git checkout LJ-Ling-2.6-flash-mtp
mkdir -p build && cd build
cmake .. -DLLAMA_METAL=ON
make -j llama-cli llama-server llama-batched-bench

CLI

./bin/llama-cli \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -st -p "The capital of France is"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Max)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 92274.69 MB

Loading model...

> The capital of France is

The capital of France is Paris.

[ Prompt: 96.1 t/s | Generation: 33.3 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 +     632 +     490) +         704 |
common_memory_breakdown_print: |   - Host                |                    653 =   345 +       0 +     308                |
ggml_metal_free: deallocating

Server

./bin/llama-server \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -ctx 4096 -fa -ngl 99

Performance (MacBook Pro M2 Max, 96 GB)

  • Prefill: ~250-400 tok/s
  • Generation: ~30-45 tok/s
./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.169 |   437.96 |    2.739 |    46.73 |    3.908 |   163.75 |
|  1024 |    128 |    1 |   1152 |    2.855 |   358.72 |    3.534 |    36.22 |    6.389 |   180.32 |
|  2048 |    128 |    1 |   2176 |    6.073 |   337.25 |    3.535 |    36.20 |    9.608 |   226.48 |
|  4096 |    128 |    1 |   4224 |   12.564 |   326.00 |    3.753 |    34.10 |   16.318 |   258.86 |
|  8192 |    128 |    1 |   8320 |   26.474 |   309.43 |    3.938 |    32.50 |   30.412 |   273.57 |
| 16384 |    128 |    1 |  16512 |   57.800 |   283.46 |    4.252 |    30.10 |   62.052 |   266.10 |
| 32768 |    128 |    1 |  32896 |  131.884 |   248.46 |    4.631 |    27.64 |  136.515 |   240.97 |

llama_perf_context_print:        load time =    7196.80 ms
llama_perf_context_print: prompt eval time =  239042.77 ms / 65040 tokens (    3.68 ms per token,   272.09 tokens per second)
llama_perf_context_print:        eval time =   26374.75 ms /   896 runs   (   29.44 ms per token,    33.97 tokens per second)
llama_perf_context_print:       total time =  272401.59 ms / 65936 tokens
llama_perf_context_print:    graphs reused =        889

Quantization Method

This GGUF quantization was developed entirely by AI coding agents reading the bailing_hybrid.py implementation from mlx-lm#1227 and adapting it for llama.cpp compatibility.

Agents / LLMs used:

  • Claude / GLM-5.1
  • OpenCode / Kimi-K2.6
  • OpenCode / DeepSeek-V4-Pro

Credits

Downloads last month
-
GGUF
Model size
107B params
Architecture
bailing_hybrid
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ljupco/Ling-2.6-flash-GGUF

Quantized
(9)
this model