Ling-2.6-flash GGUF
Quantized GGUF of inclusionAI/Ling-2.6-flash โ a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture.
Files
| File | Size | Format |
|---|---|---|
Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf |
~57 GB | IQ4_NL |
Running in llama.cpp
This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:
https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-mtp
While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are without mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental)
Build
git clone https://github.com/ljubomirj/llama.cpp.git
cd llama.cpp
git checkout LJ-Ling-2.6-flash-mtp
mkdir -p build && cd build
cmake .. -DLLAMA_METAL=ON
make -j llama-cli llama-server llama-batched-bench
CLI
./bin/llama-cli \
-m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
-st -p "The capital of France is"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Max)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 92274.69 MB
Loading model...
> The capital of France is
The capital of France is Paris.
[ Prompt: 96.1 t/s | Generation: 33.3 t/s ]
Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
common_memory_breakdown_print: | - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 + 632 + 490) + 704 |
common_memory_breakdown_print: | - Host | 653 = 345 + 0 + 308 |
ggml_metal_free: deallocating
Server
./bin/llama-server \
-m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
-ctx 4096 -fa -ngl 99
Performance (MacBook Pro M2 Max, 96 GB)
- Prefill: ~250-400 tok/s
- Generation: ~30-45 tok/s
./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 1.169 | 437.96 | 2.739 | 46.73 | 3.908 | 163.75 |
| 1024 | 128 | 1 | 1152 | 2.855 | 358.72 | 3.534 | 36.22 | 6.389 | 180.32 |
| 2048 | 128 | 1 | 2176 | 6.073 | 337.25 | 3.535 | 36.20 | 9.608 | 226.48 |
| 4096 | 128 | 1 | 4224 | 12.564 | 326.00 | 3.753 | 34.10 | 16.318 | 258.86 |
| 8192 | 128 | 1 | 8320 | 26.474 | 309.43 | 3.938 | 32.50 | 30.412 | 273.57 |
| 16384 | 128 | 1 | 16512 | 57.800 | 283.46 | 4.252 | 30.10 | 62.052 | 266.10 |
| 32768 | 128 | 1 | 32896 | 131.884 | 248.46 | 4.631 | 27.64 | 136.515 | 240.97 |
llama_perf_context_print: load time = 7196.80 ms
llama_perf_context_print: prompt eval time = 239042.77 ms / 65040 tokens ( 3.68 ms per token, 272.09 tokens per second)
llama_perf_context_print: eval time = 26374.75 ms / 896 runs ( 29.44 ms per token, 33.97 tokens per second)
llama_perf_context_print: total time = 272401.59 ms / 65936 tokens
llama_perf_context_print: graphs reused = 889
Quantization Method
This GGUF quantization was developed entirely by AI coding agents reading the bailing_hybrid.py implementation from mlx-lm#1227 and adapting it for llama.cpp compatibility.
Agents / LLMs used:
- Claude / GLM-5.1
- OpenCode / Kimi-K2.6
- OpenCode / DeepSeek-V4-Pro
Credits
- Original model inclusionAI/Ling-2.6-flash
- The original
bailing_hybrid.pyimplementation from mlx-lm#1227 - Custom llama.cpp fork ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-mtp
- MLX reference implementation: mlx-community/Ling-2.6-flash-mlx-4bit-DWQ
- Downloads last month
- -
4-bit
Model tree for ljupco/Ling-2.6-flash-GGUF
Base model
inclusionAI/Ling-2.6-flash