How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ssweens/inclusionAI__Ling-2.6-flash-GGUF-YMMV",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

๐Ÿงช Experimental GGUFs for Ling-2.6-flash

A stopgap to experiment with Ling 2.6 locally while the tools ecosystem catches up. Expect rough edges. Validated for text and coding coherence.

GGUF files for inclusionAI/Ling-2.6-flash.

โš ๏ธ You need the custom fork

These GGUFs require a Ling-2.6-capable fork of llama.cpp. Vanilla llama.cpp doesn't support the BailingMoeV2.5 architecture yet.

Performance

Example:

llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2 
-m /mnt/supmodels/gguf/inclusionAI__Ling-2.6-flash/inclusionAI__Ling-2.6-flash-Q4_K_M.gguf -c 32768 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0

Speed (custom, n=2)

Model Prompt t/s Gen t/s TTFT s Decode s Backend
IQ2_XS 1438.08 34.58 0.64 3.70 CUDA
Q2_K 1407.68 34.30 0.65 3.73 CUDA
Q4_K_M 1176.48 27.09 0.78 4.72 CUDA
Q8_0 531.16 15.35 1.66 8.34 CUDA+ROCm

Coding (humaneval_instruct, n=30)

Model pass@1 Backend
IQ2_XS 0.933ยฑ0.046 CUDA
Q2_K 0.967ยฑ0.033 CUDA
Q4_K_M 1.000ยฑ0.000 CUDA
Q8_0 1.000ยฑ0.000 CUDA+ROCm

Original model

Thanks

  • inclusionAI โ€” open model weights, architecture, and the BailingMoeV2.5 design
  • llama.cpp โ€” the project that makes local LLM inference possible
Downloads last month
1,278
GGUF
Model size
108B params
Architecture
bailingmoe2.5
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ssweens/inclusionAI__Ling-2.6-flash-GGUF-YMMV

Quantized
(9)
this model