llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)๐งช Experimental GGUFs for Ling-2.6-flash
A stopgap to experiment with Ling 2.6 locally while the tools ecosystem catches up. Expect rough edges. Validated for text and coding coherence.
GGUF files for inclusionAI/Ling-2.6-flash.
โ ๏ธ You need the custom fork
These GGUFs require a Ling-2.6-capable fork of llama.cpp. Vanilla llama.cpp doesn't support the BailingMoeV2.5 architecture yet.
- llama.cpp fork: ssweens/llama.cpp-ling-2.6
- Backends: Tested on CUDA and ROCm.
Performance
Example:
llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2
-m /mnt/supmodels/gguf/inclusionAI__Ling-2.6-flash/inclusionAI__Ling-2.6-flash-Q4_K_M.gguf -c 32768 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0
Speed (custom, n=2)
| Model | Prompt t/s | Gen t/s | TTFT s | Decode s | Backend |
|---|---|---|---|---|---|
| IQ2_XS | 1438.08 | 34.58 | 0.64 | 3.70 | CUDA |
| Q2_K | 1407.68 | 34.30 | 0.65 | 3.73 | CUDA |
| Q4_K_M | 1176.48 | 27.09 | 0.78 | 4.72 | CUDA |
| Q8_0 | 531.16 | 15.35 | 1.66 | 8.34 | CUDA+ROCm |
Coding (humaneval_instruct, n=30)
| Model | pass@1 | Backend |
|---|---|---|
| IQ2_XS | 0.933ยฑ0.046 | CUDA |
| Q2_K | 0.967ยฑ0.033 | CUDA |
| Q4_K_M | 1.000ยฑ0.000 | CUDA |
| Q8_0 | 1.000ยฑ0.000 | CUDA+ROCm |
Original model
Thanks
- inclusionAI โ open model weights, architecture, and the BailingMoeV2.5 design
- llama.cpp โ the project that makes local LLM inference possible
- Downloads last month
- 1,278
Hardware compatibility
Log In to add your hardware
2-bit
4-bit
8-bit
Model tree for ssweens/inclusionAI__Ling-2.6-flash-GGUF-YMMV
Base model
inclusionAI/Ling-2.6-flash
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ssweens/inclusionAI__Ling-2.6-flash-GGUF-YMMV", filename="", )