How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SparseLLM/DECO-0.5B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparseLLM/DECO-0.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SparseLLM/DECO-0.5B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparseLLM/DECO-0.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Quick Links

DECO-0.5B

This is the 0.5B DECO checkpoint introduced by the paper DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices.

DECO (Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices) is a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. It is an improved version of the BlockFFN architecture.

Quick start

You can load and use this model with AutoTokenizer and AutoModelForCausalLM from transformers. Since the model uses a custom architecture, trust_remote_code=True is required.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "SparseLLM/DECO-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda").eval()

prompt = "Mixture-of-Experts models are useful because"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, do_sample=False)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

If you find our work useful for your research, please kindly cite our paper as follows:

@article{song2026deco,
      title={{DECO}: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices}, 
      author={Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu},
      journal={arXiv preprint arXiv:2605.10933},
      year={2026},
      url={https://arxiv.org/pdf/2605.10933}, 
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for SparseLLM/DECO-0.5B