DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
Paper • 2605.10933 • Published • 1
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "SparseLLM/DECO-0.2B" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "SparseLLM/DECO-0.2B",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'This is the 0.2B DECO checkpoint introduced by the paper DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices.
DECO (Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices) is a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. It is an improved version of the BlockFFN architecture.
You can load and use this model with AutoTokenizer and AutoModelForCausalLM from transformers. Since the model uses a custom architecture, trust_remote_code=True is required.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "SparseLLM/DECO-0.2B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda").eval()
prompt = "Mixture-of-Experts models are useful because"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
If you find our work useful for your research, please kindly cite our paper as follows:
@article{song2026deco,
title={{DECO}: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices},
author={Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu},
journal={arXiv preprint arXiv:2605.10933},
year={2026},
url={https://arxiv.org/pdf/2605.10933},
}
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SparseLLM/DECO-0.2B" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparseLLM/DECO-0.2B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'