How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SparseLLM/DECO-0.2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparseLLM/DECO-0.2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/SparseLLM/DECO-0.2B
Quick Links

DECO-0.2B

This is the 0.2B DECO checkpoint introduced by the paper DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices.

DECO (Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices) is a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. It is an improved version of the BlockFFN architecture.

Quick start

You can load and use this model with AutoTokenizer and AutoModelForCausalLM from transformers. Since the model uses a custom architecture, trust_remote_code=True is required.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "SparseLLM/DECO-0.2B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda").eval()

prompt = "Mixture-of-Experts models are useful because"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, do_sample=False)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

If you find our work useful for your research, please kindly cite our paper as follows:

@article{song2026deco,
      title={{DECO}: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices}, 
      author={Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu},
      journal={arXiv preprint arXiv:2605.10933},
      year={2026},
      url={https://arxiv.org/pdf/2605.10933}, 
}
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for SparseLLM/DECO-0.2B