Instructions to use CohereLabs/command-a-plus-05-2026-w4a4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/command-a-plus-05-2026-w4a4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="CohereLabs/command-a-plus-05-2026-w4a4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("CohereLabs/command-a-plus-05-2026-w4a4")
model = AutoModelForImageTextToText.from_pretrained("CohereLabs/command-a-plus-05-2026-w4a4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CohereLabs/command-a-plus-05-2026-w4a4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/command-a-plus-05-2026-w4a4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/command-a-plus-05-2026-w4a4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/command-a-plus-05-2026-w4a4

SGLang

How to use CohereLabs/command-a-plus-05-2026-w4a4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/command-a-plus-05-2026-w4a4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/command-a-plus-05-2026-w4a4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/command-a-plus-05-2026-w4a4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/command-a-plus-05-2026-w4a4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/command-a-plus-05-2026-w4a4 with Docker Model Runner:
```
docker model run hf.co/CohereLabs/command-a-plus-05-2026-w4a4
```

alexrs commited on about 16 hours ago

Commit

f33b2c6

verified ·

1 Parent(s): 710fbfe

Update README via Huggy

Browse files

Files changed (1) hide show

README.md +13 -16

README.md CHANGED Viewed

@@ -67,17 +67,16 @@ Command A+ is an open source model with 25 billion active parameters and 218B to
 Developed by: [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.com/research)
-* Point of Contact: [**Cohere Labs**](https://cohere.com/research)
-* License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-* Model: command-a-plus-05-2026
-* Model Size: 25B active parameters, 218B total parameters
 * Context length: 128K input
-For more details about this model, please check out our [blog post](http://cohere.com/blog/command-a-plus).
 You can try out Command A+ before downloading the weights in our hosted [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).
 **Available quantizations**
 The following quantizations are available with example minimum GPU requirements
@@ -90,7 +89,7 @@ The following quantizations are available with example minimum GPU requirements
 All three quantizations show negligible differences in benchmark quality and performance. **Our recommended quantization for most uses is [W4A4](https://huggingface.co/CohereLabs/command-a-plus-05-2026-w4a4) which boasts superior speed and latency characteristics alongside a smaller hardware footprint.**
-For more details, please check out our [blog post](http://cohere.com/blog/command-a-plus).
 **Usage**
@@ -117,9 +116,9 @@ input_ids = tokenizer.apply_chat_template(
 )
 gen_tokens = model.generate(
-    input_ids,
-    max_new_tokens=4096,
-    do_sample=True,
     temperature=0.6,
     top_p=0.95
 )
@@ -174,7 +173,7 @@ print(outputs[0]["generated_text"][-1])
 Command A+ `w4a4` can only run on `vLLM >=0.21.0`. W4A4 and accurate response parsing also requires installing Cohere’s melody library.
 ```sh
-uv pip install vllm>=0.21.0
 uv pip install transformers
 uv pip install cohere_melody>=0.9.0
 ```
@@ -188,15 +187,15 @@ vllm serve CohereLabs/command-a-plus-05-2026-w4a4 -tp 1 --tool-call-parser coher
 We recommend using the following set of sampling parameters for generation: `temperature=0.9`, `top_p=0.95`, `repetition_penalty=1.04`.
-**Quantization Methodology:** Reasoning models pay an outsized quantization tax: long decoding traces compound per-token errors, so naive low-bit conversion typically shows up as visible regressions on hard benchmarks. To mitigate this, we quantize selectively and use distillation to close the residual quality gap. We apply NVFP4 W4A4 quantization (4-bit weights and activations, with two-level scaling) to the MoE experts only. The attention path, i.e., Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. MoE experts dominate total parameter count, so quantizing them to 4 bits brings the model within the memory budget of a single B200 and accelerates the expert GEMMs that bottleneck short-to-medium-context decode. Furthermore, we use Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student is trained to match the full-precision teacher's output distribution, with fake quantization operators in the forward pass and straight-through estimators on the backward.
 ## **Model Details**
 **Input**: Text and images.
-**Output**: Model generates text.
-**Model Architecture**: Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer Model. With 25B active parameters and 218B total parameters, it has 128 experts, out of which 8 are active per token, and a single shared expert is applied to all tokens. The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio, as first introduced in Command A. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router. We use additive-bias-based load balancing to encourage balanced token load across all experts, and swap out the softmax router activation function with a normalized sigmoid over the topk expert logits per token.
 **Languages covered:** The model has been trained on 48 languages: English, Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, Spanish, Estonian, Persian, Finnish, Filipino, French, Irish, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Icelandic, Italian, Japanese, Korean, Lithuanian, Latvian, Malay, Maltese, Dutch, Norwegian, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Chinese.
@@ -295,5 +294,3 @@ For errors or additional questions about details in this model card, contact \[[
 **Try it now:**
 You can try Command A+ in the [playground](https://dashboard.cohere.com/playground/chat?model=command-a-plus-05-2026). You can also use it in our dedicated [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).

 Developed by: [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.com/research)
+* Point of Contact: [**Cohere Labs**](https://cohere.com/research)
+* License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+* Model: command-a-plus-05-2026
+* Model Size: 25B active parameters, 218B total parameters
 * Context length: 128K input
+For more details about this model, please check out our [blog post](http://cohere.com/blog/command-a-plus).
 You can try out Command A+ before downloading the weights in our hosted [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).
 **Available quantizations**
 The following quantizations are available with example minimum GPU requirements
 All three quantizations show negligible differences in benchmark quality and performance. **Our recommended quantization for most uses is [W4A4](https://huggingface.co/CohereLabs/command-a-plus-05-2026-w4a4) which boasts superior speed and latency characteristics alongside a smaller hardware footprint.**
+For more details, please check out our [blog post](http://cohere.com/blog/command-a-plus).
 **Usage**
 )
 gen_tokens = model.generate(
+    input_ids,
+    max_new_tokens=4096,
+    do_sample=True,
     temperature=0.6,
     top_p=0.95
 )
 Command A+ `w4a4` can only run on `vLLM >=0.21.0`. W4A4 and accurate response parsing also requires installing Cohere’s melody library.
 ```sh
+uv pip install vllm>=0.21.0
 uv pip install transformers
 uv pip install cohere_melody>=0.9.0
 ```
 We recommend using the following set of sampling parameters for generation: `temperature=0.9`, `top_p=0.95`, `repetition_penalty=1.04`.
+**Quantization Methodology:** Reasoning models pay an outsized quantization tax: long decoding traces compound per-token errors, so naive low-bit conversion typically shows up as visible regressions on hard benchmarks. To mitigate this, we quantize selectively and use distillation to close the residual quality gap. We apply NVFP4 W4A4 quantization (4-bit weights and activations, with two-level scaling) to the MoE experts only. The attention path, i.e., Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. MoE experts dominate total parameter count, so quantizing them to 4 bits brings the model within the memory budget of a single B200 and accelerates the expert GEMMs that bottleneck short-to-medium-context decode. Furthermore, we use Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student is trained to match the full-precision teacher's output distribution, with fake quantization operators in the forward pass and straight-through estimators on the backward.
 ## **Model Details**
 **Input**: Text and images.
+**Output**: Model generates text.
+**Model Architecture**: Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer Model. With 25B active parameters and 218B total parameters, it has 128 experts, out of which 8 are active per token, and a single shared expert is applied to all tokens. The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio, as first introduced in Command A. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router. We use additive-bias-based load balancing to encourage balanced token load across all experts, and swap out the softmax router activation function with a normalized sigmoid over the topk expert logits per token.
 **Languages covered:** The model has been trained on 48 languages: English, Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, Spanish, Estonian, Persian, Finnish, Filipino, French, Irish, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Icelandic, Italian, Japanese, Korean, Lithuanian, Latvian, Malay, Maltese, Dutch, Norwegian, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Chinese.
 **Try it now:**
 You can try Command A+ in the [playground](https://dashboard.cohere.com/playground/chat?model=command-a-plus-05-2026). You can also use it in our dedicated [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).