alexrs commited on
Commit
f33b2c6
·
verified ·
1 Parent(s): 710fbfe

Update README via Huggy

Browse files
Files changed (1) hide show
  1. README.md +13 -16
README.md CHANGED
@@ -67,17 +67,16 @@ Command A+ is an open source model with 25 billion active parameters and 218B to
67
 
68
  Developed by: [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.com/research)
69
 
70
- * Point of Contact: [**Cohere Labs**](https://cohere.com/research)
71
- * License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
72
- * Model: command-a-plus-05-2026
73
- * Model Size: 25B active parameters, 218B total parameters
74
  * Context length: 128K input
75
 
76
- For more details about this model, please check out our [blog post](http://cohere.com/blog/command-a-plus).
77
 
78
  You can try out Command A+ before downloading the weights in our hosted [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).
79
 
80
-
81
  **Available quantizations**
82
 
83
  The following quantizations are available with example minimum GPU requirements
@@ -90,7 +89,7 @@ The following quantizations are available with example minimum GPU requirements
90
 
91
  All three quantizations show negligible differences in benchmark quality and performance. **Our recommended quantization for most uses is [W4A4](https://huggingface.co/CohereLabs/command-a-plus-05-2026-w4a4) which boasts superior speed and latency characteristics alongside a smaller hardware footprint.**
92
 
93
- For more details, please check out our [blog post](http://cohere.com/blog/command-a-plus).
94
 
95
 
96
  **Usage**
@@ -117,9 +116,9 @@ input_ids = tokenizer.apply_chat_template(
117
  )
118
 
119
  gen_tokens = model.generate(
120
- input_ids,
121
- max_new_tokens=4096,
122
- do_sample=True,
123
  temperature=0.6,
124
  top_p=0.95
125
  )
@@ -174,7 +173,7 @@ print(outputs[0]["generated_text"][-1])
174
  Command A+ `w4a4` can only run on `vLLM >=0.21.0`. W4A4 and accurate response parsing also requires installing Cohere’s melody library.
175
 
176
  ```sh
177
- uv pip install vllm>=0.21.0
178
  uv pip install transformers
179
  uv pip install cohere_melody>=0.9.0
180
  ```
@@ -188,15 +187,15 @@ vllm serve CohereLabs/command-a-plus-05-2026-w4a4 -tp 1 --tool-call-parser coher
188
 
189
  We recommend using the following set of sampling parameters for generation: `temperature=0.9`, `top_p=0.95`, `repetition_penalty=1.04`.
190
 
191
- **Quantization Methodology:** Reasoning models pay an outsized quantization tax: long decoding traces compound per-token errors, so naive low-bit conversion typically shows up as visible regressions on hard benchmarks. To mitigate this, we quantize selectively and use distillation to close the residual quality gap. We apply NVFP4 W4A4 quantization (4-bit weights and activations, with two-level scaling) to the MoE experts only. The attention path, i.e., Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. MoE experts dominate total parameter count, so quantizing them to 4 bits brings the model within the memory budget of a single B200 and accelerates the expert GEMMs that bottleneck short-to-medium-context decode. Furthermore, we use Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student is trained to match the full-precision teacher's output distribution, with fake quantization operators in the forward pass and straight-through estimators on the backward.
192
 
193
  ## **Model Details**
194
 
195
  **Input**: Text and images.
196
 
197
- **Output**: Model generates text.
198
 
199
- **Model Architecture**: Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer Model. With 25B active parameters and 218B total parameters, it has 128 experts, out of which 8 are active per token, and a single shared expert is applied to all tokens. The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio, as first introduced in Command A. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router. We use additive-bias-based load balancing to encourage balanced token load across all experts, and swap out the softmax router activation function with a normalized sigmoid over the topk expert logits per token.
200
 
201
  **Languages covered:** The model has been trained on 48 languages: English, Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, Spanish, Estonian, Persian, Finnish, Filipino, French, Irish, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Icelandic, Italian, Japanese, Korean, Lithuanian, Latvian, Malay, Maltese, Dutch, Norwegian, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Chinese.
202
 
@@ -295,5 +294,3 @@ For errors or additional questions about details in this model card, contact \[[
295
  **Try it now:**
296
 
297
  You can try Command A+ in the [playground](https://dashboard.cohere.com/playground/chat?model=command-a-plus-05-2026). You can also use it in our dedicated [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).
298
-
299
-
 
67
 
68
  Developed by: [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.com/research)
69
 
70
+ * Point of Contact: [**Cohere Labs**](https://cohere.com/research)
71
+ * License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
72
+ * Model: command-a-plus-05-2026
73
+ * Model Size: 25B active parameters, 218B total parameters
74
  * Context length: 128K input
75
 
76
+ For more details about this model, please check out our [blog post](http://cohere.com/blog/command-a-plus).
77
 
78
  You can try out Command A+ before downloading the weights in our hosted [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).
79
 
 
80
  **Available quantizations**
81
 
82
  The following quantizations are available with example minimum GPU requirements
 
89
 
90
  All three quantizations show negligible differences in benchmark quality and performance. **Our recommended quantization for most uses is [W4A4](https://huggingface.co/CohereLabs/command-a-plus-05-2026-w4a4) which boasts superior speed and latency characteristics alongside a smaller hardware footprint.**
91
 
92
+ For more details, please check out our [blog post](http://cohere.com/blog/command-a-plus).
93
 
94
 
95
  **Usage**
 
116
  )
117
 
118
  gen_tokens = model.generate(
119
+ input_ids,
120
+ max_new_tokens=4096,
121
+ do_sample=True,
122
  temperature=0.6,
123
  top_p=0.95
124
  )
 
173
  Command A+ `w4a4` can only run on `vLLM >=0.21.0`. W4A4 and accurate response parsing also requires installing Cohere’s melody library.
174
 
175
  ```sh
176
+ uv pip install vllm>=0.21.0
177
  uv pip install transformers
178
  uv pip install cohere_melody>=0.9.0
179
  ```
 
187
 
188
  We recommend using the following set of sampling parameters for generation: `temperature=0.9`, `top_p=0.95`, `repetition_penalty=1.04`.
189
 
190
+ **Quantization Methodology:** Reasoning models pay an outsized quantization tax: long decoding traces compound per-token errors, so naive low-bit conversion typically shows up as visible regressions on hard benchmarks. To mitigate this, we quantize selectively and use distillation to close the residual quality gap. We apply NVFP4 W4A4 quantization (4-bit weights and activations, with two-level scaling) to the MoE experts only. The attention path, i.e., Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. MoE experts dominate total parameter count, so quantizing them to 4 bits brings the model within the memory budget of a single B200 and accelerates the expert GEMMs that bottleneck short-to-medium-context decode. Furthermore, we use Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student is trained to match the full-precision teacher's output distribution, with fake quantization operators in the forward pass and straight-through estimators on the backward.
191
 
192
  ## **Model Details**
193
 
194
  **Input**: Text and images.
195
 
196
+ **Output**: Model generates text.
197
 
198
+ **Model Architecture**: Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer Model. With 25B active parameters and 218B total parameters, it has 128 experts, out of which 8 are active per token, and a single shared expert is applied to all tokens. The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio, as first introduced in Command A. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router. We use additive-bias-based load balancing to encourage balanced token load across all experts, and swap out the softmax router activation function with a normalized sigmoid over the topk expert logits per token.
199
 
200
  **Languages covered:** The model has been trained on 48 languages: English, Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, Spanish, Estonian, Persian, Finnish, Filipino, French, Irish, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Icelandic, Italian, Japanese, Korean, Lithuanian, Latvian, Malay, Maltese, Dutch, Norwegian, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Chinese.
201
 
 
294
  **Try it now:**
295
 
296
  You can try Command A+ in the [playground](https://dashboard.cohere.com/playground/chat?model=command-a-plus-05-2026). You can also use it in our dedicated [Hugging Face Space](https://huggingface.co/spaces/CohereLabs/command-a-plus-05-2026).