zancato commited on
Commit
06d573a
·
verified ·
1 Parent(s): 7549600

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -74,7 +74,8 @@ Evaluated on [HELMET](https://github.com/princeton-nlp/HELMET), [MRCR](https://h
74
 
75
  The plot below shows performance averaged over context lengths from 8K to 128K.
76
 
77
- **Note.** For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the [model card](https://huggingface.co/Qwen/Qwen3-8B)
 
78
 
79
  <img width="1000px" src="https://raw.githubusercontent.com/awslabs/hybrid-model-factory/main/assets/figures/long_context_results_8B_models.png">
80
 
@@ -106,12 +107,13 @@ Each category in the table below averages the following Tulu3-dev subtasks:
106
 
107
 
108
  **How close are the Hybrid models to the Transformer baseline on short context tasks?**
109
- All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [< 0.5% of the base Transformer's pre-training token budget](#training-data). Note, B'MOJO-F fully matches the Transformer baseline but is slower to deploy (see [above](#long-context-benchmarks)).
110
 
111
  **Which SSM layer type performs best?**
112
  Among the non B'MOJO-F Hybrids, GKA ranks first (~2 point gap with Qwen3-8B [Long]), followed by GDN, then Mamba2. This ranking correlates with the expressiveness order of their respective SSM layers.
113
 
114
- **Note.** For applications to complex reasoning and coding problems check out our [Primed Hybrid Reasoning models](https://huggingface.co/collections/amazon/primed-hybrid-models-collection).
 
115
 
116
  ## About B'MOJO-F
117
 
@@ -193,10 +195,11 @@ curl http://localhost:8000/v1/chat/completions \
193
  }'
194
  ```
195
 
196
- > **Tip:** The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
 
197
 
198
- ### With HuggingFace Transformers
199
- See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#huggingface-transformers-inference) for details on when we recommend the HuggingFace Transformers implementation as opposed to the highly optimized vLLM one.
200
 
201
  ```python
202
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -218,11 +221,11 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
218
 
219
  ### Training-Free Context Extension
220
 
221
- This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of [PICASO cache composition](https://arxiv.org/abs/2502.17605). See the [State Composition guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/StateComposition.md) for usage. Note, this is currently supported in HuggingFace Transformers only.
222
 
223
 
224
  ## Training data
225
- These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using < 0.5% of the base Transformer model's pre-training token budget.
226
 
227
  ## Responsible AI Considerations
228
  At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI. When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
74
 
75
  The plot below shows performance averaged over context lengths from 8K to 128K.
76
 
77
+ > [!NOTE]
78
+ > For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the [model card](https://huggingface.co/Qwen/Qwen3-8B)
79
 
80
  <img width="1000px" src="https://raw.githubusercontent.com/awslabs/hybrid-model-factory/main/assets/figures/long_context_results_8B_models.png">
81
 
 
107
 
108
 
109
  **How close are the Hybrid models to the Transformer baseline on short context tasks?**
110
+ All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [<0.5% of the base Transformer's pre-training token budget](#training-data). Note, B'MOJO-F fully matches the Transformer baseline but is slower to deploy (see [above](#long-context-benchmarks)).
111
 
112
  **Which SSM layer type performs best?**
113
  Among the non B'MOJO-F Hybrids, GKA ranks first (~2 point gap with Qwen3-8B [Long]), followed by GDN, then Mamba2. This ranking correlates with the expressiveness order of their respective SSM layers.
114
 
115
+ > [!NOTE]
116
+ > For applications to complex reasoning and coding problems check out our [Primed Hybrid Reasoning models](https://huggingface.co/collections/amazon/primed-hybrid-models-collection).
117
 
118
  ## About B'MOJO-F
119
 
 
195
  }'
196
  ```
197
 
198
+ > [!TIP]
199
+ > The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
200
 
201
+ ### With Hugging Face Transformers
202
+ See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#huggingface-transformers-inference) for details on when we recommend the Hugging Face Transformers implementation as opposed to the highly optimized vLLM one.
203
 
204
  ```python
205
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
221
 
222
  ### Training-Free Context Extension
223
 
224
+ This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of [PICASO cache composition](https://arxiv.org/abs/2502.17605). See the [State Composition guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/StateComposition.md) for usage. Note, this is currently supported in Hugging Face Transformers only.
225
 
226
 
227
  ## Training data
228
+ These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using <0.5% of the base Transformer model's pre-training token budget.
229
 
230
  ## Responsible AI Considerations
231
  At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI. When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.