Update README.md

Browse files

Files changed (1) hide show

README.md +11 -8

README.md CHANGED Viewed

@@ -74,7 +74,8 @@ Evaluated on [HELMET](https://github.com/princeton-nlp/HELMET), [MRCR](https://h
 The plot below shows performance averaged over context lengths from 8K to 128K.
-**Note.** For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the [model card](https://huggingface.co/Qwen/Qwen3-8B)
 <img width="1000px" src="https://raw.githubusercontent.com/awslabs/hybrid-model-factory/main/assets/figures/long_context_results_8B_models.png">
@@ -106,12 +107,13 @@ Each category in the table below averages the following Tulu3-dev subtasks:
 **How close are the Hybrid models to the Transformer baseline on short context tasks?**
-All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [< 0.5% of the base Transformer's pre-training token budget](#training-data). Note, B'MOJO-F fully matches the Transformer baseline but is slower to deploy (see [above](#long-context-benchmarks)).
 **Which SSM layer type performs best?**
 Among the non B'MOJO-F Hybrids, GKA ranks first (~2 point gap with Qwen3-8B [Long]), followed by GDN, then Mamba2. This ranking correlates with the expressiveness order of their respective SSM layers.
-**Note.** For applications to complex reasoning and coding problems check out our [Primed Hybrid Reasoning models](https://huggingface.co/collections/amazon/primed-hybrid-models-collection).
 ## About B'MOJO-F
@@ -193,10 +195,11 @@ curl http://localhost:8000/v1/chat/completions \
   }'
 ```
-> **Tip:** The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
-### With HuggingFace Transformers
-See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#huggingface-transformers-inference) for details on when we recommend the HuggingFace Transformers implementation as opposed to the highly optimized vLLM one.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -218,11 +221,11 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ### Training-Free Context Extension
-This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of [PICASO cache composition](https://arxiv.org/abs/2502.17605). See the [State Composition guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/StateComposition.md) for usage. Note, this is currently supported in HuggingFace Transformers only.
 ## Training data
-These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using < 0.5% of the base Transformer model's pre-training token budget.
 ## Responsible AI Considerations
 At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI.  When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

 The plot below shows performance averaged over context lengths from 8K to 128K.
+> [!NOTE]
+> For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the [model card](https://huggingface.co/Qwen/Qwen3-8B)
 <img width="1000px" src="https://raw.githubusercontent.com/awslabs/hybrid-model-factory/main/assets/figures/long_context_results_8B_models.png">
 **How close are the Hybrid models to the Transformer baseline on short context tasks?**
+All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [<0.5% of the base Transformer's pre-training token budget](#training-data). Note, B'MOJO-F fully matches the Transformer baseline but is slower to deploy (see [above](#long-context-benchmarks)).
 **Which SSM layer type performs best?**
 Among the non B'MOJO-F Hybrids, GKA ranks first (~2 point gap with Qwen3-8B [Long]), followed by GDN, then Mamba2. This ranking correlates with the expressiveness order of their respective SSM layers.
+> [!NOTE]
+> For applications to complex reasoning and coding problems check out our [Primed Hybrid Reasoning models](https://huggingface.co/collections/amazon/primed-hybrid-models-collection).
 ## About B'MOJO-F
   }'
 ```
+> [!TIP]
+> The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
+### With Hugging Face Transformers
+See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#huggingface-transformers-inference) for details on when we recommend the Hugging Face Transformers implementation as opposed to the highly optimized vLLM one.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 ### Training-Free Context Extension
+This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of [PICASO cache composition](https://arxiv.org/abs/2502.17605). See the [State Composition guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/StateComposition.md) for usage. Note, this is currently supported in Hugging Face Transformers only.
 ## Training data
+These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using <0.5% of the base Transformer model's pre-training token budget.
 ## Responsible AI Considerations
 At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI.  When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.