| DETAILS = """ |
| ### Motivation |
| Existing tools like the [Hugging Face Model Memory Estimator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), [DeepSpeed Calculator](https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage), and [DeepSpeed Native Utility](https://deepspeed.readthedocs.io/en/latest/memory.html) are valuable but don't support the full range of modern training configurations. |
| |
| This tool adds: |
| - Arbitrary model configurations beyond preset architectures |
| - FSDP and 5D parallelism support |
| - Interactive memory breakdowns by category to inform configuration decisions |
| |
| ### References |
| Helpful resources used while building this: |
| - [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) |
| - [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198) |
| - [Transformer Math - Michael Wornow](https://michaelwornow.net/2024/01/18/counting-params-in-transformer) |
| - [Transformer Math 101](https://blog.eleuther.ai/transformer-math/) |
| """ |
|
|
| INSTRUCTIONS = """ |
| This calculator will estimate a coarse upper bound for memory used per GPU during training (excluding intermediates) |
| ## How to Use |
| 1. Use Presets OR Adjust the parallelism, model, and training panels to match your run. |
| 2. Press **Calculate** to refresh the memory breakdown chart. |
| 3. Review the details and references below for context on the estimates. |
| """ |
|
|
| LIMITATIONS = """ |
| ### Key Assumptions: |
| - Standard transformer architecture with homogeneous layers |
| - Adam optimizer |
| - Mixed precision keeps master weights copy |
| - Tensor parallelism includes sequence parallelism |
| - Pipeline parallelism maintains consistent activation memory due to schedule |
| |
| ### Not Currently Supported: |
| - Non-standard architectures (alternating dense/sparse layers, custom attention) |
| - Multi-modal models with vision layers |
| - Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported. |
| - Kernel/framework overhead and intermediate memory |
| |
| For advanced configurations, results should be validated against profiling. |
| """ |
|
|
|
|
| ACCURACY = """ |
| I validated this calculator against the projected memory usage in The Ultra-Scale Playbook w/in 10%. Some overage is expected since the calculator makes pessimistic assumptions and looks for peak memory. Note that you could still OOM from intermediates! |
| Welcome any detailed memory usage reports along with configurations and framework details to tune this further! |
| """ |