--- tags: - kernels license: apache-2.0 --- # Activation Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs. - Currently implemented - [PolyNorm](https://arxiv.org/html/2411.03884v1) - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) - **FusedAddRMSNorm** A fused operator that combines **residual addition** (`x + residual`) with **RMSNorm** in a single kernel. - Instead of: ```python y = x + residual hidden_state = rms_norm(y, weight, eps) out = y + some_op(hidden_state) ``` - Fused as: ```python hidden_state, y = fused_add_rms_norm(x, residual, weight, eps) out = y + some_op(hidden_state) ``` - **FusedMulPolyNorm** A fused operator that combines **PolyNorm** with an **element-wise multiplication** by a Tensor. - Instead of: ```python y = poly_norm(x, weight, bias, eps) out = y * a ``` - Fused as: ```python out = fused_mul_poly_norm(x, a, weight, bias, eps) ``` - **FusedMulGroupedPolyNorm** (CUDA) A CUDA-accelerated grouped variant of FusedMulPolyNorm for **MoE (Mixture of Experts)** models. Fuses the entire PolyNorm computation into CUDA kernels (fwd + bwd), with per-expert weights/bias, in-kernel binary search for expert mapping, optional routing scores multiplication, and hidden\_clamp fusion. - Instead of: ```python for i, expert in enumerate(experts): out[start:end] = fused_mul_poly_norm(x[start:end], mul[start:end], weight[i], bias[i], eps) ``` - Fused as: ```python out = fused_mul_grouped_poly_norm(x, mul, weight, bias, offsets, eps, scores=scores, hidden_clamp=10.0) ``` ## Installation ```bash # Local CUDA build (development) pip install --no-build-isolation -e . ``` ## Usage ```python import torch import activation torch.set_default_device("cuda") poly_norm = activation.layers.PolyNorm(eps=1e-6) x = torch.randn(10, 10) print(poly_norm(x)) ``` ## Performance - Test cases are from the Motif LLM - The results can be reproduced using the provided benchmarking tools. - For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md). - The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small. ### RMSNorm #### H100 Results
Forward Performance ![RMSNorm Forward Performance](./benchmarks/plots/h100/rms/plot_rms-fwd-perf.png)
Backward Performance ![RMSNorm Backward Performance](./benchmarks/plots/h100/rms/plot_rms-bwd-perf.png)
#### MI250 Results
Forward Performance ![RMSNorm Forward Performance](./benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png)
Backward Performance ![RMSNorm Backward Performance](./benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png)
--- ### FusedAddRMSNorm > [!NOTE] > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. #### H100 Results
Forward Performance ![FusedAddRMSNorm Forward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png)
Backward Performance ![FusedAddRMSNorm Backward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png)
#### MI250 Results
Forward Performance ![FusedAddRMSNorm Forward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png)
Backward Performance ![FusedAddRMSNorm Backward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png)
--- ### PolyNorm #### H100 Results
Forward Performance ![PolyNorm Forward Performance](./benchmarks/plots/h100/poly/plot_poly-fwd-perf.png)
Backward Performance ![PolyNorm Backward Performance](./benchmarks/plots/h100/poly/plot_poly-bwd-perf.png)
#### MI250 Results
Forward Performance ![PolyNorm Forward Performance](./benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png)
Backward Performance ![PolyNorm Backward Performance](./benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png)
--- ### FusedMulPolyNorm > [!NOTE] > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. #### H100 Results
Forward Performance ![FusedMulPolyNorm Forward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png)
Backward Performance ![FusedMulPolyNorm Backward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png)
#### MI250 Results
Forward Performance ![FusedMulPolyNorm Forward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png)
Backward Performance ![FusedMulPolyNorm Backward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png)
--- ### FusedMulGroupedPolyNorm (CUDA) > [!NOTE] > This kernel is implemented in CUDA C++ (compiled via setup.py). > Benchmarks compare three variants: **Naive** (raw PyTorch reference), **Compiled** (`torch.compile`'d reference), and **CUDA** (fused CUDA kernel). > Benchmark dimension: 1280, 384 experts. > > **Training profile (B200, motif3\_seq, lbs=8, seqlen=4K):** > | | CUDA kernel | torch.compile | Speedup | > |---|---|---|---| > | Forward | 0.7 ms | 2.1 ms | **3.0x** | > | Backward | 1.4 ms | 3.7 ms | **2.6x** | ## Pre-commit Hooks This project uses [pre-commit](https://pre-commit.com/) to automatically check and format code before commits. ### Setup 1. Install pre-commit: ```bash pip install pre-commit ``` 2. Install the git hooks: ```bash pre-commit install ``` Once installed, the configured hooks will run automatically on each commit. ### Included Hooks The following tools are run via pre-commit: - **[yapf](https://github.com/google/yapf)** – Python code formatter - **[typos](https://github.com/crate-ci/typos)** – Spell checker for common typos - **[isort](https://github.com/PyCQA/isort)** – Organizes and sorts Python imports - **[clang-format](https://clang.llvm.org/docs/ClangFormat.html)** – Formats C++/CUDA code (`--style=file`) - **[pymarkdown](https://github.com/jackdewinter/pymarkdown)** – Lints and auto-fixes Markdown files - **[actionlint](https://github.com/rhysd/actionlint)** – Validates GitHub Actions workflows ### Usage - Run all checks on the entire codebase: ```bash pre-commit run --all-files ``` - Run a specific hook (example: isort): ```bash pre-commit run isort --all-files ```