---
tags:
- kernels
license: apache-2.0
---

# Activation

Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs.

- Currently implemented
  - [PolyNorm](https://arxiv.org/html/2411.03884v1)
  - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html)
  - **FusedAddRMSNorm**

    A fused operator that combines **residual addition** (`x + residual`) with **RMSNorm** in a single kernel.
    - Instead of:

      ```python
      y = x + residual
      hidden_state = rms_norm(y, weight, eps)
      out = y + some_op(hidden_state)
      ```

    - Fused as:

      ```python
      hidden_state, y = fused_add_rms_norm(x, residual, weight, eps)
      out = y + some_op(hidden_state)
      ```

  - **FusedMulPolyNorm**

    A fused operator that combines **PolyNorm** with an **element-wise multiplication** by a Tensor.
    - Instead of:

      ```python
      y = poly_norm(x, weight, bias, eps)
      out = y * a
      ```

    - Fused as:

      ```python
      out = fused_mul_poly_norm(x, a, weight, bias, eps)
      ```

  - **FusedMulGroupedPolyNorm** (CUDA)

    A CUDA-accelerated grouped variant of FusedMulPolyNorm for **MoE (Mixture of Experts)** models. Fuses the entire PolyNorm computation into CUDA kernels (fwd + bwd), with per-expert weights/bias, in-kernel binary search for expert mapping, optional routing scores multiplication, and hidden\_clamp fusion.
    - Instead of:

      ```python
      for i, expert in enumerate(experts):
          out[start:end] = fused_mul_poly_norm(x[start:end], mul[start:end], weight[i], bias[i], eps)
      ```

    - Fused as:

      ```python
      out = fused_mul_grouped_poly_norm(x, mul, weight, bias, offsets, eps,
                                         scores=scores, hidden_clamp=10.0)
      ```

## Installation

```bash
# Local CUDA build (development)
pip install --no-build-isolation -e .
```

## Usage

```python
import torch
import activation

torch.set_default_device("cuda")
poly_norm = activation.layers.PolyNorm(eps=1e-6)
x = torch.randn(10, 10)

print(poly_norm(x))
```

## Performance
- Test cases are from the Motif LLM
- The results can be reproduced using the provided benchmarking tools.
- For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md).
- The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small.

### RMSNorm

#### H100 Results

<details>
<summary>Forward Performance</summary>

![RMSNorm Forward Performance](./benchmarks/plots/h100/rms/plot_rms-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![RMSNorm Backward Performance](./benchmarks/plots/h100/rms/plot_rms-bwd-perf.png)

</details>

#### MI250 Results

<details>
<summary>Forward Performance</summary>

![RMSNorm Forward Performance](./benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![RMSNorm Backward Performance](./benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png)

</details>

---

### FusedAddRMSNorm

> [!NOTE]  
> For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**.

#### H100 Results

<details>
<summary>Forward Performance</summary>

![FusedAddRMSNorm Forward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![FusedAddRMSNorm Backward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png)

</details>

#### MI250 Results

<details>
<summary>Forward Performance</summary>

![FusedAddRMSNorm Forward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![FusedAddRMSNorm Backward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png)

</details>

---

### PolyNorm

#### H100 Results

<details>
<summary>Forward Performance</summary>

![PolyNorm Forward Performance](./benchmarks/plots/h100/poly/plot_poly-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![PolyNorm Backward Performance](./benchmarks/plots/h100/poly/plot_poly-bwd-perf.png)

</details>

#### MI250 Results

<details>
<summary>Forward Performance</summary>

![PolyNorm Forward Performance](./benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![PolyNorm Backward Performance](./benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png)

</details>

---

### FusedMulPolyNorm

> [!NOTE]  
> For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**.

#### H100 Results

<details>
<summary>Forward Performance</summary>

![FusedMulPolyNorm Forward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![FusedMulPolyNorm Backward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png)

</details>

#### MI250 Results

<details>
<summary>Forward Performance</summary>

![FusedMulPolyNorm Forward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png)

</details>

<details>
<summary>Backward Performance</summary>

![FusedMulPolyNorm Backward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png)

</details>

---

### FusedMulGroupedPolyNorm (CUDA)

> [!NOTE]
> This kernel is implemented in CUDA C++ (compiled via setup.py).
> Benchmarks compare three variants: **Naive** (raw PyTorch reference), **Compiled** (`torch.compile`'d reference), and **CUDA** (fused CUDA kernel).
> Benchmark dimension: 1280, 384 experts.
>
> **Training profile (B200, motif3\_seq, lbs=8, seqlen=4K):**
> | | CUDA kernel | torch.compile | Speedup |
> |---|---|---|---|
> | Forward | 0.7 ms | 2.1 ms | **3.0x** |
> | Backward | 1.4 ms | 3.7 ms | **2.6x** |

## Pre-commit Hooks

This project uses [pre-commit](https://pre-commit.com/) to automatically check and format code before commits.

### Setup

1. Install pre-commit:

   ```bash
   pip install pre-commit
   ```

2. Install the git hooks:
  
```bash
   pre-commit install
   ```

Once installed, the configured hooks will run automatically on each commit.

### Included Hooks

The following tools are run via pre-commit:

- **[yapf](https://github.com/google/yapf)** – Python code formatter  
- **[typos](https://github.com/crate-ci/typos)** – Spell checker for common typos  
- **[isort](https://github.com/PyCQA/isort)** – Organizes and sorts Python imports  
- **[clang-format](https://clang.llvm.org/docs/ClangFormat.html)** – Formats C++/CUDA code (`--style=file`)  
- **[pymarkdown](https://github.com/jackdewinter/pymarkdown)** – Lints and auto-fixes Markdown files  
- **[actionlint](https://github.com/rhysd/actionlint)** – Validates GitHub Actions workflows  

### Usage

- Run all checks on the entire codebase:

   ```bash
   pre-commit run --all-files
   ```

- Run a specific hook (example: isort):
  
 ```bash
   pre-commit run isort --all-files
   ```