# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Custom CUDA/ROCm normalization kernels for LLM training and inference, published as `motif-technologies/activation` on HuggingFace. Implements PolyNorm, RMSNorm, FusedAddRMSNorm, and FusedMulPolyNorm with full autograd support, fake tensor registration, and DTensor sharding strategies for sequence parallelism.

## Build System

### Local development build (primary)

```bash
pip install -e .                          # editable install (build + install)
python setup.py build_ext --inplace       # build only
```

`setup.py` does two things:
1. Compiles `_activation` C extension from CUDA sources (`activation/*.cu`) + C++ binding (`torch-ext/torch_binding.cpp`)
2. Installs `activation` Python package from `torch-ext/activation/` (autograd functions, layers, etc.)

After `pip install -e .`, all imports work directly (`import activation`, `from activation.grouped_poly_norm import ...`). No PYTHONPATH manipulation needed.

NVCC flags: `-O3 --use_fast_math -std=c++17`, targets sm_80 (A100), sm_89 (L40/4090), sm_90 (H100), sm_100 (B200, CUDA 12.8+).

### CI / HuggingFace distribution build

Uses HuggingFace's `kernel-builder` via Nix for cross-compilation of pre-built `.abi3.so` binaries.

```bash
nix run .#build-and-copy
```

Pre-built binaries go to `build/` (tracked via Git LFS). The build config lives in `build.toml`.

## Running Tests

Tests require a GPU. Install first with `pip install -e .`.

```bash
# Run all tests
pytest tests/

# Run a single test file
pytest tests/test_rms_norm.py

# Run a specific test
pytest tests/test_rms_norm.py::test_rms_norm_forward -v

# Sequence parallel tests (require torch>=2.8 and 2+ GPUs)
torchrun --nproc-per-node=2 -m pytest tests/test_rms_norm_sequence_parallel.py
```

Pytest config is in `tests/pytest.ini` (log_cli enabled at INFO level).

## Linting / Formatting

Pre-commit hooks handle all formatting. Install with `pre-commit install`.

- **Python**: yapf (formatter), isort (imports)
- **C++/CUDA**: clang-format (`--style=file`)
- **Markdown**: pymarkdown
- **Spelling**: typos

The `build/` and `result/` directories are excluded from all hooks.

## Architecture

### Layer structure (bottom-up)

1. **CUDA/HIP kernels** (`activation/*.cu`, `activation/*.h`): Hand-written kernels that compile for both NVIDIA (CUB) and AMD (hipcub). Each kernel uses vectorized template dispatch (`width > 0` for coalesced 128-bit loads when `dim % 8 == 0`, scalar fallback otherwise). Accumulation is always in float32.

2. **C++ torch binding** (`torch-ext/torch_binding.cpp`): Registers ops via `TORCH_LIBRARY_EXPAND` under a build-specific namespace (e.g., `_activation_53ed492_dirty`).

3. **Autograd Functions** (`torch-ext/activation/poly_norm.py`, `rms_norm.py`): `torch.autograd.Function` subclasses with `forward`, `setup_context`, and `backward`. Also registers `@torch.library.register_fake` for `torch.compile`/AOT support.

4. **DTensor sharding strategies** (`torch-ext/activation/rms_norm_meta.py`, `fused_add_rms_norm_meta.py`): `@register_op_strategy` definitions. Input can be sharded on any dim except the last (normalization dim); weight is always replicated.

5. **nn.Module wrappers** (`torch-ext/activation/layers.py`): `PolyNorm`, `FusedMulPolyNorm`, `RMSNorm`, `FusedAddRMSNorm` — these are the public user-facing API.

6. **Parallel style** (`torch-ext/activation/parallel_style.py`): `ResidualSequenceParallel` extends PyTorch's `SequenceParallel` for the two-input (x + residual) pattern of `FusedAddRMSNorm`.

### Key files

| Path | Purpose |
|------|---------|
| `build.toml` | kernel-builder manifest (backends, source files, ROCm arch targets) |
| `activation/cuda_compat.h` | CUDA/ROCm compatibility shim (CUB vs hipcub, WARP_SIZE) |
| `activation/dispatch_utils.h` | `MOTIF_DISPATCH_FLOATING_TYPES` type dispatch macro |
| `torch-ext/activation/__init__.py` | Package entry point; exports functional API + layers + parallel_style |
| `torch-ext/activation/_ops.py` | Generated at build time; loads `.abi3.so` and exposes `torch.ops.*` |

### Adding a new kernel

1. Write the CUDA kernel in `activation/new_kernel.cu` (follow the vectorized template pattern from existing kernels)
2. Add C++ declarations in `torch-ext/torch_binding.h` and register in `torch-ext/torch_binding.cpp`
3. Create `torch.autograd.Function` in `torch-ext/activation/new_kernel.py` with fake tensor registration
4. Add `nn.Module` wrapper in `torch-ext/activation/layers.py`
5. If distributed support is needed, add DTensor strategy in `torch-ext/activation/new_kernel_meta.py`
6. Add the `.cu` file to both `[kernel.activation]` and `[kernel.activation_cuda]` in `build.toml`
7. Export from `torch-ext/activation/__init__.py`
8. Add tests in `tests/test_new_kernel.py` (numerical comparison vs PyTorch reference + `torch.library.opcheck`)

### Test conventions

- Tests compare custom ops against PyTorch reference implementations using tolerances from `tests/allclose_default.py`
- Every test runs `torch.library.opcheck` to validate op schema, autograd, fake tensor, and AOT dispatch
- Tests are parametrized over dtypes (float32, float16, bfloat16), sequence lengths, and hidden dimensions
- Sequence parallel tests use `torchrun` with 2 GPUs and require torch>=2.8

### ROCm/CUDA target matrix

- **ROCm architectures**: gfx90a (MI250), gfx942 (MI300X)
- **PyTorch versions**: 2.7 through 2.10
- **CUDA versions**: 11.8, 12.6, 12.8, 12.9, 13.0
- **ROCm versions**: 6.3, 6.4, 7.0, 7.1