CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Custom CUDA/ROCm normalization kernels for LLM training and inference, published as motif-technologies/activation on HuggingFace. Implements PolyNorm, RMSNorm, FusedAddRMSNorm, and FusedMulPolyNorm with full autograd support, fake tensor registration, and DTensor sharding strategies for sequence parallelism.

Build System

Local development build (primary)

pip install -e .                          # editable install (build + install)
python setup.py build_ext --inplace       # build only

setup.py does two things:

Compiles _activation C extension from CUDA sources (activation/*.cu) + C++ binding (torch-ext/torch_binding.cpp)
Installs activation Python package from torch-ext/activation/ (autograd functions, layers, etc.)

After pip install -e ., all imports work directly (import activation, from activation.grouped_poly_norm import ...). No PYTHONPATH manipulation needed.

NVCC flags: -O3 --use_fast_math -std=c++17, targets sm_80 (A100), sm_89 (L40/4090), sm_90 (H100), sm_100 (B200, CUDA 12.8+).

CI / HuggingFace distribution build

Uses HuggingFace's kernel-builder via Nix for cross-compilation of pre-built .abi3.so binaries.

nix run .#build-and-copy

Pre-built binaries go to build/ (tracked via Git LFS). The build config lives in build.toml.

Running Tests

Tests require a GPU. Install first with pip install -e ..

# Run all tests
pytest tests/

# Run a single test file
pytest tests/test_rms_norm.py

# Run a specific test
pytest tests/test_rms_norm.py::test_rms_norm_forward -v

# Sequence parallel tests (require torch>=2.8 and 2+ GPUs)
torchrun --nproc-per-node=2 -m pytest tests/test_rms_norm_sequence_parallel.py

Pytest config is in tests/pytest.ini (log_cli enabled at INFO level).

Linting / Formatting

Pre-commit hooks handle all formatting. Install with pre-commit install.

Python: yapf (formatter), isort (imports)
C++/CUDA: clang-format (--style=file)
Markdown: pymarkdown
Spelling: typos

The build/ and result/ directories are excluded from all hooks.

Architecture

Layer structure (bottom-up)

CUDA/HIP kernels (activation/*.cu, activation/*.h): Hand-written kernels that compile for both NVIDIA (CUB) and AMD (hipcub). Each kernel uses vectorized template dispatch (width > 0 for coalesced 128-bit loads when dim % 8 == 0, scalar fallback otherwise). Accumulation is always in float32.
C++ torch binding (torch-ext/torch_binding.cpp): Registers ops via TORCH_LIBRARY_EXPAND under a build-specific namespace (e.g., _activation_53ed492_dirty).
Autograd Functions (torch-ext/activation/poly_norm.py, rms_norm.py): torch.autograd.Function subclasses with forward, setup_context, and backward. Also registers @torch.library.register_fake for torch.compile/AOT support.
DTensor sharding strategies (torch-ext/activation/rms_norm_meta.py, fused_add_rms_norm_meta.py): @register_op_strategy definitions. Input can be sharded on any dim except the last (normalization dim); weight is always replicated.
nn.Module wrappers (torch-ext/activation/layers.py): PolyNorm, FusedMulPolyNorm, RMSNorm, FusedAddRMSNorm — these are the public user-facing API.
Parallel style (torch-ext/activation/parallel_style.py): ResidualSequenceParallel extends PyTorch's SequenceParallel for the two-input (x + residual) pattern of FusedAddRMSNorm.

Key files

Path	Purpose
`build.toml`	kernel-builder manifest (backends, source files, ROCm arch targets)
`activation/cuda_compat.h`	CUDA/ROCm compatibility shim (CUB vs hipcub, WARP_SIZE)
`activation/dispatch_utils.h`	`MOTIF_DISPATCH_FLOATING_TYPES` type dispatch macro
`torch-ext/activation/__init__.py`	Package entry point; exports functional API + layers + parallel_style
`torch-ext/activation/_ops.py`	Generated at build time; loads `.abi3.so` and exposes `torch.ops.*`

Adding a new kernel

Write the CUDA kernel in activation/new_kernel.cu (follow the vectorized template pattern from existing kernels)
Add C++ declarations in torch-ext/torch_binding.h and register in torch-ext/torch_binding.cpp
Create torch.autograd.Function in torch-ext/activation/new_kernel.py with fake tensor registration
Add nn.Module wrapper in torch-ext/activation/layers.py
If distributed support is needed, add DTensor strategy in torch-ext/activation/new_kernel_meta.py
Add the .cu file to both [kernel.activation] and [kernel.activation_cuda] in build.toml
Export from torch-ext/activation/__init__.py
Add tests in tests/test_new_kernel.py (numerical comparison vs PyTorch reference + torch.library.opcheck)

Test conventions

Tests compare custom ops against PyTorch reference implementations using tolerances from tests/allclose_default.py
Every test runs torch.library.opcheck to validate op schema, autograd, fake tensor, and AOT dispatch
Tests are parametrized over dtypes (float32, float16, bfloat16), sequence lengths, and hidden dimensions
Sequence parallel tests use torchrun with 2 GPUs and require torch>=2.8

ROCm/CUDA target matrix

ROCm architectures: gfx90a (MI250), gfx942 (MI300X)
PyTorch versions: 2.7 through 2.10
CUDA versions: 11.8, 12.6, 12.8, 12.9, 13.0
ROCm versions: 6.3, 6.4, 7.0, 7.1