CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Custom CUDA/ROCm normalization kernels for LLM training and inference, published as motif-technologies/activation on HuggingFace. Implements PolyNorm, RMSNorm, FusedAddRMSNorm, and FusedMulPolyNorm with full autograd support, fake tensor registration, and DTensor sharding strategies for sequence parallelism.
Build System
Local development build (primary)
pip install -e . # editable install (build + install)
python setup.py build_ext --inplace # build only
setup.py does two things:
- Compiles
_activationC extension from CUDA sources (activation/*.cu) + C++ binding (torch-ext/torch_binding.cpp) - Installs
activationPython package fromtorch-ext/activation/(autograd functions, layers, etc.)
After pip install -e ., all imports work directly (import activation, from activation.grouped_poly_norm import ...). No PYTHONPATH manipulation needed.
NVCC flags: -O3 --use_fast_math -std=c++17, targets sm_80 (A100), sm_89 (L40/4090), sm_90 (H100), sm_100 (B200, CUDA 12.8+).
CI / HuggingFace distribution build
Uses HuggingFace's kernel-builder via Nix for cross-compilation of pre-built .abi3.so binaries.
nix run .#build-and-copy
Pre-built binaries go to build/ (tracked via Git LFS). The build config lives in build.toml.
Running Tests
Tests require a GPU. Install first with pip install -e ..
# Run all tests
pytest tests/
# Run a single test file
pytest tests/test_rms_norm.py
# Run a specific test
pytest tests/test_rms_norm.py::test_rms_norm_forward -v
# Sequence parallel tests (require torch>=2.8 and 2+ GPUs)
torchrun --nproc-per-node=2 -m pytest tests/test_rms_norm_sequence_parallel.py
Pytest config is in tests/pytest.ini (log_cli enabled at INFO level).
Linting / Formatting
Pre-commit hooks handle all formatting. Install with pre-commit install.
- Python: yapf (formatter), isort (imports)
- C++/CUDA: clang-format (
--style=file) - Markdown: pymarkdown
- Spelling: typos
The build/ and result/ directories are excluded from all hooks.
Architecture
Layer structure (bottom-up)
CUDA/HIP kernels (
activation/*.cu,activation/*.h): Hand-written kernels that compile for both NVIDIA (CUB) and AMD (hipcub). Each kernel uses vectorized template dispatch (width > 0for coalesced 128-bit loads whendim % 8 == 0, scalar fallback otherwise). Accumulation is always in float32.C++ torch binding (
torch-ext/torch_binding.cpp): Registers ops viaTORCH_LIBRARY_EXPANDunder a build-specific namespace (e.g.,_activation_53ed492_dirty).Autograd Functions (
torch-ext/activation/poly_norm.py,rms_norm.py):torch.autograd.Functionsubclasses withforward,setup_context, andbackward. Also registers@torch.library.register_fakefortorch.compile/AOT support.DTensor sharding strategies (
torch-ext/activation/rms_norm_meta.py,fused_add_rms_norm_meta.py):@register_op_strategydefinitions. Input can be sharded on any dim except the last (normalization dim); weight is always replicated.nn.Module wrappers (
torch-ext/activation/layers.py):PolyNorm,FusedMulPolyNorm,RMSNorm,FusedAddRMSNorm— these are the public user-facing API.Parallel style (
torch-ext/activation/parallel_style.py):ResidualSequenceParallelextends PyTorch'sSequenceParallelfor the two-input (x + residual) pattern ofFusedAddRMSNorm.
Key files
| Path | Purpose |
|---|---|
build.toml |
kernel-builder manifest (backends, source files, ROCm arch targets) |
activation/cuda_compat.h |
CUDA/ROCm compatibility shim (CUB vs hipcub, WARP_SIZE) |
activation/dispatch_utils.h |
MOTIF_DISPATCH_FLOATING_TYPES type dispatch macro |
torch-ext/activation/__init__.py |
Package entry point; exports functional API + layers + parallel_style |
torch-ext/activation/_ops.py |
Generated at build time; loads .abi3.so and exposes torch.ops.* |
Adding a new kernel
- Write the CUDA kernel in
activation/new_kernel.cu(follow the vectorized template pattern from existing kernels) - Add C++ declarations in
torch-ext/torch_binding.hand register intorch-ext/torch_binding.cpp - Create
torch.autograd.Functionintorch-ext/activation/new_kernel.pywith fake tensor registration - Add
nn.Modulewrapper intorch-ext/activation/layers.py - If distributed support is needed, add DTensor strategy in
torch-ext/activation/new_kernel_meta.py - Add the
.cufile to both[kernel.activation]and[kernel.activation_cuda]inbuild.toml - Export from
torch-ext/activation/__init__.py - Add tests in
tests/test_new_kernel.py(numerical comparison vs PyTorch reference +torch.library.opcheck)
Test conventions
- Tests compare custom ops against PyTorch reference implementations using tolerances from
tests/allclose_default.py - Every test runs
torch.library.opcheckto validate op schema, autograd, fake tensor, and AOT dispatch - Tests are parametrized over dtypes (float32, float16, bfloat16), sequence lengths, and hidden dimensions
- Sequence parallel tests use
torchrunwith 2 GPUs and require torch>=2.8
ROCm/CUDA target matrix
- ROCm architectures: gfx90a (MI250), gfx942 (MI300X)
- PyTorch versions: 2.7 through 2.10
- CUDA versions: 11.8, 12.6, 12.8, 12.9, 13.0
- ROCm versions: 6.3, 6.4, 7.0, 7.1