# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Custom CUDA/ROCm normalization kernels for LLM training and inference, published as `motif-technologies/activation` on HuggingFace. Implements PolyNorm, RMSNorm, FusedAddRMSNorm, and FusedMulPolyNorm with full autograd support, fake tensor registration, and DTensor sharding strategies for sequence parallelism. ## Build System ### Local development build (primary) ```bash pip install -e . # editable install (build + install) python setup.py build_ext --inplace # build only ``` `setup.py` does two things: 1. Compiles `_activation` C extension from CUDA sources (`activation/*.cu`) + C++ binding (`torch-ext/torch_binding.cpp`) 2. Installs `activation` Python package from `torch-ext/activation/` (autograd functions, layers, etc.) After `pip install -e .`, all imports work directly (`import activation`, `from activation.grouped_poly_norm import ...`). No PYTHONPATH manipulation needed. NVCC flags: `-O3 --use_fast_math -std=c++17`, targets sm_80 (A100), sm_89 (L40/4090), sm_90 (H100), sm_100 (B200, CUDA 12.8+). ### CI / HuggingFace distribution build Uses HuggingFace's `kernel-builder` via Nix for cross-compilation of pre-built `.abi3.so` binaries. ```bash nix run .#build-and-copy ``` Pre-built binaries go to `build/` (tracked via Git LFS). The build config lives in `build.toml`. ## Running Tests Tests require a GPU. Install first with `pip install -e .`. ```bash # Run all tests pytest tests/ # Run a single test file pytest tests/test_rms_norm.py # Run a specific test pytest tests/test_rms_norm.py::test_rms_norm_forward -v # Sequence parallel tests (require torch>=2.8 and 2+ GPUs) torchrun --nproc-per-node=2 -m pytest tests/test_rms_norm_sequence_parallel.py ``` Pytest config is in `tests/pytest.ini` (log_cli enabled at INFO level). ## Linting / Formatting Pre-commit hooks handle all formatting. Install with `pre-commit install`. - **Python**: yapf (formatter), isort (imports) - **C++/CUDA**: clang-format (`--style=file`) - **Markdown**: pymarkdown - **Spelling**: typos The `build/` and `result/` directories are excluded from all hooks. ## Architecture ### Layer structure (bottom-up) 1. **CUDA/HIP kernels** (`activation/*.cu`, `activation/*.h`): Hand-written kernels that compile for both NVIDIA (CUB) and AMD (hipcub). Each kernel uses vectorized template dispatch (`width > 0` for coalesced 128-bit loads when `dim % 8 == 0`, scalar fallback otherwise). Accumulation is always in float32. 2. **C++ torch binding** (`torch-ext/torch_binding.cpp`): Registers ops via `TORCH_LIBRARY_EXPAND` under a build-specific namespace (e.g., `_activation_53ed492_dirty`). 3. **Autograd Functions** (`torch-ext/activation/poly_norm.py`, `rms_norm.py`): `torch.autograd.Function` subclasses with `forward`, `setup_context`, and `backward`. Also registers `@torch.library.register_fake` for `torch.compile`/AOT support. 4. **DTensor sharding strategies** (`torch-ext/activation/rms_norm_meta.py`, `fused_add_rms_norm_meta.py`): `@register_op_strategy` definitions. Input can be sharded on any dim except the last (normalization dim); weight is always replicated. 5. **nn.Module wrappers** (`torch-ext/activation/layers.py`): `PolyNorm`, `FusedMulPolyNorm`, `RMSNorm`, `FusedAddRMSNorm` — these are the public user-facing API. 6. **Parallel style** (`torch-ext/activation/parallel_style.py`): `ResidualSequenceParallel` extends PyTorch's `SequenceParallel` for the two-input (x + residual) pattern of `FusedAddRMSNorm`. ### Key files | Path | Purpose | |------|---------| | `build.toml` | kernel-builder manifest (backends, source files, ROCm arch targets) | | `activation/cuda_compat.h` | CUDA/ROCm compatibility shim (CUB vs hipcub, WARP_SIZE) | | `activation/dispatch_utils.h` | `MOTIF_DISPATCH_FLOATING_TYPES` type dispatch macro | | `torch-ext/activation/__init__.py` | Package entry point; exports functional API + layers + parallel_style | | `torch-ext/activation/_ops.py` | Generated at build time; loads `.abi3.so` and exposes `torch.ops.*` | ### Adding a new kernel 1. Write the CUDA kernel in `activation/new_kernel.cu` (follow the vectorized template pattern from existing kernels) 2. Add C++ declarations in `torch-ext/torch_binding.h` and register in `torch-ext/torch_binding.cpp` 3. Create `torch.autograd.Function` in `torch-ext/activation/new_kernel.py` with fake tensor registration 4. Add `nn.Module` wrapper in `torch-ext/activation/layers.py` 5. If distributed support is needed, add DTensor strategy in `torch-ext/activation/new_kernel_meta.py` 6. Add the `.cu` file to both `[kernel.activation]` and `[kernel.activation_cuda]` in `build.toml` 7. Export from `torch-ext/activation/__init__.py` 8. Add tests in `tests/test_new_kernel.py` (numerical comparison vs PyTorch reference + `torch.library.opcheck`) ### Test conventions - Tests compare custom ops against PyTorch reference implementations using tolerances from `tests/allclose_default.py` - Every test runs `torch.library.opcheck` to validate op schema, autograd, fake tensor, and AOT dispatch - Tests are parametrized over dtypes (float32, float16, bfloat16), sequence lengths, and hidden dimensions - Sequence parallel tests use `torchrun` with 2 GPUs and require torch>=2.8 ### ROCm/CUDA target matrix - **ROCm architectures**: gfx90a (MI250), gfx942 (MI300X) - **PyTorch versions**: 2.7 through 2.10 - **CUDA versions**: 11.8, 12.6, 12.8, 12.9, 13.0 - **ROCm versions**: 6.3, 6.4, 7.0, 7.1