feat: replace triton do_bench with torch.profiler for kernel timing

Switch from triton.testing.do_bench to torch.profiler-based CUDA kernel
time measurement. do_bench includes PyTorch CPU overhead in timing,
leading to inaccurate results. torch.profiler sums only actual CUDA
kernel durations for precise measurement.

- Add profile_bench() with per-kernel breakdown and bandwidth output
- Add GB/s and SpeedUp(ratio) columns to CSV reports
- Fix grouped_mul_poly import path (activation.grouped_poly_norm)
- Add docs/benchmark.md for benchmark system documentation
- Update CLAUDE.md build system description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (6) hide show

CLAUDE.md +116 -0
benchmarks/benchmark_profiler.yaml +96 -0
benchmarks/cases/grouped_mul_poly.py +4 -3
benchmarks/common/bench_framework.py +133 -10
benchmarks/run_cases.py +45 -26
docs/benchmark.md +146 -0

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Custom CUDA/ROCm normalization kernels for LLM training and inference, published as `motif-technologies/activation` on HuggingFace. Implements PolyNorm, RMSNorm, FusedAddRMSNorm, and FusedMulPolyNorm with full autograd support, fake tensor registration, and DTensor sharding strategies for sequence parallelism.
+## Build System
+### Local development build (primary)
+```bash
+pip install -e .                          # editable install (build + install)
+python setup.py build_ext --inplace       # build only
+```
+`setup.py` does two things:
+1. Compiles `_activation` C extension from CUDA sources (`activation/*.cu`) + C++ binding (`torch-ext/torch_binding.cpp`)
+2. Installs `activation` Python package from `torch-ext/activation/` (autograd functions, layers, etc.)
+After `pip install -e .`, all imports work directly (`import activation`, `from activation.grouped_poly_norm import ...`). No PYTHONPATH manipulation needed.
+NVCC flags: `-O3 --use_fast_math -std=c++17`, targets sm_80 (A100), sm_89 (L40/4090), sm_90 (H100), sm_100 (B200, CUDA 12.8+).
+### CI / HuggingFace distribution build
+Uses HuggingFace's `kernel-builder` via Nix for cross-compilation of pre-built `.abi3.so` binaries.
+```bash
+nix run .#build-and-copy
+```
+Pre-built binaries go to `build/` (tracked via Git LFS). The build config lives in `build.toml`.
+## Running Tests
+Tests require a GPU. Install first with `pip install -e .`.
+```bash
+# Run all tests
+pytest tests/
+# Run a single test file
+pytest tests/test_rms_norm.py
+# Run a specific test
+pytest tests/test_rms_norm.py::test_rms_norm_forward -v
+# Sequence parallel tests (require torch>=2.8 and 2+ GPUs)
+torchrun --nproc-per-node=2 -m pytest tests/test_rms_norm_sequence_parallel.py
+```
+Pytest config is in `tests/pytest.ini` (log_cli enabled at INFO level).
+## Linting / Formatting
+Pre-commit hooks handle all formatting. Install with `pre-commit install`.
+- **Python**: yapf (formatter), isort (imports)
+- **C++/CUDA**: clang-format (`--style=file`)
+- **Markdown**: pymarkdown
+- **Spelling**: typos
+The `build/` and `result/` directories are excluded from all hooks.
+## Architecture
+### Layer structure (bottom-up)
+1. **CUDA/HIP kernels** (`activation/*.cu`, `activation/*.h`): Hand-written kernels that compile for both NVIDIA (CUB) and AMD (hipcub). Each kernel uses vectorized template dispatch (`width > 0` for coalesced 128-bit loads when `dim % 8 == 0`, scalar fallback otherwise). Accumulation is always in float32.
+2. **C++ torch binding** (`torch-ext/torch_binding.cpp`): Registers ops via `TORCH_LIBRARY_EXPAND` under a build-specific namespace (e.g., `_activation_53ed492_dirty`).
+3. **Autograd Functions** (`torch-ext/activation/poly_norm.py`, `rms_norm.py`): `torch.autograd.Function` subclasses with `forward`, `setup_context`, and `backward`. Also registers `@torch.library.register_fake` for `torch.compile`/AOT support.
+4. **DTensor sharding strategies** (`torch-ext/activation/rms_norm_meta.py`, `fused_add_rms_norm_meta.py`): `@register_op_strategy` definitions. Input can be sharded on any dim except the last (normalization dim); weight is always replicated.
+5. **nn.Module wrappers** (`torch-ext/activation/layers.py`): `PolyNorm`, `FusedMulPolyNorm`, `RMSNorm`, `FusedAddRMSNorm` — these are the public user-facing API.
+6. **Parallel style** (`torch-ext/activation/parallel_style.py`): `ResidualSequenceParallel` extends PyTorch's `SequenceParallel` for the two-input (x + residual) pattern of `FusedAddRMSNorm`.
+### Key files
+| Path | Purpose |
+|------|---------|
+| `build.toml` | kernel-builder manifest (backends, source files, ROCm arch targets) |
+| `activation/cuda_compat.h` | CUDA/ROCm compatibility shim (CUB vs hipcub, WARP_SIZE) |
+| `activation/dispatch_utils.h` | `MOTIF_DISPATCH_FLOATING_TYPES` type dispatch macro |
+| `torch-ext/activation/__init__.py` | Package entry point; exports functional API + layers + parallel_style |
+| `torch-ext/activation/_ops.py` | Generated at build time; loads `.abi3.so` and exposes `torch.ops.*` |
+### Adding a new kernel
+1. Write the CUDA kernel in `activation/new_kernel.cu` (follow the vectorized template pattern from existing kernels)
+2. Add C++ declarations in `torch-ext/torch_binding.h` and register in `torch-ext/torch_binding.cpp`
+3. Create `torch.autograd.Function` in `torch-ext/activation/new_kernel.py` with fake tensor registration
+4. Add `nn.Module` wrapper in `torch-ext/activation/layers.py`
+5. If distributed support is needed, add DTensor strategy in `torch-ext/activation/new_kernel_meta.py`
+6. Add the `.cu` file to both `[kernel.activation]` and `[kernel.activation_cuda]` in `build.toml`
+7. Export from `torch-ext/activation/__init__.py`
+8. Add tests in `tests/test_new_kernel.py` (numerical comparison vs PyTorch reference + `torch.library.opcheck`)
+### Test conventions
+- Tests compare custom ops against PyTorch reference implementations using tolerances from `tests/allclose_default.py`
+- Every test runs `torch.library.opcheck` to validate op schema, autograd, fake tensor, and AOT dispatch
+- Tests are parametrized over dtypes (float32, float16, bfloat16), sequence lengths, and hidden dimensions
+- Sequence parallel tests use `torchrun` with 2 GPUs and require torch>=2.8
+### ROCm/CUDA target matrix
+- **ROCm architectures**: gfx90a (MI250), gfx942 (MI300X)
+- **PyTorch versions**: 2.7 through 2.10
+- **CUDA versions**: 11.8, 12.6, 12.8, 12.9, 13.0
+- **ROCm versions**: 6.3, 6.4, 7.0, 7.1

benchmarks/benchmark_profiler.yaml ADDED Viewed

	@@ -0,0 +1,96 @@

+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: TrainJob
+metadata:
+  name: jeesoo-grouped-polynorm-profiler-bench
+  namespace: kbm-g-np-motif
+spec:
+  managedBy: trainer.kubeflow.org/trainjob-controller
+  podTemplateOverrides:
+    - spec:
+        containers:
+          - name: node
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: shm
+              - mountPath: /mair
+                name: mair
+        volumes:
+          - emptyDir:
+              medium: Memory
+              sizeLimit: 64Gi
+            name: shm
+          - name: mair
+            persistentVolumeClaim:
+              claimName: mair
+      targetJobs:
+        - name: node
+  runtimeRef:
+    apiGroup: trainer.kubeflow.org
+    kind: ClusterTrainingRuntime
+    name: torch-distributed
+  suspend: false
+  trainer:
+    args:
+      - /bin/bash
+      - '-c'
+      - >
+        ACTIVATIONPATH=/mair/team-sys/jeesoo/activation
+        BUILDLOG=$ACTIVATIONPATH/benchmarks/results/build.log
+        pip install triton matplotlib pandas
+        echo "=== Building with setup.py ==="
+        cd $ACTIVATIONPATH
+        rm -f $ACTIVATIONPATH/_activation*.so
+        pip install --no-build-isolation -e . -v 2>&1 | tee $BUILDLOG | tail -200
+        python -c "import _activation; print('Build OK:', _activation)" || { echo "BUILD FAILED"; exit 0; }
+        echo "=== Build success. Running profiler benchmarks ==="
+        cd $ACTIVATIONPATH/benchmarks
+        DATESTAMP=$(date +'%y_%m_%d_%H_%M')
+        SAVE_PATH=$ACTIVATIONPATH/benchmarks/results/${DATESTAMP}
+        mkdir -p $SAVE_PATH/bench/grouped_mul_poly/bf16
+        nvidia-smi | tee $SAVE_PATH/nvidia_smi.txt
+        python -c "import torch; x=torch.randn(8192,1280,device='cuda',dtype=torch.bfloat16); [torch.mm(x.T,x) for _ in range(100)]; torch.cuda.synchronize(); print('warmup done')"
+        echo "=== Benchmark (torch.profiler) ==="
+        python run_cases.py --case grouped_mul_poly --dtype bf16 --save-path ${SAVE_PATH}/bench 2>&1 | tee ${SAVE_PATH}/bench_log.txt
+        echo "=== Done ==="
+        exit 0;
+    env:
+      - name: PYTHONUNBUFFERED
+        value: '1'
+      - name: PYTORCH_ALLOC_CONF
+        value: expandable_segments:True
+      - name: CUDA_LAUNCH_BLOCKING
+        value: '0'
+      - name: OMP_NUM_THREADS
+        value: '1'
+      - name: HF_HOME
+        value: /mair/llm-dataset/hf_cache
+    image: ghcr.io/motiftechnologies/llm-training:v0.1.3
+    numNodes: 1
+    numProcPerNode: 1
+    resourcesPerNode:
+      limits:
+        cpu: '16'
+        memory: 128Gi
+        nvidia.com/gpu: '1'
+      requests:
+        cpu: '16'
+        memory: 128Gi
+        nvidia.com/gpu: '1'

benchmarks/cases/grouped_mul_poly.py CHANGED Viewed

@@ -5,8 +5,8 @@ from common.diff_engine import DiffCase
 torch._functorch.config.donated_buffer = False
-from grouped_poly_norm import (fused_mul_grouped_poly_norm,
-                               fused_mul_grouped_poly_norm_ref)
 # 384 / 8 (EP) = 48 experts per rank
 # total_tokens = bs * sl, which equals per-rank tokens
@@ -73,7 +73,8 @@ class GroupedMulPoly(DiffCase):
         probs = torch.ones(num_experts) / num_experts
         assignments = torch.multinomial(probs, total_tokens, replacement=True)
         counts = torch.bincount(assignments, minlength=num_experts).tolist()
-        offsets = torch.cumsum(torch.tensor(counts, dtype=torch.int32), dim=0)
         return {
             "x":

 torch._functorch.config.donated_buffer = False
+from activation.grouped_poly_norm import (fused_mul_grouped_poly_norm,
+                                          fused_mul_grouped_poly_norm_ref)
 # 384 / 8 (EP) = 48 experts per rank
 # total_tokens = bs * sl, which equals per-rank tokens
         probs = torch.ones(num_experts) / num_experts
         assignments = torch.multinomial(probs, total_tokens, replacement=True)
         counts = torch.bincount(assignments, minlength=num_experts).tolist()
+        offsets = torch.cumsum(torch.tensor(counts, dtype=torch.int32),
+                               dim=0).to(torch.int32)
         return {
             "x":

benchmarks/common/bench_framework.py CHANGED Viewed

@@ -4,11 +4,112 @@ import re
 from typing import Any, Dict, Sequence
 import torch
 import triton
 from .diff_engine import DiffCase
 def make_fwd_key(batch_size, seq_len, dim):
     return f"forward : ({batch_size}, {seq_len}, {dim})"
@@ -31,7 +132,7 @@ def make_fwd_benchmark_for_case(
     case: DiffCase,
     configs: Sequence[tuple[int, int, int]],
     plot_name: str,
-    ylabel: str = "us",
     line_vals=("naive", "cuda", "speedup"),
     line_names: Dict[str, str] | None = None,
     dtype=torch.bfloat16,
@@ -39,6 +140,7 @@ def make_fwd_benchmark_for_case(
     time_unit_scale: float = 1000,
 ):
     timings_ms = collections.defaultdict(dict)
     line_vals = list(line_vals)
     line_names = line_names or {v: v.title() for v in line_vals}
     x_vals = [list(_) for _ in configs]
@@ -56,7 +158,11 @@ def make_fwd_benchmark_for_case(
         key = make_fwd_key(dim, batch_size, seq_len)
         I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
         if provider == "speedup":
-            return timings_ms["naive"][key] / timings_ms["cuda"][key]
         if provider == "naive":
             obj = case.make_naive(I)
         elif provider == "compiled" and hasattr(case, "make_compiled"):
@@ -64,7 +170,10 @@ def make_fwd_benchmark_for_case(
         else:
             obj = case.make_cuda(I)
         run = lambda: case.forward(obj, I)
-        ms = triton.testing.do_bench(run)
         timings_ms[provider][key] = ms
         return time_unit_scale * ms
@@ -113,10 +222,12 @@ def make_fwd_benchmark_plot_for_case(
         else:
             obj = case.make_cuda(I)
         run = lambda: case.forward(obj, I)
-        ms = triton.testing.do_bench(run)
         timings_ms[provider][config] = ms
         if provider == "cuda":
-            ratio = timings_ms["naive"][config] / timings_ms["cuda"][config]
             spdup_ratio.append(ratio)
             return round(ratio, 2)
         else:
@@ -130,7 +241,7 @@ def make_bwd_benchmark_for_case(
     case: DiffCase,
     configs: Sequence[tuple[int, int, int]],
     plot_name: str,
-    ylabel: str = "us",
     line_vals=("naive", "cuda", "speedup"),
     line_names: Dict[str, str] | None = None,
     dtype=torch.bfloat16,
@@ -138,6 +249,7 @@ def make_bwd_benchmark_for_case(
     time_unit_scale: float = 1000,
 ):
     timings_ms = collections.defaultdict(dict)
     line_vals = list(line_vals)
     line_names = line_names or {v: v.title() for v in line_vals}
     x_vals = [list(_) for _ in configs]
@@ -155,7 +267,11 @@ def make_bwd_benchmark_for_case(
         key = make_bwd_key(dim, batch_size, seq_len)
         I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
         if provider == "speedup":
-            return timings_ms["naive"][key] / timings_ms["cuda"][key]
         if provider == "naive":
             obj = case.make_naive(I)
         elif provider == "compiled" and hasattr(case, "make_compiled"):
@@ -174,7 +290,11 @@ def make_bwd_benchmark_for_case(
                                           retain_graph=True,
                                           create_graph=False,
                                           allow_unused=False)
-        ms = triton.testing.do_bench(run)
         timings_ms[provider][key] = ms
         return time_unit_scale * ms
@@ -234,10 +354,13 @@ def make_bwd_benchmark_plot_for_case(
                                           retain_graph=True,
                                           create_graph=False,
                                           allow_unused=False)
-        ms = triton.testing.do_bench(run)
         timings_ms[provider][config] = ms
         if provider == "cuda":
-            ratio = timings_ms["naive"][config] / timings_ms["cuda"][config]
             spdup_ratio.append(ratio)
             return round(ratio, 2)
         else:

 from typing import Any, Dict, Sequence
 import torch
+from torch.profiler import ProfilerActivity, profile
 import triton
 from .diff_engine import DiffCase
+def _get_best_cuda_timing(timings_ms, key):
+    """Look up the best CUDA-based timing for speedup calculation."""
+    for provider in ("cuda", "compiled_cuda"):
+        if provider in timings_ms and key in timings_ms[provider]:
+            return timings_ms[provider][key]
+    raise KeyError(f"No CUDA timing found for {key}")
+def _shorten_kernel_name(name: str) -> str:
+    """Strip template args and function params from CUDA kernel names.
+    ``void motif::grouped_poly_norm_bwd_kernel<...>(...)``
+    → ``motif::grouped_poly_norm_bwd_kernel``
+    """
+    # Remove leading 'void '
+    s = re.sub(r"^void\s+", "", name)
+    # Remove template args <...> (handles nested <>)
+    while "<" in s:
+        s = re.sub(r"<[^<>]*>", "", s)
+    # Remove function params (...)
+    s = re.sub(r"\(.*\)$", "", s)
+    return s.strip()
+def _compute_bytes(inputs, forward_fn, obj):
+    """Compute total bytes: all input tensors read + all output tensors written."""
+    input_bytes = sum(v.nbytes for v in inputs.values()
+                      if isinstance(v, torch.Tensor))
+    output = forward_fn()
+    if isinstance(output, torch.Tensor):
+        output_bytes = output.nbytes
+    elif isinstance(output, (tuple, list)):
+        output_bytes = sum(
+            o.nbytes for o in output if isinstance(o, torch.Tensor))
+    else:
+        output_bytes = 0
+    return input_bytes + output_bytes
+def profile_bench(fn, warmup=5, repeat=10, verbose=True, total_bytes=0):
+    """Measure CUDA kernel time via torch.profiler.
+    Profiles the function, sums all CUDA kernel durations, and returns
+    the median across repeats.  Also prints a per-kernel breakdown when
+    *verbose* is True so the caller can spot unexpected kernels.
+    Parameters
+    ----------
+    total_bytes : int
+        Total bytes transferred (inputs read + outputs written).
+        If > 0, prints bandwidth in GB/s after the breakdown.
+    Returns
+    -------
+    median_ms : float
+        Median total CUDA kernel time in **milliseconds** (same unit as
+        ``triton.testing.do_bench``).
+    """
+    for _ in range(warmup):
+        fn()
+    torch.cuda.synchronize()
+    kernel_times_us: list[float] = []
+    last_breakdown: list[tuple[str, float]] = []
+    for _ in range(repeat):
+        with profile(activities=[ProfilerActivity.CUDA]) as prof:
+            fn()
+        breakdown: dict[str, float] = {}
+        for evt in prof.key_averages():
+            if evt.device_time_total > 0:
+                breakdown[evt.key] = (breakdown.get(evt.key, 0) +
+                                      evt.device_time_total)
+        total_us = sum(breakdown.values())
+        kernel_times_us.append(total_us)
+        last_breakdown = sorted(breakdown.items(),
+                                key=lambda x: x[1],
+                                reverse=True)
+    median_us = sorted(kernel_times_us)[len(kernel_times_us) // 2]
+    if verbose and last_breakdown:
+        total = sum(t for _, t in last_breakdown)
+        names = [_shorten_kernel_name(n) for n, _ in last_breakdown]
+        col_w = max(len(n) for n in names) + 2
+        col_w = max(col_w, len("Total kernel time") + 2)
+        for name, (_, t) in zip(names, last_breakdown):
+            pct = 100 * t / total if total > 0 else 0
+            print(f"    {name:<{col_w}s} {t:>8.1f}us ({pct:4.1f}%)")
+        print(f"    {'Total kernel time':<{col_w}s} {total:>8.1f}us")
+        if total_bytes > 0 and median_us > 0:
+            bw_gbs = total_bytes / (median_us * 1e-6) / 1e9
+            print(f"    {'Bandwidth':<{col_w}s} {bw_gbs:>7.1f} GB/s"
+                  f"  ({total_bytes / 1e6:.1f} MB)")
+    return median_us / 1000  # us -> ms
 def make_fwd_key(batch_size, seq_len, dim):
     return f"forward : ({batch_size}, {seq_len}, {dim})"
     case: DiffCase,
     configs: Sequence[tuple[int, int, int]],
     plot_name: str,
+    ylabel: str = "",
     line_vals=("naive", "cuda", "speedup"),
     line_names: Dict[str, str] | None = None,
     dtype=torch.bfloat16,
     time_unit_scale: float = 1000,
 ):
     timings_ms = collections.defaultdict(dict)
+    bytes_map: dict[str, int] = {}
     line_vals = list(line_vals)
     line_names = line_names or {v: v.title() for v in line_vals}
     x_vals = [list(_) for _ in configs]
         key = make_fwd_key(dim, batch_size, seq_len)
         I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
         if provider == "speedup":
+            return round(timings_ms["naive"][key] / _get_best_cuda_timing(timings_ms, key), 2)
+        if provider.endswith("_bw"):
+            base = provider[:-3]
+            ms = timings_ms[base][key]
+            return round(bytes_map[key] / (ms * 1e-3) / 1e9, 2)
         if provider == "naive":
             obj = case.make_naive(I)
         elif provider == "compiled" and hasattr(case, "make_compiled"):
         else:
             obj = case.make_cuda(I)
         run = lambda: case.forward(obj, I)
+        nbytes = _compute_bytes(I, run, obj)
+        bytes_map[key] = nbytes
+        print(f"  [{provider}] {key}")
+        ms = profile_bench(run, total_bytes=nbytes)
         timings_ms[provider][key] = ms
         return time_unit_scale * ms
         else:
             obj = case.make_cuda(I)
         run = lambda: case.forward(obj, I)
+        nbytes = _compute_bytes(I, run, obj)
+        print(f"  [{provider}] {config}")
+        ms = profile_bench(run, total_bytes=nbytes)
         timings_ms[provider][config] = ms
         if provider == "cuda":
+            ratio = timings_ms["naive"][config] / _get_best_cuda_timing(timings_ms, config)
             spdup_ratio.append(ratio)
             return round(ratio, 2)
         else:
     case: DiffCase,
     configs: Sequence[tuple[int, int, int]],
     plot_name: str,
+    ylabel: str = "",
     line_vals=("naive", "cuda", "speedup"),
     line_names: Dict[str, str] | None = None,
     dtype=torch.bfloat16,
     time_unit_scale: float = 1000,
 ):
     timings_ms = collections.defaultdict(dict)
+    bytes_map: dict[str, int] = {}
     line_vals = list(line_vals)
     line_names = line_names or {v: v.title() for v in line_vals}
     x_vals = [list(_) for _ in configs]
         key = make_bwd_key(dim, batch_size, seq_len)
         I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
         if provider == "speedup":
+            return round(timings_ms["naive"][key] / _get_best_cuda_timing(timings_ms, key), 2)
+        if provider.endswith("_bw"):
+            base = provider[:-3]
+            ms = timings_ms[base][key]
+            return round(bytes_map[key] / (ms * 1e-3) / 1e9, 2)
         if provider == "naive":
             obj = case.make_naive(I)
         elif provider == "compiled" and hasattr(case, "make_compiled"):
                                           retain_graph=True,
                                           create_graph=False,
                                           allow_unused=False)
+        fwd_run = lambda: case.forward(obj, I)
+        nbytes = _compute_bytes(I, fwd_run, obj)
+        bytes_map[key] = nbytes
+        print(f"  [{provider}] {key}")
+        ms = profile_bench(run, total_bytes=nbytes)
         timings_ms[provider][key] = ms
         return time_unit_scale * ms
                                           retain_graph=True,
                                           create_graph=False,
                                           allow_unused=False)
+        fwd_run = lambda: case.forward(obj, I)
+        nbytes = _compute_bytes(I, fwd_run, obj)
+        print(f"  [{provider}] {config}")
+        ms = profile_bench(run, total_bytes=nbytes)
         timings_ms[provider][config] = ms
         if provider == "cuda":
+            ratio = timings_ms["naive"][config] / _get_best_cuda_timing(timings_ms, config)
             spdup_ratio.append(ratio)
             return round(ratio, 2)
         else:

benchmarks/run_cases.py CHANGED Viewed

@@ -12,6 +12,17 @@ from common.bench_framework import (make_bwd_benchmark_for_case,
 from common.diff_engine import DiffCase, calculate_diff
 def make_title_tag():
     if torch.cuda.is_available():
         dev_name = torch.cuda.get_device_name(0)
@@ -66,11 +77,6 @@ def main():
         default="bf16",
         help="Data type for benchmarking (default: bf16)",
     )
-    ap.add_argument(
-        "--profile",
-        action="store_true",
-        help="Export chrome traces for backward benchmarks",
-    )
     args = ap.parse_args()
     dtype_map = {
@@ -170,27 +176,38 @@ def main():
                 itertools.product(dim, batch_size_range, seq_length_range))
             if is_grouped:
-                fwd_line_vals = ("naive", "compiled", "cuda", "speedup")
                 fwd_line_names = {
-                    "naive": "Naive",
-                    "compiled": "Compiled",
-                    "cuda": "Triton",
-                    "speedup": "SpeedUp",
                 }
-                bwd_line_vals = ("naive", "compiled", "compiled_cuda",
-                                 "speedup")
                 bwd_line_names = {
-                    "naive": "Naive",
-                    "compiled": "Compiled",
-                    "compiled_cuda": "CompiledCUDA",
-                    "speedup": "SpeedUp",
                 }
             else:
-                fwd_line_vals = ("naive", "cuda", "speedup")
                 fwd_line_names = {
-                    "naive": "Naive",
-                    "cuda": "Cuda",
-                    "speedup": "SpeedUp",
                 }
                 bwd_line_vals = fwd_line_vals
                 bwd_line_names = fwd_line_names
@@ -202,11 +219,12 @@ def main():
                 dtype=dtype,
                 line_vals=fwd_line_vals,
                 line_names=fwd_line_names,
-                profile=args.profile,
-                profile_dir=os.path.join(save_dir, "traces"),
             )
-            bench.run(print_data=True, save_path=save_dir)
             bench = make_bwd_benchmark_for_case(
                 case=case,
@@ -215,11 +233,12 @@ def main():
                 dtype=dtype,
                 line_vals=bwd_line_vals,
                 line_names=bwd_line_names,
-                profile=args.profile,
-                profile_dir=os.path.join(save_dir, "traces"),
             )
-            bench.run(print_data=True, save_path=save_dir)
             for f in glob.glob(os.path.join(save_dir, "*.html")) + \
                     glob.glob(os.path.join(save_dir, "*.png")):
                 os.remove(f)

 from common.diff_engine import DiffCase, calculate_diff
+def _clean_and_print_csv(csv_path, title):
+    """Remove trailing ' ()' from CSV column names and print the table."""
+    import pandas as pd
+    df = pd.read_csv(csv_path)
+    df.columns = [c.replace(" ()", "") for c in df.columns]
+    df.to_csv(csv_path, index=False)
+    print(f"{title}:")
+    print(df.to_string(index=False))
+    print()
 def make_title_tag():
     if torch.cuda.is_available():
         dev_name = torch.cuda.get_device_name(0)
         default="bf16",
         help="Data type for benchmarking (default: bf16)",
     )
     args = ap.parse_args()
     dtype_map = {
                 itertools.product(dim, batch_size_range, seq_length_range))
             if is_grouped:
+                fwd_line_vals = ("naive", "naive_bw", "compiled",
+                                 "compiled_bw", "cuda", "cuda_bw", "speedup")
                 fwd_line_names = {
+                    "naive": "Naive (us)",
+                    "naive_bw": "Naive (GB/s)",
+                    "compiled": "Compiled (us)",
+                    "compiled_bw": "Compiled (GB/s)",
+                    "cuda": "CUDA (us)",
+                    "cuda_bw": "CUDA (GB/s)",
+                    "speedup": "SpeedUp (ratio)",
                 }
+                bwd_line_vals = ("naive", "naive_bw", "compiled",
+                                 "compiled_bw", "compiled_cuda",
+                                 "compiled_cuda_bw", "speedup")
                 bwd_line_names = {
+                    "naive": "Naive (us)",
+                    "naive_bw": "Naive (GB/s)",
+                    "compiled": "Compiled (us)",
+                    "compiled_bw": "Compiled (GB/s)",
+                    "compiled_cuda": "CompiledCUDA (us)",
+                    "compiled_cuda_bw": "CompiledCUDA (GB/s)",
+                    "speedup": "SpeedUp (ratio)",
                 }
             else:
+                fwd_line_vals = ("naive", "naive_bw", "cuda", "cuda_bw",
+                                 "speedup")
                 fwd_line_names = {
+                    "naive": "Naive (us)",
+                    "naive_bw": "Naive (GB/s)",
+                    "cuda": "CUDA (us)",
+                    "cuda_bw": "CUDA (GB/s)",
+                    "speedup": "SpeedUp (ratio)",
                 }
                 bwd_line_vals = fwd_line_vals
                 bwd_line_names = fwd_line_names
                 dtype=dtype,
                 line_vals=fwd_line_vals,
                 line_names=fwd_line_names,
             )
+            fwd_name = f"{args.case}-{dtype_name}-fwd-perf"
+            bench.run(print_data=False, save_path=save_dir)
+            _clean_and_print_csv(os.path.join(save_dir, fwd_name + ".csv"),
+                                 fwd_name)
             bench = make_bwd_benchmark_for_case(
                 case=case,
                 dtype=dtype,
                 line_vals=bwd_line_vals,
                 line_names=bwd_line_names,
             )
+            bwd_name = f"{args.case}-{dtype_name}-bwd-perf"
+            bench.run(print_data=False, save_path=save_dir)
+            _clean_and_print_csv(os.path.join(save_dir, bwd_name + ".csv"),
+                                 bwd_name)
             for f in glob.glob(os.path.join(save_dir, "*.html")) + \
                     glob.glob(os.path.join(save_dir, "*.png")):
                 os.remove(f)

docs/benchmark.md ADDED Viewed

	@@ -0,0 +1,146 @@

+# Benchmark System
+## Overview
+벤치마크 시스템은 커스텀 CUDA 커널의 forward/backward 성능을 naive PyTorch 구현 대비 측정한다. Triton의 `do_bench`를 사용하며, 정확도 검증(correctness check) 후 성능을 측정한다.
+## Directory Structure
+```
+benchmarks/
+├── run_cases.py              # CLI 진입점
+├── common/
+│   ├── bench_framework.py    # 벤치마크 유틸리티 (Triton perf_report 기반)
+│   └── diff_engine.py        # 정확도 검증 엔진 (DiffCase ABC)
+├── cases/                    # 벤치마크 케이스 구현
+│   ├── rms.py                # RMSNorm
+│   ├── add_rms.py            # Fused Add + RMSNorm
+│   ├── poly.py               # PolyNorm
+│   ├── mul_poly.py           # Fused Mul + PolyNorm
+│   └── grouped_mul_poly.py   # Grouped MoE Fused Mul + PolyNorm
+├── benchmark.yaml            # Kubeflow 벤치마크 job config
+├── test.yaml                 # Kubeflow 테스트 job config
+├── plots/                    # 생성된 플롯 결과
+└── results/                  # 타임스탬프별 벤치마크 결과
+```
+## Usage
+```bash
+python benchmarks/run_cases.py --case <CASE> [OPTIONS]
+```
+### Arguments
+| Argument | Default | Choices | Description |
+|----------|---------|---------|-------------|
+| `--case` | (필수) | `rms`, `add_rms`, `poly`, `mul_poly`, `grouped_mul_poly` | 벤치마크 케이스 |
+| `--dtype` | `bf16` | `fp16`, `bf16`, `fp32`, `all` | 데이터 타입 |
+| `--save-path` | `./configs/` | 경로 | 결과 출력 디렉토리 |
+| `--plot` | false | - | 플롯 생성 모드 |
+| `--profile` | false | - | Chrome trace 프로파일링 내보내기 |
+### Examples
+```bash
+# bf16 기본 벤치마크
+python benchmarks/run_cases.py --case grouped_mul_poly
+# 모든 dtype + 프로파일링
+python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results
+# 플롯만 생성
+python benchmarks/run_cases.py --case rms --plot --save-path ./plots
+```
+## Benchmark Cases
+각 케이스는 `DiffCase` ABC를 구현하며, naive(PyTorch 참조)와 CUDA 커널을 비교한다.
+| Case | Naive | CUDA | Inputs |
+|------|-------|------|--------|
+| `rms` | `torch.nn.RMSNorm` | `activation.layers.RMSNorm` | x, weight, eps |
+| `add_rms` | custom `FusedAddRMSNorm` | `activation.layers.FusedAddRMSNorm` | x, residual, weight, eps |
+| `poly` | custom `PolyNorm` (x^3, x^2, x 조합) | `activation.layers.PolyNorm` | x, weight(3), bias(1), eps |
+| `mul_poly` | custom `FusedMulPolyNorm` | `activation.layers.FusedMulPolyNorm` | x, mul, weight(3), bias, eps |
+| `grouped_mul_poly` | `fused_mul_grouped_poly_norm_ref` | `fused_mul_grouped_poly_norm` | x, mul, weight(num_experts, 3), bias, offsets |
+`grouped_mul_poly`는 추가로 `compiled`(torch.compile된 naive)와 `compiled_cuda`(torch.compile된 CUDA) provider도 측정한다.
+## Execution Flow
+1. **정확도 검증** - 3개 config에 대해 `calculate_diff()` 실행
+   - `(bs=2, sl=128, hidden=4096)`
+   - `(bs=8, sl=4096, hidden=1280)`
+   - `(bs=1, sl=32768, hidden=1280)`
+   - forward/backward 모두 `atol=1e-2, rtol=1e-2`로 비교
+2. **벤치마크 실행** - dtype별로 forward/backward 성능 측정
+3. **결과 저장** - CSV 파일 (및 선택적으로 플롯/trace)
+## Configuration Ranges
+**Standard cases** (rms, add_rms, poly, mul_poly):
+- Batch sizes: 1, 2, 4, 8
+- Sequence lengths: 1024, 2048, 4096, 8192
+- Hidden dims: 2048, 4096
+**Grouped case** (grouped_mul_poly):
+- Total tokens: 1024 ~ 65536 (bs x sl)
+- Hidden dim: 1280 (고정)
+- Experts: 48 per rank
+`--plot` 모드에서는 `bs=1`로 고정하고 seq_len만 sweep한다.
+## Output
+### CSV
+`{save_path}/{case}/{dtype}/` 디렉토리에 저장:
+- `{case}-{dtype}-fwd-perf.csv` - forward 결과
+- `{case}-{dtype}-bwd-perf.csv` - backward 결과
+컬럼: `dim`, `batch_size`, `seq_len`, `Naive (us)`, `Compiled (us)`, `Cuda (us)`, `SpeedUp (us)`
+### Chrome Trace (`--profile`)
+`{save_path}/{case}/{dtype}/traces/` 디렉토리에 JSON 형식으로 저장. `chrome://tracing`에서 로드하여 GPU 타임라인을 분석할 수 있다.
+파일명 패턴: `trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json`
+### Plot (`--plot`)
+Speedup 비교 플롯 생성. Geometric mean으로 전체 speedup을 집계한다.
+## Framework Internals
+### bench_framework.py
+Triton의 `perf_report`/`Benchmark`를 사용하는 4개 팩토리 함수:
+- `make_fwd_benchmark_for_case()` - forward 벤치마크 (CSV)
+- `make_bwd_benchmark_for_case()` - backward 벤치마크 (CSV)
+- `make_fwd_benchmark_plot_for_case()` - forward 플롯
+- `make_bwd_benchmark_plot_for_case()` - backward 플롯
+타이밍은 `triton.testing.do_bench()`로 측정하며, ms 단위를 us로 변환한다 (`time_unit_scale=1000`).
+### diff_engine.py
+`DiffCase` ABC 인터페이스:
+- `build_inputs(bs, sl, dim)` - 입력 텐서 생��
+- `make_naive()` / `make_cuda()` - 구현체 생성
+- `forward(module, inputs)` - forward 실행
+- `grad_inputs(inputs)` - gradient 대상 텐서 반환
+`calculate_diff()`가 naive와 CUDA 양쪽의 forward output + backward gradient를 `torch.testing.assert_close()`로 비교한다.
+## Kubeflow Integration
+`benchmark.yaml`로 클러스터에서 벤치마크를 실행할 수 있다:
+- triton, matplotlib, pandas 설치
+- C++ extension 빌드 (`setup.py`)
+- GPU warmup (100 iterations matmul)
+- 결과를 `benchmarks/results/{YY_MM_DD_HH_MM}/`에 저장