Kernels
activation / docs /benchmark.md
wyldecat's picture
feat: replace triton do_bench with torch.profiler for kernel timing
7d51e61

Benchmark System

Overview

๋ฒค์น˜๋งˆํฌ ์‹œ์Šคํ…œ์€ ์ปค์Šคํ…€ CUDA ์ปค๋„์˜ forward/backward ์„ฑ๋Šฅ์„ naive PyTorch ๊ตฌํ˜„ ๋Œ€๋น„ ์ธก์ •ํ•œ๋‹ค. Triton์˜ do_bench๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ •ํ™•๋„ ๊ฒ€์ฆ(correctness check) ํ›„ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•œ๋‹ค.

Directory Structure

benchmarks/
โ”œโ”€โ”€ run_cases.py              # CLI ์ง„์ž…์ 
โ”œโ”€โ”€ common/
โ”‚   โ”œโ”€โ”€ bench_framework.py    # ๋ฒค์น˜๋งˆํฌ ์œ ํ‹ธ๋ฆฌํ‹ฐ (Triton perf_report ๊ธฐ๋ฐ˜)
โ”‚   โ””โ”€โ”€ diff_engine.py        # ์ •ํ™•๋„ ๊ฒ€์ฆ ์—”์ง„ (DiffCase ABC)
โ”œโ”€โ”€ cases/                    # ๋ฒค์น˜๋งˆํฌ ์ผ€์ด์Šค ๊ตฌํ˜„
โ”‚   โ”œโ”€โ”€ rms.py                # RMSNorm
โ”‚   โ”œโ”€โ”€ add_rms.py            # Fused Add + RMSNorm
โ”‚   โ”œโ”€โ”€ poly.py               # PolyNorm
โ”‚   โ”œโ”€โ”€ mul_poly.py           # Fused Mul + PolyNorm
โ”‚   โ””โ”€โ”€ grouped_mul_poly.py   # Grouped MoE Fused Mul + PolyNorm
โ”œโ”€โ”€ benchmark.yaml            # Kubeflow ๋ฒค์น˜๋งˆํฌ job config
โ”œโ”€โ”€ test.yaml                 # Kubeflow ํ…Œ์ŠคํŠธ job config
โ”œโ”€โ”€ plots/                    # ์ƒ์„ฑ๋œ ํ”Œ๋กฏ ๊ฒฐ๊ณผ
โ””โ”€โ”€ results/                  # ํƒ€์ž„์Šคํƒฌํ”„๋ณ„ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

Usage

python benchmarks/run_cases.py --case <CASE> [OPTIONS]

Arguments

Argument Default Choices Description
--case (ํ•„์ˆ˜) rms, add_rms, poly, mul_poly, grouped_mul_poly ๋ฒค์น˜๋งˆํฌ ์ผ€์ด์Šค
--dtype bf16 fp16, bf16, fp32, all ๋ฐ์ดํ„ฐ ํƒ€์ž…
--save-path ./configs/ ๊ฒฝ๋กœ ๊ฒฐ๊ณผ ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ
--plot false - ํ”Œ๋กฏ ์ƒ์„ฑ ๋ชจ๋“œ
--profile false - Chrome trace ํ”„๋กœํŒŒ์ผ๋ง ๋‚ด๋ณด๋‚ด๊ธฐ

Examples

# bf16 ๊ธฐ๋ณธ ๋ฒค์น˜๋งˆํฌ
python benchmarks/run_cases.py --case grouped_mul_poly

# ๋ชจ๋“  dtype + ํ”„๋กœํŒŒ์ผ๋ง
python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results

# ํ”Œ๋กฏ๋งŒ ์ƒ์„ฑ
python benchmarks/run_cases.py --case rms --plot --save-path ./plots

Benchmark Cases

๊ฐ ์ผ€์ด์Šค๋Š” DiffCase ABC๋ฅผ ๊ตฌํ˜„ํ•˜๋ฉฐ, naive(PyTorch ์ฐธ์กฐ)์™€ CUDA ์ปค๋„์„ ๋น„๊ตํ•œ๋‹ค.

Case Naive CUDA Inputs
rms torch.nn.RMSNorm activation.layers.RMSNorm x, weight, eps
add_rms custom FusedAddRMSNorm activation.layers.FusedAddRMSNorm x, residual, weight, eps
poly custom PolyNorm (x^3, x^2, x ์กฐํ•ฉ) activation.layers.PolyNorm x, weight(3), bias(1), eps
mul_poly custom FusedMulPolyNorm activation.layers.FusedMulPolyNorm x, mul, weight(3), bias, eps
grouped_mul_poly fused_mul_grouped_poly_norm_ref fused_mul_grouped_poly_norm x, mul, weight(num_experts, 3), bias, offsets

grouped_mul_poly๋Š” ์ถ”๊ฐ€๋กœ compiled(torch.compile๋œ naive)์™€ compiled_cuda(torch.compile๋œ CUDA) provider๋„ ์ธก์ •ํ•œ๋‹ค.

Execution Flow

  1. ์ •ํ™•๋„ ๊ฒ€์ฆ - 3๊ฐœ config์— ๋Œ€ํ•ด calculate_diff() ์‹คํ–‰
    • (bs=2, sl=128, hidden=4096)
    • (bs=8, sl=4096, hidden=1280)
    • (bs=1, sl=32768, hidden=1280)
    • forward/backward ๋ชจ๋‘ atol=1e-2, rtol=1e-2๋กœ ๋น„๊ต
  2. ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ - dtype๋ณ„๋กœ forward/backward ์„ฑ๋Šฅ ์ธก์ •
  3. ๊ฒฐ๊ณผ ์ €์žฅ - CSV ํŒŒ์ผ (๋ฐ ์„ ํƒ์ ์œผ๋กœ ํ”Œ๋กฏ/trace)

Configuration Ranges

Standard cases (rms, add_rms, poly, mul_poly):

  • Batch sizes: 1, 2, 4, 8
  • Sequence lengths: 1024, 2048, 4096, 8192
  • Hidden dims: 2048, 4096

Grouped case (grouped_mul_poly):

  • Total tokens: 1024 ~ 65536 (bs x sl)
  • Hidden dim: 1280 (๊ณ ์ •)
  • Experts: 48 per rank

--plot ๋ชจ๋“œ์—์„œ๋Š” bs=1๋กœ ๊ณ ์ •ํ•˜๊ณ  seq_len๋งŒ sweepํ•œ๋‹ค.

Output

CSV

{save_path}/{case}/{dtype}/ ๋””๋ ‰ํ† ๋ฆฌ์— ์ €์žฅ:

  • {case}-{dtype}-fwd-perf.csv - forward ๊ฒฐ๊ณผ
  • {case}-{dtype}-bwd-perf.csv - backward ๊ฒฐ๊ณผ

์ปฌ๋Ÿผ: dim, batch_size, seq_len, Naive (us), Compiled (us), Cuda (us), SpeedUp (us)

Chrome Trace (--profile)

{save_path}/{case}/{dtype}/traces/ ๋””๋ ‰ํ† ๋ฆฌ์— JSON ํ˜•์‹์œผ๋กœ ์ €์žฅ. chrome://tracing์—์„œ ๋กœ๋“œํ•˜์—ฌ GPU ํƒ€์ž„๋ผ์ธ์„ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

ํŒŒ์ผ๋ช… ํŒจํ„ด: trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json

Plot (--plot)

Speedup ๋น„๊ต ํ”Œ๋กฏ ์ƒ์„ฑ. Geometric mean์œผ๋กœ ์ „์ฒด speedup์„ ์ง‘๊ณ„ํ•œ๋‹ค.

Framework Internals

bench_framework.py

Triton์˜ perf_report/Benchmark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” 4๊ฐœ ํŒฉํ† ๋ฆฌ ํ•จ์ˆ˜:

  • make_fwd_benchmark_for_case() - forward ๋ฒค์น˜๋งˆํฌ (CSV)
  • make_bwd_benchmark_for_case() - backward ๋ฒค์น˜๋งˆํฌ (CSV)
  • make_fwd_benchmark_plot_for_case() - forward ํ”Œ๋กฏ
  • make_bwd_benchmark_plot_for_case() - backward ํ”Œ๋กฏ

ํƒ€์ด๋ฐ์€ triton.testing.do_bench()๋กœ ์ธก์ •ํ•˜๋ฉฐ, ms ๋‹จ์œ„๋ฅผ us๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค (time_unit_scale=1000).

diff_engine.py

DiffCase ABC ์ธํ„ฐํŽ˜์ด์Šค:

  • build_inputs(bs, sl, dim) - ์ž…๋ ฅ ํ…์„œ ์ƒ์„ฑ
  • make_naive() / make_cuda() - ๊ตฌํ˜„์ฒด ์ƒ์„ฑ
  • forward(module, inputs) - forward ์‹คํ–‰
  • grad_inputs(inputs) - gradient ๋Œ€์ƒ ํ…์„œ ๋ฐ˜ํ™˜

calculate_diff()๊ฐ€ naive์™€ CUDA ์–‘์ชฝ์˜ forward output + backward gradient๋ฅผ torch.testing.assert_close()๋กœ ๋น„๊ตํ•œ๋‹ค.

Kubeflow Integration

benchmark.yaml๋กœ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค:

  • triton, matplotlib, pandas ์„ค์น˜
  • C++ extension ๋นŒ๋“œ (setup.py)
  • GPU warmup (100 iterations matmul)
  • ๊ฒฐ๊ณผ๋ฅผ benchmarks/results/{YY_MM_DD_HH_MM}/์— ์ €์žฅ