Benchmark System
Overview
๋ฒค์น๋งํฌ ์์คํ
์ ์ปค์คํ
CUDA ์ปค๋์ forward/backward ์ฑ๋ฅ์ naive PyTorch ๊ตฌํ ๋๋น ์ธก์ ํ๋ค. Triton์ do_bench๋ฅผ ์ฌ์ฉํ๋ฉฐ, ์ ํ๋ ๊ฒ์ฆ(correctness check) ํ ์ฑ๋ฅ์ ์ธก์ ํ๋ค.
Directory Structure
benchmarks/
โโโ run_cases.py # CLI ์ง์
์
โโโ common/
โ โโโ bench_framework.py # ๋ฒค์น๋งํฌ ์ ํธ๋ฆฌํฐ (Triton perf_report ๊ธฐ๋ฐ)
โ โโโ diff_engine.py # ์ ํ๋ ๊ฒ์ฆ ์์ง (DiffCase ABC)
โโโ cases/ # ๋ฒค์น๋งํฌ ์ผ์ด์ค ๊ตฌํ
โ โโโ rms.py # RMSNorm
โ โโโ add_rms.py # Fused Add + RMSNorm
โ โโโ poly.py # PolyNorm
โ โโโ mul_poly.py # Fused Mul + PolyNorm
โ โโโ grouped_mul_poly.py # Grouped MoE Fused Mul + PolyNorm
โโโ benchmark.yaml # Kubeflow ๋ฒค์น๋งํฌ job config
โโโ test.yaml # Kubeflow ํ
์คํธ job config
โโโ plots/ # ์์ฑ๋ ํ๋กฏ ๊ฒฐ๊ณผ
โโโ results/ # ํ์์คํฌํ๋ณ ๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ
Usage
python benchmarks/run_cases.py --case <CASE> [OPTIONS]
Arguments
| Argument | Default | Choices | Description |
|---|---|---|---|
--case |
(ํ์) | rms, add_rms, poly, mul_poly, grouped_mul_poly |
๋ฒค์น๋งํฌ ์ผ์ด์ค |
--dtype |
bf16 |
fp16, bf16, fp32, all |
๋ฐ์ดํฐ ํ์ |
--save-path |
./configs/ |
๊ฒฝ๋ก | ๊ฒฐ๊ณผ ์ถ๋ ฅ ๋๋ ํ ๋ฆฌ |
--plot |
false | - | ํ๋กฏ ์์ฑ ๋ชจ๋ |
--profile |
false | - | Chrome trace ํ๋กํ์ผ๋ง ๋ด๋ณด๋ด๊ธฐ |
Examples
# bf16 ๊ธฐ๋ณธ ๋ฒค์น๋งํฌ
python benchmarks/run_cases.py --case grouped_mul_poly
# ๋ชจ๋ dtype + ํ๋กํ์ผ๋ง
python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results
# ํ๋กฏ๋ง ์์ฑ
python benchmarks/run_cases.py --case rms --plot --save-path ./plots
Benchmark Cases
๊ฐ ์ผ์ด์ค๋ DiffCase ABC๋ฅผ ๊ตฌํํ๋ฉฐ, naive(PyTorch ์ฐธ์กฐ)์ CUDA ์ปค๋์ ๋น๊ตํ๋ค.
| Case | Naive | CUDA | Inputs |
|---|---|---|---|
rms |
torch.nn.RMSNorm |
activation.layers.RMSNorm |
x, weight, eps |
add_rms |
custom FusedAddRMSNorm |
activation.layers.FusedAddRMSNorm |
x, residual, weight, eps |
poly |
custom PolyNorm (x^3, x^2, x ์กฐํฉ) |
activation.layers.PolyNorm |
x, weight(3), bias(1), eps |
mul_poly |
custom FusedMulPolyNorm |
activation.layers.FusedMulPolyNorm |
x, mul, weight(3), bias, eps |
grouped_mul_poly |
fused_mul_grouped_poly_norm_ref |
fused_mul_grouped_poly_norm |
x, mul, weight(num_experts, 3), bias, offsets |
grouped_mul_poly๋ ์ถ๊ฐ๋ก compiled(torch.compile๋ naive)์ compiled_cuda(torch.compile๋ CUDA) provider๋ ์ธก์ ํ๋ค.
Execution Flow
- ์ ํ๋ ๊ฒ์ฆ - 3๊ฐ config์ ๋ํด
calculate_diff()์คํ(bs=2, sl=128, hidden=4096)(bs=8, sl=4096, hidden=1280)(bs=1, sl=32768, hidden=1280)- forward/backward ๋ชจ๋
atol=1e-2, rtol=1e-2๋ก ๋น๊ต
- ๋ฒค์น๋งํฌ ์คํ - dtype๋ณ๋ก forward/backward ์ฑ๋ฅ ์ธก์
- ๊ฒฐ๊ณผ ์ ์ฅ - CSV ํ์ผ (๋ฐ ์ ํ์ ์ผ๋ก ํ๋กฏ/trace)
Configuration Ranges
Standard cases (rms, add_rms, poly, mul_poly):
- Batch sizes: 1, 2, 4, 8
- Sequence lengths: 1024, 2048, 4096, 8192
- Hidden dims: 2048, 4096
Grouped case (grouped_mul_poly):
- Total tokens: 1024 ~ 65536 (bs x sl)
- Hidden dim: 1280 (๊ณ ์ )
- Experts: 48 per rank
--plot ๋ชจ๋์์๋ bs=1๋ก ๊ณ ์ ํ๊ณ seq_len๋ง sweepํ๋ค.
Output
CSV
{save_path}/{case}/{dtype}/ ๋๋ ํ ๋ฆฌ์ ์ ์ฅ:
{case}-{dtype}-fwd-perf.csv- forward ๊ฒฐ๊ณผ{case}-{dtype}-bwd-perf.csv- backward ๊ฒฐ๊ณผ
์ปฌ๋ผ: dim, batch_size, seq_len, Naive (us), Compiled (us), Cuda (us), SpeedUp (us)
Chrome Trace (--profile)
{save_path}/{case}/{dtype}/traces/ ๋๋ ํ ๋ฆฌ์ JSON ํ์์ผ๋ก ์ ์ฅ. chrome://tracing์์ ๋ก๋ํ์ฌ GPU ํ์๋ผ์ธ์ ๋ถ์ํ ์ ์๋ค.
ํ์ผ๋ช
ํจํด: trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json
Plot (--plot)
Speedup ๋น๊ต ํ๋กฏ ์์ฑ. Geometric mean์ผ๋ก ์ ์ฒด speedup์ ์ง๊ณํ๋ค.
Framework Internals
bench_framework.py
Triton์ perf_report/Benchmark๋ฅผ ์ฌ์ฉํ๋ 4๊ฐ ํฉํ ๋ฆฌ ํจ์:
make_fwd_benchmark_for_case()- forward ๋ฒค์น๋งํฌ (CSV)make_bwd_benchmark_for_case()- backward ๋ฒค์น๋งํฌ (CSV)make_fwd_benchmark_plot_for_case()- forward ํ๋กฏmake_bwd_benchmark_plot_for_case()- backward ํ๋กฏ
ํ์ด๋ฐ์ triton.testing.do_bench()๋ก ์ธก์ ํ๋ฉฐ, ms ๋จ์๋ฅผ us๋ก ๋ณํํ๋ค (time_unit_scale=1000).
diff_engine.py
DiffCase ABC ์ธํฐํ์ด์ค:
build_inputs(bs, sl, dim)- ์ ๋ ฅ ํ ์ ์์ฑmake_naive()/make_cuda()- ๊ตฌํ์ฒด ์์ฑforward(module, inputs)- forward ์คํgrad_inputs(inputs)- gradient ๋์ ํ ์ ๋ฐํ
calculate_diff()๊ฐ naive์ CUDA ์์ชฝ์ forward output + backward gradient๋ฅผ torch.testing.assert_close()๋ก ๋น๊ตํ๋ค.
Kubeflow Integration
benchmark.yaml๋ก ํด๋ฌ์คํฐ์์ ๋ฒค์น๋งํฌ๋ฅผ ์คํํ ์ ์๋ค:
- triton, matplotlib, pandas ์ค์น
- C++ extension ๋น๋ (
setup.py) - GPU warmup (100 iterations matmul)
- ๊ฒฐ๊ณผ๋ฅผ
benchmarks/results/{YY_MM_DD_HH_MM}/์ ์ ์ฅ