Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port 5adea7d unverified Jangwoong Kim commited on 2 days ago
bench: MLA RoPE fused vs vanilla sweep 536f0b2 Jangwoong Kim Claude Opus 4.6 (1M context) commited on 2 days ago
test: numerical parity for MLA RoPE fused kernels vs PyTorch reference 0c42208 3v324v23 Claude Opus 4.6 (1M context) commited on 2 days ago
cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native) 7e86d2e 3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago
refactor: replace warp shuffle with CUB BlockReduce 79a877a wyldecat Claude Opus 4.6 (1M context) commited on 3 days ago
fix: unify all backward kernels to input-based math + fix test import 09ecd67 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
style: fix yapf/isort/clang-format for CI --all-files 9dcee96 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
feat: add RMSNorm benchmark scripts and K8s job a5e85e1 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
feat: update RMSNorm Python interface for optimized kernels 4bb42a5 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
perf: optimize RMSNorm CUDA kernels for all dims dc88599 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused) 35a25ee 3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago
perf: remove autotune, hard-code per-kernel configs from live dump 1e2bc2b 3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago
cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs 4d94a7d 3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago
review fixups: stride asserts, autotune split, intent comments 2712745 3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago
feat: MLA RoPE Triton kernels (port from llm-training) f61868b 3v324v23 Claude Opus 4.6 (1M context) commited on 4 days ago
style: fix yapf/isort formatting for CI --all-files check 3f2678c wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
style: apply yapf + isort formatting 60615a0 wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago
feat: replace triton do_bench with torch.profiler for kernel timing 7d51e61 wyldecat Claude Opus 4.6 (1M context) commited on 7 days ago
chore: remove pre-built binaries and add local build loader shim (#18) 1e08296 unverified wyldecat Claude Opus 4.6 (1M context) commited on 9 days ago
style: apply yapf, isort, and clang-format 6436ad6 wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
style: fix clang-format on torch_binding.h 344ed39 wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
docs: update README for CUDA kernel and pip install workflow d11ff7e wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
ci: remove nix build-and-commit workflow a633feb wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
fix: rename stale references and clean up Triton remnants 5a9d09d wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
refactor: remove Triton kernels, add hidden_clamp to unscored ops 906e125 wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
test: add scores and hidden_clamp tests for fused_mul_grouped_poly_norm f06406d wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
feat: add setup.py for local CUDA development builds 656a6f4 wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion 0045757 wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago
refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm 60a628a wyldecat Claude Opus 4.6 (1M context) commited on 13 days ago
feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16) e195bbb unverified TaehyunKim Claude Opus 4.6 github-actions[bot] commited on Mar 6
fix: support PyTorch 2.10 register_op_strategy import path change ad23c2a wyldecat Claude Opus 4.6 commited on Feb 19
fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass f19f8f4 wyldecat commited on Oct 13, 2025
refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations 7e4334d wyldecat commited on Oct 13, 2025
refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability 66b3c5e wyldecat commited on Oct 13, 2025
feat(workflow): add Slack notifications for build start, success, and failure [skip-build] ab05e35 wyldecat commited on Oct 13, 2025