Motif-Technologies
/

activation

Model card Files Files and versions

Commit History

Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port

5adea7d
unverified

Jangwoong Kim commited on 2 days ago

bench: MLA RoPE fused vs vanilla sweep

536f0b2

Jangwoong Kim Claude Opus 4.6 (1M context) commited on 2 days ago

test: numerical parity for MLA RoPE fused kernels vs PyTorch reference

0c42208

3v324v23 Claude Opus 4.6 (1M context) commited on 2 days ago

cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native)

7e86d2e

3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago

refactor: replace warp shuffle with CUB BlockReduce

79a877a

wyldecat Claude Opus 4.6 (1M context) commited on 3 days ago

fix: unify all backward kernels to input-based math + fix test import

09ecd67

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

style: fix yapf/isort/clang-format for CI --all-files

9dcee96

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

feat: add RMSNorm benchmark scripts and K8s job

a5e85e1

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

feat: update RMSNorm Python interface for optimized kernels

4bb42a5

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

perf: optimize RMSNorm CUDA kernels for all dims

dc88599

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused)

35a25ee

3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago

perf: remove autotune, hard-code per-kernel configs from live dump

1e2bc2b

3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago

cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs

4d94a7d

3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago

review fixups: stride asserts, autotune split, intent comments

2712745

3v324v23 Claude Opus 4.6 (1M context) commited on 3 days ago

feat: MLA RoPE Triton kernels (port from llm-training)

f61868b

3v324v23 Claude Opus 4.6 (1M context) commited on 4 days ago

style: fix yapf/isort formatting for CI --all-files check

3f2678c

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

style: apply yapf + isort formatting

60615a0

wyldecat Claude Opus 4.6 (1M context) commited on 4 days ago

feat: replace triton do_bench with torch.profiler for kernel timing

7d51e61

wyldecat Claude Opus 4.6 (1M context) commited on 7 days ago

grouped polynorm with padding aware (#19)

972d63b
unverified

TaehyunKim commited on 8 days ago

chore: remove pre-built binaries and add local build loader shim (#18)

1e08296
unverified

wyldecat Claude Opus 4.6 (1M context) commited on 9 days ago

style: apply yapf, isort, and clang-format

6436ad6

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

style: fix clang-format on torch_binding.h

344ed39

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

docs: update README for CUDA kernel and pip install workflow

d11ff7e

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

ci: remove nix build-and-commit workflow

a633feb

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

fix: rename stale references and clean up Triton remnants

5a9d09d

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

refactor: remove Triton kernels, add hidden_clamp to unscored ops

906e125

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

test: add scores and hidden_clamp tests for fused_mul_grouped_poly_norm

f06406d

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

feat: add setup.py for local CUDA development builds

656a6f4

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion

0045757

wyldecat Claude Opus 4.6 (1M context) commited on 11 days ago

refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm

60a628a

wyldecat Claude Opus 4.6 (1M context) commited on 13 days ago

feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16)

e195bbb
unverified

TaehyunKim Claude Opus 4.6 github-actions[bot] commited on Mar 6

Add built binary [skip-build]

46020a2

github-actions[bot] commited on Feb 19

fix: support PyTorch 2.10 register_op_strategy import path change

ad23c2a

wyldecat Claude Opus 4.6 commited on Feb 19

fix: update toml

cef5fdf

wyldecat commited on Feb 19

Add built binary [skip-build]

dc1d060

github-actions[bot] commited on Nov 11, 2025

fix: fix fused add rms norm sharding strategy

a35a092

wyldecat commited on Nov 11, 2025

fix: fix rms norm sharding strategy

138159c

wyldecat commited on Nov 10, 2025

fix: comment out actionlint

1da8432

wyldecat commited on Nov 10, 2025

Update tag (#4)

8f89ce2
verified

danieldk HF Staff commited on Oct 27, 2025

Add built binary [skip-build]

f2471cd

github-actions[bot] commited on Oct 14, 2025

fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass

f19f8f4

wyldecat commited on Oct 13, 2025

feat: support sequence parallel with fused_add_rms_norm

151bb5a

wyldecat commited on Oct 13, 2025

refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations

7e4334d

wyldecat commited on Oct 13, 2025

refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability

66b3c5e

wyldecat commited on Oct 13, 2025

feat: support sequence parallel with rms_norm

06d6367

wyldecat commited on Oct 1, 2025

feat: add assert is_contiguous

a2a2501

wyldecat commited on Oct 1, 2025

feat: make rms_norm as out-place

9d0a235

wyldecat commited on Oct 1, 2025

feat(workflow): add Slack notifications for build start, success, and failure [skip-build]

ab05e35

wyldecat commited on Oct 13, 2025

Revert "fix typo in readme (#7)" (#8)

ddd119c
unverified

TaehyunKim commited on Oct 13, 2025

fix typo in readme (#7)

2d926c3
unverified

TaehyunKim github-actions[bot] commited on Oct 13, 2025