chomera / chimera /quantization.py

Commit History

fix: NaN at step 150 — add gradient clamping to STE detach trick + lower max_grad_norm to 0.5\n\nThe pure detach() STE passes gradients through unbounded, causing\ngradient explosion around step 140-150 when loss is still high.\n\nFix: clamp the gradient contribution within the detach trick:\n w_q = clamp(w_scaled, -1, 1) + (round(clamped) - clamped).detach()\nThis ensures gradients are zero outside [-1, 1] (weights already at\nquantization boundary get no gradient push) while keeping the STE\nidentity pass-through inside the valid range.\n\nAlso reduces max_grad_norm from 1.0 to 0.5 for additional stability.\n\nRef: 4-bit CPU training paper (2603.13931) uses tanh soft clipping\nfor the same reason."
ec200d2
verified

Lgr54HFi commited on

perf: replace _RoundTernarySTE autograd.Function with detach() trick — zero graph breaks for torch.compile\n\nThe detach() identity pattern (w + (round(clamp(w)) - w).detach()) is\nmathematically equivalent to the old STE but uses only standard aten ops\nthat torch.compile/Inductor can trace through. This eliminates 84+\ngraph breaks, enabling full kernel fusion of quantize+linear.\n\nPattern from official BitNet b1.58 implementation (1bitLLM/bitnet_b1_58-large).\nRef: arxiv 2402.17764"
31b0fdf
verified

Lgr54HFi commited on

Upload folder using huggingface_hub
11c11f8
verified

Lgr54HFi commited on