Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
Lgr54HFi
/
chomera
like
0
chimera51
custom_code
arxiv:
12 papers
Model card
Files
Files and versions
xet
Community
main
chomera
/
chimera
179 kB
Ctrl+K
Ctrl+K
1 contributor
History:
23 commits
Lgr54HFi
fix: MoE intermediate_size not scaled for tiny β 158Mβ4M MoE params
6cb7b4d
verified
11 days ago
training
fix: MoE intermediate_size not scaled for tiny β 158Mβ4M MoE params
11 days ago
__init__.py
Safe
2.43 kB
Upload folder using huggingface_hub
12 days ago
__main__.py
Safe
894 Bytes
Upload folder using huggingface_hub
12 days ago
cli.py
Safe
1.97 kB
Upload folder using huggingface_hub
12 days ago
config.py
Safe
3.11 kB
Upload folder using huggingface_hub
12 days ago
evolution.py
Safe
23.3 kB
perf: eliminate .item() graph breaks in evolution.py β use tensor comparisons for torch.compile compat"
12 days ago
hyper.py
Safe
18.7 kB
Upload folder using huggingface_hub
12 days ago
inference.py
Safe
15.1 kB
Upload folder using huggingface_hub
12 days ago
layers.py
Safe
21.1 kB
Upload folder using huggingface_hub
12 days ago
looping.py
Safe
2.82 kB
Upload folder using huggingface_hub
12 days ago
model.py
Safe
15.9 kB
Skip SpanEngine/Grammar/DebtLedger during training (inference-only ops on 200K logits)
11 days ago
moe.py
Safe
4.29 kB
Upload folder using huggingface_hub
12 days ago
multimodal.py
Safe
5.15 kB
Upload folder using huggingface_hub
12 days ago
paths.py
Safe
358 Bytes
Upload folder using huggingface_hub
12 days ago
quantization.py
17.4 kB
fix: NaN at step 150 β add gradient clamping to STE detach trick + lower max_grad_norm to 0.5\n\nThe pure detach() STE passes gradients through unbounded, causing\ngradient explosion around step 140-150 when loss is still high.\n\nFix: clamp the gradient contribution within the detach trick:\n w_q = clamp(w_scaled, -1, 1) + (round(clamped) - clamped).detach()\nThis ensures gradients are zero outside [-1, 1] (weights already at\nquantization boundary get no gradient push) while keeping the STE\nidentity pass-through inside the valid range.\n\nAlso reduces max_grad_norm from 1.0 to 0.5 for additional stability.\n\nRef: 4-bit CPU training paper (2603.13931) uses tanh soft clipping\nfor the same reason."
12 days ago
tokenizer.py
Safe
6.84 kB
Upload folder using huggingface_hub
12 days ago