Lgr54HFi
/

chomera

Model card Files Files and versions

Commit History

fix: MoE intermediate_size not scaled for tiny — 158M→4M MoE params

6cb7b4d
verified

Lgr54HFi commited on 10 days ago

fix: print every step + first-step timing to diagnose slow forward

5b5a08d
verified

Lgr54HFi commited on 10 days ago

fix: batch_size 32→4 base (GrowLength scales up, _safe_batch caps)

995be31
verified

Lgr54HFi commited on 10 days ago

fix: OOM at batch=256 — cap batch by logits memory, enable grad ckpt

5bfbb8a
verified

Lgr54HFi commited on 10 days ago

fix: tcmalloc debug .so crash, add error trapping, chmod note

e80380b
verified

Lgr54HFi commited on 10 days ago

perf: 4-stage GrowLength + CLI defaults for 300-step target

1eb24b2
verified

Lgr54HFi commited on 10 days ago

perf: tune train_hyper_loop for 300-step convergence

9d8c566
verified

Lgr54HFi commited on 10 days ago

perf: tune chimera_turbo.py for 300-step convergence + throughput

8b16586
verified

Lgr54HFi commited on 10 days ago

perf: P-core-only threading, KMP_BLOCKTIME=0, mandatory tcmalloc

fdb348a
verified

Lgr54HFi commited on 10 days ago

Fix loss rebound: lower Muon LR (0.02→0.008), clamp ternary latents, steeper cosine decay

e4d9588
verified

Lgr54HFi commited on 11 days ago

Skip SpanEngine/Grammar/DebtLedger during training (inference-only ops on 200K logits)

dda344d
verified

Lgr54HFi commited on 11 days ago

Upload chimera/training/loops.py

6d5c935
verified

Lgr54HFi commited on 11 days ago

Fix throughput (26→~80+ tok/s) and convergence (lr 0.0015→0.02)

d83bada
verified

Lgr54HFi commited on 11 days ago

Fix NaN loss reporting: show nan instead of 0.0 when all steps in window are NaN

8e41f12
verified

Lgr54HFi commited on 11 days ago

Fix NaN cascade: restore per-step gradient sanitization, add weight/momentum repair, harden Newton-Schulz

0e7327a
verified

Lgr54HFi commited on 11 days ago

Upload train_hyper.py

0e64e3a
verified

Lgr54HFi commited on 11 days ago

Upload chimera/model.py

310c416
verified

Lgr54HFi commited on 11 days ago

Upload chimera/training/hyper.py

6a7521a
verified

Lgr54HFi commited on 11 days ago

Upload chimera/training/loops.py

edcdcb3
verified

Lgr54HFi commited on 11 days ago

Fix loss plateau + throughput collapse: 7 bugs resolved

f9d237b
verified

Lgr54HFi commited on 11 days ago

fix: v12 GENESIS — fix 6 interaction bugs between paradigms\n\n1. P13 MTP heads added to optimizer (were dead — never updated)\n2. P18 Grokfast: skip Muon 2D params (NS normalisation cancels amplification)\n Apply only to 1D/embed params where AdamW preserves the signal\n3. P16 Plateau: save/restore ALL group LRs (was destroying LLRD ratios)\n4. P15 Token Triage applied to MTP loss too (was only on base loss)\n5. P16 Plateau: gentler burst ×2 instead of ×3 (Grokfast already amplifies)\n6. P15 Triage: per-position EMA disabled, use global excess only"

cf64132
verified

Lgr54HFi commited on 11 days ago

feat: loops.py v11 — aligned with GENESIS engine, no distiller overhead"

3859a82
verified

Lgr54HFi commited on 11 days ago

feat: v11 CHIMERA GENESIS — Grokfast-EMA + fused loss + LLRD + kill EMA distill overhead\n\nMajor rewrite of training step:\n\n1. P18 Grokfast-EMA (arxiv 2405.20233): 43× convergence acceleration.\n Amplifies slow gradient components (generalization signal),\n filters fast components (memorization/STE noise). 5 lines, 0 overhead.\n Especially powerful for ternary STE where gradient noise is high.\n\n2. FUSED LOSS: P15 Token Triage + P17 Batch Metabolism now COMBINE\n instead of elif. Token triage weights individual tokens, batch\n metabolism weights sequences. Multiplicative composition.\n\n3. P19 Layer-wise LR Decay: higher LR for top layers (task-specific),\n lower for bottom (general features). decay_rate=0.85 per layer.\n Proven for ternary by TernaryLM (arxiv 2602.07374).\n\n4. REMOVED EMA Self-Distillation: doubled forward pass time for marginal\n gain. The EMA model copy consumed 227M params of memory for a KL loss\n that barely helps in from-scratch pretraining (Baby Llama recipe was\n for fine-tuning with a DIFFERENT teacher, not self-EMA)."

05566cc
verified

Lgr54HFi commited on 11 days ago

feat: v10 — P15 Selective Token Triage, P16 Plateau Breaker, P17 Batch Metabolism\n\nThree new paradigms fusionné dans le concept 'Adaptive Token Metabolism':\n\nP15 Token Triage (inspiré Rho-1, arxiv 2404.07965):\nCompute per-token excess loss vs EMA baseline. Top 60% tokens get\nfull gradient, bottom 40% get 0.1× gradient. No reference model needed —\nuses running EMA of per-position loss as baseline. This focuses\n~90% of gradient energy on the actually-learnable tokens.\n\nP16 Plateau Breaker:\nTrack loss EMA variance. When loss stagnates (variance < threshold\nfor 100 steps), trigger a 'warm restart': boost LR by 3× for 50 steps\nthen decay back. Inspired by SGDR (arxiv 1608.03983) but adaptive.\n\nP17 Batch Metabolism (Online Hard Example Mining for LLM):\nWithin each batch, weight sequences by their loss relative to\nbatch mean. High-loss sequences get weight up to 2×, easy ones\nget 0.5×. The model 'digests' harder examples more thoroughly."

974e9c4
verified

Lgr54HFi commited on 11 days ago

feat: loops.py — integrate Muon + MTP + EMA distillation in training loop"

9897d01
verified

Lgr54HFi commited on 11 days ago

feat: P12 Muon optimizer, P13 Multi-Token Prediction, P14 EMA Self-Distillation\n\nThree new paradigms for revolutionary sample efficiency:\n\nP12 Muon: Newton-Schulz orthogonalized momentum for 2D weight matrices.\nSame loss in 52% of FLOPs vs AdamW (arxiv 2502.16982). AdamW fallback\nfor 1D params (biases, norms, embeddings).\n\nP13 MTP: predict next 3 tokens instead of 1. Each forward pass yields\n3x gradient signal. Implemented as auxiliary loss heads sharing the trunk.\n\nP14 EMA Self-Distillation: EMA copy of model acts as teacher. KL loss\nbetween student and EMA soft targets gives dense signal across full vocab\nvs sparse one-hot labels. α=0.5, T=2.0 (Baby Llama recipe, arxiv 2308.02019)."

76e1136
verified

Lgr54HFi commited on 11 days ago

fix: --all no longer enables progressive_unfreeze (counterproductive with backprop)"

acc06f5
verified

Lgr54HFi commited on 11 days ago

feat: train_hyper_loop with progressive looping, evolution loss feedback, no progressive_unfreeze default\n\nActivates dormant ch1mera paradigms:\n1. Progressive looping: 1→2→3 Parcae loops during training\n2. Evolution receives prev_loss for surprise-based memory writes\n3. progressive_unfreeze disabled by default (all layers train from start)\n4. Logs loop count and NaN-safe averaging"

b6bcd75
verified

Lgr54HFi commited on 11 days ago

feat: export ProgressiveLoopScheduler"

945c5bf
verified

Lgr54HFi commited on 11 days ago

feat: activate dormant paradigms — progressive looping, evolution with loss feedback, no progressive_unfreeze\n\nWith STE+AdamW (not MeZO), we can afford multi-loop training.\nProgressive loop schedule: 1→2→3 loops as training advances.\nEvolution engine now receives previous step loss for surprise\ndetection and memory writes.\nProgressive unfreeze disabled by default (counterproductive with backprop)."

5fd9d22
verified

Lgr54HFi commited on 11 days ago

fix: loops.py — use chimera_turbo v8 defaults (wd=0.01, warmup=750, β2=0.98) instead of hardcoded values"

e2f5e25
verified

Lgr54HFi commited on 11 days ago

perf: BitNet-paper hyperparams — β2=0.98, wd=0.01, warmup=750, grad_clip=1.0, NaN-safe\n\nAligned with BitNet training recipe (2310.11453 Table 5-6):\n- β2: 0.95→0.98 (all BitNet papers use 0.98, critical for ternary noise)\n- wd: 0.05→0.01 (original BitNet; Reloaded uses 0.05 but 0.01 more stable)\n- warmup: 500→750 fixed steps (paper-exact)\n- grad_clip: 0.5→1.0 (papers use none, but we keep light clip for safety)\n- Default LR: 1.5e-3 (interpolated 125M→2.4e-3, 350M→1.2e-3)"

64db48c
verified

Lgr54HFi commited on 11 days ago

fix: NaN skip + grad sanitization — detect NaN loss, zero corrupted grads, skip optimizer step\n\nWhen a rare batch produces NaN loss (step 380/500), the backward pass\ncontaminates all gradients with NaN. Without detection, optimizer.step()\npushes all weights to NaN → irrecoverable.\n\nFix: check loss for NaN/Inf before backward. If detected, zero grads\nand skip the optimizer step. Training recovers on the next batch."

58f6f80
verified

Lgr54HFi commited on 11 days ago

fix: lower max_grad_norm 1.0→0.5 to prevent NaN with ternary STE training"

a97a233
verified

Lgr54HFi commited on 11 days ago

fix: NaN at step 150 — add gradient clamping to STE detach trick + lower max_grad_norm to 0.5\n\nThe pure detach() STE passes gradients through unbounded, causing\ngradient explosion around step 140-150 when loss is still high.\n\nFix: clamp the gradient contribution within the detach trick:\n w_q = clamp(w_scaled, -1, 1) + (round(clamped) - clamped).detach()\nThis ensures gradients are zero outside [-1, 1] (weights already at\nquantization boundary get no gradient push) while keeping the STE\nidentity pass-through inside the valid range.\n\nAlso reduces max_grad_norm from 1.0 to 0.5 for additional stability.\n\nRef: 4-bit CPU training paper (2603.13931) uses tanh soft clipping\nfor the same reason."

ec200d2
verified

Lgr54HFi commited on 11 days ago

fix: catch IPEX version mismatch crash (AttributeError from buggy os.exit in IPEX)"

f1fa72a
verified

Lgr54HFi commited on 11 days ago

fix: torch.compile mode='default' (reduce-overhead crashes on CPU with glibc heap corruption)"

bb2d3d5
verified

Lgr54HFi commited on 11 days ago

perf: eliminate .item() graph breaks in evolution.py — use tensor comparisons for torch.compile compat"

fc678ef
verified

Lgr54HFi commited on 11 days ago

fix: re-enable torch.compile in train_hyper_loop (STE graph breaks fixed)"

f6670ea
verified

Lgr54HFi commited on 11 days ago

perf: re-enable torch.compile now that STE uses detach() trick (zero graph breaks)"

dd57d33
verified

Lgr54HFi commited on 11 days ago

perf: replace _RoundTernarySTE autograd.Function with detach() trick — zero graph breaks for torch.compile\n\nThe detach() identity pattern (w + (round(clamp(w)) - w).detach()) is\nmathematically equivalent to the old STE but uses only standard aten ops\nthat torch.compile/Inductor can trace through. This eliminates 84+\ngraph breaks, enabling full kernel fusion of quantize+linear.\n\nPattern from official BitNet b1.58 implementation (1bitLLM/bitnet_b1_58-large).\nRef: arxiv 2402.17764"

31b0fdf
verified

Lgr54HFi commited on 11 days ago

fix: train_hyper_loop grad_accum=1 (DataLoader already batches), better tok/s logging

31d69ba
verified

Lgr54HFi commited on 11 days ago

fix: turbo v2 — disable compile (84 graph breaks), fix grad_accum, add diagnostics

20ad65d
verified

Lgr54HFi commited on 11 days ago

Upload folder using huggingface_hub

11c11f8
verified

Lgr54HFi commited on 11 days ago

initial commit

f4dbb46
verified

Lgr54HFi commited on 12 days ago