Commit History

fix: MoE intermediate_size not scaled for tiny — 158M→4M MoE params
6cb7b4d
verified

Lgr54HFi commited on

fix: print every step + first-step timing to diagnose slow forward
5b5a08d
verified

Lgr54HFi commited on

fix: batch_size 32β†’4 base (GrowLength scales up, _safe_batch caps)
995be31
verified

Lgr54HFi commited on

fix: OOM at batch=256 β€” cap batch by logits memory, enable grad ckpt
5bfbb8a
verified

Lgr54HFi commited on

fix: tcmalloc debug .so crash, add error trapping, chmod note
e80380b
verified

Lgr54HFi commited on

perf: 4-stage GrowLength + CLI defaults for 300-step target
1eb24b2
verified

Lgr54HFi commited on

perf: tune train_hyper_loop for 300-step convergence
9d8c566
verified

Lgr54HFi commited on

perf: tune chimera_turbo.py for 300-step convergence + throughput
8b16586
verified

Lgr54HFi commited on

perf: P-core-only threading, KMP_BLOCKTIME=0, mandatory tcmalloc
fdb348a
verified

Lgr54HFi commited on

Fix loss rebound: lower Muon LR (0.02β†’0.008), clamp ternary latents, steeper cosine decay
e4d9588
verified

Lgr54HFi commited on

Skip SpanEngine/Grammar/DebtLedger during training (inference-only ops on 200K logits)
dda344d
verified

Lgr54HFi commited on

Upload chimera/training/loops.py
6d5c935
verified

Lgr54HFi commited on

Fix throughput (26β†’~80+ tok/s) and convergence (lr 0.0015β†’0.02)
d83bada
verified

Lgr54HFi commited on

Fix NaN loss reporting: show nan instead of 0.0 when all steps in window are NaN
8e41f12
verified

Lgr54HFi commited on

Fix NaN cascade: restore per-step gradient sanitization, add weight/momentum repair, harden Newton-Schulz
0e7327a
verified

Lgr54HFi commited on

Upload train_hyper.py
0e64e3a
verified

Lgr54HFi commited on

Upload chimera/model.py
310c416
verified

Lgr54HFi commited on

Upload chimera/training/hyper.py
6a7521a
verified

Lgr54HFi commited on

Upload chimera/training/loops.py
edcdcb3
verified

Lgr54HFi commited on

Fix loss plateau + throughput collapse: 7 bugs resolved
f9d237b
verified

Lgr54HFi commited on

fix: v12 GENESIS β€” fix 6 interaction bugs between paradigms\n\n1. P13 MTP heads added to optimizer (were dead β€” never updated)\n2. P18 Grokfast: skip Muon 2D params (NS normalisation cancels amplification)\n Apply only to 1D/embed params where AdamW preserves the signal\n3. P16 Plateau: save/restore ALL group LRs (was destroying LLRD ratios)\n4. P15 Token Triage applied to MTP loss too (was only on base loss)\n5. P16 Plateau: gentler burst Γ—2 instead of Γ—3 (Grokfast already amplifies)\n6. P15 Triage: per-position EMA disabled, use global excess only"
cf64132
verified

Lgr54HFi commited on

feat: loops.py v11 β€” aligned with GENESIS engine, no distiller overhead"
3859a82
verified

Lgr54HFi commited on

feat: v11 CHIMERA GENESIS β€” Grokfast-EMA + fused loss + LLRD + kill EMA distill overhead\n\nMajor rewrite of training step:\n\n1. P18 Grokfast-EMA (arxiv 2405.20233): 43Γ— convergence acceleration.\n Amplifies slow gradient components (generalization signal),\n filters fast components (memorization/STE noise). 5 lines, 0 overhead.\n Especially powerful for ternary STE where gradient noise is high.\n\n2. FUSED LOSS: P15 Token Triage + P17 Batch Metabolism now COMBINE\n instead of elif. Token triage weights individual tokens, batch\n metabolism weights sequences. Multiplicative composition.\n\n3. P19 Layer-wise LR Decay: higher LR for top layers (task-specific),\n lower for bottom (general features). decay_rate=0.85 per layer.\n Proven for ternary by TernaryLM (arxiv 2602.07374).\n\n4. REMOVED EMA Self-Distillation: doubled forward pass time for marginal\n gain. The EMA model copy consumed 227M params of memory for a KL loss\n that barely helps in from-scratch pretraining (Baby Llama recipe was\n for fine-tuning with a DIFFERENT teacher, not self-EMA)."
05566cc
verified

Lgr54HFi commited on

feat: v10 β€” P15 Selective Token Triage, P16 Plateau Breaker, P17 Batch Metabolism\n\nThree new paradigms fusionnΓ© dans le concept 'Adaptive Token Metabolism':\n\nP15 Token Triage (inspirΓ© Rho-1, arxiv 2404.07965):\nCompute per-token excess loss vs EMA baseline. Top 60% tokens get\nfull gradient, bottom 40% get 0.1Γ— gradient. No reference model needed β€”\nuses running EMA of per-position loss as baseline. This focuses\n~90% of gradient energy on the actually-learnable tokens.\n\nP16 Plateau Breaker:\nTrack loss EMA variance. When loss stagnates (variance < threshold\nfor 100 steps), trigger a 'warm restart': boost LR by 3Γ— for 50 steps\nthen decay back. Inspired by SGDR (arxiv 1608.03983) but adaptive.\n\nP17 Batch Metabolism (Online Hard Example Mining for LLM):\nWithin each batch, weight sequences by their loss relative to\nbatch mean. High-loss sequences get weight up to 2Γ—, easy ones\nget 0.5Γ—. The model 'digests' harder examples more thoroughly."
974e9c4
verified

Lgr54HFi commited on

feat: loops.py β€” integrate Muon + MTP + EMA distillation in training loop"
9897d01
verified

Lgr54HFi commited on

feat: P12 Muon optimizer, P13 Multi-Token Prediction, P14 EMA Self-Distillation\n\nThree new paradigms for revolutionary sample efficiency:\n\nP12 Muon: Newton-Schulz orthogonalized momentum for 2D weight matrices.\nSame loss in 52% of FLOPs vs AdamW (arxiv 2502.16982). AdamW fallback\nfor 1D params (biases, norms, embeddings).\n\nP13 MTP: predict next 3 tokens instead of 1. Each forward pass yields\n3x gradient signal. Implemented as auxiliary loss heads sharing the trunk.\n\nP14 EMA Self-Distillation: EMA copy of model acts as teacher. KL loss\nbetween student and EMA soft targets gives dense signal across full vocab\nvs sparse one-hot labels. Ξ±=0.5, T=2.0 (Baby Llama recipe, arxiv 2308.02019)."
76e1136
verified

Lgr54HFi commited on

fix: --all no longer enables progressive_unfreeze (counterproductive with backprop)"
acc06f5
verified

Lgr54HFi commited on

feat: train_hyper_loop with progressive looping, evolution loss feedback, no progressive_unfreeze default\n\nActivates dormant ch1mera paradigms:\n1. Progressive looping: 1β†’2β†’3 Parcae loops during training\n2. Evolution receives prev_loss for surprise-based memory writes\n3. progressive_unfreeze disabled by default (all layers train from start)\n4. Logs loop count and NaN-safe averaging"
b6bcd75
verified

Lgr54HFi commited on

feat: export ProgressiveLoopScheduler"
945c5bf
verified

Lgr54HFi commited on

feat: activate dormant paradigms β€” progressive looping, evolution with loss feedback, no progressive_unfreeze\n\nWith STE+AdamW (not MeZO), we can afford multi-loop training.\nProgressive loop schedule: 1β†’2β†’3 loops as training advances.\nEvolution engine now receives previous step loss for surprise\ndetection and memory writes.\nProgressive unfreeze disabled by default (counterproductive with backprop)."
5fd9d22
verified

Lgr54HFi commited on

fix: loops.py β€” use chimera_turbo v8 defaults (wd=0.01, warmup=750, Ξ²2=0.98) instead of hardcoded values"
e2f5e25
verified

Lgr54HFi commited on

perf: BitNet-paper hyperparams — β2=0.98, wd=0.01, warmup=750, grad_clip=1.0, NaN-safe\n\nAligned with BitNet training recipe (2310.11453 Table 5-6):\n- β2: 0.95→0.98 (all BitNet papers use 0.98, critical for ternary noise)\n- wd: 0.05→0.01 (original BitNet; Reloaded uses 0.05 but 0.01 more stable)\n- warmup: 500→750 fixed steps (paper-exact)\n- grad_clip: 0.5→1.0 (papers use none, but we keep light clip for safety)\n- Default LR: 1.5e-3 (interpolated 125M→2.4e-3, 350M→1.2e-3)"
64db48c
verified

Lgr54HFi commited on

fix: NaN skip + grad sanitization β€” detect NaN loss, zero corrupted grads, skip optimizer step\n\nWhen a rare batch produces NaN loss (step 380/500), the backward pass\ncontaminates all gradients with NaN. Without detection, optimizer.step()\npushes all weights to NaN β†’ irrecoverable.\n\nFix: check loss for NaN/Inf before backward. If detected, zero grads\nand skip the optimizer step. Training recovers on the next batch."
58f6f80
verified

Lgr54HFi commited on

fix: lower max_grad_norm 1.0β†’0.5 to prevent NaN with ternary STE training"
a97a233
verified

Lgr54HFi commited on

fix: NaN at step 150 β€” add gradient clamping to STE detach trick + lower max_grad_norm to 0.5\n\nThe pure detach() STE passes gradients through unbounded, causing\ngradient explosion around step 140-150 when loss is still high.\n\nFix: clamp the gradient contribution within the detach trick:\n w_q = clamp(w_scaled, -1, 1) + (round(clamped) - clamped).detach()\nThis ensures gradients are zero outside [-1, 1] (weights already at\nquantization boundary get no gradient push) while keeping the STE\nidentity pass-through inside the valid range.\n\nAlso reduces max_grad_norm from 1.0 to 0.5 for additional stability.\n\nRef: 4-bit CPU training paper (2603.13931) uses tanh soft clipping\nfor the same reason."
ec200d2
verified

Lgr54HFi commited on

fix: catch IPEX version mismatch crash (AttributeError from buggy os.exit in IPEX)"
f1fa72a
verified

Lgr54HFi commited on

fix: torch.compile mode='default' (reduce-overhead crashes on CPU with glibc heap corruption)"
bb2d3d5
verified

Lgr54HFi commited on

perf: eliminate .item() graph breaks in evolution.py β€” use tensor comparisons for torch.compile compat"
fc678ef
verified

Lgr54HFi commited on

fix: re-enable torch.compile in train_hyper_loop (STE graph breaks fixed)"
f6670ea
verified

Lgr54HFi commited on

perf: re-enable torch.compile now that STE uses detach() trick (zero graph breaks)"
dd57d33
verified

Lgr54HFi commited on

perf: replace _RoundTernarySTE autograd.Function with detach() trick β€” zero graph breaks for torch.compile\n\nThe detach() identity pattern (w + (round(clamp(w)) - w).detach()) is\nmathematically equivalent to the old STE but uses only standard aten ops\nthat torch.compile/Inductor can trace through. This eliminates 84+\ngraph breaks, enabling full kernel fusion of quantize+linear.\n\nPattern from official BitNet b1.58 implementation (1bitLLM/bitnet_b1_58-large).\nRef: arxiv 2402.17764"
31b0fdf
verified

Lgr54HFi commited on

fix: train_hyper_loop grad_accum=1 (DataLoader already batches), better tok/s logging
31d69ba
verified

Lgr54HFi commited on

fix: turbo v2 β€” disable compile (84 graph breaks), fix grad_accum, add diagnostics
20ad65d
verified

Lgr54HFi commited on

Upload folder using huggingface_hub
11c11f8
verified

Lgr54HFi commited on

initial commit
f4dbb46
verified

Lgr54HFi commited on