ARBS / docs /true-ternary /TRUE-TERNARY-REFACTOR12.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified

TRUE-TERNARY-REFACTOR12

Date: 2026-05-19

Scope

Readiness pass after the ARBS platform restructure, focused on:

  • making the default scripts easier to run on cloud machines,
  • keeping core training pure ternary with no AdamW/master weights,
  • reducing default VRAM pressure,
  • verifying the MoE path after the top-2/low-iteration refactor,
  • fixing the CUDA TScale E update regression found during verification.

Changes

  • ARBModel() now defaults to core text/VQ/graph only:
    • image/audio sidecars are opt-in,
    • MemGram modules are opt-in,
    • ACT halt threshold defaults to 0.99,
    • MoE iteration default follows ACT_MAX_ITERS=4.
  • arbitor.train now keeps memory modules off by default. Use --enable-memory only when a run explicitly needs MemGram.
  • training/pretrain.py no longer exposes stale LR/AdamW knobs for pure core pretraining. It now builds only the sidecars for active weighted modalities.
  • training/text.py no longer carries a dead optimizer branch and defaults to a small local context.
  • arbitor.smoke now supports --backward to run forward, backward, and one pure ternary state update. This is the practical cache-warm command for Triton training kernels:
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward

TScale E Update Fix

The targeted CUDA tests showed that direct GPU update_E() was returning without changing E or E_accum, because the Triton forward/backward path retains _hook_grad_2d and _hook_x_2d rather than dense _hook_grad_T_sign.

Fix:

  • TernaryScaleTensor.update_E() now detects the direct CUDA hooks and calls _triton_update_e_direct.
  • Dense CUDA hooks call _triton_update_e.
  • CPU fallback now uses the same residual grouped E_accum rule as the Triton kernels instead of the stale EMA/log2 update.
  • This keeps persistent scale state as int8 E + int8 E_accum and avoids hidden FP optimizer state.

Verification

Passed:

  • python -m compileall -q arbitor training
  • python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"
    • 2 passed, 25 deselected
  • python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --no-vq --no-graph --backward
    • logical ternary weights: 1,246,223,808
    • training state: 1625.12 MB
    • forward: 0.348s
    • backward/update: 0.586s
    • CUDA peak: 1638.54 MB
  • python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --backward
    • logical ternary weights: 1,268,095,232
    • training state: 1653.68 MB
    • forward: 0.392s
    • backward/update: 0.851s
    • CUDA peak: 1734.90 MB
  • python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure-nomoe --reset --disable-moe --disable-vq --disable-graph
    • logical ternary weights: 123,996,160
    • training state: 161.61 MB
    • one train/eval step passed.
  • python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure --reset --max-moe-iters 1 --disable-vq --disable-graph
    • cold run after kernel changes: 119.99s, dominated by first-run Triton compilation.
    • cached rerun: 1.09s.
    • logical ternary weights: 1,246,223,808
    • training state: 1625.12 MB
    • one train/eval step passed.
  • python training/text.py --steps 1 --batch 1 --ctx 4 --eval-interval 1 --run smoke-text-legacy
    • one legacy text-script step passed with zero float params.
  • python training/pretrain.py --steps 0 --batch 1 --ctx 4 --no-save --text-weight 0 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0 --run smoke-pretrain-parse
    • parse/build path passed.
  • git diff --check -- arbitor training pyproject.toml docs/true-ternary

Operational Notes

  • The severe MoE slowdown seen on a fresh cloud box is reproducible as first-run Triton compilation. The cached MoE one-step trainer fell from about 120s to about 1.1s.
  • For cloud use, run a small smoke warmup before the real job:
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward
  • For low-VRAM local training, start with:
python -m arbitor.train --ctx 128 --batch 1 --accum 4 --max-moe-iters 1 --no-save
  • Enable sidecars and memory modules only when the run actually needs them:
    • --enable-image
    • --enable-audio
    • --enable-memory

Remaining Work

  • The MoE path is fast after kernel cache warmup, but the first-run compile tax is still painful. A production launch script should prewarm the exact shapes used by the real job.
  • The sparse MoE still uses Python-level expert loops for the large-token path. The cached small-batch path is acceptable now, but a fused grouped-dispatch Triton kernel remains the next native-speed step.
  • LoRA finetuning intentionally still uses float adapter parameters and AdamW under training/finetuning; the base ternary model remains frozen in that path.

Training Folder Audit Addendum

Follow-up training audit:

  • training/pretrain.py now has a local byte stream via --text-data, so phase-1 text bootstrap can start from training/data/tinyshakespeare.txt or any local .txt/.pt before moving to HF FineWeb.
  • Text/code pretraining now uses trigram-aligned targets (x[:, 3:]) instead of the old next-token stream targets.
  • Image pretraining data now yields raw transformed images, not pre-encoded DINO features. The model-owned image sequencer is responsible for the frozen/int8 DINO pass.
  • LibriSpeech VQ target preparation now unpacks AudioVQEncoder as (_, indices) instead of treating the tuple as logits.
  • Video pretraining now reshapes/pads latent targets to the current VideoHead output before MSE.
  • Checkpoints are now conventional files:
    • latest.pt
    • best.pt
    • final.pt
  • Resume supports architecture drift with strict=False, which is required when moving from text-only checkpoints into image/audio/video-enabled runs.
  • training/audio.py, training/vision.py, and training/diffusion.py now follow the pure ternary base-training rule. They no longer expose Adam/learning-rate branches; LoRA + AdamW remains under training/finetuning/.

Additional verification:

  • python training/pretrain.py --steps 1 --batch 1 --ctx 4 --text-data training/data/tinyshakespeare.txt --no-save --save-interval 0 --eval-interval 1 --log-interval 1 --max-moe-iters 1 --run smoke-pretrain-local --text-weight 1 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0
    • passed cached CUDA path in about 1.02s train-loop time.
  • python training/audio.py --steps 1 --batch 1 --ctx 16000 --run smoke-audio-pure
    • passed with synthetic audio.
  • python training/vision.py --steps 1 --batch 1 --image-size 224 --run smoke-vision-pure
    • passed with synthetic image/text.
  • python training/diffusion.py --steps 1 --batch 1 --run smoke-diffusion-pure
    • passed with synthetic latent target.