TRUE-TERNARY-REFACTOR12
Date: 2026-05-19
Scope
Readiness pass after the ARBS platform restructure, focused on:
- making the default scripts easier to run on cloud machines,
- keeping core training pure ternary with no AdamW/master weights,
- reducing default VRAM pressure,
- verifying the MoE path after the top-2/low-iteration refactor,
- fixing the CUDA TScale
Eupdate regression found during verification.
Changes
ARBModel()now defaults to core text/VQ/graph only:- image/audio sidecars are opt-in,
- MemGram modules are opt-in,
- ACT halt threshold defaults to
0.99, - MoE iteration default follows
ACT_MAX_ITERS=4.
arbitor.trainnow keeps memory modules off by default. Use--enable-memoryonly when a run explicitly needs MemGram.training/pretrain.pyno longer exposes stale LR/AdamW knobs for pure core pretraining. It now builds only the sidecars for active weighted modalities.training/text.pyno longer carries a dead optimizer branch and defaults to a small local context.arbitor.smokenow supports--backwardto run forward, backward, and one pure ternary state update. This is the practical cache-warm command for Triton training kernels:
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward
TScale E Update Fix
The targeted CUDA tests showed that direct GPU update_E() was returning without changing E or E_accum, because the Triton forward/backward path retains _hook_grad_2d and _hook_x_2d rather than dense _hook_grad_T_sign.
Fix:
TernaryScaleTensor.update_E()now detects the direct CUDA hooks and calls_triton_update_e_direct.- Dense CUDA hooks call
_triton_update_e. - CPU fallback now uses the same residual grouped
E_accumrule as the Triton kernels instead of the stale EMA/log2 update. - This keeps persistent scale state as
int8 E + int8 E_accumand avoids hidden FP optimizer state.
Verification
Passed:
python -m compileall -q arbitor trainingpython -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"2 passed, 25 deselected
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --no-vq --no-graph --backward- logical ternary weights:
1,246,223,808 - training state:
1625.12 MB - forward:
0.348s - backward/update:
0.586s - CUDA peak:
1638.54 MB
- logical ternary weights:
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --backward- logical ternary weights:
1,268,095,232 - training state:
1653.68 MB - forward:
0.392s - backward/update:
0.851s - CUDA peak:
1734.90 MB
- logical ternary weights:
python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure-nomoe --reset --disable-moe --disable-vq --disable-graph- logical ternary weights:
123,996,160 - training state:
161.61 MB - one train/eval step passed.
- logical ternary weights:
python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure --reset --max-moe-iters 1 --disable-vq --disable-graph- cold run after kernel changes:
119.99s, dominated by first-run Triton compilation. - cached rerun:
1.09s. - logical ternary weights:
1,246,223,808 - training state:
1625.12 MB - one train/eval step passed.
- cold run after kernel changes:
python training/text.py --steps 1 --batch 1 --ctx 4 --eval-interval 1 --run smoke-text-legacy- one legacy text-script step passed with zero float params.
python training/pretrain.py --steps 0 --batch 1 --ctx 4 --no-save --text-weight 0 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0 --run smoke-pretrain-parse- parse/build path passed.
git diff --check -- arbitor training pyproject.toml docs/true-ternary
Operational Notes
- The severe MoE slowdown seen on a fresh cloud box is reproducible as first-run Triton compilation. The cached MoE one-step trainer fell from about
120sto about1.1s. - For cloud use, run a small smoke warmup before the real job:
python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward
- For low-VRAM local training, start with:
python -m arbitor.train --ctx 128 --batch 1 --accum 4 --max-moe-iters 1 --no-save
- Enable sidecars and memory modules only when the run actually needs them:
--enable-image--enable-audio--enable-memory
Remaining Work
- The MoE path is fast after kernel cache warmup, but the first-run compile tax is still painful. A production launch script should prewarm the exact shapes used by the real job.
- The sparse MoE still uses Python-level expert loops for the large-token path. The cached small-batch path is acceptable now, but a fused grouped-dispatch Triton kernel remains the next native-speed step.
- LoRA finetuning intentionally still uses float adapter parameters and AdamW under
training/finetuning; the base ternary model remains frozen in that path.
Training Folder Audit Addendum
Follow-up training audit:
training/pretrain.pynow has a local byte stream via--text-data, so phase-1 text bootstrap can start fromtraining/data/tinyshakespeare.txtor any local.txt/.ptbefore moving to HF FineWeb.- Text/code pretraining now uses trigram-aligned targets (
x[:, 3:]) instead of the old next-token stream targets. - Image pretraining data now yields raw transformed images, not pre-encoded DINO features. The model-owned image sequencer is responsible for the frozen/int8 DINO pass.
- LibriSpeech VQ target preparation now unpacks
AudioVQEncoderas(_, indices)instead of treating the tuple as logits. - Video pretraining now reshapes/pads latent targets to the current
VideoHeadoutput before MSE. - Checkpoints are now conventional files:
latest.ptbest.ptfinal.pt
- Resume supports architecture drift with
strict=False, which is required when moving from text-only checkpoints into image/audio/video-enabled runs. training/audio.py,training/vision.py, andtraining/diffusion.pynow follow the pure ternary base-training rule. They no longer expose Adam/learning-rate branches; LoRA + AdamW remains undertraining/finetuning/.
Additional verification:
python training/pretrain.py --steps 1 --batch 1 --ctx 4 --text-data training/data/tinyshakespeare.txt --no-save --save-interval 0 --eval-interval 1 --log-interval 1 --max-moe-iters 1 --run smoke-pretrain-local --text-weight 1 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0- passed cached CUDA path in about
1.02strain-loop time.
- passed cached CUDA path in about
python training/audio.py --steps 1 --batch 1 --ctx 16000 --run smoke-audio-pure- passed with synthetic audio.
python training/vision.py --steps 1 --batch 1 --image-size 224 --run smoke-vision-pure- passed with synthetic image/text.
python training/diffusion.py --steps 1 --batch 1 --run smoke-diffusion-pure- passed with synthetic latent target.