| # TRUE-TERNARY-REFACTOR12 |
|
|
| Date: 2026-05-19 |
|
|
| ## Scope |
|
|
| Readiness pass after the ARBS platform restructure, focused on: |
|
|
| - making the default scripts easier to run on cloud machines, |
| - keeping core training pure ternary with no AdamW/master weights, |
| - reducing default VRAM pressure, |
| - verifying the MoE path after the top-2/low-iteration refactor, |
| - fixing the CUDA TScale `E` update regression found during verification. |
|
|
| ## Changes |
|
|
| - `ARBModel()` now defaults to core text/VQ/graph only: |
| - image/audio sidecars are opt-in, |
| - MemGram modules are opt-in, |
| - ACT halt threshold defaults to `0.99`, |
| - MoE iteration default follows `ACT_MAX_ITERS=4`. |
| - `arbitor.train` now keeps memory modules off by default. Use `--enable-memory` only when a run explicitly needs MemGram. |
| - `training/pretrain.py` no longer exposes stale LR/AdamW knobs for pure core pretraining. It now builds only the sidecars for active weighted modalities. |
| - `training/text.py` no longer carries a dead optimizer branch and defaults to a small local context. |
| - `arbitor.smoke` now supports `--backward` to run forward, backward, and one pure ternary state update. This is the practical cache-warm command for Triton training kernels: |
|
|
| ```bash |
| python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward |
| ``` |
|
|
| ## TScale E Update Fix |
|
|
| The targeted CUDA tests showed that direct GPU `update_E()` was returning without changing `E` or `E_accum`, because the Triton forward/backward path retains `_hook_grad_2d` and `_hook_x_2d` rather than dense `_hook_grad_T_sign`. |
|
|
| Fix: |
|
|
| - `TernaryScaleTensor.update_E()` now detects the direct CUDA hooks and calls `_triton_update_e_direct`. |
| - Dense CUDA hooks call `_triton_update_e`. |
| - CPU fallback now uses the same residual grouped `E_accum` rule as the Triton kernels instead of the stale EMA/log2 update. |
| - This keeps persistent scale state as `int8 E + int8 E_accum` and avoids hidden FP optimizer state. |
|
|
| ## Verification |
|
|
| Passed: |
|
|
| - `python -m compileall -q arbitor training` |
| - `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"` |
| - `2 passed, 25 deselected` |
| - `python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --no-vq --no-graph --backward` |
| - logical ternary weights: `1,246,223,808` |
| - training state: `1625.12 MB` |
| - forward: `0.348s` |
| - backward/update: `0.586s` |
| - CUDA peak: `1638.54 MB` |
| - `python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --backward` |
| - logical ternary weights: `1,268,095,232` |
| - training state: `1653.68 MB` |
| - forward: `0.392s` |
| - backward/update: `0.851s` |
| - CUDA peak: `1734.90 MB` |
| - `python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure-nomoe --reset --disable-moe --disable-vq --disable-graph` |
| - logical ternary weights: `123,996,160` |
| - training state: `161.61 MB` |
| - one train/eval step passed. |
| - `python -m arbitor.train --steps 1 --batch 1 --accum 1 --ctx 4 --eval-steps 1 --eval-interval 1 --save-interval 0 --no-save --run smoke-pure --reset --max-moe-iters 1 --disable-vq --disable-graph` |
| - cold run after kernel changes: `119.99s`, dominated by first-run Triton compilation. |
| - cached rerun: `1.09s`. |
| - logical ternary weights: `1,246,223,808` |
| - training state: `1625.12 MB` |
| - one train/eval step passed. |
| - `python training/text.py --steps 1 --batch 1 --ctx 4 --eval-interval 1 --run smoke-text-legacy` |
| - one legacy text-script step passed with zero float params. |
| - `python training/pretrain.py --steps 0 --batch 1 --ctx 4 --no-save --text-weight 0 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0 --run smoke-pretrain-parse` |
| - parse/build path passed. |
| - `git diff --check -- arbitor training pyproject.toml docs/true-ternary` |
|
|
| ## Operational Notes |
|
|
| - The severe MoE slowdown seen on a fresh cloud box is reproducible as first-run Triton compilation. The cached MoE one-step trainer fell from about `120s` to about `1.1s`. |
| - For cloud use, run a small smoke warmup before the real job: |
|
|
| ```bash |
| python -m arbitor.smoke --device cuda --ctx 4 --batch 1 --max-moe-iters 1 --backward |
| ``` |
|
|
| - For low-VRAM local training, start with: |
|
|
| ```bash |
| python -m arbitor.train --ctx 128 --batch 1 --accum 4 --max-moe-iters 1 --no-save |
| ``` |
|
|
| - Enable sidecars and memory modules only when the run actually needs them: |
| - `--enable-image` |
| - `--enable-audio` |
| - `--enable-memory` |
|
|
| ## Remaining Work |
|
|
| - The MoE path is fast after kernel cache warmup, but the first-run compile tax is still painful. A production launch script should prewarm the exact shapes used by the real job. |
| - The sparse MoE still uses Python-level expert loops for the large-token path. The cached small-batch path is acceptable now, but a fused grouped-dispatch Triton kernel remains the next native-speed step. |
| - LoRA finetuning intentionally still uses float adapter parameters and AdamW under `training/finetuning`; the base ternary model remains frozen in that path. |
|
|
| ## Training Folder Audit Addendum |
|
|
| Follow-up training audit: |
|
|
| - `training/pretrain.py` now has a local byte stream via `--text-data`, so phase-1 text bootstrap can start from `training/data/tinyshakespeare.txt` or any local `.txt`/`.pt` before moving to HF FineWeb. |
| - Text/code pretraining now uses trigram-aligned targets (`x[:, 3:]`) instead of the old next-token stream targets. |
| - Image pretraining data now yields raw transformed images, not pre-encoded DINO features. The model-owned image sequencer is responsible for the frozen/int8 DINO pass. |
| - LibriSpeech VQ target preparation now unpacks `AudioVQEncoder` as `(_, indices)` instead of treating the tuple as logits. |
| - Video pretraining now reshapes/pads latent targets to the current `VideoHead` output before MSE. |
| - Checkpoints are now conventional files: |
| - `latest.pt` |
| - `best.pt` |
| - `final.pt` |
| - Resume supports architecture drift with `strict=False`, which is required when moving from text-only checkpoints into image/audio/video-enabled runs. |
| - `training/audio.py`, `training/vision.py`, and `training/diffusion.py` now follow the pure ternary base-training rule. They no longer expose Adam/learning-rate branches; LoRA + AdamW remains under `training/finetuning/`. |
|
|
| Additional verification: |
|
|
| - `python training/pretrain.py --steps 1 --batch 1 --ctx 4 --text-data training/data/tinyshakespeare.txt --no-save --save-interval 0 --eval-interval 1 --log-interval 1 --max-moe-iters 1 --run smoke-pretrain-local --text-weight 1 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0` |
| - passed cached CUDA path in about `1.02s` train-loop time. |
| - `python training/audio.py --steps 1 --batch 1 --ctx 16000 --run smoke-audio-pure` |
| - passed with synthetic audio. |
| - `python training/vision.py --steps 1 --batch 1 --image-size 224 --run smoke-vision-pure` |
| - passed with synthetic image/text. |
| - `python training/diffusion.py --steps 1 --batch 1 --run smoke-diffusion-pure` |
| - passed with synthetic latent target. |
|
|