File size: 4,192 Bytes

d8bc908

# TRUE TERNARY REFACTOR 18: BigInt Correlation Scaling

## Goal

Port the successful `testing/test_bigint_ternary.py` scaling rule into the main ARB ternary training path while staying under the >8GB VRAM training target.

The important math is now the dense ternary default:

```text
mean_corr = corr_accum / (step_counter * group_size)
S = 2 ** (E + ARB_BIGINT_CORR_STRENGTH * mean_corr)
```

Default `ARB_BIGINT_CORR_STRENGTH` is `4.0`, matching the successful BigInt test. `E` stays fixed after initialization for dense `TernaryScaleTensor` modules. The trainable signal moves through integer `corr_accum`.

## Main Changes

- `arbitor/kernel/ternary_scale.py`
  - Replaced dense `T_accum` and `E_accum` training state with `corr_accum int64` and `step_counter int64`.
  - Added BigInt scale expansion in `_get_S()` and the Triton linear forward/grad-x kernels.
  - Added a Triton direct correlation accumulation kernel for `sign(grad_y.T @ x) * T` grouped by scale group.
  - Added a custom autograd function that recomputes effective weights in backward instead of saving `w_eff` in the graph.
  - Disabled the TileLang dense path for this mode until TileLang supports `corr_accum` in both forward and backward. This prevents the previous fp16 TileLang path from silently breaking ternary training.

- `arbitor/main.py`
  - Updated `prepare_ternary_backward()` and update cleanup to recognize streamed BigInt updates.
  - Preserved `LossComponent` routing by sending component-specific dense signs into `corr_accum`.
  - Skipped old per-group threshold float temporaries for BigInt dense layers. Those thresholds were only used by legacy `T_accum` flips and were a large avoidable allocation at 3B scale.

- `arbitor/kernel/ternary_audit.py`
  - Audit now counts `corr_accum` and `step_counter` so training-state memory is no longer underreported.

- `arbitor/attention/context_attention.py`, `arbitor/components.py`, `arbitor/decoders.py`
  - Converted the context attention gate from `nn.Linear` to `TernaryScaleTensor`.
  - Froze LTI injection float constants so pure ternary trainers have zero trainable float parameters.

- `testing/test_tscale.py`, `testing/test_polarity_validation.py`
  - Updated expectations from old `E_accum/T_accum` behavior to fixed `E`, integer `corr_accum`, and BigInt step counters.

## Validation

Commands run:

```bash
python -m compileall -q arbitor training testing
python -m pytest -q testing/test_polarity_validation.py testing/test_tscale.py -k "not model_integration and not runtime_switch"
ARB_TERNARY_BACKEND=triton python training/text.py --steps 1 --batch 1 --ctx 4 --eval-interval 999 --run bigint-smoke
ARB_TERNARY_BACKEND=triton python training/pretrain.py --steps 1 --batch 1 --ctx 4 --text-data training/data/tinyshakespeare.txt --text-weight 1.0 --code-weight 0 --image-weight 0 --audio-weight 0 --video-weight 0 --eval-interval 0 --log-interval 0 --save-interval 0 --no-save --run bigint-pretrain-smoke
```

Focused tests passed:

```text
42 passed, 2 deselected
```

3-step full text-model CUDA memory probe:

```text
logical_ternary_weights 3122933472
training_state_mb 1684.27
after_cuda_mb 1706
step 0 alloc_mb 1754 reserved_mb 1838 peak_mb 1998
step 1 alloc_mb 1754 reserved_mb 1838 peak_mb 2038
step 2 alloc_mb 1754 reserved_mb 1842 peak_mb 2036
```

This is the key result: allocated VRAM did not stack across steps after cleanup.

## Current State

- Dense ternary modules now train through BigInt correlation scaling rather than discrete ternary flips.
- Persistent dense training state is integer only: `T_packed uint8`, `E int8`, `corr_accum int64`, `step_counter int64`.
- The 3.12B logical ternary text path reports 0 trainable float params and runs a 1-step pretrain smoke.
- Remaining float params are frozen LTI constants, not optimizer state.

## Follow-Up

- Port sparse embedding/VQ tables from legacy `T_accum/E_accum` to BigInt correlation if the `T_accum=321 MB` training-state block becomes the next bottleneck.
- Add TileLang BigInt support only after its kernels accept `corr_accum`, `step_counter`, and integer correlation accumulation. The fp16-only TileLang path should stay disabled for BigInt training.