# ARBS Ternary Training System (TTS) ## E1TM Format — Exponent-1 Ternary Mantissa E1TM encodes each weight group as **one int8 exponent shared across N ternary mantissas**. ``` W_eff[i] = S × T[i] where T[i] ∈ {-1, 0, +1}, S = 2^{E + Δ} E = int8 log₂ scale (persistent, per group) Δ = 4 × corr_accum / (step × gs) (from BigInt accumulator) S = 2^{E+Δ} (float32, ephemeral — created per forward, discarded) ``` ### Format variants | Name | TScaleType | T per E | gs | E bpw | T bpw | Total bpw (inf) | Precision | |---|---|---|---|---|---|---|---| | E1TM4 | T4 | 4 | 4 | 2.000 | 1.58 | 3.58 | Highest | | E1TM6 | T6 | 6 | 6 | 1.333 | 1.58 | 2.91 | | | E1TM8 | T8 | 8 | 8 | 1.000 | 1.58 | 2.58 | | | E1TM16 | T16 | 16 | 16 | 0.500 | 1.58 | 2.08 | | | **E1TM32** | **T32** | **32** | **32** | **0.250** | **1.58** | **1.85** | **Default** | | E1TM64 | T64 | 64 | 64 | 0.125 | 1.58 | 1.71 | | | E1TM96 | T96 | 96 | 96 | 0.083 | 1.58 | 1.67 | Most packed | Higher T number = more T per E = less storage = coarser per-weight magnitude. ### Group sizes The TScaleType name is the group size: ```python TScaleType.T4 → gs = 4 → E shared across 4 ternary mantissas TScaleType.T32 → gs = 32 → E shared across 32 ternary mantissas TScaleType.T96 → gs = 96 → E shared across 96 ternary mantissas ``` ### Persistent training state (all integer) | Buffer | Type | Size/weight | Role | |---|---|---|---| | T_packed | uint8 | 1.58 bpw | Base-3 packed ternary {-1,0,+1}, 5 trits/byte | | E | int8 | 8/N bpw | Log₂ scale, one per N-weight group | | corr_accum | int64 | 64/N bpw | BigInt accumulator for gradient sign votes | | step_counter | int64 | 0 bpw | Total steps processed | **No float32/16 anywhere in persistent state.** Float32 ephemeral `W_eff` is created per-forward and discarded after backward. ### Why ternary over binary or int4 | Format | Values/weight | Packing efficiency | Null state | |---|---|---|---| | Binary | 2 | 1 bit/bw (100%) | No | | Ternary | 3 | 1.58 bpw (log₂3 ≈ 95%) | **Yes** (T=0 = null) | | Int4 | 16 | 4 bpw (100%) | No | Ternary's null state (T=0) provides structural sparsity — ≈38% of weights are zero, skipping matmul tiles. No other low-bit format has this property at equivalent bpw. ### The BigInt difference Unlike conventional quantization where E is static after conversion, ARBS TTS trains **through** E via a BigInt correlation accumulator: ``` corr_accum[g] -= Σ (grad_sign × T) # int64, never clips or resets Δ = 4 × corr_accum / (step × gs) # continuous adjustment from integer division S = 2^{E + Δ} # effective scale (ephemeral float32) ``` The division `corr_accum / (step × gs)` is the **Big Number Calculator** operation — it converts the accumulated integer evidence into a continuous ratio with arbitrary precision. No threshold flips, no discrete steps, no information loss. ### Training vs inference | Phase | T_packed | E | corr_accum | step | S | |---|---|---|---|---|---| | Training | Read-only | Read-only | **Accumulates** | **Increments** | Computed from corr/step | | Inference (Option A) | Frozen | Frozen | Frozen | Frozen | Burned into checkpoint | | Inference (Option B) | Frozen | **Fused** | Discarded | Discarded | Static 2^{E_fused} | **Option A** (export): keep corr_accum + step for continuous S. **Option B** (fuse): `E_fused = round(E + 4 × corr_accum / (step × gs))` — discards corr_accum, drops to 2.6 bpw. ### Relationship to IEEE float ``` IEEE FP32: 1 sign + 8 exponent + 23 mantissa → per value E1TM32: 1 exponent (int8) + 32 ternary signs → per group of 32 ``` In IEEE, the exponent and mantissa belong to the same value. In E1TM, the exponent is **shared** — the mantissa is split into N independent ternary signs. The corr_accum provides sub-exponent precision beyond the int8 E, making the effective scale continuous rather than constrained to the 256 discrete `2^E` values.