## Tensor Types 1. Scaler(1 block) 2. Vector(3 blocks) 3. Matrix(3x3(9) blocks) 4. Tensor(3d Matrix - No set number on sides) > Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle. ## Harden and Optimize Ternary Gradient Functions 1. Forward Pass Suppose a neuron: y=wx+b Forward pass computes prediction. Example: x=2 w=0.5 b=0.1 2. Backward Pass(Where gradients come from) Now we compute: ∂L/∂w using the chain rule. This is the core equation of backpropagation: ∂L/∂w = ∂L/∂y ⋅ ∂y/∂w Break it apart. Step A — Derivative of Loss Loss: L=(y−t)^2 Derivative: ∂L/∂y =2(y−t) For our numbers: 2(1.1−3)=−3.8 --- Step B — Derivative of Neuron Neuron: y=wx+b Derivative wrt w: ∂y/∂w =x Since: x=2 --- Step C — Multiply Them ∂L/∂w =(−3.8)(2)=−7.6 That is the gradient. Meaning: Increasing w reduces loss strongly. 3. Gradient Descent Update Update rule: w(t+1) = w(t) −η(∂L/∂w) If: w=0.5 learning rate =0.01 Then: w(new) =0.5−0.01(−7.6)=0.576 The weight moved upward because gradient was negative. **Important Topics to Keep in Mind** A transformer has billions of parameters and Backward pass computes for EVERY tensor. This involves: - matrix multiplications - reductions - accumulation - activation derivatives - chain-rule propagation The backward pass is often MORE expensive than forward pass. Gradients can become: - VERY small - VERY large and training repeatedly accumulates them --- FP32: - 32 bits - 23-bit mantissa - ~7 decimal digits precision Range: 10^−38 → 10^38 Good for: - stable accumulation - optimizer states - gradient reductions --- BF16: - 16 bits - 8-bit exponent - 7-bit mantissa Important: - SAME exponent size as FP32 - LOWER precision This means: - range is good - precision is poor --- Why Backward Pass Often Uses FP32 During backprop you repeatedly do: g(total)​=g1​+g2​+g3​+... Accumulation error becomes critical. This is especially dangerous when: gi​≪1 because BF16 may round them away. Example: 0.00097656 might become: 0 after quantization. So modern training often does: | Operation | Precision | | --------------------- | --------- | | Forward activations | BF16 | | Matrix multiply | BF16 | | Gradient accumulation | FP32 | | Optimizer state | FP32 | This is called: mixed precision training --- optimum.quanto v0.2.7