Tensor Types
- Scaler(1 block)
- Vector(3 blocks)
- Matrix(3x3(9) blocks)
- Tensor(3d Matrix - No set number on sides)
Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.
Harden and Optimize Ternary Gradient Functions
Forward Pass Suppose a neuron: y=wx+b
Forward pass computes prediction.
Example: x=2 w=0.5 b=0.1
Backward Pass(Where gradients come from) Now we compute: βL/βw
using the chain rule.
This is the core equation of backpropagation: βL/βw = βL/βy β βy/βw
Break it apart.
Step A β Derivative of Loss Loss: L=(yβt)^2 Derivative: βL/βy =2(yβt) For our numbers: 2(1.1β3)=β3.8
Step B β Derivative of Neuron Neuron: y=wx+b Derivative wrt w: βy/βw =x Since: x=2
Step C β Multiply Them βL/βw =(β3.8)(2)=β7.6 That is the gradient. Meaning: Increasing w reduces loss strongly.
Gradient Descent Update Update rule: w(t+1) = w(t) βΞ·(βL/βw)
If: w=0.5 learning rate =0.01
Then: w(new) =0.5β0.01(β7.6)=0.576 The weight moved upward because gradient was negative.
Important Topics to Keep in Mind A transformer has billions of parameters and Backward pass computes for EVERY tensor. This involves:
- matrix multiplications
- reductions
- accumulation
- activation derivatives
- chain-rule propagation
The backward pass is often MORE expensive than forward pass. Gradients can become:
- VERY small
- VERY large and training repeatedly accumulates them
FP32:
- 32 bits
- 23-bit mantissa
- ~7 decimal digits precision
Range: 10^β38 β 10^38
Good for:
- stable accumulation
- optimizer states
- gradient reductions
BF16:
- 16 bits
- 8-bit exponent
- 7-bit mantissa
Important:
- SAME exponent size as FP32
- LOWER precision
This means:
- range is good
- precision is poor
Why Backward Pass Often Uses FP32
During backprop you repeatedly do: g(total)β=g1β+g2β+g3β+...
Accumulation error becomes critical.
This is especially dangerous when: giββͺ1
because BF16 may round them away.
Example: 0.00097656
might become: 0
after quantization. So modern training often does:
| Operation | Precision |
|---|---|
| Forward activations | BF16 |
| Matrix multiply | BF16 |
| Gradient accumulation | FP32 |
| Optimizer state | FP32 |
This is called: mixed precision training
optimum.quanto v0.2.7