|
|
| ## Tensor Types |
| 1. Scaler(1 block) |
| 2. Vector(3 blocks) |
| 3. Matrix(3x3(9) blocks) |
| 4. Tensor(3d Matrix - No set number on sides) |
| > Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle. |
|
|
| ## Harden and Optimize Ternary Gradient Functions |
| 1. Forward Pass |
| Suppose a neuron: |
| y=wx+b |
| |
| Forward pass computes prediction. |
| |
| Example: |
| x=2 |
| w=0.5 |
| b=0.1 |
| |
| 2. Backward Pass(Where gradients come from) |
| Now we compute: |
| βL/βw |
| |
| using the chain rule. |
| |
| This is the core equation of backpropagation: |
| βL/βw = βL/βy β
βy/βw |
| |
| Break it apart. |
| |
| Step A β Derivative of Loss |
| Loss: L=(yβt)^2 |
| Derivative: βL/βy =2(yβt) |
| For our numbers: 2(1.1β3)=β3.8 |
| |
| --- |
| |
| Step B β Derivative of Neuron |
| Neuron: y=wx+b |
| Derivative wrt w: βy/βw =x |
| Since: x=2 |
| |
| --- |
| |
| Step C β Multiply Them |
| βL/βw =(β3.8)(2)=β7.6 |
| That is the gradient. |
| Meaning: Increasing w reduces loss strongly. |
| |
| 3. Gradient Descent Update |
| Update rule: |
| w(t+1) = w(t) βΞ·(βL/βw) |
| |
| If: |
| w=0.5 |
| learning rate =0.01 |
|
|
| Then: |
| w(new) =0.5β0.01(β7.6)=0.576 |
| The weight moved upward because gradient was negative. |
|
|
|
|
| **Important Topics to Keep in Mind** |
| A transformer has billions of parameters and Backward pass computes for EVERY tensor. |
| This involves: |
| - matrix multiplications |
| - reductions |
| - accumulation |
| - activation derivatives |
| - chain-rule propagation |
|
|
| The backward pass is often MORE expensive than forward pass. |
| Gradients can become: |
| - VERY small |
| - VERY large |
| and training repeatedly accumulates them |
|
|
| --- |
|
|
| FP32: |
| - 32 bits |
| - 23-bit mantissa |
| - ~7 decimal digits precision |
|
|
| Range: |
| 10^β38 β 10^38 |
| |
| Good for: |
| - stable accumulation |
| - optimizer states |
| - gradient reductions |
|
|
| --- |
|
|
| BF16: |
| - 16 bits |
| - 8-bit exponent |
| - 7-bit mantissa |
|
|
| Important: |
| - SAME exponent size as FP32 |
| - LOWER precision |
|
|
| This means: |
| - range is good |
| - precision is poor |
|
|
| --- |
|
|
| Why Backward Pass Often Uses FP32 |
|
|
| During backprop you repeatedly do: |
| g(total)β=g1β+g2β+g3β+... |
|
|
| Accumulation error becomes critical. |
|
|
| This is especially dangerous when: |
| giββͺ1 |
|
|
| because BF16 may round them away. |
|
|
| Example: |
| 0.00097656 |
|
|
| might become: |
| 0 |
|
|
| after quantization. |
| So modern training often does: |
|
|
| | Operation | Precision | |
| | --------------------- | --------- | |
| | Forward activations | BF16 | |
| | Matrix multiply | BF16 | |
| | Gradient accumulation | FP32 | |
| | Optimizer state | FP32 | |
|
|
| This is called: |
| mixed precision training |
|
|
| --- |
|
|
|
|
|
|
| optimum.quanto v0.2.7 |