File size: 2,756 Bytes
d8bc908 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
## Tensor Types
1. Scaler(1 block)
2. Vector(3 blocks)
3. Matrix(3x3(9) blocks)
4. Tensor(3d Matrix - No set number on sides)
> Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.
## Harden and Optimize Ternary Gradient Functions
1. Forward Pass
Suppose a neuron:
y=wx+b
Forward pass computes prediction.
Example:
x=2
w=0.5
b=0.1
2. Backward Pass(Where gradients come from)
Now we compute:
βL/βw
using the chain rule.
This is the core equation of backpropagation:
βL/βw = βL/βy β
βy/βw
Break it apart.
Step A β Derivative of Loss
Loss: L=(yβt)^2
Derivative: βL/βy =2(yβt)
For our numbers: 2(1.1β3)=β3.8
---
Step B β Derivative of Neuron
Neuron: y=wx+b
Derivative wrt w: βy/βw =x
Since: x=2
---
Step C β Multiply Them
βL/βw =(β3.8)(2)=β7.6
That is the gradient.
Meaning: Increasing w reduces loss strongly.
3. Gradient Descent Update
Update rule:
w(t+1) = w(t) βΞ·(βL/βw)
If:
w=0.5
learning rate =0.01
Then:
w(new) =0.5β0.01(β7.6)=0.576
The weight moved upward because gradient was negative.
**Important Topics to Keep in Mind**
A transformer has billions of parameters and Backward pass computes for EVERY tensor.
This involves:
- matrix multiplications
- reductions
- accumulation
- activation derivatives
- chain-rule propagation
The backward pass is often MORE expensive than forward pass.
Gradients can become:
- VERY small
- VERY large
and training repeatedly accumulates them
---
FP32:
- 32 bits
- 23-bit mantissa
- ~7 decimal digits precision
Range:
10^β38 β 10^38
Good for:
- stable accumulation
- optimizer states
- gradient reductions
---
BF16:
- 16 bits
- 8-bit exponent
- 7-bit mantissa
Important:
- SAME exponent size as FP32
- LOWER precision
This means:
- range is good
- precision is poor
---
Why Backward Pass Often Uses FP32
During backprop you repeatedly do:
g(total)β=g1β+g2β+g3β+...
Accumulation error becomes critical.
This is especially dangerous when:
giββͺ1
because BF16 may round them away.
Example:
0.00097656
might become:
0
after quantization.
So modern training often does:
| Operation | Precision |
| --------------------- | --------- |
| Forward activations | BF16 |
| Matrix multiply | BF16 |
| Gradient accumulation | FP32 |
| Optimizer state | FP32 |
This is called:
mixed precision training
---
optimum.quanto v0.2.7 |