File size: 2,756 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146

## Tensor Types
1. Scaler(1 block)
2. Vector(3 blocks)
3. Matrix(3x3(9) blocks)
4. Tensor(3d Matrix - No set number on sides)
> Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.

## Harden and Optimize Ternary Gradient Functions
1. Forward Pass
    Suppose a neuron:
    y=wx+b

    Forward pass computes prediction.

    Example:
    x=2
    w=0.5
    b=0.1

2. Backward Pass(Where gradients come from)
    Now we compute:
    βˆ‚L/βˆ‚w
    
    using the chain rule.

    This is the core equation of backpropagation:
    βˆ‚L/βˆ‚w = βˆ‚L/βˆ‚y β‹… βˆ‚y/βˆ‚w

    Break it apart.

    Step A β€” Derivative of Loss
    Loss: L=(yβˆ’t)^2
    Derivative: βˆ‚L/βˆ‚y =2(yβˆ’t)
    For our numbers: 2(1.1βˆ’3)=βˆ’3.8

    ---

    Step B β€” Derivative of Neuron
    Neuron: y=wx+b
    Derivative wrt w: βˆ‚y/βˆ‚w =x
    Since: x=2

    ---

    Step C β€” Multiply Them
    βˆ‚L/βˆ‚w =(βˆ’3.8)(2)=βˆ’7.6
    That is the gradient.
    Meaning: Increasing w reduces loss strongly.

3. Gradient Descent Update
    Update rule:
w(t+1) = w(t) βˆ’Ξ·(βˆ‚L/βˆ‚w)

If:
w=0.5
learning rate =0.01

Then:
w(new) =0.5βˆ’0.01(βˆ’7.6)=0.576
The weight moved upward because gradient was negative.


**Important Topics to Keep in Mind**
A transformer has billions of parameters and Backward pass computes for EVERY tensor.
This involves:
- matrix multiplications
- reductions
- accumulation
- activation derivatives
- chain-rule propagation

The backward pass is often MORE expensive than forward pass.
Gradients can become:
- VERY small
- VERY large
and training repeatedly accumulates them

---

FP32:
- 32 bits
- 23-bit mantissa
- ~7 decimal digits precision

Range:
    10^βˆ’38 β†’ 10^38

Good for:
- stable accumulation
- optimizer states
- gradient reductions

---

BF16:
- 16 bits
- 8-bit exponent
- 7-bit mantissa

Important:
- SAME exponent size as FP32
- LOWER precision

This means:
- range is good
- precision is poor

---

Why Backward Pass Often Uses FP32

During backprop you repeatedly do:
g(total)​=g1​+g2​+g3​+...

Accumulation error becomes critical.

This is especially dangerous when:
gi​β‰ͺ1

because BF16 may round them away.

Example:
0.00097656

might become:
0

after quantization.
So modern training often does:

| Operation             | Precision |
| --------------------- | --------- |
| Forward activations   | BF16      |
| Matrix multiply       | BF16      |
| Gradient accumulation | FP32      |
| Optimizer state       | FP32      |

This is called:
mixed precision training

---



optimum.quanto v0.2.7