ARBS / docs /project /TORCH-NOTES.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified
## Tensor Types
1. Scaler(1 block)
2. Vector(3 blocks)
3. Matrix(3x3(9) blocks)
4. Tensor(3d Matrix - No set number on sides)
> Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.
## Harden and Optimize Ternary Gradient Functions
1. Forward Pass
Suppose a neuron:
y=wx+b
Forward pass computes prediction.
Example:
x=2
w=0.5
b=0.1
2. Backward Pass(Where gradients come from)
Now we compute:
βˆ‚L/βˆ‚w
using the chain rule.
This is the core equation of backpropagation:
βˆ‚L/βˆ‚w = βˆ‚L/βˆ‚y β‹… βˆ‚y/βˆ‚w
Break it apart.
Step A β€” Derivative of Loss
Loss: L=(yβˆ’t)^2
Derivative: βˆ‚L/βˆ‚y =2(yβˆ’t)
For our numbers: 2(1.1βˆ’3)=βˆ’3.8
---
Step B β€” Derivative of Neuron
Neuron: y=wx+b
Derivative wrt w: βˆ‚y/βˆ‚w =x
Since: x=2
---
Step C β€” Multiply Them
βˆ‚L/βˆ‚w =(βˆ’3.8)(2)=βˆ’7.6
That is the gradient.
Meaning: Increasing w reduces loss strongly.
3. Gradient Descent Update
Update rule:
w(t+1) = w(t) βˆ’Ξ·(βˆ‚L/βˆ‚w)
If:
w=0.5
learning rate =0.01
Then:
w(new) =0.5βˆ’0.01(βˆ’7.6)=0.576
The weight moved upward because gradient was negative.
**Important Topics to Keep in Mind**
A transformer has billions of parameters and Backward pass computes for EVERY tensor.
This involves:
- matrix multiplications
- reductions
- accumulation
- activation derivatives
- chain-rule propagation
The backward pass is often MORE expensive than forward pass.
Gradients can become:
- VERY small
- VERY large
and training repeatedly accumulates them
---
FP32:
- 32 bits
- 23-bit mantissa
- ~7 decimal digits precision
Range:
10^βˆ’38 β†’ 10^38
Good for:
- stable accumulation
- optimizer states
- gradient reductions
---
BF16:
- 16 bits
- 8-bit exponent
- 7-bit mantissa
Important:
- SAME exponent size as FP32
- LOWER precision
This means:
- range is good
- precision is poor
---
Why Backward Pass Often Uses FP32
During backprop you repeatedly do:
g(total)​=g1​+g2​+g3​+...
Accumulation error becomes critical.
This is especially dangerous when:
gi​β‰ͺ1
because BF16 may round them away.
Example:
0.00097656
might become:
0
after quantization.
So modern training often does:
| Operation | Precision |
| --------------------- | --------- |
| Forward activations | BF16 |
| Matrix multiply | BF16 |
| Gradient accumulation | FP32 |
| Optimizer state | FP32 |
This is called:
mixed precision training
---
optimum.quanto v0.2.7