File size: 2,756 Bytes

d8bc908


## Tensor Types
1. Scaler(1 block)
2. Vector(3 blocks)
3. Matrix(3x3(9) blocks)
4. Tensor(3d Matrix - No set number on sides)
> Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.

## Harden and Optimize Ternary Gradient Functions
1. Forward Pass
    Suppose a neuron:
    y=wx+b

    Forward pass computes prediction.

    Example:
    x=2
    w=0.5
    b=0.1

2. Backward Pass(Where gradients come from)
    Now we compute:
    ∂L/∂w
    
    using the chain rule.

    This is the core equation of backpropagation:
    ∂L/∂w = ∂L/∂y ⋅ ∂y/∂w

    Break it apart.

    Step A — Derivative of Loss
    Loss: L=(y−t)^2
    Derivative: ∂L/∂y =2(y−t)
    For our numbers: 2(1.1−3)=−3.8

    ---

    Step B — Derivative of Neuron
    Neuron: y=wx+b
    Derivative wrt w: ∂y/∂w =x
    Since: x=2

    ---

    Step C — Multiply Them
    ∂L/∂w =(−3.8)(2)=−7.6
    That is the gradient.
    Meaning: Increasing w reduces loss strongly.

3. Gradient Descent Update
    Update rule:
w(t+1) = w(t) −η(∂L/∂w)

If:
w=0.5
learning rate =0.01

Then:
w(new) =0.5−0.01(−7.6)=0.576
The weight moved upward because gradient was negative.


**Important Topics to Keep in Mind**
A transformer has billions of parameters and Backward pass computes for EVERY tensor.
This involves:
- matrix multiplications
- reductions
- accumulation
- activation derivatives
- chain-rule propagation

The backward pass is often MORE expensive than forward pass.
Gradients can become:
- VERY small
- VERY large
and training repeatedly accumulates them

---

FP32:
- 32 bits
- 23-bit mantissa
- ~7 decimal digits precision

Range:
    10^−38 → 10^38

Good for:
- stable accumulation
- optimizer states
- gradient reductions

---

BF16:
- 16 bits
- 8-bit exponent
- 7-bit mantissa

Important:
- SAME exponent size as FP32
- LOWER precision

This means:
- range is good
- precision is poor

---

Why Backward Pass Often Uses FP32

During backprop you repeatedly do:
g(total)=g1+g2+g3+...

Accumulation error becomes critical.

This is especially dangerous when:
gi≪1

because BF16 may round them away.

Example:
0.00097656

might become:
0

after quantization.
So modern training often does:

| Operation             | Precision |
| --------------------- | --------- |
| Forward activations   | BF16      |
| Matrix multiply       | BF16      |
| Gradient accumulation | FP32      |
| Optimizer state       | FP32      |

This is called:
mixed precision training

---



optimum.quanto v0.2.7