Tensor Types

Scaler(1 block)
Vector(3 blocks)
Matrix(3x3(9) blocks)
Tensor(3d Matrix - No set number on sides)

Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.

Harden and Optimize Ternary Gradient Functions

Forward Pass Suppose a neuron: y=wx+b

Forward pass computes prediction.

Example: x=2 w=0.5 b=0.1
Backward Pass(Where gradients come from) Now we compute: ∂L/∂w

using the chain rule.

This is the core equation of backpropagation: ∂L/∂w = ∂L/∂y ⋅ ∂y/∂w

Break it apart.

Step A — Derivative of Loss Loss: L=(y−t)^2 Derivative: ∂L/∂y =2(y−t) For our numbers: 2(1.1−3)=−3.8

Step B — Derivative of Neuron Neuron: y=wx+b Derivative wrt w: ∂y/∂w =x Since: x=2

Step C — Multiply Them ∂L/∂w =(−3.8)(2)=−7.6 That is the gradient. Meaning: Increasing w reduces loss strongly.
Gradient Descent Update Update rule: w(t+1) = w(t) −η(∂L/∂w)

If: w=0.5 learning rate =0.01

Then: w(new) =0.5−0.01(−7.6)=0.576 The weight moved upward because gradient was negative.

Important Topics to Keep in Mind A transformer has billions of parameters and Backward pass computes for EVERY tensor. This involves:

The backward pass is often MORE expensive than forward pass. Gradients can become:

FP32:

Range: 10^−38 → 10^38

Good for:

BF16:

Important:

This means:

Why Backward Pass Often Uses FP32

During backprop you repeatedly do: g(total)=g1+g2+g3+...

Accumulation error becomes critical.

This is especially dangerous when: gi≪1

because BF16 may round them away.

Example: 0.00097656

might become: 0

after quantization. So modern training often does:

This is called: mixed precision training

optimum.quanto v0.2.7