ARBS / docs /project /TORCH-NOTES.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified

Tensor Types

  1. Scaler(1 block)
  2. Vector(3 blocks)
  3. Matrix(3x3(9) blocks)
  4. Tensor(3d Matrix - No set number on sides)

    Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.

Harden and Optimize Ternary Gradient Functions

  1. Forward Pass Suppose a neuron: y=wx+b

    Forward pass computes prediction.

    Example: x=2 w=0.5 b=0.1

  2. Backward Pass(Where gradients come from) Now we compute: βˆ‚L/βˆ‚w

    using the chain rule.

    This is the core equation of backpropagation: βˆ‚L/βˆ‚w = βˆ‚L/βˆ‚y β‹… βˆ‚y/βˆ‚w

    Break it apart.

    Step A β€” Derivative of Loss Loss: L=(yβˆ’t)^2 Derivative: βˆ‚L/βˆ‚y =2(yβˆ’t) For our numbers: 2(1.1βˆ’3)=βˆ’3.8


    Step B β€” Derivative of Neuron Neuron: y=wx+b Derivative wrt w: βˆ‚y/βˆ‚w =x Since: x=2


    Step C β€” Multiply Them βˆ‚L/βˆ‚w =(βˆ’3.8)(2)=βˆ’7.6 That is the gradient. Meaning: Increasing w reduces loss strongly.

  3. Gradient Descent Update Update rule: w(t+1) = w(t) βˆ’Ξ·(βˆ‚L/βˆ‚w)

If: w=0.5 learning rate =0.01

Then: w(new) =0.5βˆ’0.01(βˆ’7.6)=0.576 The weight moved upward because gradient was negative.

Important Topics to Keep in Mind A transformer has billions of parameters and Backward pass computes for EVERY tensor. This involves:

  • matrix multiplications
  • reductions
  • accumulation
  • activation derivatives
  • chain-rule propagation

The backward pass is often MORE expensive than forward pass. Gradients can become:

  • VERY small
  • VERY large and training repeatedly accumulates them

FP32:

  • 32 bits
  • 23-bit mantissa
  • ~7 decimal digits precision

Range: 10^βˆ’38 β†’ 10^38

Good for:

  • stable accumulation
  • optimizer states
  • gradient reductions

BF16:

  • 16 bits
  • 8-bit exponent
  • 7-bit mantissa

Important:

  • SAME exponent size as FP32
  • LOWER precision

This means:

  • range is good
  • precision is poor

Why Backward Pass Often Uses FP32

During backprop you repeatedly do: g(total)​=g1​+g2​+g3​+...

Accumulation error becomes critical.

This is especially dangerous when: gi​β‰ͺ1

because BF16 may round them away.

Example: 0.00097656

might become: 0

after quantization. So modern training often does:

Operation Precision
Forward activations BF16
Matrix multiply BF16
Gradient accumulation FP32
Optimizer state FP32

This is called: mixed precision training


optimum.quanto v0.2.7