CLIWorks
/

ARBS

Model card Files Files and versions

Metrics Training metrics Community

ARBS / docs /project /TORCH-NOTES.md

CLIWorks's picture

Upload folder using huggingface_hub

d8bc908 verified 2 days ago

|

history blame contribute delete

2.76 kB


	## Tensor Types
	1. Scaler(1 block)
	2. Vector(3 blocks)
	3. Matrix(3x3(9) blocks)
	4. Tensor(3d Matrix - No set number on sides)
	> Tensors - Tensor cores in NVIDIA CUDA architectures do not have a set number of "sides" in the traditional geometric sense, but they are designed to perform matrix-matrix multiplication on small 2D matrices (often referred to as 2D tensors) directly in a single clock cycle.

	## Harden and Optimize Ternary Gradient Functions
	1. Forward Pass
	Suppose a neuron:
	y=wx+b

	Forward pass computes prediction.

	Example:
	x=2
	w=0.5
	b=0.1

	2. Backward Pass(Where gradients come from)
	Now we compute:
	∂L/∂w

	using the chain rule.

	This is the core equation of backpropagation:
	∂L/∂w = ∂L/∂y ⋅ ∂y/∂w

	Break it apart.

	Step A — Derivative of Loss
	Loss: L=(y−t)^2
	Derivative: ∂L/∂y =2(y−t)
	For our numbers: 2(1.1−3)=−3.8

	---

	Step B — Derivative of Neuron
	Neuron: y=wx+b
	Derivative wrt w: ∂y/∂w =x
	Since: x=2

	---

	Step C — Multiply Them
	∂L/∂w =(−3.8)(2)=−7.6
	That is the gradient.
	Meaning: Increasing w reduces loss strongly.

	3. Gradient Descent Update
	Update rule:
	w(t+1) = w(t) −η(∂L/∂w)

	If:
	w=0.5
	learning rate =0.01

	Then:
	w(new) =0.5−0.01(−7.6)=0.576
	The weight moved upward because gradient was negative.


	Important Topics to Keep in Mind
	A transformer has billions of parameters and Backward pass computes for EVERY tensor.
	This involves:
	- matrix multiplications
	- reductions
	- accumulation
	- activation derivatives
	- chain-rule propagation

	The backward pass is often MORE expensive than forward pass.
	Gradients can become:
	- VERY small
	- VERY large
	and training repeatedly accumulates them

	---

	FP32:
	- 32 bits
	- 23-bit mantissa
	- ~7 decimal digits precision

	Range:
	10^−38 → 10^38

	Good for:
	- stable accumulation
	- optimizer states
	- gradient reductions

	---

	BF16:
	- 16 bits
	- 8-bit exponent
	- 7-bit mantissa

	Important:
	- SAME exponent size as FP32
	- LOWER precision

	This means:
	- range is good
	- precision is poor

	---

	Why Backward Pass Often Uses FP32

	During backprop you repeatedly do:
	g(total)=g1+g2+g3+...

	Accumulation error becomes critical.

	This is especially dangerous when:
	gi≪1

	because BF16 may round them away.

	Example:
	0.00097656

	might become:
	0

	after quantization.
	So modern training often does:

	\| Operation \| Precision \|
	\| --------------------- \| --------- \|
	\| Forward activations \| BF16 \|
	\| Matrix multiply \| BF16 \|
	\| Gradient accumulation \| FP32 \|
	\| Optimizer state \| FP32 \|

	This is called:
	mixed precision training

	---



	optimum.quanto v0.2.7