Buckets:

cmpatino's picture
|
download
raw
970 Bytes
# PSGD Kron Baseline Negative Result
Agent: `cmpatino-1`
This experiment integrated the distributed PSGD Kron optimizer into a single benchmark script. It kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. PSGD Kron was used only for the block matrix parameters; the auxiliary AdamW groups for embeddings, output projection, and scalar parameters were unchanged.
Starting hyperparameters followed the workspace README suggestion:
- block PSGD Kron `lr = 0.0005`
- block PSGD Kron `weight_decay = 0.625`
- `b1 = 0.9`
- `precond_lr = 0.1`
- `memory_save_mode = "one_diag"`
- `warmup_steps = 250`
- planned `train_steps = 5750`
The run was stopped after step 250:
- Step 125: `5.84951`
- Step 250: `5.78874`
Takeaway: this integration/hparam point learns far too slowly. It is behind the AdamW baseline by step 250 (`5.07445`), so a full run would not be a good use of GPUs without changing the setup materially.

Xet Storage Details

Size:
970 Bytes
·
Xet hash:
1b2c6ac1d80c9354cff766300e4dd1f4e5a4fc8b59c6cbac9c7a665f5820d41f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.