YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
neox-ckpt-pythia-14m-seed0
This repository contains GPT-NeoX format checkpoints for a 14M parameter Pythia model. Checkpoints are stored as branches (e.g., step0, step1000, step143000), with 154 checkpoints total matching the standard Pythia checkpoint schedule.
Model Details
| Parameter | Value |
|---|---|
| Parameters | 14M |
| Layers | 6 |
| Hidden size | 128 |
| Attention heads | 4 |
| Sequence length | 2048 |
| Vocab size | 50304 |
| Training steps | 143,000 |
| Batch size | 32 per GPU |
| Learning rate | 1e-3 (cosine decay, 10% warmup) |
| Optimizer | Adam (betas 0.9, 0.95) |
| Precision | FP16 |
| Init method | small_init (weights), wang_init (output layer) |
Training Data
The training config specifies /weka/pile/pile_20B_tokenizer_text_document as the training data path, which is the standard (non-deduplicated) Pile.
Relationship to Other Pythia Models
This model was trained as part of an "extra seeds" project (wandb_project: pythia-extra-seeds). The config specifies seed: 1234, which is the same seed used for the original Pythia models.
Despite sharing the same seed value and init method, the step 0 (pre-training) weights in this repository do not match the step 0 weights of any published Pythia model. We investigated this by comparing against EleutherAI/pythia-14m-deduped, which was trained on the deduplicated Pile but uses the same architecture and seed.
Step 0 comparison with pythia-14m-deduped
At step 0, before any training occurs, the two models should reflect only their random initialization. We found:
- 50 of 76 weight tensors are bitwise identical. These are all biases (initialized to zero) and layer norm parameters (initialized to ones/zeros).
- 26 weight tensors differ. These are the randomly initialized weight matrices: embeddings, attention projections, and MLP projections.
To distinguish between "different init method" and "same init, different RNG state," we compared the standard deviations of corresponding weight tensors:
| Layer type | HF std (typical) | NeoX std (typical) | Ratio |
|---|---|---|---|
| Embedding | 0.0559 | 0.0559 | ~1.000 |
| QKV projection | 0.0558 | 0.0558 | ~1.000 |
| Attention dense | 0.0294 | 0.0295 | ~1.000 |
| MLP up-proj | 0.0560 | 0.0560 | ~1.000 |
| MLP down-proj | 0.0295 | 0.0294 | ~1.000 |
The standard deviations match to within ~1% across all layers, confirming the same initialization distribution was used. A different init method would produce a systematic scale difference. This means the same small_init + wang_init scheme was applied, but the RNG produced different values.
Step 143000 (final checkpoint) comparison
At the final checkpoint, all 76 weight tensors differ substantially (as expected, since training data also differs). Max differences range from 0.08 to 136, confirming these are fully distinct models.
Conclusion
We are not sure what happened. The training config specifies the same seed (1234) and the same init methods as the original Pythia models, and the weight statistics confirm the same init distribution was used, but the actual initialized values differ. Possible explanations include differences in the software version, tensor parallelism configuration, or other factors that affect RNG state. The training data also differs (standard Pile vs deduplicated Pile for the published pythia-14m-deduped).
Checkpoint Format
Each step branch contains:
mp_rank_00_model_states.ptโ Full NeoX checkpoint (model weights, optimizer state, LR scheduler, RNG states)configs/โ Training configuration YAML files
These are not HuggingFace Transformers format checkpoints. To use them for inference, convert to HF format first using the NeoX conversion scripts.
License
Apache 2.0