neox-ckpt-pythia-14m-seed0

This repository contains GPT-NeoX format checkpoints for a 14M parameter Pythia model. Checkpoints are stored as branches (e.g., step0, step1000, step143000), with 154 checkpoints total matching the standard Pythia checkpoint schedule.

Model Details

Parameter	Value
Parameters	14M
Layers	6
Hidden size	128
Attention heads	4
Sequence length	2048
Vocab size	50304
Training steps	143,000
Batch size	32 per GPU
Learning rate	1e-3 (cosine decay, 10% warmup)
Optimizer	Adam (betas 0.9, 0.95)
Precision	FP16
Init method	`small_init` (weights), `wang_init` (output layer)

Training Data

The training config specifies /weka/pile/pile_20B_tokenizer_text_document as the training data path, which is the standard (non-deduplicated) Pile.

Relationship to Other Pythia Models

This model was trained as part of an "extra seeds" project (wandb_project: pythia-extra-seeds). The config specifies seed: 1234, which is the same seed used for the original Pythia models.

Despite sharing the same seed value and init method, the step 0 (pre-training) weights in this repository do not match the step 0 weights of any published Pythia model. We investigated this by comparing against EleutherAI/pythia-14m-deduped, which was trained on the deduplicated Pile but uses the same architecture and seed.

Step 0 comparison with pythia-14m-deduped

At step 0, before any training occurs, the two models should reflect only their random initialization. We found:

50 of 76 weight tensors are bitwise identical. These are all biases (initialized to zero) and layer norm parameters (initialized to ones/zeros).
26 weight tensors differ. These are the randomly initialized weight matrices: embeddings, attention projections, and MLP projections.

To distinguish between "different init method" and "same init, different RNG state," we compared the standard deviations of corresponding weight tensors:

Layer type	HF std (typical)	NeoX std (typical)	Ratio
Embedding	0.0559	0.0559	~1.000
QKV projection	0.0558	0.0558	~1.000
Attention dense	0.0294	0.0295	~1.000
MLP up-proj	0.0560	0.0560	~1.000
MLP down-proj	0.0295	0.0294	~1.000

The standard deviations match to within ~1% across all layers, confirming the same initialization distribution was used. A different init method would produce a systematic scale difference. This means the same small_init + wang_init scheme was applied, but the RNG produced different values.

Step 143000 (final checkpoint) comparison

At the final checkpoint, all 76 weight tensors differ substantially (as expected, since training data also differs). Max differences range from 0.08 to 136, confirming these are fully distinct models.

Conclusion

We are not sure what happened. The training config specifies the same seed (1234) and the same init methods as the original Pythia models, and the weight statistics confirm the same init distribution was used, but the actual initialized values differ. Possible explanations include differences in the software version, tensor parallelism configuration, or other factors that affect RNG state. The training data also differs (standard Pile vs deduplicated Pile for the published pythia-14m-deduped).

Checkpoint Format

Each step branch contains:

mp_rank_00_model_states.pt — Full NeoX checkpoint (model weights, optimizer state, LR scheduler, RNG states)
configs/ — Training configuration YAML files

These are not HuggingFace Transformers format checkpoints. To use them for inference, convert to HF format first using the NeoX conversion scripts.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support