GreyMatter: A Transformer Language Model from Scratch

GreyMatter is a custom Transformer-based language model implemented from scratch in PyTorch for learning and research purposes. It includes pretraining on a large web dataset and supervised fine-tuning (SFT) on conversational data, similar to the pipeline used in modern LLM development.

Features

Implemented from scratch in PyTorch (no Hugging Face Transformers).
Byte-Pair Encoding (BPE) tokenizer trained on 1.1B tokens.
GPT-style decoder-only Transformer:
- 12 layers, 768 hidden size, 8 heads, 123M parameters.
- Rotary Positional Encoding (RoPE).
- RMSNorm instead of LayerNorm.
- Dropout + weight decay regularization.

Training

Pretrained on a 5GB subset of Falcon RefinedWeb.
Optimized with AdamW + gradient accumulation on a single RTX 3090.

Supervised Fine-Tuning (SFT)

Aligned with UltraChat-200k.
Further reduced perplexity on both train and validation sets.

Repository Structure

.
├── dataset.py                  # BPE tokenizer + dataset loader
├── model.py                    # Transformer model (GreyMatter)
├── train.py                    # Pretraining script
├── sft.py                      # Supervised Fine-Tuning (SFT) script
├── inference.py                # Inference script (decoding strategies)
├── transformer_block_numpy.py  # Numpy Implementation of Transformer Block (without backprop)
└── README.md                   # Project documentation

Model Configuration

config = {
    'vocab_size': 25000,
    'seq_len': 1024,
    'd_model': 768,
    'n_heads': 8,
    'n_layers': 12,
    'd_ff': 3072,
    'dropout': 0.1,
    'batch_size': 8,
    'grad_acc_step': 8,  # effective batch size = 64
    'learning_rate': 1e-4,
    'weight_decay': 0.01,
    'num_epochs': 3,
}

Parameters: ~123M
Pretraining tokens: 1.1B
Effective batch size: 64 (with gradient accumulation)

Usage

Note: Prepare dataset using train_tokenizer.py (download parquet files)

Pretraining

python train.py

Supervised Fine-Tuning (SFT)

python sft.py

Inference

python inference.py

Results

Pretraining: Achieved significant reduction in perplexity on Falcon RefinedWeb subset.

Fine-tuning: Further decreased perplexity (33 -> 9) on UltraChat.

Scope and Goals

Focus: AI/ML, LLMs, and NLP research.
Goal: Learning the internals of LLMs by implementing everything from scratch.

License

MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

johnsonarokiadoss52197
/

greymatter-123M