GreyMatter: A Transformer Language Model from Scratch

GreyMatter is a custom Transformer-based language model implemented from scratch in PyTorch for learning and research purposes. It includes pretraining on a large web dataset and supervised fine-tuning (SFT) on conversational data, similar to the pipeline used in modern LLM development.

Features

  • Implemented from scratch in PyTorch (no Hugging Face Transformers).
  • Byte-Pair Encoding (BPE) tokenizer trained on 1.1B tokens.
  • GPT-style decoder-only Transformer:
    • 12 layers, 768 hidden size, 8 heads, 123M parameters.
    • Rotary Positional Encoding (RoPE).
    • RMSNorm instead of LayerNorm.
    • Dropout + weight decay regularization.

Training

  • Pretrained on a 5GB subset of Falcon RefinedWeb.
  • Optimized with AdamW + gradient accumulation on a single RTX 3090.

Supervised Fine-Tuning (SFT)

  • Aligned with UltraChat-200k.
  • Further reduced perplexity on both train and validation sets.

Repository Structure

.
β”œβ”€β”€ dataset.py                  # BPE tokenizer + dataset loader
β”œβ”€β”€ model.py                    # Transformer model (GreyMatter)
β”œβ”€β”€ train.py                    # Pretraining script
β”œβ”€β”€ sft.py                      # Supervised Fine-Tuning (SFT) script
β”œβ”€β”€ inference.py                # Inference script (decoding strategies)
β”œβ”€β”€ transformer_block_numpy.py  # Numpy Implementation of Transformer Block (without backprop)
└── README.md                   # Project documentation

Model Configuration

config = {
    'vocab_size': 25000,
    'seq_len': 1024,
    'd_model': 768,
    'n_heads': 8,
    'n_layers': 12,
    'd_ff': 3072,
    'dropout': 0.1,
    'batch_size': 8,
    'grad_acc_step': 8,  # effective batch size = 64
    'learning_rate': 1e-4,
    'weight_decay': 0.01,
    'num_epochs': 3,
}
  • Parameters: ~123M
  • Pretraining tokens: 1.1B
  • Effective batch size: 64 (with gradient accumulation)

Usage

Note: Prepare dataset using train_tokenizer.py (download parquet files)

  1. Pretraining
python train.py
  1. Supervised Fine-Tuning (SFT)
python sft.py
  1. Inference
python inference.py

Results

Pretraining: Achieved significant reduction in perplexity on Falcon RefinedWeb subset. Pretraining loss curve

Fine-tuning: Further decreased perplexity (33 -> 9) on UltraChat.

Scope and Goals

  • Focus: AI/ML, LLMs, and NLP research.
  • Goal: Learning the internals of LLMs by implementing everything from scratch.

License

MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train johnsonarokiadoss52197/greymatter-123M