KernelGPT β GPU/AI Systems Performance
A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.
Model Specs
| Property | Value |
|---|---|
| Parameters | ~125M |
| Architecture | Decoder-only Transformer |
| Embedding dim | 768 |
| Attention heads | 12 |
| Layers | 8 |
| Context length | 512 tokens |
| Vocab size | 32,000 (SentencePiece BPE) |
Training
| Setting | Value |
|---|---|
| Training steps | 162,000 |
| Val loss | 4.3889 |
| Optimizer | AdamW |
| Learning rate | 3e-4 (cosine decay) |
| Batch size | 1 (effective 4 with grad accum) |
Training Data
- FineWeb (general web text)
- arXiv papers (cs.DC, cs.AR, cs.LG, cs.PF categories β GPU/AI/systems)
- Wikipedia (ML/systems filtered articles)
- GPU-specific crawl (NVIDIA docs, GitHub READMEs, arXiv abstracts)
Topics cover all 20 chapters of AI Performance Engineering including CUDA internals, KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.
Usage
import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download
# Download files
ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")
tok_path = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")
# Load tokenizer
sp = spm.SentencePieceProcessor(model_file=tok_path)
# Load model
# (requires TinyGPT src β clone https://github.com/your-username/TinyGPT)
checkpoint = torch.load(ckpt_path, map_location="cpu")
Acknowledgments
- Inspired by Andrej Karpathy's nanoGPT
- Training topics based on AI Performance Engineering by Chris Fregly
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support