KernelGPT — GPU/AI Systems Performance

A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.

Model Specs

Property	Value
Parameters	~125M
Architecture	Decoder-only Transformer
Embedding dim	768
Attention heads	12
Layers	8
Context length	512 tokens
Vocab size	32,000 (SentencePiece BPE)

Training

Setting	Value
Training steps	162,000
Val loss	4.3889
Optimizer	AdamW
Learning rate	3e-4 (cosine decay)
Batch size	1 (effective 4 with grad accum)

Training Data

FineWeb (general web text)
arXiv papers (cs.DC, cs.AR, cs.LG, cs.PF categories — GPU/AI/systems)
Wikipedia (ML/systems filtered articles)
GPU-specific crawl (NVIDIA docs, GitHub READMEs, arXiv abstracts)

Topics cover all 20 chapters of AI Performance Engineering including CUDA internals, KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.

Usage

import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download

# Download files
ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")
tok_path  = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")

# Load tokenizer
sp = spm.SentencePieceProcessor(model_file=tok_path)

# Load model
# (requires TinyGPT src — clone https://github.com/your-username/TinyGPT)
checkpoint = torch.load(ckpt_path, map_location="cpu")

Acknowledgments

Inspired by Andrej Karpathy's nanoGPT
Training topics based on AI Performance Engineering by Chris Fregly

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support