KernelGPT β€” GPU/AI Systems Performance

A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.

Model Specs

Property Value
Parameters ~125M
Architecture Decoder-only Transformer
Embedding dim 768
Attention heads 12
Layers 8
Context length 512 tokens
Vocab size 32,000 (SentencePiece BPE)

Training

Setting Value
Training steps 162,000
Val loss 4.3889
Optimizer AdamW
Learning rate 3e-4 (cosine decay)
Batch size 1 (effective 4 with grad accum)

Training Data

  • FineWeb (general web text)
  • arXiv papers (cs.DC, cs.AR, cs.LG, cs.PF categories β€” GPU/AI/systems)
  • Wikipedia (ML/systems filtered articles)
  • GPU-specific crawl (NVIDIA docs, GitHub READMEs, arXiv abstracts)

Topics cover all 20 chapters of AI Performance Engineering including CUDA internals, KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.

Usage

import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download

# Download files
ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")
tok_path  = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")

# Load tokenizer
sp = spm.SentencePieceProcessor(model_file=tok_path)

# Load model
# (requires TinyGPT src β€” clone https://github.com/your-username/TinyGPT)
checkpoint = torch.load(ckpt_path, map_location="cpu")

Acknowledgments

  • Inspired by Andrej Karpathy's nanoGPT
  • Training topics based on AI Performance Engineering by Chris Fregly
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support