AttnRes DevOps GPT
A 48M-parameter GPT trained on DevOps/GitOps data (Kubernetes, Helm, Terraform, ArgoCD, Flux) with a novel Attention Residual (AttnRes) mechanism.
What is AttnRes?
Instead of the standard fixed residual (x + sublayer(x)), AttnRes replaces the residual
with a learned attention aggregation over previous hidden states within the same block:
residual = x + sigmoid(alpha) * (attn_aggregate(prev_states) - x)
h = residual + sublayer(x)
alphais initialized to-4โsigmoid(-4) โ 0.018(starts as classic residual)- The model gradually opens the AttnRes channel as it learns
- Layers are grouped in blocks of 4; each layer attends only to previous states in its block
Results
Trained on ~50M tokens of DevOps corpus (GitHub repos, official docs, Stack Overflow):
| Variant | Best val ppl | Final val ppl | Params |
|---|---|---|---|
| Baseline | 11.99 | 13.66 | 46.4M |
| AttnRes | 11.74 | 13.91 | 48.0M |
AttnRes reached a lower perplexity minimum (~2% better) and with faster convergence. Qualitatively, AttnRes generates more syntactically correct YAML/Terraform/kubectl output.
Architecture
GPTConfig(
vocab_size = 16_384, # BPE tokenizer trained on DevOps corpus
max_seq = 512,
d_model = 512,
n_layers = 12,
n_heads = 8,
d_ff = 2_048,
dropout = 0.1,
variant = "attnres",
block_size = 4,
)
Usage
import torch
from huggingface_hub import hf_hub_download
from model import GPT, GPTConfig
from tokenizers import ByteLevelBPETokenizer
# Load model
ckpt = torch.load(hf_hub_download("roanbrasil/attnres-devops-gpt", "attnres_latest.pt"),
map_location="cpu", weights_only=False)
cfg = GPTConfig(**ckpt["config"])
model = GPT(cfg)
state = {k.replace("_orig_mod.", ""): v for k, v in ckpt["model"].items()}
model.load_state_dict(state)
model.eval()
# Load tokenizer
vocab = hf_hub_download("roanbrasil/attnres-devops-gpt", "tokenizer/vocab.json")
merges = hf_hub_download("roanbrasil/attnres-devops-gpt", "tokenizer/merges.txt")
tok = ByteLevelBPETokenizer(vocab, merges)
# Generate
prompt = "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: my-app"
ids = tok.encode(prompt).ids
x = torch.tensor([ids])
out = model.generate(x, max_new=100, temperature=0.7, top_k=40)
print(tok.decode(out[0].tolist()))
Training
- Corpus: ~50M tokens from GitHub (kubernetes/helm/terraform repos), official docs, Stack Overflow
- Tokenizer: ByteLevelBPE, vocab_size=16384, trained on domain corpus
- Steps: 50k, batch_size=16, grad_accum=4, seq_len=512
- Optimizer: AdamW, lr=3e-4, cosine decay, warmup=2k steps
- Gate params (alpha) use 100x higher LR to allow AttnRes to open during training
- Hardware: NVIDIA RTX 3080 (~4.5h)