AttnRes DevOps GPT

A 48M-parameter GPT trained on DevOps/GitOps data (Kubernetes, Helm, Terraform, ArgoCD, Flux) with a novel Attention Residual (AttnRes) mechanism.

What is AttnRes?

Instead of the standard fixed residual (x + sublayer(x)), AttnRes replaces the residual with a learned attention aggregation over previous hidden states within the same block:

residual = x + sigmoid(alpha) * (attn_aggregate(prev_states) - x)
h = residual + sublayer(x)

alpha is initialized to -4 → sigmoid(-4) ≈ 0.018 (starts as classic residual)
The model gradually opens the AttnRes channel as it learns
Layers are grouped in blocks of 4; each layer attends only to previous states in its block

Results

Trained on ~50M tokens of DevOps corpus (GitHub repos, official docs, Stack Overflow):

Variant	Best val ppl	Final val ppl	Params
Baseline	11.99	13.66	46.4M
AttnRes	11.74	13.91	48.0M

AttnRes reached a lower perplexity minimum (~2% better) and with faster convergence. Qualitatively, AttnRes generates more syntactically correct YAML/Terraform/kubectl output.

Architecture

GPTConfig(
    vocab_size = 16_384,   # BPE tokenizer trained on DevOps corpus
    max_seq    = 512,
    d_model    = 512,
    n_layers   = 12,
    n_heads    = 8,
    d_ff       = 2_048,
    dropout    = 0.1,
    variant    = "attnres",
    block_size = 4,
)

Usage

import torch
from huggingface_hub import hf_hub_download
from model import GPT, GPTConfig
from tokenizers import ByteLevelBPETokenizer

# Load model
ckpt = torch.load(hf_hub_download("roanbrasil/attnres-devops-gpt", "attnres_latest.pt"),
                  map_location="cpu", weights_only=False)
cfg   = GPTConfig(**ckpt["config"])
model = GPT(cfg)
state = {k.replace("_orig_mod.", ""): v for k, v in ckpt["model"].items()}
model.load_state_dict(state)
model.eval()

# Load tokenizer
vocab  = hf_hub_download("roanbrasil/attnres-devops-gpt", "tokenizer/vocab.json")
merges = hf_hub_download("roanbrasil/attnres-devops-gpt", "tokenizer/merges.txt")
tok = ByteLevelBPETokenizer(vocab, merges)

# Generate
prompt  = "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: my-app"
ids     = tok.encode(prompt).ids
x       = torch.tensor([ids])
out     = model.generate(x, max_new=100, temperature=0.7, top_k=40)
print(tok.decode(out[0].tolist()))

Training

Corpus: ~50M tokens from GitHub (kubernetes/helm/terraform repos), official docs, Stack Overflow
Tokenizer: ByteLevelBPE, vocab_size=16384, trained on domain corpus
Steps: 50k, batch_size=16, grad_accum=4, seq_len=512
Optimizer: AdamW, lr=3e-4, cosine decay, warmup=2k steps
Gate params (alpha) use 100x higher LR to allow AttnRes to open during training
Hardware: NVIDIA RTX 3080 (~4.5h)

Downloads last month: -; Downloads are not tracked for this model. How to track