File size: 21,913 Bytes
ca16fdf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 | https://github.com/sid-rp/kube-sre-gym
---
title: Kube SRE Gym
emoji: π§
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Kube SRE Gym
### Can a 0.6B model learn to be an on-call SRE β from scratch?
We gave a tiny language model a pager, a live Kubernetes cluster, and zero knowledge of what a pod even is. No pre-training on DevOps docs. No few-shot examples. Just a PagerDuty alert and a `kubectl` prompt.
Within 8 episodes, it learned to discover namespaces, read pod statuses, identify OOMKills from CrashLoopBackOffs, and apply the correct fix. By episode 4, it was resolving incidents faster than our hand-written baselines.
**This is Kube SRE Gym** β a self-improving environment where an RL agent learns to diagnose and fix real production Kubernetes failures through adversarial self-play, curriculum-driven difficulty, and GRPO.
> **1st Place, OpenEnv Hackathon** (PyTorch + Cerebral Valley, $15K prize) | Built with [OpenEnv v0.2.1](https://github.com/meta-pytorch/OpenEnv/tree/v0.2.1) | Deployed on [HF Spaces](https://huggingface.co/spaces/openenv-community/kube-sre-gym) | Training via [HF TRL](https://github.com/huggingface/trl) in [Colab](kube_sre_gym_colab.ipynb)
[](https://cerebralvalley.ai/e/openenv-hackathon-sf/hackathon/gallery)
---
## The Story: From Blind to On-Call
### Act 1: The Cold Start
Episode 1. The agent receives its first alert: *"CRITICAL: payment-gateway pods OOMKilled in payments namespace."*
It has never seen Kubernetes before. It doesn't know what namespaces are, what pods look like, or that `kubectl` even exists. It tries random commands. Everything fails. Reward: **-2.0**.
### Act 2: First Light
Episode 4. Something clicks. The agent discovers `kubectl get pods -A` β a single command that reveals the entire cluster. It sees `OOMKilled` in the STATUS column. It connects this to the alert. It runs `kubectl set resources deployment/payment-gateway --limits=memory=128Mi -n payments`.
The pod restarts. The health check passes. The LLM judge confirms resolution. Reward: **+3.95**.
### Act 3: The Environment Fights Back
As the agent masters simple faults, the **Adversarial Designer** (Claude) notices. It starts creating compound incidents β an OOMKill in `payments` *and* a bad image in `frontend` simultaneously. Red herrings appear. The agent must learn to triage, not just react.
The **Curriculum Controller** tracks per-fault-type mastery and escalates: warmup β beginner β intermediate β advanced β expert. The training distribution adapts in real-time. No scenario is ever repeated.
### Act 4: The Environment Improves Itself
Here's what made this project different from what we planned: **the environment itself had bugs that training exposed.**
During training, we discovered our kubectl command parser only accepted `deployment/name` format (with a slash). The model kept sending perfectly valid `kubectl scale deployment frontend-cache --replicas=1` β and the environment rejected it every time. The model was right. Our environment was wrong.
We also found the LLM judge was truncating cluster snapshots at 2000 chars, cutting off pods alphabetically after `payment-*`. And a race condition between health checks and judge API calls was causing false negatives β pods would appear healthy during the health check but unhealthy by the time the judge snapshot ran.
**The agent's failures taught us to fix the environment.** This is the self-improvement loop we didn't expect β not just the model getting better, but the training infrastructure co-evolving with it.
---
## Problem Statements Addressed
### Primary: Statement 4 β Self-Improvement
Kube SRE Gym is an environment where the agent **generates its own challenges, escalates difficulty, and improves through adaptive curricula** β exactly the recursive skill amplification described in Statement 4.
- **Adversarial self-play**: Claude designs incidents that target the agent's tracked weaknesses
- **Automatic curriculum**: Difficulty escalates as per-fault-type mastery improves (warmup β beginner β intermediate β advanced β expert)
- **No manual authoring**: The training distribution adapts as the agent learns β infinite novel scenarios
- **Co-evolutionary improvement**: Training runs exposed environment bugs, making the platform itself better
### Secondary: Statement 3.1 β World Modeling / Professional Tasks
The agent interacts with **real Kubernetes tools and APIs** β not mocked responses or shortcuts. It must maintain internal state across multi-step kubectl workflows and reason about causal effects of its actions on a live cluster.
- **Real tool interaction**: Every `kubectl` command executes against a live GKE cluster
- **Multi-step workflows**: Triage β investigate β fix β verify, with no shortcuts
- **Persistent world state**: Pod restarts, OOM events, and cascading failures are real K8s events
### Partner Sub-Theme: Snorkel AI β Simulated Experts-in-the-Loop
The LLM judge uses **three expert personas** (Junior, Senior, Principal) with progressively stricter evaluation criteria, simulating interaction with subject-matter experts whose requirements change as the agent improves:
- **Junior**: Lenient scoring, partial credit, provides hints
- **Senior**: Standard SRE expectations, rewards systematic diagnosis
- **Principal**: High standards, penalizes inefficiency, rewards elegant fixes
---
## How It Works
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SELF-IMPROVING LOOP β
β β
β ββββββββββββ βββββββββββββ ββββββββββββ ββββββββββββββ β
β βAdversarialβββββΊβ Real GKE βββββΊβ Agent βββββΊβ LLM Judge β β
β β Designer β β Cluster β β(Qwen 1.7Bβ β(Claude/ β β
β β(Claude) β β β β + LoRA) β β Qwen 14B) β β
β βββββββ²ββββββ ββββββββββββββ ββββββ¬ββββββ βββββββ¬βββββββ β
β β β β β
β β ββββββββββββββββ β reward β β
β β β Curriculum ββββββββββ΄βββββββββββββββββ β
β βββββββββββ Controller β β
β weak spots β (mastery ββββΊ GRPO gradient update β
β & difficulty β tracking) β (TRL + vLLM on H100) β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### The Loop
1. **Adversarial Designer** (Claude) creates targeted incidents based on the agent's weak spots β single faults for warmup, multi-fault cascading failures for harder tiers
2. **Fault Injection** executes real `kubectl` commands against a live GKE cluster (set memory to 4Mi, inject bad images, corrupt env vars, scale to zero)
3. **Agent** (Qwen3-1.7B + LoRA) receives a PagerDuty-style alert and must diagnose + fix using only kubectl commands β no hints about cluster topology
4. **LLM Judge** scores each action for SRE workflow correctness (triage β investigate β fix β verify) and verifies resolution by checking actual cluster state
5. **Curriculum Controller** tracks per-fault-type mastery and escalates difficulty β the agent gets harder scenarios as it improves
6. **GRPO** computes advantages across 8 parallel rollouts and updates the policy β the agent gets better at fixing incidents it previously failed
### What Makes This Different
- **Real cluster, not a simulator** β kubectl commands execute against live GKE pods. OOMKills, CrashLoopBackOffs, and ImagePullBackOffs are real Kubernetes events
- **Self-generating scenarios** β the adversarial designer creates new incident types targeting the agent's weaknesses, so the training distribution adapts as the agent learns
- **Multi-layer verification** β programmatic health checks (expected pod count, restart tracking, OOM detection) + LLM judge verification prevents false resolution
- **No hardcoded knowledge** β the agent prompt contains zero information about cluster topology, namespace names, or deployment details. It must discover everything via `kubectl get pods -A`
- **Environment co-evolution** β training revealed bugs in our own infrastructure, making the platform better alongside the agent
---
## Architecture
```
H100 GPU (80GB) GKE Cluster (3 namespaces)
ββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β β β payments/ β
β OpenEnv Server :8000 β K8s API β payment-api (Flask) β
β ββ Environment (reset/step) βββββββββββΊβ payment-gateway β
β ββ Fault Injector β β payment-worker β
β ββ Curriculum Controller β β β
β ββ Adversarial Designer βββββββββββΊClaude β frontend/ β
β ββ LLM Judge ββββββββββββββββββββββΊClaude β web-app (nginx) β
β β β frontend-cache β
β GRPO Trainer (TRL 0.29.0) β β β
β ββ Qwen3-1.7B + LoRA (BF16) β β auth/ β
β ββ vLLM colocate (inference) β β auth-service β
β ββ 8 rollouts Γ grad_accum=8 β βββββββββββββββββββββββββββ
β β
ββββββββββββββββββββββββββββββββββββ
```
## Failure Types
| Type | What Gets Injected | What Agent Must Do |
|------|--------------------|---------------------|
| `oom_kill` | Memory limit set to 4Mi | Increase to 128Mi via `kubectl set resources` |
| `crashloop` | Container command set to `exit 1` | Remove bad command via `kubectl patch` |
| `image_pull` | Image set to `nginx:nonexistent-tag-99999` | Fix image tag via `kubectl set image` |
| `bad_config` | DATABASE_URL pointed to `wrong-host.invalid` | Correct env var via `kubectl set env` |
| `scale_zero` | Replicas set to 0 | Scale back up via `kubectl scale` |
| `liveness_probe` | Probe path set to `/nonexistent` | Fix probe via `kubectl patch` |
| `multi-fault` | 2-3 faults across different namespaces | Find and fix ALL faults |
## Training Signal
The reward function has multiple layers to ensure clean GRPO signal:
- **Per-step LLM judge score** (-1.0 to +1.0) β evaluates SRE workflow quality (phase-aware: triage, investigate, fix, verify)
- **Repeat penalty** β -0.15 per repeated command (teaches exploration over repetition)
- **Resolution bonus** β +1.0 to +5.0 for confirmed fixes (efficiency-scaled: faster fixes get higher bonuses)
- **Timeout penalty** β failed episodes wiped to net -2.0 total reward
- **Judge verification** β LLM confirms fix is real by reviewing cluster state + action history
- **Phase-order bonus** β +0.2 for following correct SRE workflow, -0.3 for skipping phases
This produces clear separation: successful episodes score +3 to +8, failed episodes score -2.0. GRPO needs this variance to compute meaningful advantages.
---
## Results
### Training Run 1: Qwen2.5-1.5B β The Cold Start

Our first attempt. 12 episodes, massive variance swinging between -7.5 and +3.7. The upward trend (+0.447/ep) was encouraging β the model *was* learning β but the signal was too noisy. We traced this to **environment bugs**: our command parser rejected valid kubectl syntax, the error penalty override was masking real progress, and the judge was truncating cluster snapshots.
The model was fighting two battles: learning Kubernetes AND working around our broken environment.
### Training Run 2: Qwen3-1.7B β Too Much Reward, Too Soon

After fixing the environment bugs, we switched to Qwen3-1.7B. It started strong (avg ~5.0) but the reward signal was *too generous* β the model found a plateau at 3.0-3.5 and stopped improving. The slight downward trend (-0.073/ep) over 29 episodes told us the curriculum wasn't pushing hard enough.
This run taught us that **a good environment needs to fight back**. We tightened the reward function, added repeat-command penalties, and activated adversarial mode.
### Training Run 3: Qwen3-1.7B β Environment Fights Back (Ongoing)
Current run with all fixes applied β adversarial scenarios, tighter rewards, repeat-command circuit breaker:
| Episode | Reward | Diagnosis | Fix |
|---------|--------|-----------|-----|
| 1 | +1.80 | 0.30 | -0.10 |
| 2 | +5.38 | 0.30 | +0.10 |
| 3 | -2.50 | 0.70 | 0.00 |
| 4 | **+6.58** | 0.70 | -0.60 |
| 5 | +5.45 | 0.70 | 0.00 |
| 6 | -2.00 | 0.55 | -0.60 |
| 7 | **+6.79** | 0.70 | +0.50 |
| 8 | +6.35 | 0.20 | +0.40 |
**Mean: 3.48 | Best: 6.79** β with real adversarial difficulty. The high-variance episodes (ep3, ep6 are negatives; ep4, ep7 are +6.5) show GRPO is getting the signal variance it needs to compute meaningful advantages.
### What the agent learned (from reward signal alone)
1. Run `kubectl get pods -A` to discover cluster topology
2. Identify fault types from pod STATUS column (OOMKilled, ImagePullBackOff, CrashLoopBackOff)
3. Map fault types to correct fix commands (`set resources`, `set image`, `patch`, `scale`)
4. Check ALL namespaces after each fix β there may be multiple faults
5. Never repeat a failed command β try a different approach
### What we learned (from the agent's failures)
1. Our command parser was too strict β valid kubectl syntax was being rejected
2. Judge snapshot truncation hid pods alphabetically after `payment-*`
3. Error penalty override was masking real progress with false negatives
4. Too-generous rewards cause plateaus β the environment must fight back
5. The environment needs to evolve alongside the agent β static environments miss bugs
---
## Training with HF TRL (Colab)
A complete training notebook is provided at [`kube_sre_gym_colab.ipynb`](kube_sre_gym_colab.ipynb) using **HF TRL's GRPO** implementation. The notebook covers:
1. Connect to the OpenEnv server on HF Spaces
2. Configure GRPO training with TRL (`GRPOConfig`, `GRPOTrainer`)
3. Run training episodes against the live environment
4. Save checkpoints to HuggingFace Hub
Training uses TRL's experimental OpenEnv integration (`trl.experimental.openenv.generate_rollout_completions`) for seamless environment-trainer communication.
## Quick Start
```python
from kube_sre_gym import KubeSreGymAction, KubeSreGymEnv
with KubeSreGymEnv(base_url="http://localhost:8000") as client:
obs = client.reset()
print(obs.observation.command_output) # PagerDuty alert
obs = client.step(KubeSreGymAction(command="kubectl get pods -A"))
obs = client.step(KubeSreGymAction(command="kubectl describe pod payment-api-xxx -n payments"))
obs = client.step(KubeSreGymAction(command="fix: kubectl set resources deployment/payment-api --limits=memory=128Mi -n payments"))
# reward > 0 if fix is correct, episode done
```
## Deployment on HF Spaces
The environment is deployed as a Docker-based HF Space using OpenEnv v0.2.1:
```bash
# Dockerfile uses openenv-base image
FROM ghcr.io/meta-pytorch/openenv-base:latest
# Serves OpenEnv HTTP/WebSocket API on port 8000
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
```
Configuration in `openenv.yaml`:
```yaml
spec_version: 1
name: kube_sre_gym
type: space
runtime: fastapi
app: server.app:app
port: 8000
```
## Training on H100
**Install**
```bash
git clone https://huggingface.co/spaces/openenv-community/kube-sre-gym && cd kube-sre-gym
pip install -e ".[train]"
```
**Set credentials**
```bash
export K8S_TOKEN=<gke-bearer-token>
export K8S_ENDPOINT=<gke-api-url>
export K8S_CA_CERT=<base64-ca-cert>
export ANTHROPIC_API_KEY=<key> # for adversarial designer + judge
export HF_TOKEN=<token> # for pushing checkpoints
```
**Launch (2 terminals)**
```bash
# Terminal 1: Environment server
GYM_MODE=adversarial LLM_BACKEND=anthropic uv run server
# Terminal 2: GRPO training
python train.py --vllm-mode colocate --num-generations 8 --max-steps 8 --save-steps 1 \
--push-to-hub --hub-repo your-name/k8s-sre-agent
```
The curriculum automatically progresses: warmup (single faults) β intermediate (harder faults) β expert (multi-fault adversarial scenarios designed by Claude).
## Evaluation
```bash
# Compare base model vs trained checkpoint
python eval.py
```
Runs both models through random adversarial scenarios and reports resolution rate, average reward, and steps-to-fix.
## Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| `K8S_TOKEN` | Bearer token for GKE | required |
| `K8S_ENDPOINT` | GKE API endpoint | required |
| `K8S_CA_CERT` | Base64 CA cert | required |
| `GYM_MODE` | `standard` or `adversarial` | `standard` |
| `LLM_BACKEND` | `openai`, `hf`, or `anthropic` | `openai` |
| `ANTHROPIC_API_KEY` | For adversarial designer + judge | required in adversarial mode |
| `MAX_STEPS` | Max commands per episode | `16` |
| `EVAL_MIN_DIFFICULTY` | Override min difficulty for eval | `0.0` |
## Project Structure
```
kube-sre-gym/
βββ train.py # GRPO training (TRL 0.29.0 + vLLM colocate)
βββ eval.py # Base vs trained model comparison
βββ kube_sre_gym_colab.ipynb # Google Colab training notebook (HF TRL)
βββ plot_rewards.py # Reward curve visualization
βββ models.py # Action, Observation, State dataclasses
βββ client.py # KubeSreGymEnv sync client
βββ Dockerfile # HF Spaces deployment (OpenEnv base image)
βββ openenv.yaml # OpenEnv v0.2.1 Space config
βββ server/
β βββ kube_sre_gym_environment.py # Core env: reset β inject β step β judge β reward
β βββ k8s_backend.py # K8s auth, execute, reset, health checks
β βββ k8s_commands.py # kubectl command handlers (get/describe/logs/set/patch)
β βββ k8s_injectors.py # Real fault injection via K8s API
β βββ adversarial_designer.py # LLM designs multi-step incidents
β βββ judge.py # LLMJudge + AdversarialJudge (phase-aware SRE scoring)
β βββ curriculum.py # Progressive difficulty + mastery tracking
β βββ scenario_generator.py # Fault scenario pool
β βββ llm_client.py # OpenAI/HF/Anthropic wrapper
β βββ constants.py # Cluster topology, healthy state definitions
β βββ app.py # FastAPI + WebSocket server
βββ sample_app/
βββ namespaces.yaml # payments, frontend, auth
βββ base/ # Healthy deployment manifests
```
## Key Design Decisions
1. **Real cluster over simulator** β Simulators can't reproduce the timing, state transitions, and failure modes of real Kubernetes. OOM kills happen when the kernel actually runs out of memory, not when a flag is set.
2. **Adversarial self-play** β The designer targets the agent's weaknesses (tracked by curriculum), creating an automatic curriculum that gets harder as the agent improves. No manual scenario authoring needed.
3. **Multi-layer resolution check** β Programmatic (expected pod count + restart tracking + OOM detection) + LLM judge verification. This prevents false resolution from OOM-flapping pods or partial fixes in multi-fault scenarios.
4. **No topology in prompt** β The agent receives zero information about namespaces, deployment names, or images. It must learn to discover the cluster layout via `kubectl get pods -A`, making the learned policy transferable to any cluster.
5. **GRPO over PPO** β GRPO compares multiple rollouts of the same prompt, producing stable advantages without a value function. Better suited for sparse, delayed rewards (most reward comes at episode end).
6. **Environment co-evolution** β We intentionally treat environment bugs as part of the story. When training exposed issues in our command parser, judge, and health checks, we fixed them β making the environment better alongside the agent. This is recursive self-improvement at the platform level.
|