| https://github.com/sid-rp/kube-sre-gym |
|
|
|
|
| --- |
| title: Kube SRE Gym |
| emoji: π§ |
| colorFrom: red |
| colorTo: yellow |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| base_path: /web |
| tags: |
| - openenv |
| --- |
|
|
| # Kube SRE Gym |
|
|
| ### Can a 0.6B model learn to be an on-call SRE β from scratch? |
|
|
| We gave a tiny language model a pager, a live Kubernetes cluster, and zero knowledge of what a pod even is. No pre-training on DevOps docs. No few-shot examples. Just a PagerDuty alert and a `kubectl` prompt. |
|
|
| Within 8 episodes, it learned to discover namespaces, read pod statuses, identify OOMKills from CrashLoopBackOffs, and apply the correct fix. By episode 4, it was resolving incidents faster than our hand-written baselines. |
|
|
| **This is Kube SRE Gym** β a self-improving environment where an RL agent learns to diagnose and fix real production Kubernetes failures through adversarial self-play, curriculum-driven difficulty, and GRPO. |
|
|
| > **1st Place, OpenEnv Hackathon** (PyTorch + Cerebral Valley, $15K prize) | Built with [OpenEnv v0.2.1](https://github.com/meta-pytorch/OpenEnv/tree/v0.2.1) | Deployed on [HF Spaces](https://huggingface.co/spaces/openenv-community/kube-sre-gym) | Training via [HF TRL](https://github.com/huggingface/trl) in [Colab](kube_sre_gym_colab.ipynb) |
|
|
| [](https://cerebralvalley.ai/e/openenv-hackathon-sf/hackathon/gallery) |
|
|
| --- |
|
|
| ## The Story: From Blind to On-Call |
|
|
| ### Act 1: The Cold Start |
|
|
| Episode 1. The agent receives its first alert: *"CRITICAL: payment-gateway pods OOMKilled in payments namespace."* |
|
|
| It has never seen Kubernetes before. It doesn't know what namespaces are, what pods look like, or that `kubectl` even exists. It tries random commands. Everything fails. Reward: **-2.0**. |
|
|
| ### Act 2: First Light |
|
|
| Episode 4. Something clicks. The agent discovers `kubectl get pods -A` β a single command that reveals the entire cluster. It sees `OOMKilled` in the STATUS column. It connects this to the alert. It runs `kubectl set resources deployment/payment-gateway --limits=memory=128Mi -n payments`. |
|
|
| The pod restarts. The health check passes. The LLM judge confirms resolution. Reward: **+3.95**. |
|
|
| ### Act 3: The Environment Fights Back |
|
|
| As the agent masters simple faults, the **Adversarial Designer** (Claude) notices. It starts creating compound incidents β an OOMKill in `payments` *and* a bad image in `frontend` simultaneously. Red herrings appear. The agent must learn to triage, not just react. |
|
|
| The **Curriculum Controller** tracks per-fault-type mastery and escalates: warmup β beginner β intermediate β advanced β expert. The training distribution adapts in real-time. No scenario is ever repeated. |
|
|
| ### Act 4: The Environment Improves Itself |
|
|
| Here's what made this project different from what we planned: **the environment itself had bugs that training exposed.** |
|
|
| During training, we discovered our kubectl command parser only accepted `deployment/name` format (with a slash). The model kept sending perfectly valid `kubectl scale deployment frontend-cache --replicas=1` β and the environment rejected it every time. The model was right. Our environment was wrong. |
|
|
| We also found the LLM judge was truncating cluster snapshots at 2000 chars, cutting off pods alphabetically after `payment-*`. And a race condition between health checks and judge API calls was causing false negatives β pods would appear healthy during the health check but unhealthy by the time the judge snapshot ran. |
|
|
| **The agent's failures taught us to fix the environment.** This is the self-improvement loop we didn't expect β not just the model getting better, but the training infrastructure co-evolving with it. |
|
|
| --- |
|
|
| ## Problem Statements Addressed |
|
|
| ### Primary: Statement 4 β Self-Improvement |
|
|
| Kube SRE Gym is an environment where the agent **generates its own challenges, escalates difficulty, and improves through adaptive curricula** β exactly the recursive skill amplification described in Statement 4. |
|
|
| - **Adversarial self-play**: Claude designs incidents that target the agent's tracked weaknesses |
| - **Automatic curriculum**: Difficulty escalates as per-fault-type mastery improves (warmup β beginner β intermediate β advanced β expert) |
| - **No manual authoring**: The training distribution adapts as the agent learns β infinite novel scenarios |
| - **Co-evolutionary improvement**: Training runs exposed environment bugs, making the platform itself better |
|
|
| ### Secondary: Statement 3.1 β World Modeling / Professional Tasks |
|
|
| The agent interacts with **real Kubernetes tools and APIs** β not mocked responses or shortcuts. It must maintain internal state across multi-step kubectl workflows and reason about causal effects of its actions on a live cluster. |
|
|
| - **Real tool interaction**: Every `kubectl` command executes against a live GKE cluster |
| - **Multi-step workflows**: Triage β investigate β fix β verify, with no shortcuts |
| - **Persistent world state**: Pod restarts, OOM events, and cascading failures are real K8s events |
|
|
| ### Partner Sub-Theme: Snorkel AI β Simulated Experts-in-the-Loop |
|
|
| The LLM judge uses **three expert personas** (Junior, Senior, Principal) with progressively stricter evaluation criteria, simulating interaction with subject-matter experts whose requirements change as the agent improves: |
|
|
| - **Junior**: Lenient scoring, partial credit, provides hints |
| - **Senior**: Standard SRE expectations, rewards systematic diagnosis |
| - **Principal**: High standards, penalizes inefficiency, rewards elegant fixes |
|
|
| --- |
|
|
| ## How It Works |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β SELF-IMPROVING LOOP β |
| β β |
| β ββββββββββββ βββββββββββββ ββββββββββββ ββββββββββββββ β |
| β βAdversarialβββββΊβ Real GKE βββββΊβ Agent βββββΊβ LLM Judge β β |
| β β Designer β β Cluster β β(Qwen 1.7Bβ β(Claude/ β β |
| β β(Claude) β β β β + LoRA) β β Qwen 14B) β β |
| β βββββββ²ββββββ ββββββββββββββ ββββββ¬ββββββ βββββββ¬βββββββ β |
| β β β β β |
| β β ββββββββββββββββ β reward β β |
| β β β Curriculum ββββββββββ΄βββββββββββββββββ β |
| β βββββββββββ Controller β β |
| β weak spots β (mastery ββββΊ GRPO gradient update β |
| β & difficulty β tracking) β (TRL + vLLM on H100) β |
| β ββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ### The Loop |
|
|
| 1. **Adversarial Designer** (Claude) creates targeted incidents based on the agent's weak spots β single faults for warmup, multi-fault cascading failures for harder tiers |
| 2. **Fault Injection** executes real `kubectl` commands against a live GKE cluster (set memory to 4Mi, inject bad images, corrupt env vars, scale to zero) |
| 3. **Agent** (Qwen3-1.7B + LoRA) receives a PagerDuty-style alert and must diagnose + fix using only kubectl commands β no hints about cluster topology |
| 4. **LLM Judge** scores each action for SRE workflow correctness (triage β investigate β fix β verify) and verifies resolution by checking actual cluster state |
| 5. **Curriculum Controller** tracks per-fault-type mastery and escalates difficulty β the agent gets harder scenarios as it improves |
| 6. **GRPO** computes advantages across 8 parallel rollouts and updates the policy β the agent gets better at fixing incidents it previously failed |
|
|
| ### What Makes This Different |
|
|
| - **Real cluster, not a simulator** β kubectl commands execute against live GKE pods. OOMKills, CrashLoopBackOffs, and ImagePullBackOffs are real Kubernetes events |
| - **Self-generating scenarios** β the adversarial designer creates new incident types targeting the agent's weaknesses, so the training distribution adapts as the agent learns |
| - **Multi-layer verification** β programmatic health checks (expected pod count, restart tracking, OOM detection) + LLM judge verification prevents false resolution |
| - **No hardcoded knowledge** β the agent prompt contains zero information about cluster topology, namespace names, or deployment details. It must discover everything via `kubectl get pods -A` |
| - **Environment co-evolution** β training revealed bugs in our own infrastructure, making the platform better alongside the agent |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| H100 GPU (80GB) GKE Cluster (3 namespaces) |
| ββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ |
| β β β payments/ β |
| β OpenEnv Server :8000 β K8s API β payment-api (Flask) β |
| β ββ Environment (reset/step) βββββββββββΊβ payment-gateway β |
| β ββ Fault Injector β β payment-worker β |
| β ββ Curriculum Controller β β β |
| β ββ Adversarial Designer βββββββββββΊClaude β frontend/ β |
| β ββ LLM Judge ββββββββββββββββββββββΊClaude β web-app (nginx) β |
| β β β frontend-cache β |
| β GRPO Trainer (TRL 0.29.0) β β β |
| β ββ Qwen3-1.7B + LoRA (BF16) β β auth/ β |
| β ββ vLLM colocate (inference) β β auth-service β |
| β ββ 8 rollouts Γ grad_accum=8 β βββββββββββββββββββββββββββ |
| β β |
| ββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Failure Types |
|
|
| | Type | What Gets Injected | What Agent Must Do | |
| |------|--------------------|---------------------| |
| | `oom_kill` | Memory limit set to 4Mi | Increase to 128Mi via `kubectl set resources` | |
| | `crashloop` | Container command set to `exit 1` | Remove bad command via `kubectl patch` | |
| | `image_pull` | Image set to `nginx:nonexistent-tag-99999` | Fix image tag via `kubectl set image` | |
| | `bad_config` | DATABASE_URL pointed to `wrong-host.invalid` | Correct env var via `kubectl set env` | |
| | `scale_zero` | Replicas set to 0 | Scale back up via `kubectl scale` | |
| | `liveness_probe` | Probe path set to `/nonexistent` | Fix probe via `kubectl patch` | |
| | `multi-fault` | 2-3 faults across different namespaces | Find and fix ALL faults | |
|
|
| ## Training Signal |
|
|
| The reward function has multiple layers to ensure clean GRPO signal: |
|
|
| - **Per-step LLM judge score** (-1.0 to +1.0) β evaluates SRE workflow quality (phase-aware: triage, investigate, fix, verify) |
| - **Repeat penalty** β -0.15 per repeated command (teaches exploration over repetition) |
| - **Resolution bonus** β +1.0 to +5.0 for confirmed fixes (efficiency-scaled: faster fixes get higher bonuses) |
| - **Timeout penalty** β failed episodes wiped to net -2.0 total reward |
| - **Judge verification** β LLM confirms fix is real by reviewing cluster state + action history |
| - **Phase-order bonus** β +0.2 for following correct SRE workflow, -0.3 for skipping phases |
|
|
| This produces clear separation: successful episodes score +3 to +8, failed episodes score -2.0. GRPO needs this variance to compute meaningful advantages. |
|
|
| --- |
|
|
| ## Results |
|
|
| ### Training Run 1: Qwen2.5-1.5B β The Cold Start |
|
|
|  |
|
|
| Our first attempt. 12 episodes, massive variance swinging between -7.5 and +3.7. The upward trend (+0.447/ep) was encouraging β the model *was* learning β but the signal was too noisy. We traced this to **environment bugs**: our command parser rejected valid kubectl syntax, the error penalty override was masking real progress, and the judge was truncating cluster snapshots. |
|
|
| The model was fighting two battles: learning Kubernetes AND working around our broken environment. |
|
|
| ### Training Run 2: Qwen3-1.7B β Too Much Reward, Too Soon |
|
|
|  |
|
|
| After fixing the environment bugs, we switched to Qwen3-1.7B. It started strong (avg ~5.0) but the reward signal was *too generous* β the model found a plateau at 3.0-3.5 and stopped improving. The slight downward trend (-0.073/ep) over 29 episodes told us the curriculum wasn't pushing hard enough. |
|
|
| This run taught us that **a good environment needs to fight back**. We tightened the reward function, added repeat-command penalties, and activated adversarial mode. |
|
|
| ### Training Run 3: Qwen3-1.7B β Environment Fights Back (Ongoing) |
|
|
| Current run with all fixes applied β adversarial scenarios, tighter rewards, repeat-command circuit breaker: |
|
|
| | Episode | Reward | Diagnosis | Fix | |
| |---------|--------|-----------|-----| |
| | 1 | +1.80 | 0.30 | -0.10 | |
| | 2 | +5.38 | 0.30 | +0.10 | |
| | 3 | -2.50 | 0.70 | 0.00 | |
| | 4 | **+6.58** | 0.70 | -0.60 | |
| | 5 | +5.45 | 0.70 | 0.00 | |
| | 6 | -2.00 | 0.55 | -0.60 | |
| | 7 | **+6.79** | 0.70 | +0.50 | |
| | 8 | +6.35 | 0.20 | +0.40 | |
|
|
| **Mean: 3.48 | Best: 6.79** β with real adversarial difficulty. The high-variance episodes (ep3, ep6 are negatives; ep4, ep7 are +6.5) show GRPO is getting the signal variance it needs to compute meaningful advantages. |
|
|
| ### What the agent learned (from reward signal alone) |
|
|
| 1. Run `kubectl get pods -A` to discover cluster topology |
| 2. Identify fault types from pod STATUS column (OOMKilled, ImagePullBackOff, CrashLoopBackOff) |
| 3. Map fault types to correct fix commands (`set resources`, `set image`, `patch`, `scale`) |
| 4. Check ALL namespaces after each fix β there may be multiple faults |
| 5. Never repeat a failed command β try a different approach |
|
|
| ### What we learned (from the agent's failures) |
|
|
| 1. Our command parser was too strict β valid kubectl syntax was being rejected |
| 2. Judge snapshot truncation hid pods alphabetically after `payment-*` |
| 3. Error penalty override was masking real progress with false negatives |
| 4. Too-generous rewards cause plateaus β the environment must fight back |
| 5. The environment needs to evolve alongside the agent β static environments miss bugs |
|
|
| --- |
|
|
| ## Training with HF TRL (Colab) |
|
|
| A complete training notebook is provided at [`kube_sre_gym_colab.ipynb`](kube_sre_gym_colab.ipynb) using **HF TRL's GRPO** implementation. The notebook covers: |
|
|
| 1. Connect to the OpenEnv server on HF Spaces |
| 2. Configure GRPO training with TRL (`GRPOConfig`, `GRPOTrainer`) |
| 3. Run training episodes against the live environment |
| 4. Save checkpoints to HuggingFace Hub |
|
|
| Training uses TRL's experimental OpenEnv integration (`trl.experimental.openenv.generate_rollout_completions`) for seamless environment-trainer communication. |
|
|
| ## Quick Start |
|
|
| ```python |
| from kube_sre_gym import KubeSreGymAction, KubeSreGymEnv |
| |
| with KubeSreGymEnv(base_url="http://localhost:8000") as client: |
| obs = client.reset() |
| print(obs.observation.command_output) # PagerDuty alert |
| |
| obs = client.step(KubeSreGymAction(command="kubectl get pods -A")) |
| obs = client.step(KubeSreGymAction(command="kubectl describe pod payment-api-xxx -n payments")) |
| obs = client.step(KubeSreGymAction(command="fix: kubectl set resources deployment/payment-api --limits=memory=128Mi -n payments")) |
| # reward > 0 if fix is correct, episode done |
| ``` |
|
|
| ## Deployment on HF Spaces |
|
|
| The environment is deployed as a Docker-based HF Space using OpenEnv v0.2.1: |
|
|
| ```bash |
| # Dockerfile uses openenv-base image |
| FROM ghcr.io/meta-pytorch/openenv-base:latest |
| # Serves OpenEnv HTTP/WebSocket API on port 8000 |
| CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"] |
| ``` |
|
|
| Configuration in `openenv.yaml`: |
| ```yaml |
| spec_version: 1 |
| name: kube_sre_gym |
| type: space |
| runtime: fastapi |
| app: server.app:app |
| port: 8000 |
| ``` |
|
|
| ## Training on H100 |
|
|
| **Install** |
| ```bash |
| git clone https://huggingface.co/spaces/openenv-community/kube-sre-gym && cd kube-sre-gym |
| pip install -e ".[train]" |
| ``` |
|
|
| **Set credentials** |
| ```bash |
| export K8S_TOKEN=<gke-bearer-token> |
| export K8S_ENDPOINT=<gke-api-url> |
| export K8S_CA_CERT=<base64-ca-cert> |
| export ANTHROPIC_API_KEY=<key> # for adversarial designer + judge |
| export HF_TOKEN=<token> # for pushing checkpoints |
| ``` |
|
|
| **Launch (2 terminals)** |
| ```bash |
| # Terminal 1: Environment server |
| GYM_MODE=adversarial LLM_BACKEND=anthropic uv run server |
| |
| # Terminal 2: GRPO training |
| python train.py --vllm-mode colocate --num-generations 8 --max-steps 8 --save-steps 1 \ |
| --push-to-hub --hub-repo your-name/k8s-sre-agent |
| ``` |
|
|
| The curriculum automatically progresses: warmup (single faults) β intermediate (harder faults) β expert (multi-fault adversarial scenarios designed by Claude). |
|
|
| ## Evaluation |
|
|
| ```bash |
| # Compare base model vs trained checkpoint |
| python eval.py |
| ``` |
|
|
| Runs both models through random adversarial scenarios and reports resolution rate, average reward, and steps-to-fix. |
|
|
| ## Configuration |
|
|
| | Variable | Description | Default | |
| |----------|-------------|---------| |
| | `K8S_TOKEN` | Bearer token for GKE | required | |
| | `K8S_ENDPOINT` | GKE API endpoint | required | |
| | `K8S_CA_CERT` | Base64 CA cert | required | |
| | `GYM_MODE` | `standard` or `adversarial` | `standard` | |
| | `LLM_BACKEND` | `openai`, `hf`, or `anthropic` | `openai` | |
| | `ANTHROPIC_API_KEY` | For adversarial designer + judge | required in adversarial mode | |
| | `MAX_STEPS` | Max commands per episode | `16` | |
| | `EVAL_MIN_DIFFICULTY` | Override min difficulty for eval | `0.0` | |
|
|
| ## Project Structure |
|
|
| ``` |
| kube-sre-gym/ |
| βββ train.py # GRPO training (TRL 0.29.0 + vLLM colocate) |
| βββ eval.py # Base vs trained model comparison |
| βββ kube_sre_gym_colab.ipynb # Google Colab training notebook (HF TRL) |
| βββ plot_rewards.py # Reward curve visualization |
| βββ models.py # Action, Observation, State dataclasses |
| βββ client.py # KubeSreGymEnv sync client |
| βββ Dockerfile # HF Spaces deployment (OpenEnv base image) |
| βββ openenv.yaml # OpenEnv v0.2.1 Space config |
| βββ server/ |
| β βββ kube_sre_gym_environment.py # Core env: reset β inject β step β judge β reward |
| β βββ k8s_backend.py # K8s auth, execute, reset, health checks |
| β βββ k8s_commands.py # kubectl command handlers (get/describe/logs/set/patch) |
| β βββ k8s_injectors.py # Real fault injection via K8s API |
| β βββ adversarial_designer.py # LLM designs multi-step incidents |
| β βββ judge.py # LLMJudge + AdversarialJudge (phase-aware SRE scoring) |
| β βββ curriculum.py # Progressive difficulty + mastery tracking |
| β βββ scenario_generator.py # Fault scenario pool |
| β βββ llm_client.py # OpenAI/HF/Anthropic wrapper |
| β βββ constants.py # Cluster topology, healthy state definitions |
| β βββ app.py # FastAPI + WebSocket server |
| βββ sample_app/ |
| βββ namespaces.yaml # payments, frontend, auth |
| βββ base/ # Healthy deployment manifests |
| ``` |
|
|
| ## Key Design Decisions |
|
|
| 1. **Real cluster over simulator** β Simulators can't reproduce the timing, state transitions, and failure modes of real Kubernetes. OOM kills happen when the kernel actually runs out of memory, not when a flag is set. |
|
|
| 2. **Adversarial self-play** β The designer targets the agent's weaknesses (tracked by curriculum), creating an automatic curriculum that gets harder as the agent improves. No manual scenario authoring needed. |
|
|
| 3. **Multi-layer resolution check** β Programmatic (expected pod count + restart tracking + OOM detection) + LLM judge verification. This prevents false resolution from OOM-flapping pods or partial fixes in multi-fault scenarios. |
|
|
| 4. **No topology in prompt** β The agent receives zero information about namespaces, deployment names, or images. It must learn to discover the cluster layout via `kubectl get pods -A`, making the learned policy transferable to any cluster. |
|
|
| 5. **GRPO over PPO** β GRPO compares multiple rollouts of the same prompt, producing stable advantages without a value function. Better suited for sparse, delayed rewards (most reward comes at episode end). |
|
|
| 6. **Environment co-evolution** β We intentionally treat environment bugs as part of the story. When training exposed issues in our command parser, judge, and health checks, we fixed them β making the environment better alongside the agent. This is recursive self-improvement at the platform level. |
|
|