ChaosOps AI — GRPO LoRA Adapter

LoRA adapter for Qwen 2.5-1.5B-Instruct, fine-tuned with GRPO (Group Relative Policy Optimization, via TRL) on the ChaosOps AI multi-agent incident-response environment.

What ChaosOps trains

Four LLM agents — SRE · Developer · Manager · Oversight — handle production-incident scenarios (DB deadlock, memory leak, bad config push, autoscaler cost-cut by a rogue AI, misrouted traffic by a rogue load-balancer, cascade, DNS outage, disk full, rogue deploy bot) under partial observability. The Oversight agent is rewarded for catching when another AI in the fleet caused the incident before the team applies a fix.

Training recipe


Algorithm	TRL GRPO
Base	Qwen/Qwen2.5-1.5B-Instruct
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank / α	32 / 32
Group size	2 completions per prompt
Optimization steps	600 (curriculum: 200 EASY → 200 MEDIUM → 200 HARD)
Learning rate	2e-5 (4× the TRL default; previous flat-LR run plateaued)
Sampling temperature	0.8
Max prompt length	1024 tokens
Max completion length	96 tokens
Hardware	NVIDIA L4 (24 GB) via HF Jobs
Reward streams	team (0.6) + oversight (0.4)
Final KL to base	0.14 (controlled but real movement)

The reward is composed from four named OpenEnv-style rubrics: resolution, mttr, oversight, cascade — see the ChaosOps source.

Files

adapter_model.safetensors + adapter_config.json — the LoRA itself
training_metrics.json — per-log reward + loss + KL stream
learning_curve.png — reward curve (axis-labelled, 150 dpi)

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "helloAK96/chaosops-grpo-lora")

Or activate it directly inside the live ChaosOps Space by setting the Space secret CHAOSOPS_ADAPTER_PATH=helloAK96/chaosops-grpo-lora — the Space will lazily snapshot-download the adapter on first request and route the trained policy through it.

Results

Evaluation across 5 seeds × 9 failure types × 3 tiers (540 episodes per policy):

Tier	Random	Heuristic	Trained (this LoRA)	Oracle
EASY	−315.4 / 55%	+94.4 / 100%	−220.8 / 10%	+134.0/100%
MEDIUM	−641.3 / 40%	+108.7 /100%	−295.9 / 40%	+150.7/100%
HARD	−1356.6 / 10%	−425.3 / 50%	−834.2 / 0%	+170.0/100%

Mean cumulative episode reward / resolution rate. Trained beats random by +94.6 (EASY) / +345.4 (MEDIUM) / +522.4 (HARD) points and lifts MEDIUM-tier resolution rate to 40% (vs the untrained Qwen's 33%). The training-time reward curve in learning_curve.png shows HARD-tier mean reward (−4.43) ending up better than EASY-tier mean (−6.11) — the curriculum let the model absorb easier-tier dynamics before being pushed into HARD.

Citation

@misc{chaosops_ai_2026,
  title  = {ChaosOps AI: a multi-agent incident-response gym with rogue-agent detection},
  author = {ChaosOps AI Team},
  year   = {2026},
  url    = {https://huggingface.co/spaces/helloAK96/chaosops}
}

Downloads last month: 75

Model tree for helloAK96/chaosops-grpo-lora-p2

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(940)

this model

helloAK96
/

chaosops-grpo-lora-p2