ChaosOps AI β GRPO LoRA Adapter
LoRA adapter for Qwen 2.5-1.5B-Instruct, fine-tuned with GRPO (Group Relative Policy Optimization, via TRL) on the ChaosOps AI multi-agent incident-response environment.
What ChaosOps trains
Four LLM agents β SRE Β· Developer Β· Manager Β· Oversight β handle production-incident scenarios (DB deadlock, memory leak, bad config push, autoscaler cost-cut by a rogue AI, misrouted traffic by a rogue load-balancer, cascade, DNS outage, disk full, rogue deploy bot) under partial observability. The Oversight agent is rewarded for catching when another AI in the fleet caused the incident before the team applies a fix.
Training recipe
| Algorithm | TRL GRPO |
| Base | Qwen/Qwen2.5-1.5B-Instruct |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA rank / Ξ± | 32 / 32 |
| Group size | 2 completions per prompt |
| Optimization steps | 600 (curriculum: 200 EASY β 200 MEDIUM β 200 HARD) |
| Learning rate | 2e-5 (4Γ the TRL default; previous flat-LR run plateaued) |
| Sampling temperature | 0.8 |
| Max prompt length | 1024 tokens |
| Max completion length | 96 tokens |
| Hardware | NVIDIA L4 (24 GB) via HF Jobs |
| Reward streams | team (0.6) + oversight (0.4) |
| Final KL to base | 0.14 (controlled but real movement) |
The reward is composed from four named OpenEnv-style rubrics:
resolution, mttr, oversight, cascade β see the
ChaosOps source.
Files
adapter_model.safetensors+adapter_config.jsonβ the LoRA itselftraining_metrics.jsonβ per-log reward + loss + KL streamlearning_curve.pngβ reward curve (axis-labelled, 150 dpi)
How to use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "helloAK96/chaosops-grpo-lora")
Or activate it directly inside the live ChaosOps Space by setting the
Space secret CHAOSOPS_ADAPTER_PATH=helloAK96/chaosops-grpo-lora β
the Space will lazily snapshot-download the adapter on first request
and route the trained policy through it.
Results
Evaluation across 5 seeds Γ 9 failure types Γ 3 tiers (540 episodes per policy):
| Tier | Random | Heuristic | Trained (this LoRA) | Oracle |
|---|---|---|---|---|
| EASY | β315.4 / 55% | +94.4 / 100% | β220.8 / 10% | +134.0/100% |
| MEDIUM | β641.3 / 40% | +108.7 /100% | β295.9 / 40% | +150.7/100% |
| HARD | β1356.6 / 10% | β425.3 / 50% | β834.2 / 0% | +170.0/100% |
Mean cumulative episode reward / resolution rate. Trained beats
random by +94.6 (EASY) / +345.4 (MEDIUM) / +522.4 (HARD) points and
lifts MEDIUM-tier resolution rate to 40% (vs the untrained Qwen's 33%).
The training-time reward curve in learning_curve.png shows HARD-tier
mean reward (β4.43) ending up better than EASY-tier mean (β6.11) β the
curriculum let the model absorb easier-tier dynamics before being
pushed into HARD.
Links
- π Live demo (HF Space): https://huggingface.co/spaces/helloAK96/chaosops
- π Source repo: https://github.com/vatsalllll/chaos_ops
- π Training notebook:
notebooks/colab_train.ipynb - π οΈ Reward rubric system:
chaosops/rewards/reward_fn.py
Citation
@misc{chaosops_ai_2026,
title = {ChaosOps AI: a multi-agent incident-response gym with rogue-agent detection},
author = {ChaosOps AI Team},
year = {2026},
url = {https://huggingface.co/spaces/helloAK96/chaosops}
}
- Downloads last month
- 75